# Prédiction du prix des voitures à travers une application Web
À partir des données concernant les voitures, faisons un modèle de Machine Learning permettant de prédire le prix des voitures et déployer ce modèle.

## 1. Lecture de données

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [2]:
df = pd.read_csv("car_price_prediction.csv")
df.head()

Unnamed: 0,ID,Price,Levy,Manufacturer,Model,Prod. year,Category,Leather interior,Fuel type,Engine volume,Mileage,Cylinders,Gear box type,Drive wheels,Doors,Wheel,Color,Airbags
0,45654403,13328,1399,LEXUS,RX 450,2010,Jeep,Yes,Hybrid,3.5,186005 km,6.0,Automatic,4x4,04-May,Left wheel,Silver,12
1,44731507,16621,1018,CHEVROLET,Equinox,2011,Jeep,No,Petrol,3.0,192000 km,6.0,Tiptronic,4x4,04-May,Left wheel,Black,8
2,45774419,8467,-,HONDA,FIT,2006,Hatchback,No,Petrol,1.3,200000 km,4.0,Variator,Front,04-May,Right-hand drive,Black,2
3,45769185,3607,862,FORD,Escape,2011,Jeep,Yes,Hybrid,2.5,168966 km,4.0,Automatic,4x4,04-May,Left wheel,White,0
4,45809263,11726,446,HONDA,FIT,2014,Hatchback,Yes,Petrol,1.3,91901 km,4.0,Automatic,Front,04-May,Left wheel,Silver,4


In [3]:
df.dtypes

ID                    int64
Price                 int64
Levy                 object
Manufacturer         object
Model                object
Prod. year            int64
Category             object
Leather interior     object
Fuel type            object
Engine volume        object
Mileage              object
Cylinders           float64
Gear box type        object
Drive wheels         object
Doors                object
Wheel                object
Color                object
Airbags               int64
dtype: object

In [4]:
df.describe()

Unnamed: 0,ID,Price,Prod. year,Cylinders,Airbags
count,19237.0,19237.0,19237.0,19237.0,19237.0
mean,45576540.0,18555.93,2010.912824,4.582991,6.582627
std,936591.4,190581.3,5.668673,1.199933,4.320168
min,20746880.0,1.0,1939.0,1.0,0.0
25%,45698370.0,5331.0,2009.0,4.0,4.0
50%,45772310.0,13172.0,2012.0,4.0,6.0
75%,45802040.0,22075.0,2015.0,4.0,12.0
max,45816650.0,26307500.0,2020.0,16.0,16.0


## 2. Analyse Exploratoire et Visualisation des données

In [5]:
df['Price'].unique()

array([13328, 16621,  8467, ..., 56814, 63886, 22075])

In [6]:
import matplotlib.pyplot as plt
import seaborn as sns



In [None]:
plt.hist(df["Price"])
# Afficher le graphique
plt.show()

In [None]:
# on convertit les features en une liste de columns
columns = df.columns.tolist()
columns

In [None]:
plt.figure(figsize=(16,38))

for i, col in enumerate(columns, 1):
    plt.subplot(8,4,i)
    sns.kdeplot(df[col], color = '#d1aa00', fill = True, warn_singular=False)
    plt.subplot(8,4,i+11)
    df[col].plot.box()
plt.tight_layout()
plt.show()


## 3. Traitement des Valeurs abbérrantes

Vérifions si notre dataset ne comporte pas de  valeurs nulles

In [None]:
 df.isnull().sum()

Donc il n'ya pas de valeur nulle. Vérifions s'il existe de NaN.

In [None]:
 df.isna().sum()

Aucune valeur NaN non plus. Assurons-nous une fois qu'il n'ya pas de valeurs dupliquées.   

In [None]:
df.duplicated()

Le resultat nous montre qu'il ya aucune valeur dupliquée. Notre dataset est propre pour faire la prédiction/Classification

Faisons le test de skewness et kurtosis pour voir les valeurs disparates ( outliers)

In [None]:
df.head()

In [None]:
(df['Levy']=='-').sum()


In [None]:
df['Gear box type'].unique()

In [None]:
def toFloat(x):
    flt = float(x)
    return flt

def removeKm(x):
    x = x[:-3]
    return toFloat(x)

df.replace({'':np.nan}, inplace = True)

# remplacer le km en float

df['Mileage'] = df['Mileage'].map(lambda x:removeKm(x) )

# transformer le price en float

df['Price'] = df['Price'].map(lambda x:toFloat(x) )

# transformer l'année en float

df['Prod. year'] = df['Prod. year'].map(lambda x:toFloat(x) )



Donnons un identifiant à chaque caractère de de certains features


Convertissons nos tableaux en dictionnaire afin d'avoir le couple __clé-valeur__.

__Cas du feature Model__:

In [None]:
my_array_model = df['Model'].unique()
my_array_model

In [None]:
my_dict_model = {}
for i ,x in enumerate(my_array_model):
    variable = {x:i+1}
    my_dict_model.update(variable)
    

__Cas du feature Manufacturer__:

In [None]:
my_array_levy = df['Levy'].unique()


In [None]:
my_dict_levy = {}
for i ,x in enumerate(my_array_levy):
    variable = {x:i+1}
    my_dict_levy.update(variable)
    

__Cas du feature Manufacturer__:

In [None]:
my_array_manufacturer = df['Manufacturer'].unique()


In [None]:
my_dict_manufacturer = {}
for i ,x in enumerate(my_array_manufacturer):
    variable = {x:i+1}
    my_dict_manufacturer.update(variable)
    

__Cas du feature Category__:

In [None]:
my_array_category = df['Category'].unique()
my_array_category

In [None]:
my_dict_category = {}
for i ,x in enumerate(my_array_category):
    variable = {x:i+1}
    my_dict_category.update(variable)
print(my_dict_category)

__Cas du feature Leather interior__:

In [None]:
my_array_leather = df['Leather interior'].unique()
my_array_leather

In [None]:
my_dict_leather = {}
for i ,x in enumerate(my_array_leather):
    variable = {x:i+1}
    my_dict_leather.update(variable)
print(my_dict_leather)

__Cas du feature Fuel type__:

In [None]:
my_array_fuel = df['Fuel type'].unique()
my_array_fuel

In [None]:
my_dict_fuel = {}
for i ,x in enumerate(my_array_fuel):
    variable = {x:i+1}
    my_dict_fuel.update(variable)
print(my_dict_fuel)

__Cas du feature Engine volume__:

In [None]:
my_array_engine = df['Engine volume'].unique()

In [None]:
my_dict_engine = {}
for i ,x in enumerate(my_array_engine):
    variable = {x:i+1}
    my_dict_engine.update(variable)

__Cas du feature Gear box type__:

In [None]:
my_array_gearbox = df['Gear box type'].unique()
my_array_gearbox

In [None]:
my_dict_gearbox = {}
for i ,x in enumerate(my_array_gearbox):
    variable = {x:i+1}
    my_dict_gearbox.update(variable)
print(my_dict_gearbox)

__Cas du feature Drive wheels__:

In [None]:
my_array_driverwheel = df['Drive wheels'].unique()
my_array_driverwheel

In [None]:
my_dict_driverwheel = {}
for i ,x in enumerate(my_array_driverwheel):
    variable = {x:i+1}
    my_dict_driverwheel.update(variable)
print(my_dict_driverwheel)

__Cas du feature Doors__:

In [None]:
my_array_doors = df['Doors'].unique()
my_array_doors

In [None]:
my_dict_doors = {}
for i ,x in enumerate(my_array_doors):
    variable = {x:i+1}
    my_dict_doors.update(variable)
print(my_dict_doors)

__Cas du feature Wheel__:

In [None]:
my_array_wheel = df['Wheel'].unique()
my_array_wheel

In [None]:
my_dict_wheel = {}
for i ,x in enumerate(my_array_wheel):
    variable = {x:i+1}
    my_dict_wheel.update(variable)
print(my_dict_wheel)

__Cas du feature Color__:

In [None]:
my_array_color = df['Color'].unique()
my_array_color

In [None]:
my_dict_color = {}
for i ,x in enumerate(my_array_color):
    variable = {x:i+1}
    my_dict_color.update(variable)
print(my_dict_color)

Maintenant, remplaçons les valeurs chaque colonnes par celles qui sont dans chaque dictionnaire défini ci-haut.

In [None]:
df['Manufacturer'] = df['Manufacturer'].map(my_dict_manufacturer)
df['Levy'] = df['Levy'].map(my_dict_levy)
df['Model'] = df['Model'].map(my_dict_model)
df['Category'] = df['Category'].map(my_dict_category)
df['Leather interior'] = df['Leather interior'].map(my_dict_leather)
df['Fuel type'] = df['Fuel type'].map(my_dict_fuel)
df['Engine volume'] = df['Engine volume'].map(my_dict_engine)
df['Gear box type'] = df['Gear box type'].map(my_dict_gearbox)
df['Drive wheels'] = df['Drive wheels'].map(my_dict_driverwheel)
df['Doors'] = df['Doors'].map(my_dict_doors)
df['Wheel'] = df['Wheel'].map(my_dict_wheel)
df['Color'] = df['Color'].map(my_dict_color)
df['Prod. year'] = df['Prod. year'].map(lambda x : toFloat(x))

In [None]:
df.head()

In [None]:
df.dtypes

Supprimons la colonne ID

In [None]:
df.drop('ID', inplace=True, axis=1)

De ce fait ,nous pouvons le vérifier dans la liste de colonnes ci-dessous.

In [None]:
columns = df.columns.tolist()
columns

In [None]:
plt.figure(figsize=(16,38))

for i, col in enumerate(columns, 1):
    plt.subplot(8,4,i)
    sns.kdeplot(df[col], color = '#d1aa00', fill = True, warn_singular=False)
    plt.subplot(8,4,i+11)
    df[col].plot.box()
plt.tight_layout()
plt.show()


In [None]:
pd.DataFrame(data=[df[columns].skew(),df[columns].kurtosis()],index=['skewness','kurtosis'])

In [None]:
from scipy.stats import zscore
for i in columns:
    y_outliers = df[abs(zscore(df[i])) >= 3 ]
    print('Le nombre des outliers de ',i,'est ',len(y_outliers))
    y_outliers
    


In [None]:
df['Price'].describe()

In [None]:
def price_class(Price):
    if (Price >=1.000000e+00 and Price<=6576876.0):
        return "Moins Chèr"
    elif (Price > 6576876.0 and Price<=13153751.0):
        return "Moyenne"
    elif (Price >13153751.0  and Price<=19730626.0):
        return "Chèr"
    else:
        return "Très Chèr"

df['Classe'] = df['Price'].apply(price_class)
df.sample(frac=1).head(5)

Supprimons certaines colonnes .

In [None]:
df.drop('Doors', inplace=True, axis=1)
df.drop('Wheel', inplace=True, axis=1)
df.drop('Leather interior', inplace=True, axis=1)
df.drop('Levy', inplace=True, axis=1)

In [None]:
columns = df.columns.tolist()
columns

In [None]:
df.dropna()

In [None]:
df.drop_duplicates()

In [None]:
df['Classe'].where(df['Classe']=="Chèr")

In [None]:
import seaborn as sns
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')