<b>Objectif :</b>

Concevoir et évaluation d'un <font color='red'>pipeline</font> pour la prédiction de prix de maisons étant données des caractéristiques

<b>Démarche :</b>
1. Préparation des données
2. Design du pipeline
3. Configuration de pipeline
4. Tuning des hyperparamètres
5. Visualisation du rapport de performance
6. prédiction du prix d'une nouvelle maison

<b>1. Préparation des données</b>

Charger les données à partir d'un fichier CSV

In [61]:
import pandas as pd

df_maisons=pd.read_csv('maisons.csv')
df_maisons.head()

Unnamed: 0,surface,nb_chambre,type,prix,cher
0,100,3,normal,300,0
1,150,4,haut standing,500,1
2,120,3,normal,400,0
3,80,2,normal,250,0
4,200,5,haut standing,600,1


Diviser les données en entrée (X) et sortie (y)

In [62]:
X=df_maisons[['surface','nb_chambre','type']]
y=df_maisons['prix']

In [63]:
X

Unnamed: 0,surface,nb_chambre,type
0,100,3,normal
1,150,4,haut standing
2,120,3,normal
3,80,2,normal
4,200,5,haut standing
5,110,3,normal
6,130,4,normal
7,90,2,normal
8,70,2,normal
9,180,4,haut standing


<b>2. Design de pipeline</b>

<b>Etape 1 de pipeline :</b> Encodage des données discrètes

Design de preprocessor

In [64]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

features_cat = ['type']
preprocessor = ColumnTransformer(
                                    transformers=[
                                        ('cat', OneHotEncoder(), features_cat),
                                    ],
                                    remainder='passthrough'
                                )
preprocessor

In [65]:
#test
preprocessor.fit(X)
preprocessor.transform(X)

array([[  0.,   1., 100.,   3.],
       [  1.,   0., 150.,   4.],
       [  0.,   1., 120.,   3.],
       [  0.,   1.,  80.,   2.],
       [  1.,   0., 200.,   5.],
       [  0.,   1., 110.,   3.],
       [  0.,   1., 130.,   4.],
       [  0.,   1.,  90.,   2.],
       [  0.,   1.,  70.,   2.],
       [  1.,   0., 180.,   4.],
       [  0.,   1.,  95.,   2.],
       [  0.,   1., 140.,   3.],
       [  0.,   1.,  75.,   2.],
       [  1.,   0., 160.,   4.],
       [  0.,   1.,  85.,   2.],
       [  0.,   1., 105.,   3.],
       [  1.,   0., 195.,   5.],
       [  0.,   1., 125.,   3.],
       [  1.,   0., 165.,   4.],
       [  0.,   1., 115.,   3.]])

Intégrer le preprocessor dans pipeline

In [66]:
from sklearn.pipeline import Pipeline

#pipeline
pipeline = Pipeline(steps=[
                        ('preprocessor', preprocessor),
])

#test
pipeline.fit(X)
pipeline.transform(X)

array([[  0.,   1., 100.,   3.],
       [  1.,   0., 150.,   4.],
       [  0.,   1., 120.,   3.],
       [  0.,   1.,  80.,   2.],
       [  1.,   0., 200.,   5.],
       [  0.,   1., 110.,   3.],
       [  0.,   1., 130.,   4.],
       [  0.,   1.,  90.,   2.],
       [  0.,   1.,  70.,   2.],
       [  1.,   0., 180.,   4.],
       [  0.,   1.,  95.,   2.],
       [  0.,   1., 140.,   3.],
       [  0.,   1.,  75.,   2.],
       [  1.,   0., 160.,   4.],
       [  0.,   1.,  85.,   2.],
       [  0.,   1., 105.,   3.],
       [  1.,   0., 195.,   5.],
       [  0.,   1., 125.,   3.],
       [  1.,   0., 165.,   4.],
       [  0.,   1., 115.,   3.]])

<b>Etape 2  de Pipeline :</b> Normalisation

In [67]:
from sklearn.preprocessing import StandardScaler

#pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('scaler', StandardScaler())
])

#test
pipeline.fit(X)
pipeline.transform(X)

array([[-0.65465367,  0.65465367, -0.63369722, -0.15575224],
       [ 1.52752523, -1.52752523,  0.65956241,  0.88259602],
       [-0.65465367,  0.65465367, -0.11639337, -0.15575224],
       [-0.65465367,  0.65465367, -1.15100108, -1.1941005 ],
       [ 1.52752523, -1.52752523,  1.95282205,  1.92094429],
       [-0.65465367,  0.65465367, -0.37504529, -0.15575224],
       [-0.65465367,  0.65465367,  0.14225856,  0.88259602],
       [-0.65465367,  0.65465367, -0.89234915, -1.1941005 ],
       [-0.65465367,  0.65465367, -1.409653  , -1.1941005 ],
       [ 1.52752523, -1.52752523,  1.4355182 ,  0.88259602],
       [-0.65465367,  0.65465367, -0.76302319, -1.1941005 ],
       [-0.65465367,  0.65465367,  0.40091049, -0.15575224],
       [-0.65465367,  0.65465367, -1.28032704, -1.1941005 ],
       [ 1.52752523, -1.52752523,  0.91821434,  0.88259602],
       [-0.65465367,  0.65465367, -1.02167511, -1.1941005 ],
       [-0.65465367,  0.65465367, -0.50437126, -0.15575224],
       [ 1.52752523, -1.

<b>Etape 3 de pipeline :</b> Réduction de dimension

In [68]:
from sklearn.decomposition import PCA

#pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2))
])

# test
pipeline.fit(X)
pipeline.transform(X)

array([[-1.05473753,  0.29252999],
       [ 2.30275018, -0.70824355],
       [-0.79444247,  0.50322004],
       [-1.82269218, -0.53399806],
       [ 3.46114743,  0.43431956],
       [-0.92459   ,  0.39787502],
       [-0.15663535,  1.22440307],
       [-1.69254465, -0.42865304],
       [-1.95283971, -0.63934308],
       [ 2.69319278, -0.39220849],
       [-1.62747088, -0.37598053],
       [-0.5341474 ,  0.71391008],
       [-1.88776595, -0.58667057],
       [ 2.43289771, -0.60289853],
       [-1.75761842, -0.48132555],
       [-0.98966376,  0.3452025 ],
       [ 3.39607366,  0.38164705],
       [-0.7293687 ,  0.55589255],
       [ 2.49797148, -0.55022602],
       [-0.85951623,  0.45054753]])

<b>Etape 4 de pipeline :</b> Regression

In [69]:
from sklearn.linear_model import SGDRegressor

#pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2)), 
    ('regressor', SGDRegressor(
                    max_iter=10000, 
                    tol=1e-3,
                    learning_rate='constant',
                    eta0=0.00001,
                    loss='squared_error'))
])

#test
pipeline.fit(X,y)
pipeline.predict(X)



array([278.0902994 , 448.53041209, 296.58761456, 219.76130312,
       534.60538112, 287.33895698, 345.66795327, 229.0099607 ,
       210.51264554, 476.27638483, 233.63428949, 315.08492972,
       215.13697433, 457.77906967, 224.38563191, 282.71462819,
       529.98105233, 301.21194335, 462.40339846, 291.96328577])

<b>5. Configuration de pipeline</b>

Préparer pipeline (sans les hyperparamètres pour le tuning)

In [70]:
pipeline = Pipeline(steps = [
                                ('preprocessor', preprocessor),
                                ('scaler', StandardScaler()),
                                ('pca', PCA()), 
                                ('regressor', SGDRegressor(
                                                    max_iter=10000, 
                                                    learning_rate='constant',
                                                    loss='squared_error')
                                )
                            ]
                    )

#test
pipeline.fit(X,y)
pipeline.predict(X)

array([319.43574476, 469.82262227, 364.77193922, 259.5583137 ,
       597.70434502, 342.10384199, 401.98127304, 282.22641093,
       236.89021647, 537.82691396, 293.56045955, 410.10813368,
       248.22426509, 492.4907195 , 270.89236232, 330.76979337,
       586.3702964 , 376.10598783, 503.82476811, 353.4378906 ])

Spécifier les hyperparamètres concernés par le tuning et leurs valeurs possibles

In [71]:
hyperparam_grille = {
    'pca__n_components': [2,3],
    'regressor__eta0': [0.1, 0.01, 0.001, 0.0001, 0.00001],
    'regressor__tol' : [0.01, 0.001],
}
hyperparam_grille

{'pca__n_components': [2, 3],
 'regressor__eta0': [0.1, 0.01, 0.001, 0.0001, 1e-05],
 'regressor__tol': [0.01, 0.001]}

<b>6. Tuning des hyperparamètres</b>

Définir la mesure de performance (scoring) utilisée en tuning 

In [72]:
from sklearn.metrics import r2_score, make_scorer
scoring=make_scorer(r2_score, greater_is_better=True)

scoring

make_scorer(r2_score)

Configurer la recherche utilisée pour tuning des hyperparamètres en indiquant :
- Pipleine
- Hyperparamètres (valeurs possibles)
- Cross-validation (nombre de K-folds)
- Mesure de performance (scoring) 

In [73]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(pipeline,
                           hyperparam_grille,
                           cv=3,
                           scoring=make_scorer(r2_score, greater_is_better=True),
                           verbose=0)
grid_search

Lancer le tuning des hyperparamètres

In [74]:
grid_search.fit(X, y)



<b>7. Visualisation du rapport de performance</b>

Meilleur pipeline

In [75]:
# Meilleur pipeline
print("Meilleurs hyperparamètres trouvés :")
print(grid_search.best_params_)
print("Meilleur score de validation croisée :")
print(grid_search.best_score_)

Meilleurs hyperparamètres trouvés :
{'pca__n_components': 3, 'regressor__eta0': 0.1, 'regressor__tol': 0.01}
Meilleur score de validation croisée :
0.9526742631912932


Rapport détaillé

In [76]:
# tous les pipelines triés par score
results = pd.DataFrame(grid_search.cv_results_)
results.set_index('rank_test_score').sort_index()[['mean_test_score','params']]

Unnamed: 0_level_0,mean_test_score,params
rank_test_score,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.952674,"{'pca__n_components': 3, 'regressor__eta0': 0...."
2,0.946857,"{'pca__n_components': 3, 'regressor__eta0': 0...."
3,0.946813,"{'pca__n_components': 3, 'regressor__eta0': 0...."
4,0.946109,"{'pca__n_components': 3, 'regressor__eta0': 0...."
5,0.945059,"{'pca__n_components': 3, 'regressor__eta0': 0...."
6,0.942884,"{'pca__n_components': 3, 'regressor__eta0': 0...."
7,0.937265,"{'pca__n_components': 3, 'regressor__eta0': 0...."
8,0.927133,"{'pca__n_components': 2, 'regressor__eta0': 0...."
9,0.926382,"{'pca__n_components': 2, 'regressor__eta0': 0...."
10,0.925452,"{'pca__n_components': 2, 'regressor__eta0': 0...."


<b>8. Prédiction du prix d'une nouvelle maison en utilisant le meilleur pipeline</b>

In [77]:
# Meilleur pipeline
grid_search.best_estimator_

In [82]:
#nouvelle maison
x_new=pd.DataFrame([[400,4,'haut standing']], 
                   columns=['surface','nb_chambre','type'])
x_new

Unnamed: 0,surface,nb_chambre,type
0,400,4,haut standing


In [83]:
grid_search.best_estimator_.predict(x_new)

array([1067.527562])