<b>Objectif :</b>

Concevoir et évaluation d'un <font color='red'>pipeline optimal</font> pour la prédiction de prix de maisons étant données des caractéristiques

<b>Démarche :</b>
1. Préparation des données
2. Design du pipeline
3. Configuration de pipeline
4. Tuning des hyperparamètres
5. Visualisation du rapport de performance
6. Prédiction du prix d'une nouvelle maison

<b>1. Préparation des données</b>

Charger les données à partir d'un fichier CSV

In [1]:
import pandas as pd

df_main = pd.read_excel("data/data_jour4.xlsx", sheet_name='data_etiquetée')
df_main.head()

Unnamed: 0,id_produit,id_temps,puht,quantite_vendue,date,categorie,nom,Classe
0,1,1,1000,15,2023-01-01,smartphone,iPhone,Non
1,1,2,1100,14,2023-02-01,smartphone,iPhone,Oui
2,1,3,1200,16,2023-03-01,smartphone,iPhone,Oui
3,1,4,1250,15,2023-04-01,smartphone,iPhone,Oui
4,1,5,1300,16,2023-05-01,smartphone,iPhone,Oui


Diviser les données en entrée (X) et sortie (y)

In [2]:
X=df_main.drop(['Classe','id_produit','id_temps'],axis=1)
y=df_main['Classe']

In [3]:
X.head()

Unnamed: 0,puht,quantite_vendue,date,categorie,nom
0,1000,15,2023-01-01,smartphone,iPhone
1,1100,14,2023-02-01,smartphone,iPhone
2,1200,16,2023-03-01,smartphone,iPhone
3,1250,15,2023-04-01,smartphone,iPhone
4,1300,16,2023-05-01,smartphone,iPhone


In [4]:
y.head()

0    Non
1    Oui
2    Oui
3    Oui
4    Oui
Name: Classe, dtype: object

Split into train et Test

In [5]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.7, random_state=45, stratify=y)

<b>2. Design de pipeline</b>

<b>Data processing (preprocessor) :</b>

La première composante de pipeline consiste à transformer les données d'entrée (X) en données purement numériques.

Dans notre cas, nous avons 3 colonnes non numériques : 
- date : de type chaine de caractère ayant un format de date,
- nom : chaine de caractères qui prend un nombre fini de valeurs disceretes (noms des produits)
- categorie : chaine de caractères qui prend un nombre fini de valeurs disceretes (catégories des produits)

Ainsi, on peut définir les transformations suivantes sur les données d'entrée (X) :
- Transformer la colonne 'date' en colonne 'month'
- Encoder les colonnes 'nom' et 'categorie' avec l'encodage One Hot Encoding (OHE)
- Le reste des colonnes : ne rien faire

Date to Month transformer
- Input : Colonne 'date' (Type pd.Series de shape 1D (N,))
- Output : Colonne 'month'(Type np.ndarray de shape 2D (Nx1))

In [6]:
# Définition de month_transformer :

# Role : Il extracte le mois de la date

from sklearn.preprocessing import FunctionTransformer

def extract_month(date_series):
    import pandas as pd
    return pd.to_datetime(date_series).dt.month.values.reshape(-1,1)

month_transformer = FunctionTransformer(extract_month, validate=False)

In [7]:
# Test de de month transformateur :
mois=month_transformer.fit_transform(X_train['date'])
mois

array([[ 8],
       [ 6],
       [ 4],
       [12],
       [ 6],
       [ 3],
       [10],
       [ 3],
       [12],
       [ 7],
       [10],
       [ 2],
       [ 6],
       [ 2],
       [10],
       [ 9],
       [11],
       [ 9],
       [ 5],
       [ 6],
       [11],
       [ 3],
       [ 7],
       [ 7],
       [12],
       [ 9],
       [ 2],
       [ 7],
       [11],
       [10],
       [ 8],
       [ 2],
       [ 8],
       [10],
       [ 3],
       [ 1],
       [11],
       [11],
       [ 5],
       [ 5],
       [ 6],
       [ 2],
       [ 9],
       [ 5]], dtype=int32)

In [8]:
# Définition de montant_transformer :

def calculer_montant(X1):
    return (X1[:,0]*X1[:,1]).reshape(-1,1)

montant_transformer = FunctionTransformer(calculer_montant, validate=True)

In [9]:
# Test
montant_transformer.fit_transform(X_train[['puht','quantite_vendue']])

array([[15500],
       [23800],
       [18750],
       [ 9450],
       [14300],
       [19200],
       [13650],
       [11700],
       [11500],
       [ 1260],
       [10500],
       [15400],
       [  800],
       [ 5850],
       [ 8550],
       [ 9500],
       [12600],
       [28050],
       [  570],
       [10200],
       [37000],
       [10500],
       [16500],
       [ 9000],
       [16100],
       [ 7200],
       [13750],
       [22500],
       [12100],
       [19550],
       [24800],
       [  960],
       [10450],
       [11000],
       [  680],
       [  750],
       [22500],
       [10000],
       [ 8800],
       [11200],
       [13050],
       [ 8550],
       [12000],
       [20800]])

In [26]:
# Intégration des tarnsformateurs dans un seul preprocessor

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

preprocessor = ColumnTransformer(
                    transformers=[
                        ('date_to_month_trsf', month_transformer, 'date'),
                        ('ohe_categorie_produit_trsf', OneHotEncoder(), ['categorie','nom']),
                        # ('montant', montant_transformer, ['puht', 'quantite_vendue'])
                    ],
                    remainder='passthrough'
)

In [27]:
# Intégration du prprocessor dans une pipeline

from sklearn.pipeline import Pipeline

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor)
])

In [28]:
# Visualiser le pipeline
pipeline

In [13]:
#test de pipeline
pd.DataFrame(pipeline.fit_transform(X_train))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,8.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1550.0,10.0
1,6.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1400.0,17.0
2,4.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1250.0,15.0
3,12.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1050.0,9.0
4,6.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,650.0,22.0
5,3.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1200.0,16.0
6,10.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1050.0,13.0
7,3.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1300.0,9.0
8,12.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1150.0,10.0
9,7.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,210.0,6.0


<b>Etape 4 de pipeline :</b> Regression

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
#pipeline
pipeline = Pipeline(steps=[
                                ('preprocessor', preprocessor),
                                ('scaler', StandardScaler()),
                                ('classifier', LogisticRegression())
                        ]
)           

In [15]:
# visualiser le pipeline
pipeline

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [16]:
#test de pipeline finale
pipeline.fit(X_train,y_train)
y_pred=pipeline.predict(X_test)
y_pred

array(['Non', 'Oui', 'Oui', 'Oui', 'Non', 'Non', 'Oui', 'Oui', 'Non',
       'Oui', 'Oui', 'Oui', 'Oui', 'Oui', 'Non', 'Non', 'Non', 'Oui',
       'Non', 'Oui'], dtype=object)

In [17]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test,y_pred)

0.85

Enregistrement de la pipeline

In [18]:
import joblib
import dill
with open('pipeline.pkl', 'wb') as f:
    dill.dump(pipeline, f)

In [19]:
list(X.loc[0,:])

[np.int64(1000),
 np.int64(15),
 Timestamp('2023-01-01 00:00:00'),
 'smartphone',
 'iPhone']

In [20]:
# test
df_vente_new = pd.DataFrame([[1000, 15	, '2023-01-01'	, 'smartphone'	, 'Samsung']], columns=['puht', 'quantite_vendue', 'date', 'categorie', 'nom'])
df_vente_new

Unnamed: 0,puht,quantite_vendue,date,categorie,nom
0,1000,15,2023-01-01,smartphone,Samsung


In [21]:
# Charger le modèle
with open('pipeline.pkl', 'rb') as f:
    pipeline = dill.load(f)

pipeline.predict(df_vente_new)

array(['Non'], dtype=object)

<b>Recap :</b>

In [22]:
# from sklearn.pipeline import Pipeline
# from sklearn.preprocessing import FunctionTransformer
# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import OneHotEncoder
# from sklearn.linear_model import LogisticRegression
# from sklearn.preprocessing import StandardScaler

# month_transformer = FunctionTransformer(extract_month, validate=False)

# pipeline = Pipeline(steps=[ ('preprocessor', 
#                                 ColumnTransformer(
#                                         transformers=[
#                                             ('date_to_month_trsf', month_transformer, 'date'),
#                                             ('ohe_categorie_produit_trsf', OneHotEncoder(), ['categorie','nom'])
#                                         ],
#                                         remainder='passthrough'
#                                 )                                     
#                             ),
#                             ('scaler', StandardScaler()),
#                             ('classifier', LogisticRegression())
#                         ]
# )
# pipeline