# *Sélection du modèle*

Dans ce notebook, nous allons : 
- Réaliser les imports ...
- Rééquilibrer le dataset
- Sélectionner le modèle avec la méthode Nested CV

## 01 - Imports 

Pour cela, il faudra  : 
- importer les bibliothèques python nécessaires
- importer le dataset pré processé lors du chapitre précédent

In [1]:
# Import des bibliothèques nécessaires 

!pip install imblearn

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
import pickle
from sklearn.model_selection import GridSearchCV, train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import numpy as np
import pandas as pd
from sklearn.exceptions import DataConversionWarning
from sklearn.base import BaseEstimator, TransformerMixin
import warnings
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

Looking in indexes: https://nexus-external.analytics.safran/repository/pypi-proxy/simple


In [2]:
# Récupération du dataset prepocessé dans le chapitre précédent

with open('df_NC_tfidf_processed','rb') as fich:
    mon_depickler = pickle.Unpickler(fich)
    df_transformed = mon_depickler.load()
fich.closed

True

## 02 - Rééquilibrage du dataset

Pour cela, il faudra  : 
- diviser le dataset en features et targets
- diviser les données en ensemble d'entraînement et ensemble de test
- rééquilibrer les données d'entraînement
- évaluer une autre méthode de rééquilibrage si besoin (oversampling, undersampling...)

In [47]:
def changer_type_float(df, type_actuel = 'float64', nouveau_type = 'float32'):
    """
    Change le type de données des colonnes de type float dans un DataFrame pandas.

    Args:
        df (DataFrame): Le DataFrame contenant les colonnes dont vous voulez changer le type.
        nouveau_type (type): Le nouveau type de données auquel vous voulez convertir les colonnes (par exemple, 'float32', 'float16', 'int', etc.).

    Returns:
        DataFrame: Le DataFrame avec les colonnes de type float converties au nouveau type spécifié.
    """
    # Obtenir les types de données actuels du DataFrame
    types_de_donnees = df.dtypes
    
    # Sélectionner uniquement les colonnes de type float
    colonnes_float = types_de_donnees[types_de_donnees == type_actuel].index.tolist()
    
    # Changer le type de données de chaque colonne de type float
    for colonne in colonnes_float:
        df[colonne] = df[colonne].astype(nouveau_type)
    
    return df

In [22]:
# Classe destinée à réaliser le split train/test et réduire des données de train pour alléger le Nested CV 

class Resampler(BaseEstimator, TransformerMixin):
    def __init__(self, test_size=0.2, random_state=42, nombre_lignes_par_classe = 100, target = 'Root Cause Category' , **kwargs):
        self.test_size = test_size
        self.random_state = random_state
        self.oversampler = SMOTE(random_state = self.random_state)
        #self.spliter = train_test_split()
        self.nombre_lignes_par_classe = nombre_lignes_par_classe
        self.bert_params = kwargs
        self.target = target

    def fit(self, X, y=None):
        self.X = X[0]
        self.y = X[1]
        return self

    def transform(self, X):
        # Diviser les données en ensemble d'entraînement et ensemble de test
        X_train, X_test, y_train, y_test = train_test_split(self.X, self.y, test_size=self.test_size, random_state = self.random_state)
        
        # équilibrage OVERSAMPLING des données train
        X_train_oversampled, y_train_oversampled = self.oversampler.fit_resample(X_train, y_train)       

        # Réduire le nombre de lignes par classe
        df_oversampled = pd.concat([X_train_oversampled, y_train_oversampled], axis=1)
        groupes = df_oversampled.groupby(self.target)
        df_reduit = pd.DataFrame()
        for classe, groupe in groupes:
            echantillon = groupe.sample(n=self.nombre_lignes_par_classe, random_state=self.random_state)  
            df_reduit = pd.concat([df_reduit, echantillon])
        df_reduit.reset_index(drop=True, inplace=True)               
               
        # Sauvegarder les autres colonnes du DataFrame
        X_train_reduced = df_reduit.drop(columns=[self.target])  
        y_train_reduced = df_reduit[[self.target]]

        # Retourner le split final
        return X_train_reduced, X_test, y_train_reduced, y_test

    def get_feature_names_out(self):
        pass

In [48]:
# Instanciation du resampler 
resampling = Resampler(test_size=0.2, random_state=42, nombre_lignes_par_classe = 18, target = 'Root Cause Category')

# Variation du type float des features
X_proc_newFloat = changer_type_float(df_transformed[0], type_actuel = 'float32', nouveau_type = 'float16')
y_proc_newFloat = df_transformed[1]

df_processed_newFloat = (X_proc_newFloat, y_proc_newFloat)

# Obtention des X, y pour le train et le test
X_train_reduced, X_test, y_train_reduced, y_test = resampling.fit_transform(df_processed_newFloat)

In [52]:
df_transformed[0]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-0.108215,0.132080,-0.176270,-0.063965,0.048004,0.046295,-0.140503,-0.022537,0.014786,-0.032776,...,-0.050079,-0.002365,0.013878,0.044891,-0.013550,0.007732,0.019180,-0.032593,-0.024612,0.004112
1,-0.057861,-0.020798,-0.140869,-0.045685,0.136230,-0.009010,0.052216,0.127075,0.003511,0.075317,...,0.027496,-0.018356,0.034210,0.002144,0.076599,-0.067444,-0.063721,-0.001893,-0.000267,-0.001756
2,-0.067749,0.039154,-0.025909,-0.006927,-0.022888,0.077820,-0.020050,0.041595,0.045959,-0.007919,...,0.035400,0.034149,0.010414,0.001386,0.003735,-0.052765,-0.031235,0.075623,-0.026184,-0.014618
3,-0.036316,-0.075806,0.023499,-0.062439,-0.027863,0.128418,-0.007820,-0.044983,-0.109741,0.083496,...,-0.015327,-0.085144,-0.019226,-0.039886,0.008690,0.058044,0.023224,-0.002058,-0.020340,-0.031799
4,-0.032379,-0.027512,0.000547,0.004711,-0.044708,0.080811,-0.004601,-0.020309,-0.056274,-0.013779,...,0.009575,-0.018173,-0.007935,-0.015175,-0.004005,0.042358,0.008789,0.018280,0.008286,-0.028763
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14314,-0.010620,-0.027267,-0.036346,-0.003677,-0.043396,-0.077820,0.039368,0.114807,-0.019577,0.015686,...,0.000406,0.037201,-0.066040,-0.015266,-0.081360,-0.038849,0.017212,-0.016602,-0.011299,-0.007668
14315,0.004654,0.126831,-0.009193,0.034973,0.049011,-0.129761,0.040649,0.050720,-0.168457,-0.002737,...,-0.035400,0.026031,-0.024658,0.005569,-0.002632,0.014832,0.000945,-0.027374,-0.061493,-0.026642
14316,-0.038025,0.054352,-0.115356,0.011238,0.012260,-0.057617,0.293213,0.017715,0.138916,0.048187,...,-0.032104,0.008034,-0.050995,0.026398,0.044037,0.012276,0.022400,-0.047272,-0.026291,-0.027222
14317,-0.035492,-0.065613,-0.245972,-0.068848,0.114319,-0.013451,0.084778,-0.041229,-0.038849,-0.213623,...,-0.014168,0.019394,0.013611,-0.013428,0.027878,0.028259,0.008934,-0.000869,-0.002033,-0.008011


## 03 - Validation croisée imbriquée (Nested CV)

Il s'agit ici d'estimer de manière la plus fiable possible l'erreur de généralisation de chaque modèle.
Pour cela, il faudra  : 
- définir le type de classifieurs à comparer
- pour chaque type de classifieur , instancier un comparateurs/sélectionneurs
- sélectionner le type de classifier ayant la meilleure moyenne de scoring (et moindre écart type)
- optimiser les paramètres du classifier selectionné

In [26]:
lr1 = RandomForestClassifier(random_state=22, n_estimators = 1000)
lr1.fit(X_train_reduced, y_train_reduced)

In [27]:
# Définition des types de classifieurs à comparer

clf_lr = LogisticRegression(random_state=22, max_iter=200)
clf_rf = RandomForestClassifier(random_state=22)
clf_svc = SVC(random_state=22)


param_grid_lr = {'solver': ['liblinear', 'lbfgs'], 'C': np.logspace(-4, 2, 9)}

param_grid_rf = [{'n_estimators': [10, 50, 100, 250, 500, 1000], 
                  'min_samples_leaf': [1, 3, 5], 
                  'max_features': ['sqrt', 'log2']}]

param_grid_svc = [{'kernel': ['rbf'], 'C': np.logspace(-4, 4, 9), 'gamma': np.logspace(-4, 0, 4)}, 
                  {'kernel': ['linear'], 'C': np.logspace(-4, 4, 9)}]

# Piste Dimitri = SVC c'est lourd, sortir linear

In [28]:
# Pour chaque type de classifieur , instanciation d'un comparateurs/sélectionneurs 

gridcvs = {}

for pgrid, clf, name in zip((param_grid_rf, param_grid_svc, param_grid_lr),
                            (clf_rf, clf_svc, clf_lr),
                            ('RF', 'SVM', 'LogisticRegression')):
    gcv = GridSearchCV(clf, pgrid, cv=3, refit=True)
    gridcvs[name] = gcv

In [51]:
# Sélection du type de classifier ayant la meilleure moyenne (et moindre écart type)

outer_cv = StratifiedKFold(n_splits=3, shuffle=True)
outer_scores = {}

for name, gs in gridcvs.items():
    nested_score = cross_val_score(gs, X_train_reduced, y_train_reduced, cv=outer_cv)
    outer_scores[name] = nested_score
    print(f'{name}: outer accuracy {100*nested_score.mean():.2f} +/- {100*nested_score.std():.2f}')

RF: outer accuracy 21.88 +/- 6.13
SVM: outer accuracy 19.44 +/- 0.98
LogisticRegression: outer accuracy 17.36 +/- 2.60


In [None]:
# Une fois défini le type de classifieur, il faudra optimiser ses paramètres

from sklearn.metrics import accuracy_score

final_clf = gridcvs['LogisticRegression']
final_clf.fit(X_train, y_train)

print(f'Best Parameters: {final_clf.best_params_}')

train_acc = accuracy_score(y_true=y_train, y_pred=final_clf.predict(X_train))
test_acc = accuracy_score(y_true=y_test, y_pred=final_clf.predict(X_test))

print(f'Training Accuracy: {100*train_acc:.2f}')
print(f'Test Accuracy: {100*test_acc:.2f}')

In [32]:
X_train_reduced.dtypes

0     float64
1     float64
2     float64
3     float64
4     float64
       ...   
95    float64
96    float64
97    float64
98    float64
99    float64
Length: 100, dtype: object

**Résultats : **

- TFIDF avec réduction dimensionnelle PCA = 100 
    - LogisticRegression: outer accuracy 35.02 +/- 0.30
    - ...