# Projet 7 : Implémentez un modèle de scoring :Echantillonage

## Sommaire

 [1.Importation des données](#Int)  

 [2.Preprocessing](#Cha)

 [3.SMOTE et export des Datasets](#Mod)



Le but de ce notebook est la mise en place des jeux d'entrainement et de test. On proposera aussi une première solution d'échantillonage et de réequilibrage des Datasets avec SMOTE.



<a name="Imt"></a>
# **Importation des données**

In [None]:
pip install scikit-plot

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
pip install shap

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import numpy as np 
import pandas as pd

## PLOT
import matplotlib.pyplot as plt
import seaborn as sns

## Preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, OneHotEncoder

## Resampling
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTETomek
from imblearn.under_sampling import TomekLinks
from collections import Counter

##Split
from sklearn.model_selection import train_test_split

## Modelisation
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb

## Scores
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import fbeta_score
from sklearn.metrics import precision_recall_fscore_support
import scikitplot as skplt
from sklearn.model_selection import cross_val_score
from sklearn.metrics import  make_scorer

## feature importance
import shap

## Threshold
from yellowbrick.classifier.threshold import discrimination_threshold

## Export
import pickle

## Warning
import warnings

In [None]:
import sklearn
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, loguniform, randint
from sklearn.metrics import recall_score, precision_score, accuracy_score
from sklearn.metrics import precision_recall_curve
from sklearn.svm import SVC
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA
from sklearn.exceptions import ConvergenceWarning
from sklearn.exceptions import FitFailedWarning

In [None]:
warnings.filterwarnings("ignore")

In [None]:
use_colab = True 

if use_colab:
    from google.colab import drive
    drive.mount('/content/drive')
    PATH ='/content/drive/MyDrive/'
else:
    PATH ='/data/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
dataset = pd.read_csv(PATH + 'df_final.csv')

In [None]:
dataset.shape

(307511, 47)

In [None]:
train_len = dataset.shape[0]

In [None]:
train_dataset = dataset[:train_len]
train_ids = train_dataset['SK_ID_CURR']
train_dataset.drop(columns=['SK_ID_CURR'], axis = 1, inplace=True)

* Définir les features et la variable cible pour la modélisation

In [None]:
# separate training data
train_dataset['TARGET'] = train_dataset['TARGET'].astype(int)
target = train_dataset['TARGET']
features = train_dataset.drop(columns=['TARGET'], axis = 1)
print('x_train data shape: ', features.shape)
print('y_train data shape: ', target.shape)

x_train data shape:  (307511, 45)
y_train data shape:  (307511,)


In [None]:
target_sample = target
target_sample.shape

(307511,)

In [None]:
features_sample = features
features_sample.shape


(307511, 45)

* Le nombre des colonnes est : 45
* Le nombre des observations (lignes) est : 307511

<a name="Cha"></a>
# **Preprocessing**

Le prétraitement des données est le suivant:
* Fractionnement du jeu de données (40% test set).
* Définir les attributs (numériques , catégoriques ...)
* Compléter les données manquantes (Imputation):

Les variables numériques : Médiane (car variables asymétriques).
Les variables catégoriques : le plus fréquent.
* Mise à l'échelle des fonctionnalités (Robust scaler(plus fiable vis à vis des outliers)).
* Encodage des données catégorielles.


In [None]:
# Train test Split
X_train, X_test, y_train, y_test = train_test_split(features_sample, target_sample, test_size = 0.4)

In [None]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 123005 entries, 200000 to 222326
Data columns (total 45 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   NAME_CONTRACT_TYPE           123005 non-null  object 
 1   CODE_GENDER                  123005 non-null  object 
 2   CNT_CHILDREN                 123005 non-null  int64  
 3   AMT_INCOME_TOTAL             123005 non-null  float64
 4   AMT_CREDIT_x                 123005 non-null  float64
 5   NAME_TYPE_SUITE              122449 non-null  object 
 6   NAME_INCOME_TYPE             123005 non-null  object 
 7   NAME_EDUCATION_TYPE          123005 non-null  object 
 8   NAME_FAMILY_STATUS           123005 non-null  object 
 9   REGION_POPULATION_RELATIVE   123005 non-null  float64
 10  DAYS_BIRTH                   123005 non-null  int64  
 11  DAYS_EMPLOYED                123005 non-null  int64  
 12  OWN_CAR_AGE                  41946 non-null   float64

In [None]:
X_test['DAYS_INSTALMENT_delay'] = X_test['DAYS_INSTALMENT_delay'].mul(-1)

In [None]:
X_train['DAYS_INSTALMENT_delay'] = X_train['DAYS_INSTALMENT_delay'].mul(-1)

In [None]:
pd.DataFrame(X_test).to_csv('/content/drive/MyDrive/X_test.csv',index=False)

In [None]:
# Define categorical columns
categoric_attribute = list(features_sample.select_dtypes(exclude=["number"]).columns)
# Define numerical columns
numeric_attribute = list(features_sample.select_dtypes(exclude=["bool_","object_"]).columns)


In [None]:
numeric_attribute

['CNT_CHILDREN',
 'AMT_INCOME_TOTAL',
 'AMT_CREDIT_x',
 'REGION_POPULATION_RELATIVE',
 'DAYS_BIRTH',
 'DAYS_EMPLOYED',
 'OWN_CAR_AGE',
 'CNT_FAM_MEMBERS',
 'REGION_RATING_CLIENT',
 'REGION_RATING_CLIENT_W_CITY',
 'HOUR_APPR_PROCESS_START',
 'REG_CITY_NOT_WORK_CITY',
 'TOTALAREA_MODE',
 'DEF_30_CNT_SOCIAL_CIRCLE',
 'FLAG_DOCUMENT_3',
 'AMT_REQ_CREDIT_BUREAU_YEAR',
 'DAYS_CREDIT',
 'AMT_CREDIT_SUM',
 'AMT_BALANCE',
 'AMT_PAYMENT_CURRENT',
 'SK_DPD_x',
 'SK_DPD_DEF',
 'DAYS_INSTALMENT_delay',
 'AMT_INSTALMENT_delta',
 'AMT_ANNUITY',
 'AMT_CREDIT_y',
 'AMT_DOWN_PAYMENT',
 'DAYS_DECISION',
 'CNT_PAYMENT',
 'DAYS_FIRST_DRAWING',
 'DAYS_LAST_DUE',
 'DAYS_TERMINATION',
 'CNT_INSTALMENT_FUTURE',
 'SK_DPD_y']

In [None]:
pip install category_encoders

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
import category_encoders as ce

from category_encoders import TargetEncoder

encoder = LabelEncoder()

for col in categoric_attribute:
   X_train[col] = encoder.fit_transform(X_train[col])
   X_test[col] = encoder.fit_transform(X_test[col])

display(X_train)

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT_x,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,REGION_POPULATION_RELATIVE,...,DAYS_DECISION,CNT_PAYMENT,DAYS_FIRST_DRAWING,DAYS_LAST_DUE,DAYS_TERMINATION,NAME_CONTRACT_STATUS,CODE_REJECT_REASON,NAME_CLIENT_TYPE,CNT_INSTALMENT_FUTURE,SK_DPD_y
277724,0,0,0,135000.0,1350000.0,6,1,0,0,0.009175,...,-1629.500000,7.166667,365243.000000,90142.250000,-1200.000000,0,7,2,,
126412,0,0,1,180000.0,312768.0,6,1,0,0,0.006671,...,-270.666667,23.333333,365243.000000,-192.000000,-188.000000,0,7,2,6.400000,0.000000
61551,0,0,0,216000.0,188685.0,5,1,1,0,0.002042,...,-717.000000,15.733333,365243.000000,60107.000000,60112.666667,2,1,2,17.880000,0.000000
291535,0,0,0,67500.0,152820.0,6,1,1,0,0.028663,...,-346.000000,3.000000,182489.000000,182538.500000,182540.000000,0,7,0,3.000000,0.000000
293124,0,1,0,225000.0,490495.5,6,1,1,2,0.018634,...,-1028.166667,8.666667,365243.000000,-1586.666667,-1510.666667,0,7,2,6.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100758,0,0,0,99000.0,152820.0,3,1,1,1,0.014464,...,-651.000000,12.000000,365243.000000,-888.000000,-879.000000,0,7,0,,
254741,0,1,0,135000.0,90000.0,6,1,1,0,0.005002,...,-1214.000000,6.000000,243364.333333,-914.333333,120857.333333,0,7,1,6.000000,1.230769
2852,0,0,1,540000.0,2156400.0,0,1,0,0,0.046220,...,-415.000000,12.000000,365243.000000,-229.000000,-223.000000,1,7,2,6.818182,0.000000
149365,0,0,0,77850.0,254700.0,6,1,0,0,0.072508,...,,,,,,4,9,4,,


In [None]:
# Pipeline data transformation (Imputation / Scaling / Encoding):
def Preprocessing (numeric):
    numeric_transfs = [('imputer',SimpleImputer(missing_values= np.NAN, strategy= 'median')),('scaler', RobustScaler())]
    numeric_pipeline = Pipeline(numeric_transfs)
    all_transfs = [("numeric",numeric_pipeline,numeric)]
    full_preprocessor = ColumnTransformer(all_transfs, remainder='passthrough')
    return full_preprocessor

In [None]:
y_test.shape[0]

123005

In [None]:
# Data Transformed
preprocessor_fitted = Preprocessing(numeric_attribute).fit(X_train)
X_train_transformed = preprocessor_fitted.transform(X_train)
X_test_transformed = preprocessor_fitted.transform(X_test)

In [None]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 184506 entries, 277724 to 3883
Data columns (total 45 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   NAME_CONTRACT_TYPE           184506 non-null  int64  
 1   CODE_GENDER                  184506 non-null  int64  
 2   CNT_CHILDREN                 184506 non-null  int64  
 3   AMT_INCOME_TOTAL             184506 non-null  float64
 4   AMT_CREDIT_x                 184506 non-null  float64
 5   NAME_TYPE_SUITE              184506 non-null  int64  
 6   NAME_INCOME_TYPE             184506 non-null  int64  
 7   NAME_EDUCATION_TYPE          184506 non-null  int64  
 8   NAME_FAMILY_STATUS           184506 non-null  int64  
 9   REGION_POPULATION_RELATIVE   184506 non-null  float64
 10  DAYS_BIRTH                   184506 non-null  int64  
 11  DAYS_EMPLOYED                184506 non-null  int64  
 12  OWN_CAR_AGE                  62636 non-null   float64
 

In [None]:
X_test_transformed.shape

(123005, 45)

* Nombre des colonnes après encodage : 45
* Nombre des observations dans la fraction train : 184506
* Nombre des observations dans la fraction train : 123005

<a name="Mod"></a>
# **Smote et export des Datasets**

* Le sous-échantillonnage aléatoire (random undersampling) des observations majoritaires :  on retire aléatoirement des observations majoritaires
* Le sur-échantillonnage aléatoire (random oversampling) des observations minoritaires :  on tire au hasard des individus minoritaires que l’on rajoute aux données. 
* Le sur-échantillonnage synthétique (SMOTE pour Synthetic Minority Oversampling Technique) produit des observations minoritaires ressemblantes mais distinctes de celles déjà existantes.

L'idée est de combiner SMOTE avec une technique de sous-échantillonnage (ENN, Tomek) pour augmenter l'efficacité de la gestion de la classe déséquilibrée.

In [None]:
# define smote strategy
sm = SMOTE(random_state=42)
# Define SMOTE-Tomek Links (Over-sampling followed by under-sampling)
smtomek=SMOTETomek(tomek=TomekLinks(sampling_strategy='majority'))
# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority')
# define undersample strategy
undersample = RandomUnderSampler(sampling_strategy='majority')

In [None]:
def resampling (features, target, resample):
    print('Original dataset shape %s' % Counter(target))
    X, y = resample.fit_resample(features, target)
    print('Resampled dataset shape %s' % Counter(y))
    return X, y

In [None]:
#X_train_smtomek,y_train_smtomek = resampling (X_train_transformed, y_train, smtomek)

In [None]:
#X_test_smtomek,y_test_smtomek = resampling (X_test_transformed, y_test, smtomek)

On exporte ensuite les Datasets pour les réutiliser dans les autres notebooks

In [None]:
pd.DataFrame(X_train_transformed).to_csv('/content/drive/MyDrive/X_train_smtomek.csv',index=False)
pd.DataFrame(y_train).to_csv('/content/drive/MyDrive/y_train_smtomek.csv',index=False)

In [None]:
pd.DataFrame(X_test_transformed).to_csv('/content/drive/MyDrive/X_test_smtomek.csv',index=False)
pd.DataFrame(y_test).to_csv('/content/drive/MyDrive/y_test_smtomek.csv',index=False)