# Projet 7 : Implémentez un modèle de scoring :Echantillonage

## Sommaire

 [1.Importation des données](#Int)  

 [2.Undersampling](#Mod)



Le but de ce notebook est la mise en place des jeux d'entrainement et de test. On proposera aussi une première solution d'échantillonage et de réequilibrage des Datasets avec SMOTE.



<a name="Imt"></a>
# **Importation des données**

In [1]:
pip install scikit-plot

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-plot
  Downloading scikit_plot-0.3.7-py3-none-any.whl (33 kB)
Installing collected packages: scikit-plot
Successfully installed scikit-plot-0.3.7


In [2]:
pip install shap

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting shap
  Downloading shap-0.41.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (575 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m575.9/575.9 KB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Collecting slicer==0.0.7
  Downloading slicer-0.0.7-py3-none-any.whl (14 kB)
Installing collected packages: slicer, shap
Successfully installed shap-0.41.0 slicer-0.0.7


In [3]:
import numpy as np 
import pandas as pd

## PLOT
import matplotlib.pyplot as plt
import seaborn as sns

## Preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, OneHotEncoder

## Resampling
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTETomek
from imblearn.under_sampling import TomekLinks
from collections import Counter

##Split
from sklearn.model_selection import train_test_split

## Modelisation
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb

## Scores
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import fbeta_score
from sklearn.metrics import precision_recall_fscore_support
import scikitplot as skplt
from sklearn.model_selection import cross_val_score
from sklearn.metrics import  make_scorer

## feature importance
import shap

## Threshold
from yellowbrick.classifier.threshold import discrimination_threshold

## Export
import pickle

## Warning
import warnings

In [4]:
import sklearn
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, loguniform, randint
from sklearn.metrics import recall_score, precision_score, accuracy_score
from sklearn.metrics import precision_recall_curve
from sklearn.svm import SVC
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA
from sklearn.exceptions import ConvergenceWarning
from sklearn.exceptions import FitFailedWarning

In [5]:
warnings.filterwarnings("ignore")

In [6]:
use_colab = True 

if use_colab:
    from google.colab import drive
    drive.mount('/content/drive')
    PATH ='/content/drive/MyDrive/'
else:
    PATH ='/data/'

Mounted at /content/drive


In [7]:
dataset = pd.read_csv(PATH + 'df_final.csv')

In [8]:
dataset.shape

(307511, 47)

In [9]:
train_len = dataset.shape[0]

In [10]:
train_dataset = dataset[:train_len]
train_ids = train_dataset['SK_ID_CURR']
train_dataset.drop(columns=['SK_ID_CURR'], axis = 1, inplace=True)

* Définir les features et la variable cible pour la modélisation

In [11]:
# separate training data
train_dataset['TARGET'] = train_dataset['TARGET'].astype(int)
target = train_dataset['TARGET']
features = train_dataset.drop(columns=['TARGET'], axis = 1)
print('x_train data shape: ', features.shape)
print('y_train data shape: ', target.shape)

x_train data shape:  (307511, 45)
y_train data shape:  (307511,)


In [12]:
target_sample = target
target_sample.shape

(307511,)

In [13]:
features_sample = features
features_sample.shape


(307511, 45)

* Le nombre des colonnes est : 45
* Le nombre des observations (lignes) est : 307511

###Preprocessing

Le prétraitement des données est le suivant:
* Fractionnement du jeu de données (40% test set).
* Définir les attributs (numériques , catégoriques ...)
* Compléter les données manquantes (Imputation):

Les variables numériques : Médiane (car variables asymétriques).
Les variables catégoriques : le plus fréquent.
* Mise à l'échelle des fonctionnalités (Robust scaler(plus fiable vis à vis des outliers)).
* Encodage des données catégorielles.


In [14]:
# Train test Split
X_train, X_test, y_train, y_test = train_test_split(features_sample, target_sample, test_size = 0.4)

In [15]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 123005 entries, 129473 to 299858
Data columns (total 45 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   NAME_CONTRACT_TYPE           123005 non-null  object 
 1   CODE_GENDER                  123005 non-null  object 
 2   CNT_CHILDREN                 123005 non-null  int64  
 3   AMT_INCOME_TOTAL             123005 non-null  float64
 4   AMT_CREDIT_x                 123005 non-null  float64
 5   NAME_TYPE_SUITE              122490 non-null  object 
 6   NAME_INCOME_TYPE             123005 non-null  object 
 7   NAME_EDUCATION_TYPE          123005 non-null  object 
 8   NAME_FAMILY_STATUS           123005 non-null  object 
 9   REGION_POPULATION_RELATIVE   123005 non-null  float64
 10  DAYS_BIRTH                   123005 non-null  int64  
 11  DAYS_EMPLOYED                123005 non-null  int64  
 12  OWN_CAR_AGE                  42037 non-null   float64

In [16]:
X_test['DAYS_INSTALMENT_delay'] = X_test['DAYS_INSTALMENT_delay'].mul(-1)

In [17]:
X_train['DAYS_INSTALMENT_delay'] = X_train['DAYS_INSTALMENT_delay'].mul(-1)

In [18]:
pd.DataFrame(X_test).to_csv('/content/drive/MyDrive/X_test.csv',index=False)

In [19]:
# Define categorical columns
categoric_attribute = list(features_sample.select_dtypes(exclude=["number"]).columns)
# Define numerical columns
numeric_attribute = list(features_sample.select_dtypes(exclude=["bool_","object_"]).columns)


In [20]:
numeric_attribute

['CNT_CHILDREN',
 'AMT_INCOME_TOTAL',
 'AMT_CREDIT_x',
 'REGION_POPULATION_RELATIVE',
 'DAYS_BIRTH',
 'DAYS_EMPLOYED',
 'OWN_CAR_AGE',
 'CNT_FAM_MEMBERS',
 'REGION_RATING_CLIENT',
 'REGION_RATING_CLIENT_W_CITY',
 'HOUR_APPR_PROCESS_START',
 'REG_CITY_NOT_WORK_CITY',
 'TOTALAREA_MODE',
 'DEF_30_CNT_SOCIAL_CIRCLE',
 'FLAG_DOCUMENT_3',
 'AMT_REQ_CREDIT_BUREAU_YEAR',
 'DAYS_CREDIT',
 'AMT_CREDIT_SUM',
 'AMT_BALANCE',
 'AMT_PAYMENT_CURRENT',
 'SK_DPD_x',
 'SK_DPD_DEF',
 'DAYS_INSTALMENT_delay',
 'AMT_INSTALMENT_delta',
 'AMT_ANNUITY',
 'AMT_CREDIT_y',
 'AMT_DOWN_PAYMENT',
 'DAYS_DECISION',
 'CNT_PAYMENT',
 'DAYS_FIRST_DRAWING',
 'DAYS_LAST_DUE',
 'DAYS_TERMINATION',
 'CNT_INSTALMENT_FUTURE',
 'SK_DPD_y']

In [21]:
pip install category_encoders

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting category_encoders
  Downloading category_encoders-2.5.1.post0-py2.py3-none-any.whl (72 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.4/72.4 KB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: category_encoders
Successfully installed category_encoders-2.5.1.post0


In [22]:
from sklearn.preprocessing import LabelEncoder

In [23]:
import category_encoders as ce

from category_encoders import TargetEncoder

encoder = LabelEncoder()

for col in categoric_attribute:
   X_train[col] = encoder.fit_transform(X_train[col])
   X_test[col] = encoder.fit_transform(X_test[col])

display(X_train)

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT_x,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,REGION_POPULATION_RELATIVE,...,DAYS_DECISION,CNT_PAYMENT,DAYS_FIRST_DRAWING,DAYS_LAST_DUE,DAYS_TERMINATION,NAME_CONTRACT_STATUS,CODE_REJECT_REASON,NAME_CLIENT_TYPE,CNT_INSTALMENT_FUTURE,SK_DPD_y
190394,0,1,0,135000.0,545040.0,6,1,0,0,0.030755,...,-1015.666667,14.0,243100.333333,121288.333333,121290.666667,0,7,2,28.636364,0.0
300403,0,0,0,135000.0,1129500.0,0,1,1,0,0.015221,...,-477.000000,12.0,365243.000000,-116.000000,-112.000000,0,7,0,6.000000,0.0
48550,0,0,0,180000.0,616500.0,6,1,1,0,0.010643,...,-202.500000,9.6,365243.000000,-137.000000,-135.000000,1,7,2,14.714286,0.0
139030,0,1,0,143775.0,1724220.0,6,1,0,1,0.018850,...,,,,,,4,9,4,,
250937,0,1,1,225000.0,294322.5,1,1,1,0,0.008625,...,-409.400000,18.0,365243.000000,243213.333333,243216.666667,0,7,2,44.500000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58648,0,0,0,99450.0,254700.0,6,1,1,0,0.022800,...,-888.857143,6.0,365243.000000,242623.000000,242624.666667,0,7,1,10.000000,0.0
82467,0,0,0,225000.0,590337.0,6,1,1,0,0.014464,...,-398.666667,18.0,365243.000000,-308.000000,-300.000000,1,7,2,6.000000,0.0
107651,0,0,1,225000.0,550980.0,1,1,1,0,0.006008,...,-625.500000,4.0,243094.333333,121063.333333,121227.000000,0,7,2,3.000000,0.0
141814,0,1,0,90000.0,225000.0,6,1,0,0,0.035792,...,-2110.750000,10.5,273616.750000,89836.500000,89841.500000,0,7,2,,


In [24]:
# Pipeline data transformation (Imputation / Scaling / Encoding):
def Preprocessing (numeric):
    numeric_transfs = [('imputer',SimpleImputer(missing_values= np.NAN, strategy= 'median')),('scaler', RobustScaler())]
    numeric_pipeline = Pipeline(numeric_transfs)
    all_transfs = [("numeric",numeric_pipeline,numeric)]
    full_preprocessor = ColumnTransformer(all_transfs, remainder='passthrough')
    return full_preprocessor

In [25]:
y_test.shape[0]

123005

In [26]:
# Data Transformed
preprocessor_fitted = Preprocessing(numeric_attribute).fit(X_train)
X_train_transformed = preprocessor_fitted.transform(X_train)
X_test_transformed = preprocessor_fitted.transform(X_test)

In [27]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 184506 entries, 190394 to 99318
Data columns (total 45 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   NAME_CONTRACT_TYPE           184506 non-null  int64  
 1   CODE_GENDER                  184506 non-null  int64  
 2   CNT_CHILDREN                 184506 non-null  int64  
 3   AMT_INCOME_TOTAL             184506 non-null  float64
 4   AMT_CREDIT_x                 184506 non-null  float64
 5   NAME_TYPE_SUITE              184506 non-null  int64  
 6   NAME_INCOME_TYPE             184506 non-null  int64  
 7   NAME_EDUCATION_TYPE          184506 non-null  int64  
 8   NAME_FAMILY_STATUS           184506 non-null  int64  
 9   REGION_POPULATION_RELATIVE   184506 non-null  float64
 10  DAYS_BIRTH                   184506 non-null  int64  
 11  DAYS_EMPLOYED                184506 non-null  int64  
 12  OWN_CAR_AGE                  62545 non-null   float64


In [28]:
X_test_transformed.shape

(123005, 45)

* Nombre des colonnes après encodage : 45
* Nombre des observations dans la fraction train : 184506
* Nombre des observations dans la fraction train : 123005

<a name="Mod"></a>
# **Undersampling**

* Le sous-échantillonnage aléatoire (random undersampling) des observations majoritaires :  on retire aléatoirement des observations majoritaires
* Le sur-échantillonnage aléatoire (random oversampling) des observations minoritaires :  on tire au hasard des individus minoritaires que l’on rajoute aux données. 


In [30]:
from imblearn.under_sampling import TomekLinks

tl = RandomUnderSampler(sampling_strategy=0.9)

# fit predictor and target variable
X_train_smtomek, y_train_smtomek = tl.fit_resample(X_train_transformed, y_train)

print('Original dataset shape', Counter(y_train))
print('Resample dataset shape', Counter(y_train_smtomek))

Original dataset shape Counter({0: 169535, 1: 14971})
Resample dataset shape Counter({0: 16634, 1: 14971})


In [31]:
from sklearn.utils import resample

On va réequilibrer les classes en deux étapes :

- Undersampling sur la classe majoritaire (fait dans ce notebook)
- LightGBM dans le notebook suivant avec le Weight_POS = 1.1

In [32]:
pd.DataFrame(X_train_smtomek).to_csv('/content/drive/MyDrive/X_train_smtomek.csv',index=False)
pd.DataFrame(y_train_smtomek).to_csv('/content/drive/MyDrive/y_train_smtomek.csv',index=False)

In [33]:
pd.DataFrame(X_test_transformed).to_csv('/content/drive/MyDrive/X_test_smtomek.csv',index=False)
pd.DataFrame(y_test).to_csv('/content/drive/MyDrive/y_test_smtomek.csv',index=False)