# Table of Contents --> TBU

1. [Global parameters](#1-bullet) <br>
    <br>
    
2. [Loading datas](#2-bullet) <br>
    <br>

3. [Preprocessing](#3-bullet) <br>
    I - [Cleaning](#4-bullet) <br>
    II - [Split train/test and preprocessing](#5-bullet) <br>
    III - [Dimensionality reduction](#6-bullet) <br>
    IV - [Creation of folds for cv](#7-bullet) <br>
    <br>

4. [Model testing](#8-bullet) <br>
    I - [Dummy classifiers ](#9-bullet) <br>
    II - [Quick testing](#15-bullet) <br>
    III - [Linear models](#10-bullet) <br>
    VI - [KNN](#11-bullet) <br>
    V - [SVM](#12-bullet) <br>
    VI - [Trees and ensemblist methods](#13-bullet) <br>
    VII - [Neural networks](#14-bullet) <br>
    VIII - [Compare](#16-bullet) <br>
    <br>

5. [xx](#xx-bullet) <br>
    I - [xx](#xx-bullet) <br>
    II - [xx](#xx-bullet) <br>
    III - [xx](#xx-bullet) <br>
    IV - [xx](#xx-bullet) <br>
    V - [xx](#xx-bullet) <br>
    VI - [xx](#xx-bullet) <br>
    VII - [xx](#xx-bullet) <br>
    <br>

# 1. Global parameters <a class="anchor" id="1-bullet"></a>

In [1]:
# General input
random_state = 50 

# Cross-validation
optimized_metric = 'roc_auc' 
num_folds = 5
stratified = True

# 2. Loading datas <a class="anchor" id="2-bullet"></a>

In [2]:
# Classic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import timeit
from contextlib import contextmanager

import warnings
from pandas.core.common import SettingWithCopyWarning
# warnings.simplefilter(action='ignore', category=SettingWithCopyWarning)

# Project specific functions
from P7_functions import *

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn import manifold, decomposition
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler

# Sklearn models
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import NuSVC
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# Evaluation
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.metrics import log_loss
from sklearn.model_selection import GridSearchCV
from sklearn.dummy import DummyClassifier

In [3]:
@contextmanager
def timer(title):
    t0 = time.time()
    yield
    print("{} - done in {:.0f}s".format(title, time.time() - t0))

In [4]:
baseline_data = pd.read_csv('./Clean_datas/baseline_data.csv', sep=",")
data = pd.read_csv('./Clean_datas/clean_data_1.csv', sep=",")

In [5]:
data.head()

Unnamed: 0.1,Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,...,CURRENT_LOAN_LTV,CURRENT_LOAN_INCOME_CREDIT_PERC,CURRENT_LOAN_PAYMENT_RATE,TOTAL_AMT_ANNUITY,TOTAL_AMT_CREDIT,TOTAL_EFFORT_RATE,TOTAL_INCOME_CREDIT_PERC,TOTAL_PAYMENT_RATE,DAYS_EMPLOYED_PERC,INCOME_PER_PERSON
0,0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,...,1.158397,0.498036,0.060749,247829.0815,888586.065,1.223847,0.22789,0.278903,0.067329,202500.0
1,1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,...,1.145199,0.208736,0.027598,292122.185803,2103502.5,1.081934,0.128357,0.138874,0.070862,135000.0
2,2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,...,1.0,0.5,0.05,,,,,,0.011814,67500.0
3,3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,...,1.052803,0.431748,0.094941,,,,,,0.159905,67500.0
4,4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,...,1.0,0.236842,0.042623,,,,,,0.152418,121500.0


In [6]:
data.drop(['Unnamed: 0', 'SK_ID_CURR'], axis=1, inplace=True)

In [7]:
y = data['TARGET']
x = data.drop(['TARGET'], axis=1)
baseline_y = baseline_data['TARGET']
baseline_x = baseline_data.drop(['TARGET'], axis=1) # Note : categorical data already encoded

In [8]:
print(x.shape)
print(y.shape)

(307507, 399)
(307507,)


In [9]:
# Look targets breakdown
y.value_counts().apply(lambda x: x / y.count())

0    0.91927
1    0.08073
Name: TARGET, dtype: float64

We have very imbalanced classes, we will use StratifiedKFold for now. And see for SMOTE after

# 3. Preprocessing <a class="anchor" id="3-bullet"></a>

## I - Cleaning <a class="anchor" id="4-bullet"></a>

In [10]:
x.describe()

Unnamed: 0,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,...,CURRENT_LOAN_LTV,CURRENT_LOAN_INCOME_CREDIT_PERC,CURRENT_LOAN_PAYMENT_RATE,TOTAL_AMT_ANNUITY,TOTAL_AMT_CREDIT,TOTAL_EFFORT_RATE,TOTAL_INCOME_CREDIT_PERC,TOTAL_PAYMENT_RATE,DAYS_EMPLOYED_PERC,INCOME_PER_PERSON
count,307507.0,307507.0,307507.0,307495.0,307229.0,307507.0,307507.0,252133.0,307507.0,307507.0,...,307229.0,307507.0,307495.0,216304.0,216314.0,216304.0,216314.0,216304.0,252133.0,307505.0
mean,0.417047,168797.7,599028.6,27108.666786,538397.7,0.020868,-16037.027271,-2384.142254,-4986.131376,-2994.20167,...,1.122994,0.399669,0.053695,941155.9,1926691.0,5.708165,0.154313,0.609689,0.15686,93106.08
std,0.722119,237124.6,402492.6,14493.798379,369447.2,0.013831,4363.982424,2338.327666,3522.88303,1509.454566,...,0.124036,0.507927,0.022481,5921754.0,2459518.0,33.373152,0.246091,3.542178,0.133548,101373.9
min,0.0,25650.0,45000.0,1615.5,40500.0,0.00029,-25229.0,-17912.0,-24672.0,-7197.0,...,0.15,0.011801,0.022073,3006.0,45000.0,0.00383,0.000603,0.001404,-0.0,2812.5
25%,0.0,112500.0,270000.0,16524.0,238500.0,0.010006,-19682.0,-3175.0,-7479.5,-4299.0,...,1.0,0.193802,0.0369,97185.56,752400.0,0.640621,0.072865,0.093214,0.056098,47250.0
50%,0.0,147150.0,513531.0,24903.0,450000.0,0.01885,-15750.0,-1648.0,-4504.0,-3254.0,...,1.1188,0.306272,0.05,300418.3,1305000.0,2.029257,0.116145,0.211615,0.118733,75000.0
75%,1.0,202500.0,808650.0,34596.0,679500.0,0.028663,-12413.0,-767.0,-2010.0,-1720.0,...,1.198,0.495376,0.064043,714291.3,2254457.0,4.373481,0.189675,0.395364,0.219167,112500.0
max,19.0,117000000.0,4050000.0,258025.5,4050000.0,0.072508,-7489.0,0.0,0.0,0.0,...,6.0,208.003328,0.12443,680207900.0,335684700.0,3702.839475,95.097017,264.392053,0.728811,39000000.0


In [11]:
# Defining numerical and categorical columns
categorical_cols = [col for col in x.columns if x[col].dtype == 'object']
numerical_cols = list(x.drop(categorical_cols, axis=1).columns)

In [12]:
# Checking infinite values
  
count = np.isinf(x[numerical_cols]).values.sum()
print("The df contains " + str(count) + " infinite values")

The df contains 8877 infinite values


In [13]:
# We replace inf values by NaN
x.replace([np.inf, -np.inf], np.nan, inplace=True)

## II - Split train/test and preprocessing <a class="anchor" id="5-bullet"></a>

In [14]:
# Split between train and test set
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=random_state)

In [15]:
# Definition of preprocessing steps

# Preprocessing for numerical data
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('stdscaler', StandardScaler())
])

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

In [16]:
# Preprocess datas
x_train_processed = preprocessor.fit_transform(x_train)
x_test_processed = preprocessor.transform(x_test)

## III - Dimensionality reduction <a class="anchor" id="6-bullet"></a>

To speed up our algorithms on our model selection, we will reduce the dimensionality of our dataset 

In [17]:
# PCA on processed data

print("Dimensions x_train before PCA reduction : ", x_train_processed.shape)
print("Dimensions x_test before PCA reduction : ", x_test_processed.shape)
pca = decomposition.PCA(n_components=0.99)

print("")
with timer("Proceed PCA on train and test set"):
    x_train_pca = pca.fit_transform(x_train_processed)
    x_test_pca = pca.fit_transform(x_test_processed)

print("Dimensions x_train after PCA reduction : ", x_train_pca.shape)
print("Dimensions x_test after PCA reduction : ", x_test_pca.shape)

Dimensions x_train before PCA reduction :  (246005, 670)
Dimensions x_test before PCA reduction :  (61502, 670)

Proceed PCA on train and test set - done in 24s
Dimensions x_train after PCA reduction :  (246005, 289)
Dimensions x_test after PCA reduction :  (61502, 284)


## VI - Creation of folds for cv <a class="anchor" id="7-bullet"></a>

In [18]:
folds = create_folds(x_train_pca, y_train, num_folds=num_folds, stratified=True, random_state=random_state)

# 4. Model testing <a class="anchor" id="8-bullet"></a>

## I - Dummy classifiers <a class="anchor" id="9-bullet"></a>

In [19]:
dummies = test_dummy_classifiers(x, y, strategies_list=None, random_state=random_state, constant=0)
dummies

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,most_frequent,prior,stratified,uniform,constant
accuracy,0.91927,0.91927,0.852345,0.500398,0.91927
f1,0.0,0.0,0.080629,0.139867,0.0
precision,0.0,0.0,0.08106,0.081222,0.0
recall,0.0,0.0,0.080201,0.503162,0.0
roc_auc,0.5,0.5,0.500178,0.5,0.5
cross_entropy,2.788311,2.788311,5.099888,17.255997,2.788311
fit_time,0.031356,0.01362,0.016092,0.012056,0.013378
predict_time,0.003092,0.002458,0.015548,0.007068,0.002066


In [20]:
print("Average roc_auc : {:.6f}".format(dummies.iloc[4].mean()))

Average roc_auc : 0.500036


## II - Quick testing <a class="anchor" id="15-bullet"></a>

In [23]:
# Test some models without hyperparameters optimization

models_list = [
    # 'GradientBoostingClassifier', 
    # 'RandomForestClassifier', 
    'KNeighborsClassifier',
    # 'GaussianProcessClassifier', 
    'LogisticRegression', 
    'RidgeClassifier', 
    # 'SGDClassifier',
    # 'LinearSVC', 
    # 'NuSVC', 
    ## 'SVC', 
    ## 'DecisionTreeClassifier'
]

with timer("Quick test of some classifiers"):
    quick_test_1 = quick_classifiers_test(x_train_pca, y_train, 
                                          models_list=models_list, random_state=random_state, max_iter=10000, n_jobs=-1)

quick_test_1

Quick test of some classifiers - done in 4449s


Unnamed: 0,KNeighborsClassifier,LogisticRegression,RidgeClassifier
accuracy,0.923453,0.91964,0.919331
f1,0.157939,0.055335,0.000705
precision,0.699406,0.53168,0.333333
recall,0.089021,0.029186,0.000353
roc_auc,0.906119,0.772575,0.769695
cross_entropy,2.64385,2.775543,2.786212
fit_time,0.27158,239.291882,1.423131
predict_time,1814.589553,0.139057,0.125254


In [24]:
print("Average roc_auc : {:.6f}".format(quick_test_1.iloc[4].mean()))

Average roc_auc : 0.816129


## III - Linear models <a class="anchor" id="10-bullet"></a>

In [19]:
# LogisticRegression

model = LogisticRegression(random_state=random_state, max_iter=10000)
param_grid = {'C' : np.linspace(0.1, 1, num=4)}

with timer("Proceed LogisticRegression"):
    LogisticRegression_clf = run_GridSearchCV(model, x_train_pca, y_train, folds, param_grid, optimized_metric)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
Best parameters on training set :
{'C': 0.1}
Best score on training set : 0.769
Proceed LogisticRegression - done in 98s


In [20]:
# RidgeClassifier

model = RidgeClassifier(random_state=random_state, max_iter=10000)
param_grid = {'alpha' : np.linspace(1, 10, num=4, dtype=int)}

with timer("Proceed RidgeClassifier"):
    RidgeClassifier_clf = run_GridSearchCV(model, x_train_pca, y_train, folds, param_grid, optimized_metric)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
Best parameters on training set :
{'alpha': 10}
Best score on training set : 0.766
Proceed RidgeClassifier - done in 16s


## IV - KNN <a class="anchor" id="11-bullet"></a>

In [21]:
# KNeighborsClassifier

model = KNeighborsClassifier()
param_grid = {'n_neighbors' : np.linspace(3, 10, num=4, dtype=int)}

with timer("Proceed KNeighborsClassifier"):
    KNeighborsClassifier_clf = run_GridSearchCV(model, x_train_pca, y_train, folds, param_grid, optimized_metric)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
Best parameters on training set :
{'n_neighbors': 10}
Best score on training set : 0.609
Proceed KNeighborsClassifier - done in 1758s


## V - SVM <a class="anchor" id="12-bullet"></a>

In [22]:
# LinearSVC

model = LinearSVC(random_state=random_state, max_iter=10000)
param_grid = {'penalty' : ['l1', 'l2'], 'C' : np.linspace(0.1, 1, num=4)}

with timer("Proceed LinearSVC"):
    LinearSVC_clf = run_GridSearchCV(model, x_train_pca, y_train, folds, param_grid, optimized_metric)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


20 fits failed out of a total of 40.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\robin\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\robin\anaconda3\lib\site-packages\sklearn\svm\_classes.py", line 257, in fit
    self.coef_, self.intercept_, self.n_iter_ = _fit_liblinear(
  File "C:\Users\robin\anaconda3\lib\site-packages\sklearn\svm\_base.py", line 1185, in _fit_liblinear
    solver_type = _get_liblinear_solver_type(multi_class, penalty, loss, dual)
  File "C:\Users\robin\anaconda3\lib\site-packages\sklearn\svm\_base.py", li

Best parameters on training set :
{'C': 0.7, 'penalty': 'l2'}
Best score on training set : 0.768
Proceed LinearSVC - done in 13724s




In [23]:
# SVC

model = SVC(kernel='rbf', random_state=random_state, max_iter=10000)
param_grid = {'C' : np.linspace(0.1, 1, num=4)}

with timer("Proceed SVC"):
    SVC_clf = run_GridSearchCV(model, x_train_pca, y_train, folds, param_grid, optimized_metric)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
Best parameters on training set :
{'C': 0.4}
Best score on training set : 0.599
Proceed SVC - done in 7941s




## VI - Trees and ensemblist methods <a class="anchor" id="13-bullet"></a>

In [24]:
# DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=random_state)
param_grid = {'min_samples_split' : [2, 4, 8], 'min_samples_leaf' : [1, 3, 5]}

with timer("Proceed DecisionTreeClassifier"):
    DecisionTreeClassifier_clf = run_GridSearchCV(model, x_train_pca, y_train, folds, param_grid, optimized_metric)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
Best parameters on training set :
{'min_samples_leaf': 5, 'min_samples_split': 2}
Best score on training set : 0.541
Proceed DecisionTreeClassifier - done in 1815s


In [25]:
# GradientBoostingClassifier

model = GradientBoostingClassifier(random_state=random_state)
param_grid = {'n_estimators' : [10, 100, 500]}

with timer("Proceed GradientBoostingClassifier"):
    GradientBoostingClassifier_clf = run_GridSearchCV(model, x_train_pca, y_train, folds, param_grid, optimized_metric)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best parameters on training set :
{'n_estimators': 500}
Best score on training set : 0.760
Proceed GradientBoostingClassifier - done in 43891s


In [26]:
# RandomForestClassifier

model = RandomForestClassifier(random_state=random_state)
param_grid = {'n_estimators' : [10, 100, 500]}

with timer("Proceed RandomForestClassifier"):
    RandomForestClassifier_clf = run_GridSearchCV(model, x_train_pca, y_train, folds, param_grid, optimized_metric)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best parameters on training set :
{'n_estimators': 500}
Best score on training set : 0.670
Proceed RandomForestClassifier - done in 12003s


## VII - Neural networks <a class="anchor" id="14-bullet"></a>

## VIII - Compare <a class="anchor" id="16-bullet"></a>

In [None]:
# Compare scores in this iteration

cv_clfs = {
    'LogisticRegression' : LogisticRegression_clf
    'RidgeClassifier' : RidgeClassifier_clf
    'KNeighborsClassifier' : KNeighborsClassifier_clf
    'LinearSVC' : LinearSVC_clf
    'SVC' : SVC_clf
    'DecisionTreeClassifier' : DecisionTreeClassifier_clf
    'GradientBoostingClassifier' : GradientBoostingClassifier_clf
    'RandomForestClassifier' : RandomForestClassifier_clf
}

iteration_1 = pd.DataFrame()

for key, clf in cv_clfs.items():
    iteration_1[key] = [clf.best_score_, clf.best_params_]
    
iteration_1.index = ['best_score_ : ' + optimized_metric, 'best_params_']
iteration_1

In [None]:
iteration_1.to_csv('./Scores/iteration_1.csv')

Valeurs testées première itération (cleaned_data_1):
- LogisticRegression : {'C' : np.linspace(0.1, 1, num=4)}, best : C = 0.1 -> tester plus petit
- RidgeClassifier : {'alpha' : np.linspace(1, 10, num=4, dtype=int)}, best : alpha = 10 -> tester plus grand
- KNeighborsClassifier : {'n_neighbors' : np.linspace(3, 10, num=4, dtype=int)}, best : n_neighbors = 10 > tester plus grand
- LinearSVC : {'penalty' : ['l1', 'l2'], 'C' : np.linspace(0.1, 1, num=4)}, bests :
    - C = 0.7 --> tester valeurs autour
    - penalty = l2 --> conserver
- SVC : {'C' : np.linspace(0.1, 1, num=4)}, best : C = 0.4 -> tester valeurs autour
- DecisionTreeClassifier : {'min_samples_split' : [2, 4, 8], 'min_samples_leaf' : [1, 3, 5]}, bests :
    - min_samples_split = 2 --> conserver
    - min_samples_leaf = 5 --> tester plus grand
- GradientBoostingClassifier : {'n_estimators' : [10, 100, 500]}, best : n_estimators = 500 -> tester plus grand
- RandomForestClassifier : {'n_estimators' : [10, 100, 500]}, best : n_estimators = 500 -> tester plus grand