# Projet 7 Implémentez un modèle de scoring - DISCH Anthony - Notebook

Vous êtes Data Scientist au sein d'une société financière, nommée "Prêt à dépenser",  qui propose des crédits à la consommation pour des personnes ayant peu ou pas du tout d'historique de prêt.

L’entreprise souhaite mettre en œuvre un outil de “scoring crédit” pour calculer la probabilité qu’un client rembourse son crédit, puis classifie la demande en crédit accordé ou refusé. Elle souhaite donc développer un algorithme de classification en s’appuyant sur des sources de données variées (données comportementales, données provenant d'autres institutions financières, etc.).

De plus, les chargés de relation client ont fait remonter le fait que les clients sont de plus en plus demandeurs de transparence vis-à-vis des décisions d’octroi de crédit. Cette demande de transparence des clients va tout à fait dans le sens des valeurs que l’entreprise veut incarner.

Prêt à dépenser décide donc de développer un dashboard interactif pour que les chargés de relation client puissent à la fois expliquer de façon la plus transparente possible les décisions d’octroi de crédit, mais également permettre à leurs clients de disposer de leurs informations personnelles et de les explorer facilement. 

# Votre mission

- Construire un modèle de scoring qui donnera une prédiction sur la probabilité de faillite d'un client de façon automatique.
- Construire un dashboard interactif à destination des gestionnaires de la relation client permettant d'interpréter les prédictions faites par le modèle, et d’améliorer la connaissance client des chargés de relation client.

Michaël, votre manager, vous incite à sélectionner un kernel Kaggle pour vous faciliter la préparation des données nécessaires à l’élaboration du modèle de scoring. Vous analyserez ce kernel et l’adapterez pour vous assurer qu’il répond aux besoins de votre mission.

Vous pourrez ainsi vous focaliser sur l’élaboration du modèle, son optimisation et sa compréhension.

<b>J'utilise le kernel Kaggle suivant : https://www.kaggle.com/code/willkoehrsen/start-here-a-gentle-introduction. J'ai notamment récupéré le chargement des fichiers et l'analyse exploratoire de données.</b>

## Spécifications du dashboard

Michaël vous a fourni des spécifications pour le dashboard interactif. Celui-ci devra contenir au minimum les fonctionnalités suivantes :
- Permettre de visualiser le score et l’interprétation de ce score pour chaque client de façon intelligible pour une personne non experte en data science.
- Permettre de visualiser des informations descriptives relatives à un client (via un système de filtre).
- Permettre de comparer les informations descriptives relatives à un client à l’ensemble des clients ou à un groupe de clients similaires.

# Metrics

- TP : True Positives : The cases in which we predicted 1 and the actual output was also 1.
    - crédit non attribué, pas de remboursement : pas de perte ni de gain
- TN : The cases in which we predicted 0 and the actual output was 0.
    - crédit attribué, rembourse : gain des intérêts du crédit
- FP : The cases in which we predicted 1 and the actual output was 0.
    - crédit non attribué, aurait remboursé : perte des intérêts du crédit
- FN : The cases in which we predicted 0 and the actual output was 1.
    - crédit attribué, ne rembourse pas : perte des intérêts du crédit et une partie du crédit

<b>On cherche à tout prix à minimiser les FN.</b>

- precision : It is the number of correct positive results divided by the number of positive results predicted by the classifier.

![PRECISION](img/precision.png)

- recall : It is the number of correct positive results divided by the number of all relevant samples

![RECALL](img/recall.png)

- F1-score : F1 Score is used to measure a test’s accuracy. F1 Score is the Harmonic Mean between precision and recall.

![F1-SCORE](img/f1.png)

- AUC

![AUC](img/auc.png)

- matrice de confusion

![CONFUSION-MATRIX](img/confusion_matrix.png)

# Imports

In [1]:
# numpy and pandas for data manipulation
import numpy as np
import pandas as pd 

# sklearn preprocessing for dealing with categorical variables
import lightgbm as lgb
from sklearn.preprocessing import MinMaxScaler

# File system manangement
import os

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns

import joblib
from pydantic import BaseModel, create_model, Field

In [2]:
# import sys
# !{sys.executable} -m pip install -U sklearn

In [3]:
# import sklearn
# print(sklearn.__version__)

In [4]:
# import sys
# !{sys.executable} -m pip install -U sklearn

# Les données

## Read application_train

In [5]:
data = pd.read_csv('data/application_train.csv')

In [6]:
data.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


## Add engineered features

In [7]:
data['CREDIT_INCOME_PERCENT'] = data['AMT_CREDIT'] / data['AMT_INCOME_TOTAL']
data['ANNUITY_INCOME_PERCENT'] = data['AMT_ANNUITY'] / data['AMT_INCOME_TOTAL']
data['CREDIT_TERM'] = data['AMT_ANNUITY'] / data['AMT_CREDIT']
data['DAYS_EMPLOYED_PERCENT'] = data['DAYS_EMPLOYED'] / data['DAYS_BIRTH']

In [8]:
data['YEARS_BIRTH'] = (data['DAYS_BIRTH']/-365).apply(lambda x: int(x))
data = data.reset_index(drop=True)

In [9]:
categorical_feats = [
    f for f in data.columns if data[f].dtype == 'object'
]

categorical_feats
for f_ in categorical_feats:
    # data[f_], _ = pd.factorize(data[f_])
    # Set feature type as categorical
    data[f_] = data[f_].astype('category')

In [10]:
len(categorical_feats)

16

In [11]:
numeric_feats = data.select_dtypes(include=np.number).columns.tolist()

In [12]:
len(numeric_feats)

111

## Features by importance split score

In [13]:
columns_feature_importance_split_score = ['AMT_CREDIT',
 'AMT_GOODS_PRICE',
 'CODE_GENDER',
 'CREDIT_TERM',
 'DAYS_BIRTH',
 'DAYS_EMPLOYED',
 'DAYS_EMPLOYED_PERCENT',
 'DEF_60_CNT_SOCIAL_CIRCLE',
 'EXT_SOURCE_1',
 'EXT_SOURCE_2',
 'EXT_SOURCE_3',
 'FLAG_DOCUMENT_3',
 'FLAG_EMP_PHONE',
 'NAME_CONTRACT_TYPE',
 'NAME_EDUCATION_TYPE',
 'OWN_CAR_AGE',
 'REGION_RATING_CLIENT',
 'REGION_RATING_CLIENT_W_CITY',
 'REG_CITY_NOT_LIVE_CITY',
 'DEF_30_CNT_SOCIAL_CIRCLE',
 'FLAG_WORK_PHONE',
 'OCCUPATION_TYPE',
 'ORGANIZATION_TYPE',
 'AMT_ANNUITY',
 'ANNUITY_INCOME_PERCENT',
 'NAME_INCOME_TYPE',
 'FLAG_OWN_CAR',
 'DAYS_ID_PUBLISH',
 'REG_CITY_NOT_WORK_CITY',
 'AMT_REQ_CREDIT_BUREAU_DAY',
 'DAYS_LAST_PHONE_CHANGE',
 'AMT_REQ_CREDIT_BUREAU_YEAR',
 'EMERGENCYSTATE_MODE',
 'FLOORSMAX_MEDI',
 'FLAG_PHONE',
 'LIVE_CITY_NOT_WORK_CITY',
 'NAME_FAMILY_STATUS',
 'CREDIT_INCOME_PERCENT',
 'AMT_REQ_CREDIT_BUREAU_WEEK',
 'FLAG_DOCUMENT_5',
 'REG_REGION_NOT_WORK_REGION',
 'FLAG_DOCUMENT_8',
 'FLOORSMAX_AVG',
 'NAME_HOUSING_TYPE',
 'AMT_REQ_CREDIT_BUREAU_QRT',
 'HOUSETYPE_MODE',
 'LIVINGAREA_AVG',
 'REG_REGION_NOT_LIVE_REGION',
 'TOTALAREA_MODE',
 'YEARS_BEGINEXPLUATATION_MEDI',
 'AMT_INCOME_TOTAL'] # need to add 'AMT_INCOME_TOTAL' for feature engineering

## Enregistrement pour API/Dashboard

In [14]:
columns_feature_target = ['TARGET']

In [15]:
columns_feature_displayable = ['SK_ID_CURR','CODE_GENDER','YEARS_BIRTH','NAME_FAMILY_STATUS','CNT_CHILDREN',
             'NAME_EDUCATION_TYPE','FLAG_OWN_CAR','FLAG_OWN_REALTY','NAME_HOUSING_TYPE',
             'NAME_INCOME_TYPE','AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY']

In [16]:
print(len(columns_feature_displayable))

13


In [17]:
data[columns_feature_displayable]

Unnamed: 0,SK_ID_CURR,CODE_GENDER,YEARS_BIRTH,NAME_FAMILY_STATUS,CNT_CHILDREN,NAME_EDUCATION_TYPE,FLAG_OWN_CAR,FLAG_OWN_REALTY,NAME_HOUSING_TYPE,NAME_INCOME_TYPE,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY
0,100002,M,25,Single / not married,0,Secondary / secondary special,N,Y,House / apartment,Working,202500.0,406597.5,24700.5
1,100003,F,45,Married,0,Higher education,N,N,House / apartment,State servant,270000.0,1293502.5,35698.5
2,100004,M,52,Single / not married,0,Secondary / secondary special,Y,Y,House / apartment,Working,67500.0,135000.0,6750.0
3,100006,F,52,Civil marriage,0,Secondary / secondary special,N,Y,House / apartment,Working,135000.0,312682.5,29686.5
4,100007,M,54,Single / not married,0,Secondary / secondary special,N,Y,House / apartment,Working,121500.0,513000.0,21865.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
307506,456251,M,25,Separated,0,Secondary / secondary special,N,N,With parents,Working,157500.0,254700.0,27558.0
307507,456252,F,56,Widow,0,Secondary / secondary special,N,Y,House / apartment,Pensioner,72000.0,269550.0,12001.5
307508,456253,F,41,Separated,0,Higher education,N,Y,House / apartment,Working,153000.0,677664.0,29979.0
307509,456254,F,32,Married,0,Secondary / secondary special,N,Y,House / apartment,Commercial associate,171000.0,370107.0,20205.0


In [18]:
print(len(columns_feature_importance_split_score))

51


In [19]:
data[columns_feature_importance_split_score]

Unnamed: 0,AMT_CREDIT,AMT_GOODS_PRICE,CODE_GENDER,CREDIT_TERM,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_EMPLOYED_PERCENT,DEF_60_CNT_SOCIAL_CIRCLE,EXT_SOURCE_1,EXT_SOURCE_2,...,FLAG_DOCUMENT_8,FLOORSMAX_AVG,NAME_HOUSING_TYPE,AMT_REQ_CREDIT_BUREAU_QRT,HOUSETYPE_MODE,LIVINGAREA_AVG,REG_REGION_NOT_LIVE_REGION,TOTALAREA_MODE,YEARS_BEGINEXPLUATATION_MEDI,AMT_INCOME_TOTAL
0,406597.5,351000.0,M,0.060749,-9461,-637,0.067329,2.0,0.083037,0.262949,...,0,0.0833,House / apartment,0.0,block of flats,0.0190,0,0.0149,0.9722,202500.0
1,1293502.5,1129500.0,F,0.027598,-16765,-1188,0.070862,0.0,0.311267,0.622246,...,0,0.2917,House / apartment,0.0,block of flats,0.0549,0,0.0714,0.9851,270000.0
2,135000.0,135000.0,M,0.050000,-19046,-225,0.011814,0.0,,0.555912,...,0,,House / apartment,0.0,,,0,,,67500.0
3,312682.5,297000.0,F,0.094941,-19005,-3039,0.159905,0.0,,0.650442,...,0,,House / apartment,,,,0,,,135000.0
4,513000.0,513000.0,M,0.042623,-19932,-3038,0.152418,0.0,,0.322738,...,1,,House / apartment,0.0,,,0,,,121500.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307506,254700.0,225000.0,M,0.108198,-9327,-236,0.025303,0.0,0.145570,0.681632,...,1,0.6042,With parents,,block of flats,0.1965,0,0.2898,0.9876,157500.0
307507,269550.0,225000.0,F,0.044524,-20775,365243,-17.580890,0.0,,0.115992,...,0,0.0833,House / apartment,,block of flats,0.0257,0,0.0214,0.9727,72000.0
307508,677664.0,585000.0,F,0.044239,-14966,-7921,0.529266,0.0,0.744026,0.535722,...,0,0.1667,House / apartment,0.0,block of flats,0.9279,0,0.7970,0.9816,153000.0
307509,370107.0,319500.0,F,0.054592,-11961,-4786,0.400134,0.0,,0.514163,...,0,0.0417,House / apartment,0.0,block of flats,0.0061,0,0.0086,0.9771,171000.0


In [20]:
columns_features_for_api_dashboard = {'TARGET'}
columns_features_for_api_dashboard.update(columns_feature_displayable)
columns_features_for_api_dashboard.update(columns_feature_importance_split_score)

columns_features_for_api_dashboard

{'AMT_ANNUITY',
 'AMT_CREDIT',
 'AMT_GOODS_PRICE',
 'AMT_INCOME_TOTAL',
 'AMT_REQ_CREDIT_BUREAU_DAY',
 'AMT_REQ_CREDIT_BUREAU_QRT',
 'AMT_REQ_CREDIT_BUREAU_WEEK',
 'AMT_REQ_CREDIT_BUREAU_YEAR',
 'ANNUITY_INCOME_PERCENT',
 'CNT_CHILDREN',
 'CODE_GENDER',
 'CREDIT_INCOME_PERCENT',
 'CREDIT_TERM',
 'DAYS_BIRTH',
 'DAYS_EMPLOYED',
 'DAYS_EMPLOYED_PERCENT',
 'DAYS_ID_PUBLISH',
 'DAYS_LAST_PHONE_CHANGE',
 'DEF_30_CNT_SOCIAL_CIRCLE',
 'DEF_60_CNT_SOCIAL_CIRCLE',
 'EMERGENCYSTATE_MODE',
 'EXT_SOURCE_1',
 'EXT_SOURCE_2',
 'EXT_SOURCE_3',
 'FLAG_DOCUMENT_3',
 'FLAG_DOCUMENT_5',
 'FLAG_DOCUMENT_8',
 'FLAG_EMP_PHONE',
 'FLAG_OWN_CAR',
 'FLAG_OWN_REALTY',
 'FLAG_PHONE',
 'FLAG_WORK_PHONE',
 'FLOORSMAX_AVG',
 'FLOORSMAX_MEDI',
 'HOUSETYPE_MODE',
 'LIVE_CITY_NOT_WORK_CITY',
 'LIVINGAREA_AVG',
 'NAME_CONTRACT_TYPE',
 'NAME_EDUCATION_TYPE',
 'NAME_FAMILY_STATUS',
 'NAME_HOUSING_TYPE',
 'NAME_INCOME_TYPE',
 'OCCUPATION_TYPE',
 'ORGANIZATION_TYPE',
 'OWN_CAR_AGE',
 'REGION_RATING_CLIENT',
 'REGION_RATING

In [21]:
print(len(columns_features_for_api_dashboard))

56


In [22]:
data = data[columns_features_for_api_dashboard].dropna()

In [23]:
target = data[columns_feature_target]

features = data[columns_features_for_api_dashboard].drop(columns = columns_feature_target)

In [24]:
from sklearn.model_selection import train_test_split

# Séparer X et y
# Séparer les données en train et test 80/20 avec train test split de sklearn
X_train, X_test, y_train, y_test = train_test_split(features,
                                                    target,
                                                    stratify=target,
                                                    test_size = 0.25, random_state = 0)

X_test.reset_index(inplace=True, drop=True)
y_test.reset_index(inplace=True, drop=True)

In [25]:
X_test.shape

(3926, 55)

In [26]:
y_test.shape

(3926, 1)

In [27]:
db_test = X_test.copy()

In [28]:
db_test.loc[:, 'TARGET'] = y_test

In [29]:
db_test.shape

(3926, 56)

## Keeping only useful features

In [30]:
print("All categorical features", len(categorical_feats), categorical_feats)
categorical_feats_filtered = list(set(columns_feature_importance_split_score).intersection(categorical_feats))
print("\nFiltered categorical features", len(categorical_feats_filtered), categorical_feats_filtered)

All categorical features 16 ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE']

Filtered categorical features 11 ['NAME_INCOME_TYPE', 'CODE_GENDER', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'ORGANIZATION_TYPE', 'EMERGENCYSTATE_MODE', 'FLAG_OWN_CAR', 'NAME_CONTRACT_TYPE', 'HOUSETYPE_MODE', 'OCCUPATION_TYPE', 'NAME_HOUSING_TYPE']


In [31]:
print("All numeric features", len(numeric_feats), numeric_feats)
numeric_feats_filtered = list(set(columns_feature_importance_split_score).intersection(numeric_feats))
print("\nFiltered numeric features", len(numeric_feats_filtered), numeric_feats_filtered)

All numeric features 111 ['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGIN

## OneHot Encoding

In [32]:
cat_array = data[categorical_feats_filtered]
cat_array

Unnamed: 0,NAME_INCOME_TYPE,CODE_GENDER,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,ORGANIZATION_TYPE,EMERGENCYSTATE_MODE,FLAG_OWN_CAR,NAME_CONTRACT_TYPE,HOUSETYPE_MODE,OCCUPATION_TYPE,NAME_HOUSING_TYPE
51,Commercial associate,M,Higher education,Married,Services,No,Y,Cash loans,block of flats,Managers,House / apartment
71,Working,M,Secondary / secondary special,Married,Business Entity Type 3,No,Y,Cash loans,block of flats,Laborers,House / apartment
93,Commercial associate,F,Secondary / secondary special,Married,Business Entity Type 3,No,Y,Cash loans,block of flats,Sales staff,With parents
124,Working,F,Secondary / secondary special,Separated,Self-employed,No,Y,Cash loans,block of flats,Laborers,House / apartment
152,Commercial associate,F,Higher education,Married,Trade: type 7,No,Y,Cash loans,block of flats,Managers,House / apartment
...,...,...,...,...,...,...,...,...,...,...,...
307407,Commercial associate,F,Higher education,Married,Self-employed,No,Y,Cash loans,block of flats,Sales staff,House / apartment
307416,Working,F,Secondary / secondary special,Single / not married,Transport: type 3,Yes,Y,Cash loans,block of flats,High skill tech staff,House / apartment
307449,Working,M,Incomplete higher,Married,Business Entity Type 3,No,Y,Cash loans,block of flats,Sales staff,House / apartment
307456,Working,F,Secondary / secondary special,Married,Business Entity Type 2,No,Y,Cash loans,block of flats,Cleaning staff,House / apartment


In [33]:
from sklearn.preprocessing import OneHotEncoder

print('Features shape before one-hot encoding: \t', cat_array.shape)
# one-hot encoding of categorical variables
ohe = OneHotEncoder(handle_unknown='ignore')
ohe.fit(cat_array)
cat_array_encoded = ohe.transform(cat_array).toarray()
print('\nFeatures shape after one-hot encoding: \t\t', cat_array_encoded.shape)

Features shape before one-hot encoding: 	 (15704, 11)

Features shape after one-hot encoding: 		 (15704, 104)


In [34]:
ohe.categories_

[array(['Businessman', 'Commercial associate', 'State servant', 'Working'],
       dtype=object),
 array(['F', 'M'], dtype=object),
 array(['Academic degree', 'Higher education', 'Incomplete higher',
        'Lower secondary', 'Secondary / secondary special'], dtype=object),
 array(['Civil marriage', 'Married', 'Separated', 'Single / not married',
        'Widow'], dtype=object),
 array(['Advertising', 'Agriculture', 'Bank', 'Business Entity Type 1',
        'Business Entity Type 2', 'Business Entity Type 3', 'Cleaning',
        'Construction', 'Culture', 'Electricity', 'Emergency',
        'Government', 'Hotel', 'Housing', 'Industry: type 1',
        'Industry: type 10', 'Industry: type 11', 'Industry: type 12',
        'Industry: type 2', 'Industry: type 3', 'Industry: type 4',
        'Industry: type 5', 'Industry: type 6', 'Industry: type 7',
        'Industry: type 8', 'Industry: type 9', 'Insurance',
        'Kindergarten', 'Legal Services', 'Medicine', 'Military', 'Mobile',
    

In [35]:
cat_array_encoded.shape

(15704, 104)

## Scaling

In [36]:
num_array = data[numeric_feats_filtered]
num_array

Unnamed: 0,DAYS_EMPLOYED,OWN_CAR_AGE,REGION_RATING_CLIENT_W_CITY,AMT_REQ_CREDIT_BUREAU_DAY,FLOORSMAX_MEDI,LIVINGAREA_AVG,FLAG_DOCUMENT_5,AMT_INCOME_TOTAL,AMT_ANNUITY,AMT_REQ_CREDIT_BUREAU_QRT,...,ANNUITY_INCOME_PERCENT,FLAG_DOCUMENT_3,AMT_CREDIT,DEF_30_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,REGION_RATING_CLIENT,TOTALAREA_MODE,REG_REGION_NOT_LIVE_REGION,AMT_GOODS_PRICE,FLOORSMAX_AVG
51,-6977,7.0,2,0.0,0.4583,0.5878,0,540000.0,34596.0,0.0,...,0.064067,1,675000.0,0.0,-1285.0,2,0.5149,0,675000.0,0.4583
71,-892,22.0,2,0.0,0.3333,0.0933,0,103500.0,24435.0,0.0,...,0.236087,1,573628.5,1.0,-2053.0,2,0.1324,0,463500.0,0.3333
93,-1249,17.0,2,0.0,0.3333,0.2675,0,112500.0,27954.0,0.0,...,0.248480,1,862560.0,0.0,-1234.0,2,0.2576,0,720000.0,0.3333
124,-4375,8.0,2,0.0,0.1667,0.0903,0,202500.0,16789.5,0.0,...,0.082911,1,260725.5,0.0,-1782.0,2,0.0710,0,198000.0,0.1667
152,-2311,4.0,2,0.0,0.9167,0.7187,0,202500.0,53329.5,0.0,...,0.263356,0,675000.0,0.0,-1792.0,2,0.7334,0,675000.0,0.9167
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307407,-1641,4.0,2,0.0,0.1667,0.0519,0,261000.0,47673.0,0.0,...,0.182655,1,711454.5,2.0,-572.0,2,0.0454,0,643500.0,0.1667
307416,-1056,20.0,3,0.0,0.0417,0.0125,0,90000.0,21982.5,0.0,...,0.244250,1,327024.0,2.0,-955.0,3,0.0106,0,270000.0,0.0417
307449,-3389,15.0,2,0.0,0.1667,0.0641,0,315000.0,38974.5,0.0,...,0.123729,0,1175314.5,1.0,0.0,2,0.0821,0,1053000.0,0.1667
307456,-5452,5.0,2,0.0,0.0417,0.0067,0,94500.0,15075.0,0.0,...,0.159524,1,270000.0,0.0,-2299.0,2,0.0061,0,270000.0,0.0417


In [37]:
from sklearn.preprocessing import MinMaxScaler

print('Features shape before imputing/scaling: \t', num_array.shape)
# Scale each feature to 0-1
scaler = MinMaxScaler(feature_range = (0, 1))

# Fit on the training data
scaler.fit(num_array)
# Scale
num_array = scaler.transform(num_array)
print('\nFeatures shape after imputing/scaling: \t\t', num_array.shape)

Features shape before imputing/scaling: 	 (15704, 40)

Features shape after imputing/scaling: 		 (15704, 40)


# Model

https://pydantic-docs.helpmanual.io/usage/models/#model-creation-from-namedtuple-or-typeddict

In [38]:
X = np.concatenate([cat_array_encoded, num_array], axis=1)
X = np.asarray(X)
y = data["TARGET"].astype(int)

In [39]:
X.shape

(15704, 144)

In [40]:
# Best model
# Préparation du dictionnaire contenant les meilleurs paramètres :
# V1 {'class_weight': 'balanced', 'learning_rate': 0.07, 'max_depth': 3, 'n_estimators': 500, 'reg_alpha': 0.1, 'reg_lambda': 0.0005623413251903491} 
# V2 {'boosting_type': 'gbdt', 'class_weight': 'balanced', 'learning_rate': 0.07, 'max_depth': 3, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 500, 'num_leaves': 31, 'reg_alpha': 0.1, 'reg_lambda': 0.0005623413251903491}
lgb_c = lgb.LGBMClassifier(boosting_type='gbdt', class_weight='balanced', learning_rate=0.07, max_depth=3, min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0.0005623413251903491)
lgb_c.fit(X, y)

## SHAP

In [41]:
df_cat = pd.DataFrame(data=cat_array_encoded, columns=ohe.get_feature_names())

In [42]:
df_num = pd.DataFrame(data=num_array, columns=numeric_feats_filtered)

In [43]:
# concatenating df_cat and df_num along columns
df_all = pd.concat([df_cat, df_num], axis=1)

In [44]:
df_all.shape

(15704, 144)

In [45]:
import shap

# shap.plots.initjs()

# Calcul des SHAP values
explainer = shap.TreeExplainer(lgb_c)
shap_values = explainer.shap_values(df_all)[1]
exp_value = explainer.expected_value[1]

# Persistence pour utilisation dans l'API et le Dashboard

https://joblib.readthedocs.io/en/latest/persistence.html

## API / Dashboard

Données clients

In [46]:
compression_opts = dict(method='zip',
                        archive_name='data.csv')

db_test.to_csv('bin/data.zip', index=False, compression=compression_opts)  

Colonnes pour la prédiction avec le model

In [47]:
file = open('bin/features_for_model_prediction.txt','w')
for feature in columns_feature_importance_split_score:
    file.writelines(feature+'\n')
file.close()

Colonnes pour la table principale du dashboard

In [48]:
file = open('bin/features_for_dashboard_table.txt','w')
for feature in columns_feature_displayable:
    file.writelines(feature+'\n')
file.close()

Sérialisation des objets nécessaires à l'API

In [49]:
if not os.path.exists('bin'):
    os.makedirs('bin')
    
joblib.dump(ohe, 'bin/ohe.joblib')
joblib.dump(scaler, 'bin/std_scaler.joblib')
joblib.dump(lgb_c, 'bin/model.joblib')

['bin/model.joblib']

Sérialisation de la modélisation d'un Client pour la méthode de prédiction de l'API de prédiction

In [50]:
db_test_joblib = db_test[categorical_feats_filtered + numeric_feats_filtered]
db_test_joblib = db_test_joblib.astype(object)

data_model = {}
for column_name in db_test_joblib.columns:
    # print(column_name, column_dtype)
    data_model.update({column_name: (type(db_test_joblib.loc[0, column_name]), Field(...))})
    
print(data_model)

{'NAME_INCOME_TYPE': (<class 'str'>, FieldInfo(default=Ellipsis, extra={})), 'CODE_GENDER': (<class 'str'>, FieldInfo(default=Ellipsis, extra={})), 'NAME_EDUCATION_TYPE': (<class 'str'>, FieldInfo(default=Ellipsis, extra={})), 'NAME_FAMILY_STATUS': (<class 'str'>, FieldInfo(default=Ellipsis, extra={})), 'ORGANIZATION_TYPE': (<class 'str'>, FieldInfo(default=Ellipsis, extra={})), 'EMERGENCYSTATE_MODE': (<class 'str'>, FieldInfo(default=Ellipsis, extra={})), 'FLAG_OWN_CAR': (<class 'str'>, FieldInfo(default=Ellipsis, extra={})), 'NAME_CONTRACT_TYPE': (<class 'str'>, FieldInfo(default=Ellipsis, extra={})), 'HOUSETYPE_MODE': (<class 'str'>, FieldInfo(default=Ellipsis, extra={})), 'OCCUPATION_TYPE': (<class 'str'>, FieldInfo(default=Ellipsis, extra={})), 'NAME_HOUSING_TYPE': (<class 'str'>, FieldInfo(default=Ellipsis, extra={})), 'DAYS_EMPLOYED': (<class 'int'>, FieldInfo(default=Ellipsis, extra={})), 'OWN_CAR_AGE': (<class 'float'>, FieldInfo(default=Ellipsis, extra={})), 'REGION_RATING_CL

In [51]:
joblib.dump(data_model, 'bin/data_dict.joblib')

['bin/data_dict.joblib']

SHAP values

In [52]:
# np.savetxt('bin/shap_shap_values.txt', shap_values, fmt="%f")

In [53]:
# better saving with compression
np.savez_compressed("bin/shap_shap_values.npz", shap_values=shap_values)

In [54]:
file = open('bin/shap_expected_value.txt','w')
file.writelines(str(exp_value)+'\n')
file.close()

In [55]:
best_treshold = 0.5040000000000002

file = open('bin/model_best_treshold.txt','w')
file.writelines(str(best_treshold)+'\n')
file.close()

## Pour tests

In [56]:
db_test[categorical_feats_filtered + numeric_feats_filtered].loc[0,:]

NAME_INCOME_TYPE                         Commercial associate
CODE_GENDER                                                 M
NAME_EDUCATION_TYPE             Secondary / secondary special
NAME_FAMILY_STATUS                                    Married
ORGANIZATION_TYPE                      Business Entity Type 3
EMERGENCYSTATE_MODE                                        No
FLAG_OWN_CAR                                                Y
NAME_CONTRACT_TYPE                            Revolving loans
HOUSETYPE_MODE                                 block of flats
OCCUPATION_TYPE                                      Laborers
NAME_HOUSING_TYPE                           House / apartment
DAYS_EMPLOYED                                           -5215
OWN_CAR_AGE                                              26.0
REGION_RATING_CLIENT_W_CITY                                 1
AMT_REQ_CREDIT_BUREAU_DAY                                 0.0
FLOORSMAX_MEDI                                         0.3958
LIVINGAR

In [57]:
if not os.path.exists('json'):
    os.makedirs('json')
    
for i in range(0, 10):
    db_test[categorical_feats_filtered + numeric_feats_filtered].loc[i].to_json("json/row{}.json".format(i))

In [58]:
for i in range(0, 10):
    print(y_test.loc[i])

TARGET    0
Name: 0, dtype: int64
TARGET    0
Name: 1, dtype: int64
TARGET    1
Name: 2, dtype: int64
TARGET    0
Name: 3, dtype: int64
TARGET    0
Name: 4, dtype: int64
TARGET    0
Name: 5, dtype: int64
TARGET    0
Name: 6, dtype: int64
TARGET    0
Name: 7, dtype: int64
TARGET    0
Name: 8, dtype: int64
TARGET    0
Name: 9, dtype: int64


In [59]:
import json
d = db_test[categorical_feats_filtered + numeric_feats_filtered].loc[2].to_dict()
j = json.dumps(d)
j

'{"NAME_INCOME_TYPE": "Working", "CODE_GENDER": "F", "NAME_EDUCATION_TYPE": "Secondary / secondary special", "NAME_FAMILY_STATUS": "Married", "ORGANIZATION_TYPE": "Self-employed", "EMERGENCYSTATE_MODE": "No", "FLAG_OWN_CAR": "Y", "NAME_CONTRACT_TYPE": "Cash loans", "HOUSETYPE_MODE": "block of flats", "OCCUPATION_TYPE": "Managers", "NAME_HOUSING_TYPE": "House / apartment", "DAYS_EMPLOYED": -1550, "OWN_CAR_AGE": 23.0, "REGION_RATING_CLIENT_W_CITY": 3, "AMT_REQ_CREDIT_BUREAU_DAY": 0.0, "FLOORSMAX_MEDI": 0.3333, "LIVINGAREA_AVG": 0.2, "FLAG_DOCUMENT_5": 0, "AMT_INCOME_TOTAL": 270000.0, "AMT_ANNUITY": 62995.5, "AMT_REQ_CREDIT_BUREAU_QRT": 0.0, "EXT_SOURCE_2": 0.4260776081313265, "REG_CITY_NOT_LIVE_CITY": 0, "EXT_SOURCE_1": 0.5284375340835995, "REG_REGION_NOT_WORK_REGION": 0, "CREDIT_INCOME_PERCENT": 5.642066666666667, "YEARS_BEGINEXPLUATATION_MEDI": 0.9806, "FLAG_DOCUMENT_8": 0, "AMT_REQ_CREDIT_BUREAU_WEEK": 0.0, "CREDIT_TERM": 0.04135305030071723, "DAYS_EMPLOYED_PERCENT": 0.108225108225108

In [60]:
test = 0

print(type(test))
if type(test) is int and test >= 0:
    print('value is ', test)
else:
    print('no value')

<class 'int'>
value is  0
