<img src="OC.png" width="50" height="50" align="left">    

***

<center> <font size="6"> <span style='color:Blue'> P7: Implémentez un modèle de scoring </span></font> </center>  

***

<font size="3"> <span style="font-size: 1.5em" > **Partie 2 - 2/2: Modélisation** </span> </font> 

**Contexte**  
**Prêt à dépenser** est une société financière qui propose des crédits à la consommation pour des personnes ayant peu ou pas du tout d'historique de prêt.
<img src="pretadepenser.png" width="200" height="200">
**Mission**   
* Construire un modèle de scoring qui donnera une prédiction sur la probabilité de faillite d'un client de façon automatique.
* Construire un dashboard interactif à destination des gestionnaires de la relation client permettant d'interpréter les prédictions faites par le modèle, et d’améliorer la connaissance client des chargés de relation client.
* Mettre en production le modèle de scoring de prédiction à l’aide d’une API, ainsi que le dashboard interactif qui appelle l’API pour les prédictions.

Dans ce notebook, nous allons:
* Rappeler le modèle choisi et ses performances.
* Sauvegarder le modèle. 

# Importation des données et des librairies Python pour DS

In [29]:
# ------------------------------------------
# Projet : Implémenter un modèle de scoring
# Données: https://www.kaggle.com/c/home-credit-default-risk/data
# Auteur : Rim BAHROUN
# Date: Avril 2023
# OpenClassrooms
# -------------------------------------------
# importation des librairies Python pour DS
# -------------------------------------------
import os
import csv
import numpy as np
import pandas as pd
import timeit

from sklearn.model_selection import train_test_split
from sklearn import pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from sklearn.metrics import make_scorer, f1_score, fbeta_score, roc_auc_score, roc_curve
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report


import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature

In [76]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

In [38]:
train_df = pd.read_csv("data/traited/df_credit_train_35.csv")
print(train_df.shape)
train_df.head(2)

(307507, 35)


Unnamed: 0,SK_ID_CURR,TARGET,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_REGISTRATION,DAYS_ID_PUBLISH,HOUR_APPR_PROCESS_START,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,...,PREV_NAME_CONTRACT_STATUS_Approved_MEAN,PREV_NAME_PAYMENT_TYPE_Cashthroughthebank_MEAN,PREV_NAME_TYPE_SUITE_nan_MEAN,PREV_NAME_PRODUCT_TYPE_XNA_MEAN,PREV_NAME_SELLER_INDUSTRY_Consumerelectronics_MEAN,PREV_NAME_YIELD_GROUP_high_MEAN,PREV_NAME_YIELD_GROUP_low_normal_MEAN,REFUSED_DAYS_DECISION_MAX,POS_MONTHS_BALANCE_MEAN,CC_AMT_CREDIT_LIMIT_ACTUAL_SUM
0,100002,1.0,0.018801,-9461,-3648.0,-2120,10,0.083037,0.262949,0.139376,...,1.0,0.0,1.0,1.0,0.0,0.0,1.0,-396.0,-10.0,3960000.0
1,100003,0.0,0.003541,-16765,-1186.0,-291,11,0.311267,0.622246,0.535276,...,1.0,0.666667,0.0,0.666667,0.333333,0.0,0.333333,-396.0,-43.785714,3960000.0


In [41]:
test_df = pd.read_csv("data/traited/df_credit_test_35.csv")
print(test_df.shape)
test_df.head(2)

(48744, 35)


Unnamed: 0,SK_ID_CURR,TARGET,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_REGISTRATION,DAYS_ID_PUBLISH,HOUR_APPR_PROCESS_START,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,...,PREV_NAME_CONTRACT_STATUS_Approved_MEAN,PREV_NAME_PAYMENT_TYPE_Cashthroughthebank_MEAN,PREV_NAME_TYPE_SUITE_nan_MEAN,PREV_NAME_PRODUCT_TYPE_XNA_MEAN,PREV_NAME_SELLER_INDUSTRY_Consumerelectronics_MEAN,PREV_NAME_YIELD_GROUP_high_MEAN,PREV_NAME_YIELD_GROUP_low_normal_MEAN,REFUSED_DAYS_DECISION_MAX,POS_MONTHS_BALANCE_MEAN,CC_AMT_CREDIT_LIMIT_ACTUAL_SUM
0,100001,,0.01885,-19241,-5170.0,-812,18,0.752614,0.789654,0.15952,...,1.0,1.0,0.0,1.0,0.0,1.0,0.0,-396.0,-72.555556,3960000.0
1,100005,,0.035792,-18064,-9118.0,-1623,9,0.56499,0.291656,0.432962,...,0.5,0.5,1.0,1.0,0.0,0.5,0.0,-396.0,-20.0,3960000.0


In [5]:
x = train_df.drop(columns=["SK_ID_CURR", 'TARGET'])
y = train_df.loc[:, train_df.columns=='TARGET']

In [6]:
x_train, x_val, y_train, y_val = train_test_split(x, y, 
                                                    test_size=0.33, 
                                                    random_state=42)

In [43]:
x_val

Unnamed: 0,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_REGISTRATION,DAYS_ID_PUBLISH,HOUR_APPR_PROCESS_START,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,DAYS_LAST_PHONE_CHANGE,BURO_DAYS_CREDIT_MIN,...,PREV_NAME_CONTRACT_STATUS_Approved_MEAN,PREV_NAME_PAYMENT_TYPE_Cashthroughthebank_MEAN,PREV_NAME_TYPE_SUITE_nan_MEAN,PREV_NAME_PRODUCT_TYPE_XNA_MEAN,PREV_NAME_SELLER_INDUSTRY_Consumerelectronics_MEAN,PREV_NAME_YIELD_GROUP_high_MEAN,PREV_NAME_YIELD_GROUP_low_normal_MEAN,REFUSED_DAYS_DECISION_MAX,POS_MONTHS_BALANCE_MEAN,CC_AMT_CREDIT_LIMIT_ACTUAL_SUM
232923,0.025164,-9267,-372.0,-785,15,0.044202,0.544933,0.535276,-1440.0,-1827.0,...,0.500000,0.300000,0.400000,0.800000,0.200000,0.300000,0.000000,-272.0,-43.150000,3960000.0
263698,0.015221,-10916,-532.0,-3534,13,0.543037,0.587365,0.692559,-3.0,-2891.0,...,0.800000,0.700000,0.500000,0.750000,0.200000,0.142857,0.111111,-396.0,-28.593750,3960000.0
36463,0.046220,-10066,-4059.0,-2511,14,0.505998,0.643635,0.535276,-1059.0,-1827.0,...,0.500000,0.500000,1.000000,1.000000,0.000000,0.500000,0.000000,-396.0,-29.000000,3960000.0
279380,0.046220,-18698,-8905.0,-2242,18,0.505998,0.746168,0.360613,-3638.0,-1877.0,...,0.500000,1.000000,0.500000,0.000000,0.000000,0.000000,0.000000,-639.0,-16.500000,3960000.0
148324,0.018801,-18162,-7108.0,-1522,7,0.505998,0.648460,0.486653,-1827.0,-2434.0,...,0.666667,0.333333,0.666667,0.333333,0.000000,0.000000,0.000000,-773.0,-54.000000,3960000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34300,0.006008,-8768,-8765.0,-1410,9,0.460036,0.696205,0.321735,-1061.0,-1345.0,...,0.600000,0.600000,0.400000,0.800000,0.400000,0.400000,0.200000,-396.0,-18.954545,3960000.0
304790,0.007274,-22431,-4526.0,-4526,9,0.753637,0.404618,0.531686,0.0,-1436.0,...,0.800000,0.700000,0.500000,0.750000,0.200000,0.142857,0.111111,-396.0,-28.593750,3960000.0
110630,0.002042,-14798,-1497.0,-4211,13,0.505998,0.489780,0.535276,-1115.0,-1827.0,...,0.600000,0.400000,0.400000,0.600000,0.000000,0.000000,0.200000,-396.0,-21.550000,585000.0
290716,0.018029,-17253,-5321.0,-762,12,0.682599,0.431879,0.165407,-1638.0,-2906.0,...,0.666667,0.500000,0.333333,0.666667,0.333333,0.166667,0.166667,-396.0,-34.826923,7200000.0


In [7]:
x_test = test_df.drop(columns=["SK_ID_CURR", 'TARGET'])

# Fonctions utiles

In [8]:
def Custom_score(y_true, y_pred):
    # coût d'un faux positif et un faux négatif
    cout_fp = 1
    cout_fn = 10
    
    # nombre total d'exemples positifs et négatifs
    n_pos = (y_true==1).sum()
    n_neg = (y_true==0).sum()
    # calcul du coût maximum possible
    max_cout = cout_fp * n_neg + cout_fn * n_pos
    
    # Calcul du nombre de faux positifs et faux négatifs
    fp = ((y_pred == 1) & (y_true == 0)).sum()
    fn = ((y_pred == 0) & (y_true == 1)).sum()
    
    # calcul du coût total    
    cout = (cout_fp * fp + cout_fn * fn)  # / (fn + fp)
    cout_notmalise = cout/max_cout
    score = round(1 - cout_notmalise, 2)
    return score[0]

custom_score = make_scorer(Custom_score, greater_is_better=True)

In [12]:
def model_eval_score(model, Xval, yval):
    
    yval_pred = model.predict(Xval)
    
    metrics = {'AUROC_score': round(roc_auc_score(yval, model.predict_proba(Xval)[:, 1]), 2),
               'Costum_score': round(Custom_score(yval, yval_pred.reshape(-1, 1)), 2),
               'f_beta_score': round(fbeta_score(yval, yval_pred, beta=3.16), 2),
               'Accuracy_score': round(accuracy_score(yval, yval_pred), 2),
               'Recall_score': round(recall_score(yval, yval_pred), 2),
               'Presicion_score': round(precision_score(yval, yval_pred), 2)}

    for key, val in metrics.items():
        print(key + ' : ' + str(val))
    print('conf_mat : \n' + str(confusion_matrix(yval, yval_pred)))
    print() 
    return metrics

# Pipeline 

## LogisticRegression

### Enregistrement du modèle 

In [10]:
mlflow.set_tracking_uri("sqlite:///mlflow.db")

In [13]:
mlflow.sklearn.autolog(disable=True)

with mlflow.start_run(run_name='LogisticRegression'):
    params = {
        "solver": 'lbfgs',
        "class_weight" : 'balanced'
    }
    
    mlflow.set_tag("model_name", "LR_final")
    mlflow.log_params(params)
    
    pipeline_lr = pipeline.Pipeline([ ('scaler', StandardScaler()),
                             ('clf', LogisticRegression(**params))])

    pipeline_lr.fit(x_train, y_train)
    
    mlflow.log_metrics(model_eval_score(pipeline_lr, x_val, y_val))
    mlflow.sklearn.log_model(pipeline_lr, "sk_models") 

  y = column_or_1d(y, warn=True)


AUROC_score : 0.73
Costum_score : 0.67
f_beta_score : 0.5
Accuracy_score : 0.68
Recall_score : 0.65
Presicion_score : 0.15
conf_mat : 
[[63851 29457]
 [ 2853  5317]]





### Téléchargement du modèle et prédiction

In [27]:
model_name = "LR_final"
stage = "Production" # "Staging"  

model = mlflow.sklearn.load_model(model_uri=f"models:/{model_name}/{stage}")

model.predict(x_val)

array([1., 0., 0., ..., 0., 1., 0.])

In [22]:
df_ = test_df.loc[:, ["SK_ID_CURR", 'TARGET']]
df_.loc[:, 'TARGET'] = model.predict_proba(x_test)[:, 1]
df_.head()

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.443131
1,100005,0.674566
2,100013,0.229805
3,100028,0.340662
4,100038,0.717714


Le AUROC score sur le jeu de test est de **0.72** sur Kaggle.

In [None]:
#df_.to_csv('submission_lr.csv', index=False)

In [30]:
signature = infer_signature(x_train, y_train)

  inputs = _infer_schema(model_input)


In [31]:
mlflow.sklearn.save_model(model, "LR_model", signature=signature)

In [None]:
# mlflow models serve -m LR_model/

## LGBMClassifier

### Enregistrement du modèle 

In [23]:
mlflow.sklearn.autolog(disable=True)

with mlflow.start_run(run_name='LGBMClassifier'):
    params = {
        "n_estimators": 500,
        "max_depth": 8, 
        "learning_rate": 0.02,
        "class_weight": 'balanced'
    }
    
    mlflow.set_tag("model_name", "LGBM_final")
    mlflow.log_params(params)
    
    pipeline_lgbm = pipeline.Pipeline([ ('scaler', StandardScaler()),
                             ('clf', LGBMClassifier(**params))])

    pipeline_lgbm.fit(x_train, y_train)
    
    mlflow.log_metrics(model_eval_score(pipeline_lgbm, x_val, y_val))
    mlflow.sklearn.log_model(pipeline_lgbm, "sk_models") 

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


AUROC_score : 0.74
Costum_score : 0.68
f_beta_score : 0.51
Accuracy_score : 0.7
Recall_score : 0.65
Presicion_score : 0.16
conf_mat : 
[[65414 27894]
 [ 2820  5350]]



### Téléchargement du modèle et prédiction

In [32]:
model_name = "LGBM_final"
stage = "Staging"  # "Production"

model = mlflow.sklearn.load_model(model_uri=f"models:/{model_name}/{stage}")

model.predict(x_val)

array([1., 0., 1., ..., 1., 1., 1.])

In [25]:
df_ = test_df.loc[:, ["SK_ID_CURR", 'TARGET']]
df_.loc[:, 'TARGET'] = model.predict_proba(x_test)[:, 1]
df_.head()

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.23689
1,100005,0.57451
2,100013,0.190605
3,100028,0.340771
4,100038,0.73425


Le AUROC score sur le jeu de test est de **0.73** sur Kaggle.

In [90]:
#df_.to_csv('submission_lgbm.csv', index=False)

In [33]:
#mlflow.sklearn.save_model(model, "LGBM_model", signature=signature)