<div style="display: flex; background-color: RGB(119, 150, 203);">
    <h1 style="margin: auto; padding: 30px 30px 30px 30px; color: RGB(255,255,255);">
        <center>
            <b>Scoring : Projet Telecom</b><br/>
            <br/>
            Rime Boumezaoued, Romain Pénichon et Claire Gefflot<br/>
</div>

<div class="alert alert-block alert-info">
<b><u>Contexte du projet</b><br/>
<br/>
• Réalisation d’une étude<br/>
• <b>Données :</b> Comportement clients dans le domaine de la téléphonie mobile<br/>
• <b>Objectif :</b> Trouver le meilleur modèle de score permettant d’identifier au mieux les « churners »<br/>
• <b>Évènement cible :</b> Le « churn » (attrition, départ client) est défini comme l’entrée en période d’invalidité dans les 2 mois<br/>
• <b>Contexte :</b> 2 opérateurs mobiles (ope1 et ope2) + 1 fixe.<br/>
• Vous êtes “ope1”<br/>
• Mobile, prépayé, grand public<br/>
• <b>3 périodes :</b> validité (validity), grâce (grace: pas d’appels sortants), invalidité (after-grace: ni entrant ni sortant)<br/>
• <b>Types de recharges :</b> 25, 50, 100, 200 sesterces<br/>
• <b>Variable cible :</b> AFTERGRACE_FLAG<br/>

</div>

<a class="anchor" id="table_of_contents"></a>
## Sommaire

* [Data pre-processing](#chapter1)
    * [Import des packages et data](#section_1_1)
    * [Data cleaning](#section_1_2)
* [Exploration data analysis (EDA)](#chapter2)
* [Feature engineering](#chapter3)
    * [Création d'indicateur](#section_3_1)
    * [Pré-sélection des variables (Chi 2 et Student)](#section_3_2)
    * [Encoding et scaling (train et validation set)](#section_3_3)
    * [Sélection des variables - Regression logistique (RFE)](#section_3_4)
* [Modélisation](#chapter4)
    * [Régression logistique](#section_4_1)
    * [XGBoost](#section_4_2)
    * [LightGBM](#section_4_3)
* [Choix du modèle final : XGBoost](#chapter5)
    * [Analyse des résultats](#section_5_1)
    * [Identification des variables les plus importantes](#section_5_2)
    * [Interprétation du modèle](#section_5_3)
    * [Sérialisation du modèle et déploiement en situation réelle](#section_5_4)

<a class="anchor" id="chapter1"></a>
<div style="display: flex; background-color: RGB(119, 150, 203);">
    <h1 style="margin: auto; padding: 30px 30px 30px 30px; color: RGB(255,255,255);">
            <b>Data pre-processing</b>
</div>

<a class="anchor" id="section_1_1"></a>
<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: RGB(119, 150, 203); ">Imports des packages et data</h3>
</div>

In [None]:
import pandas as pd
from pandas.api.types import is_numeric_dtype, is_object_dtype
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import scipy.stats as st
from pathlib import Path
import io
from unidecode import unidecode
from skimpy import skim
import sys
from statistics import mean
import warnings
import re

from sklearn.model_selection import train_test_split

from tqdm import tqdm
from tqdm.notebook import tqdm_notebook

import sklearn
from sklearn.preprocessing import MinMaxScaler, StandardScaler, MaxAbsScaler, RobustScaler, QuantileTransformer, PowerTransformer, OneHotEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import roc_auc_score, accuracy_score, mean_squared_error, mean_absolute_error, r2_score, confusion_matrix, f1_score
import category_encoders as ce
from pre_processing import pre_processing


from sklearn.exceptions import DataConversionWarning
from xgboost import XGBClassifier
#from lightgbm import LGBMClassifier

from sklearn.tree import DecisionTreeClassifier
#from catboost import CatBoostClassifier, Pool
#from catboost import cv
import optuna
import joblib

import shap

In [None]:
# Read the dataset :

pd.set_option("display.min_rows", 10)
pd.set_option("display.max_column", 1000)

df = pd.read_csv('../DATA/base_projet_teleco.csv', index_col=False, sep=";") #../DATA/base_projet_teleco.csv or gs://scoring-data-m2tide/DATA/base_projet_teleco.csv
df=df.iloc[:, 1:]
df

<a class="anchor" id="section_1_2"></a>
<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: RGB(119, 150, 203); ">Exploration des données</h3>
</div>

In [None]:
df.info()

In [None]:
def df_analyse(df, columns, name_df):
    """
    Initial analysis on the DataFrame.

    Parameters
    ----------
    Args:
        df (pandas.DataFrame): DataFrame to analyze.
        columns (list): Dataframe keys in list format.
        name_df (str): DataFrame name.

    Returns:
        None.
        Print the initial analysis on the DataFrame.
    """

    # Calculating the memory usage based on dataframe.info()
    buf = io.StringIO()
    df.info(buf=buf)
    memory_usage = buf.getvalue().split('\n')[-2]

    if df.empty:
        print("The", name_df, "dataset is empty. Please verify the file.")
    else:
        # identifying empty columns
        empty_cols = [col for col in df.columns if df[col].isna().all()]
        #identifying full duplicates rows
        df_rows_duplicates = df[df.duplicated()]

        # Creating a dataset based on Type object and records by columns
        type_cols = df.dtypes.apply(lambda x: x.name).to_dict()
        df_resume = pd.DataFrame(list(type_cols.items()), columns = ["Name", "Type"])
        df_resume["Records"] = list(df.count())
        df_resume["% of NaN"] = list(round((df.isnull().sum(axis = 0))/len(df),5)*100)
        df_resume["Unique"] = list(df.nunique())


        print("\nInitial Analysis of", name_df, "dataset")
        print("--------------------------------------------------------------------------")
        print("- Dataset shape:                 ", df.shape[0], "rows and", df.shape[1], "columns")
        print("- Total of NaN values:           ", df.isna().sum().sum())
        #print("- Percentage of NaN:             ", round((df.isna().sum().sum() / prod(df.shape)) * 100, 2), "%")
        print("- Total of full duplicates rows: ", df_rows_duplicates.shape[0])
        print("- Total of empty rows:           ", df.shape[0] - df.dropna(axis="rows", how="all").shape[0]) if df.dropna(axis="rows", how="all").shape[0] < df.shape[0] else \
                    print("- Total of empty rows:            0")
        print("- Total of empty columns:        ", len(empty_cols))
        print("  + The empty column is:         ", empty_cols) if len(empty_cols) == 1 else \
                    print("  + The empty column are:         ", empty_cols) if len(empty_cols) >= 1 else None

        print("\n- The key(s):", columns, "is not present multiple times in the dataframe.\n  It CAN be used as a primary key.") if df.size == df.drop_duplicates(columns).size else \
                    print("\n- The key(s):", columns, "is present multiple times in the dataframe.\n  It CANNOT be used as a primary key.")

        print("\n- Type object and records by columns         (",memory_usage,")")
        print("--------------------------------------------------------------------------")
        print(df_resume.sort_values("Records", ascending=False))



# Analyse df
df_analyse(df, ["CONTRACT_KEY"], "df")

<a class="anchor" id="section_1_3"></a>
<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: RGB(119, 150, 203); ">Data cleaning</h3>
</div

<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: black; ">Traitements basiques des données</h3>
</div

In [None]:
#drop duplicate : 
df.duplicated().value_counts()
df.drop_duplicates(inplace=True, keep="first")

In [None]:
#drop useless col : "CONTRACT_KEY"
df.drop(["CONTRACT_KEY"], axis=1, inplace=True)

In [None]:
#Handling data formatting: Data formatting involves making sure that the data is in a consistent format. It can be handled by converting data types, changing date formats, etc.
#lowercase caracter :
df = df.applymap(lambda s:s.lower() if type(s) == str else s) 

#drop white space :
#df = df.applymap(lambda s:s.strip() if type(s) == str else s) 

#drop multiple(double, triple) space :
#df = df.applymap(lambda s:s.replace("  ", " ") if type(s) == str else s) 

#replace " " by "_" :
#df = df.applymap(lambda s:s.replace(" ", "_") if type(s) == str else s) 

#remove accent :
#from unidecode import unidecode
#df = df.applymap(lambda s: unidecode(s) if type(s) == str else s) 

In [None]:
# study the type and the number of unique value of each covariate
df_nunique = pd.concat([df.nunique(), df.dtypes], axis=1).rename(columns={0: 'nunique', 1: 'dtypes'})
df_nunique

In [None]:
#drop columns which have only 1 unique value :

<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: black; ">Traitements des valeurs anormales</h3>
</div

In [None]:
# Voir s'il y a des anomalies pour les var numeriques :
# analyse if there is any anomaly before fillna : until PASS_AFTERGRACE_IND_M1
df_describe = (df.describe()).T
df_describe

In [None]:
# Voir s'il y a des anomalies pour les var categorielles :
# display(df["var_cat"].unique())

In [None]:
#CUSTOMER_AGE : 18 to 100 year old
#Est-il possible que les customer_age soit entre 0 et 99 ans ?
display(df["CUSTOMER_AGE"].unique())


df["CUSTOMER_AGE"] = df["CUSTOMER_AGE"].apply(lambda x : np.nan if np.abs(x)<18 else np.abs(x))
display(df["CUSTOMER_AGE"].unique())

In [None]:
#CUSTOMER_GENDER : Genre client
display(df["CUSTOMER_GENDER"].unique())

df["CUSTOMER_GENDER"] = df["CUSTOMER_GENDER"].apply(lambda x : x.split("'")[1])
df["CUSTOMER_GENDER"] = df["CUSTOMER_GENDER"].apply(lambda x: np.nan if (x=="not ent" or x=="unknown") else x) #lambda x: "unknown" if (x=="not ent" or x=="unknown") else x

display(df["CUSTOMER_GENDER"].unique())

In [None]:
#CONTRACT_TENURE_DAYS : Ancienneté contrat en jours
"""
We know that :
-Age min for contrat : 18
If CONTRACT_TENURE_DAYS=365 = 365/365=1, then the minimum age = 18+1=19

if current age - (CONTRACT_TENURE_DAYS/365) < 18
then : replace by np.nan
else : it's good
"""

df["age_first_contract"] = df["CUSTOMER_AGE"] - (df["CONTRACT_TENURE_DAYS"]/365)
df.loc[df['age_first_contract'] < 18, 'CONTRACT_TENURE_DAYS'] = np.nan
df.drop(["age_first_contract"], axis=1, inplace=True)

In [None]:
# 'NO_OF_RECHARGES_6M' :
df.loc[(df['NO_OF_RECHARGES_6M']-df['FAILED_RECHARGE_6M']) < 0, 'NO_OF_RECHARGES_6M'] = np.nan

Aucune erreurs à signaler sur ces variables :
AVERAGE_CHARGE_6M, FAILED_RECHARGE_6M, AVERAGE_RECHARGE_TIME_6M, BALANCE_M3, BALANCE_M2, BALANCE_M1, FIRST_RECHARGE_VALUE, LAST_RECHARGE_VALUE, TIME_TO_GRACE, TIME_TO_AFTERGRACE, RECENCY_OF_LAST_RECHARGE, TOTAL_RECHARGE_6M, ZERO_BALANCE_IND_M3, ZERO_BALANCE_IND_M2, ZERO_BALANCE_IND_M1, PASS_GRACE_IND_M3, PASS_GRACE_IND_M2, PASS_GRACE_IND_M1, PASS_AFTERGRACE_IND_M3, PASS_AFTERGRACE_IND_M2, PASS_AFTERGRACE_IND_M1

<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: black; ">Transformation des variables</h3>
</div

In [None]:
#phones_data = pd.read_excel("marque_telephone.xlsx")
phones_data = pd.read_csv("../DATA/marque_telephone.csv") #../DATA/marque_telephone.csv or gs://scoring-data-m2tide/DATA/marque_telephone.csv
phones_data = phones_data.applymap(lambda x: x.lower() if isinstance(x, str) else x)

duplicates = phones_data.duplicated(subset='model', keep=False)

# Supprimer les doublons de l'index dans phones_data
phones_data = phones_data.drop_duplicates(subset='model')

# Créer une nouvelle colonne 'Marque' dans df
df['marque'] = df['CURR_HANDSET_MODE'].map(phones_data.set_index('model')['marque_tel'])

#drop "CURR_HANDSET_MODE":
df.drop(["CURR_HANDSET_MODE"], axis=1, inplace=True)

In [None]:
df['marque'].value_counts()

In [None]:
# proportion de valeurs négatives pour les trois variables concernées
col = ["INC_OUT_PROP_DUR_MIN_M1", "INC_OUT_PROP_DUR_MIN_M2", "INC_OUT_PROP_DUR_MIN_M3"]
for elem in col :
    df[elem] = np.where(df[elem] < 0, np.abs(df[elem]), df[elem])

In [None]:
df.CONTRACT_TENURE_DAYS.sort_values(ascending=False)

<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: black; ">Analyse des données après nettoyage</h3>
</div

In [None]:
df_describe = (df.describe()).T
df_describe

In [None]:
skim(df)


<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: black; ">Split df</h3>
</div

Afin d'éviter tout biais au sein de nos ensembles de données lors des imputations de valeurs manquantes, nous procédons immédiatement au split.

### Split data

In [None]:
target = "AFTERGRACE_FLAG"
df_train, df_val = train_test_split(df, test_size=0.2, stratify=df[target], random_state=14)

In [None]:
#verify the distrib
import seaborn as sns
sns.displot(df_train[target], stats="percent")
sns.displot(df_val[target], stats="percent")

In [None]:
df_train.reset_index(drop = True, inplace=True)
df_val.reset_index(drop = True, inplace=True)

df_train


<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: black; ">Imputation des valeurs manquantes</h3>
</div

### Traitement df_train

In [None]:
#1/ fillna : (mean, median,...)

#1.1/ Drop rows and col which are more 60-70% of NaN.
#If there is 50%-60% or more NaN, drop the column.
df_train_na_col = (pd.DataFrame(((df_train.isna().sum(axis=0))/len(df_train))*100)).rename(columns={0: 'Sum_NaN_%'})

#drop col :
df_train.drop(columns=list((df_train_na_col[df_train_na_col["Sum_NaN_%"]>50]).index), inplace=True)

#Count the number of NaN in rows :
#If there is 60-70% or more NaN, drop the row.
df_train_na_row = (pd.DataFrame(((df_train.isna().sum(axis=1))/len(df_train.columns))*100)).rename(columns={0: 'Sum_NaN_%'})

#drop row :
df_train.drop(index=list((df_train_na_row[df_train_na_row["Sum_NaN_%"]>60]).index), inplace=True)

#reset_index :
df_train.reset_index(drop = True, inplace=True)

df_train

In [None]:
#1.2/ dectect col which have NaN
df_train_na = (pd.DataFrame(df_train.isna().sum())).rename(columns={0: 'Sum_NaN'})
df_train_na_egal_0 = df_train_na[df_train_na['Sum_NaN']==0]
df_train_na_diff_0 = df_train_na[df_train_na['Sum_NaN']!=0]

In [None]:
df_train_na_diff_0

In [None]:
#1.3/Build groupby df for fillna :
#select best col for groupby fillna :
target = "AFTERGRACE_FLAG"
corr_matrix = df_train[list(df_train_na_egal_0.index)].corr().abs().loc[target].drop([target], axis=0)

In [None]:
corr_matrix = pd.concat([corr_matrix, df_train[list(corr_matrix.index)].nunique()], axis=1).rename(columns={0: 'nunique'})
corr_matrix.sort_values(by=["AFTERGRACE_FLAG", "nunique"])

Les variables (sans NaN) les plus corrélées à la target avec un nombre réduit de modalités sont :
- PASS_GRACE_IND_M1
- PASS_GRACE_IND_M2
- PASS_GRACE_IND_M3

In [None]:
df_train_grouby_3var = df_train.groupby(["PASS_GRACE_IND_M1", "PASS_GRACE_IND_M2", "PASS_GRACE_IND_M3"])
df_train_grouby_2var = df_train.groupby(["PASS_GRACE_IND_M1", "PASS_GRACE_IND_M2"])
df_train_grouby_1var = df_train.groupby(["PASS_GRACE_IND_M1"])

In [None]:
# useless here :
"""
list_median_mode_imputation = []
list_cat_particular_imputation = ["marque"]
list_cont_particular_imputation = [i for i in list(df_train.columns) if i not in list_median_mode_imputation + list_cat_particular_imputation + [target] + ["PASS_GRACE_IND_M1", "PASS_GRACE_IND_M2", "PASS_GRACE_IND_M3"]]
"""

In [None]:
#1.4/fillna on dataset : 

# numerical col :
warnings.filterwarnings('ignore')

#fillna with particular value :
#df_train["num_var"] = df_train["num_var"].fillna("numerical_value")
#df_train[list_cont_particular_imputation] = df_train[list_cont_particular_imputation].fillna(common value, example = 0)


#fillna with median :
target = "AFTERGRACE_FLAG"
for i in (df_train_na_diff_0.index) : # or replace (df_train_na_diff_0.index) by list_median_mode_imputation
    if is_numeric_dtype(df_train[i]) and i != target :
        df_train[i] = df_train_grouby_3var[i].fillna(df_train[i].median()) #df_train.groupby(["PASS_GRACE_IND_M1", "PASS_GRACE_IND_M2", "PASS_AFTERGRACE_IND_M2"])[i].fillna(df_train[i].median())
        df_train[i] = df_train_grouby_2var[i].fillna(df_train[i].median())
        df_train[i] = df_train_grouby_1var[i].fillna(df_train[i].median())

        df_train[i] = df_train[i].fillna(df_train[i].median())

print(df_train.isna().sum())

In [None]:
#categorical col :

#fillna with particular value :
#df_train["var"] = df_train["var"].fillna("string_value")
#df_train[list_cat_particular_imputation].fillna("cat_value", inplace=True)
df_train['marque'].fillna("unknown", inplace=True)


#fillna with mode :
target = "AFTERGRACE_FLAG"
for i in (df_train_na_diff_0.index) :  # or replace (df_train_na_diff_0.index) by list_median_mode_imputation
    if is_object_dtype(df_train[i]) and i != target :
        df_train[i] = df_train_grouby_3var[i].fillna(df_train[i].mode()[0]) ##df_train.groupby(["PASS_GRACE_IND_M1", "PASS_GRACE_IND_M2", "PASS_AFTERGRACE_IND_M2"])[i].fillna(df_train[i].mode())
        df_train[i] = df_train_grouby_2var[i].fillna(df_train[i].mode()[0])
        df_train[i] = df_train_grouby_1var[i].fillna(df_train[i].mode()[0])
        
        df_train[i] = df_train[i].fillna(df_train[i].mode()[0])

print(df_train.isna().sum())

In [None]:
(pd.DataFrame(((df_train.isna().sum(axis=0))/len(df_train))*100)).rename(columns={0: 'Sum_NaN_%'})

### Traitement df_val

In [None]:
#1.1/ Drop rows and col which are more 60-70% of NaN.
#If there is 50%-60% or more NaN, drop the column.
#drop col :
df_val.drop(columns=list((df_train_na_col[df_train_na_col["Sum_NaN_%"]>50]).index), inplace=True)

#Count the number of NaN in rows :
#If there is 60-70% or more NaN, drop the row.
#useless for row of df_val 

In [None]:
#1.1/ verify if columns (which we use for groupby) have NaN or not :
groupby_columns = ["PASS_GRACE_IND_M1", "PASS_GRACE_IND_M2", "PASS_GRACE_IND_M3"]
df_val[groupby_columns].isna().sum(axis=0)

In [None]:
# if there is NaN, fillna them by the median/mean or mode from df_train

#numerical col :
for i in groupby_columns :
    df_val[i] = df_val[i].fillna(df_train[i].median())

#cat col :
for i in groupby_columns :
    df_val[i] = df_val[i].fillna(df_train[i].mode())

In [None]:
#1.2/dectect col which have NaN
df_val_na = (pd.DataFrame(df_val.isna().sum())).rename(columns={0: 'Sum_NaN'})
df_val_na_egal_0 = df_val_na[df_val_na['Sum_NaN']==0]
df_val_na_diff_0 = df_val_na[df_val_na['Sum_NaN']!=0]

In [None]:
#1.2/fillna on dataset :

#numerical col :
warnings.filterwarnings('ignore')

#fillna with particular value :
#df_val["var"] = df_val["var"].fillna("numerical_value")
#df_val[list_cont_particular_imputation] = df_val[list_cont_particular_imputation].fillna(0)


#fillna with median :
target = "AFTERGRACE_FLAG"

def imputate_missing_3val(df) :
    if pd.isna(df[i]) :
        return (df_train_grouby_3var.get_group(("PASS_GRACE_IND_M1"==df["PASS_GRACE_IND_M1"], "PASS_GRACE_IND_M2"==df["PASS_GRACE_IND_M2"], "PASS_GRACE_IND_M3"==df["PASS_GRACE_IND_M3"]))[[elem for elem in df_val.columns if is_numeric_dtype(df_val[elem])]].agg("median"))[i]
    else :
        return df[i]

    
def imputate_missing_2val(df) :
    if pd.isna(df[i]) :
        return (df_train_grouby_2var.get_group(("PASS_GRACE_IND_M1"==df["PASS_GRACE_IND_M1"], "PASS_GRACE_IND_M2"==df["PASS_GRACE_IND_M2"]))[[elem for elem in df_val.columns if is_numeric_dtype(df_val[elem])]].agg("median"))[i]
    else :
        return df[i]

def imputate_missing_1val(df) :
    if pd.isna(df[i]) :
        return (df_train_grouby_1var.get_group(("PASS_GRACE_IND_M1"==df["PASS_GRACE_IND_M1"]))[[elem for elem in df_val.columns if is_numeric_dtype(df_val[elem])]].agg("median"))[i]
    else :
        return df[i]

for i in (df_val_na_diff_0.index) : #or list_median_mode_imputation
    if is_numeric_dtype(df_val[i]) and i != target :
        df_val[i] = df_val.apply(imputate_missing_3val, axis=1)
        df_val[i] = df_val.apply(imputate_missing_2val, axis=1)
        df_val[i] = df_val.apply(imputate_missing_1val, axis=1)
        df_val[i] = df_val[i].fillna(df_train[i].median())


pd.DataFrame(df_val.isna().sum()).sort_values(by=[0])

In [None]:
#caterical col :
from pandas.api.types import is_numeric_dtype, is_object_dtype
import warnings
warnings.filterwarnings('ignore')

#fillna with particular value :
#df_val["var"] = df_val["var"].fillna("categorical_value")
df_val['marque'].fillna("unknown", inplace=True)


#fillna with mode :
target = "AFTERGRACE_FLAG"

def imputate_missing_3val(df) :
    if pd.isna(df[i]) :
        return (df_train_grouby_3var.get_group(("PASS_GRACE_IND_M1"==df["PASS_GRACE_IND_M1"], "PASS_GRACE_IND_M2"==df["PASS_GRACE_IND_M2"],
                                                "PASS_GRACE_IND_M3"==df["PASS_GRACE_IND_M3"]))[[elem for elem in df_val.columns if is_object_dtype(df_val[elem])]].agg("mode"))[i][0]
    else :
        return df[i]
    
def imputate_missing_2val(df) :
    if pd.isna(df[i]) :
        return (df_train_grouby_2var.get_group(("PASS_GRACE_IND_M1"==df["PASS_GRACE_IND_M1"],
                                                "PASS_GRACE_IND_M2"==df["PASS_GRACE_IND_M2"]))[[elem for elem in df_val.columns if is_object_dtype(df_val[elem])]].agg("mode"))[i][0]
    else :
        return df[i]

def imputate_missing_1val(df) :
    if pd.isna(df[i]) :
        return (df_train_grouby_1var.get_group(("PASS_GRACE_IND_M1"==df["PASS_GRACE_IND_M1"]))[[elem for elem in df_val.columns if is_object_dtype(df_val[elem])]].agg("mode"))[i][0]
    else :
        return df[i]


for i in (df_val_na_diff_0.index) : #or list_median_mode_imputation
    if is_object_dtype(df_val[i]) and i != target :
        if df_val[i].isnull().values.any() :
            df_val[i] = df_val.apply(imputate_missing_3val, axis=1)
            df_val[i] = df_val.apply(imputate_missing_2val, axis=1)
            df_val[i] = df_val.apply(imputate_missing_1val, axis=1)
            df_val[i] = df_val[i].fillna(df_train[i].mode())


pd.DataFrame(df_val.isna().sum()).sort_values(by=[0])

In [None]:
df_val

### Traitement df_test
Ici on utilise que le val, donc pas de traitement sur df_test

<a class="anchor" id="chapter2"></a>
<div style="display: flex; background-color: RGB(119, 150, 203);">
    <h1 style="margin: auto; padding: 30px 30px 30px 30px; color: RGB(255,255,255);">
            <b>Exploration data analysis (EDA) on df_train</b>
</div>

In [None]:
skim(df_train)

In [None]:
# Création d'un DataFrame pour stocker les types de données des colonnes dans df
info_types = pd.DataFrame(df_train.dtypes, columns=['type'])
info_types.sort_values('type')  # Tri par la colonne 'type'

# Définition de la variable cible
target = "AFTERGRACE_FLAG"

# Sélection des colonnes numériques à l'exclusion de la variable cible
var_num = df_train.select_dtypes(include=np.number).columns.tolist()
#var_num.remove("AFTERGRACE_FLAG")

# Sélection des colonnes catégorielles à l'exclusion de la variable CURR_HANDSET_MODE
var_cat = df_train.select_dtypes(include=object).columns.tolist()
#var_cat.remove("CURR_HANDSET_MODE")

In [None]:
# Définition d'une fonction pour tracer la distribution d'une variable catégorielle par rapport à la variable cible
def distrib_for_cat_by_target(var_cat: list, dataframe, target: str):
    temp = dataframe.copy()
    temp['Frequency'] = 0
    counts = temp.groupby([target, var_cat]).count()
    freq_per_group = counts.div(counts.groupby(target).transform('sum')).reset_index()
    g = sns.catplot(x=target, y="Frequency", hue=var_cat, data=freq_per_group, kind="bar",
                  height=8, aspect=2, legend=False)
    ax = g.ax
    for p in ax.patches:
        ax.annotate(f"{p.get_height()*100:.2f}%", (p.get_x() + p.get_width() / 2., p.get_height()),
                    ha='center', va='center', fontsize=14, color='black', xytext=(0, 20),
                    textcoords='offset points')
    plt.title("Distribution de '" + var_cat + "' par 'Cible'", fontsize=22)
    plt.legend(fontsize=14)
    plt.xlabel(target, fontsize=18)
    plt.ylabel('Fréquence', fontsize=18)
    plt.show()

# Tracer la distribution des variables catégorielles par rapport à la variable
for i in var_cat:
    distrib_for_cat_by_target(i,df_train,target)


In [None]:
warnings.filterwarnings("ignore")

# Définition d'une fonction pour tracer la distribution d'une variable numérique par rapport à la variable cible
def distrib_for_num_by_target(var_num: list, dataframe, target: str):
    """
    Fonction de distribution d'une variable explicative selon la variable cible (x|y)
    var_num : variable explicative à étudier
    dataframe
    target : variable cible
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(14, 7))
    sns.distplot(dataframe[dataframe[target] == 0][var_num], ax=ax1)
    sns.distplot(dataframe[dataframe[target] == 1][var_num], ax=ax2)
    ax1.set_title("Distribution de la variable " + var_num + f" \n pour '{target}' = 0")
    ax2.set_title("Distribution de la variable " + var_num + f" \n pour '{target}' = 1")
    plt.show()

# Tracer la distribution des variables numériques par rapport à la variable cible

for i in var_num:
    distrib_for_num_by_target(i, df_train, target)

In [None]:
# Matrice de corrélation
correlation_matrix = df_train[[elem for elem in df_train.columns if is_numeric_dtype(df_train[elem])]].corr()

# Sélection des variables les plus corrélées à AFTERGRACE_FLAG avec une corrélation supérieure à 0.15
threshold = 0.15
target_correlations = correlation_matrix['AFTERGRACE_FLAG'][(correlation_matrix['AFTERGRACE_FLAG'] > threshold) | (correlation_matrix['AFTERGRACE_FLAG'] < -threshold)]

# Filtrage du DataFrame original pour les variables sélectionnées
filtered_df = df_train[target_correlations.index]

# Création d'une nouvelle matrice de corrélation avec les variables sélectionnées
filtered_correlation_matrix = filtered_df.corr()

# Création de la heatmap de la matrice de corrélation filtrée
plt.figure(figsize=(10, 8))
sns.heatmap(filtered_correlation_matrix, annot=True, cmap='coolwarm', center=0, fmt=".2f", annot_kws={"ha": 'center'})
plt.title("Matrice de corrélation avec variables corrélées à AFTERGRACE_FLAG (corrélation > 0.15)")
plt.show()


<a class="anchor" id="chapter3"></a>
<div style="display: flex; background-color: RGB(119, 150, 203);">
    <h1 style="margin: auto; padding: 30px 30px 30px 30px; color: RGB(255,255,255);">
            <b>Feature engineering</b>
</div>

<a class="anchor" id="section_3_1"></a>
<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: RGB(119, 150, 203); ">Création d'indicateur</h3>
</div>

Pour les données d'entraînement :

In [None]:
# les données sont figées à fin M1 (août), M2 = juillet etc jusqu'à M6 = mars

# création de 4 variables (comme on cherche les churners dans les 2 mois)

df_train["FLAG_RECHARGE_M1"] = df_train["RECENCY_OF_LAST_RECHARGE"].apply(lambda x : 1 if 0 <= x <= 31 else 0)

df_train["FLAG_RECHARGE_M2"] = df_train["RECENCY_OF_LAST_RECHARGE"].apply(lambda x : 1 if 32 <= x <= 62 else 0)

df_train["FLAG_RECHARGE_M3"] = df_train["RECENCY_OF_LAST_RECHARGE"].apply(lambda x : 1 if 63 <= x <= 92 else 0)

df_train["FLAG_RECHARGE_PLUS_M3"] = df_train["RECENCY_OF_LAST_RECHARGE"].apply(lambda x : 1 if x >= 93 else 0) #plus loin que M3

In [None]:
# approche via les balances : si balance M2 > M1 et balance M3 > M2 alors il a eu plusieurs recharges sur les 3 mois
# marche que si balance = reste des recharges

for index, row in df_train.iterrows():
    if row["BALANCE_M2"] > row["BALANCE_M1"] and row["BALANCE_M3"] > row["BALANCE_M2"]:
        df_train.at[index, "AVERAGE_MULTIPLE_RECHARGE_M1_M2_M3"] = 1

    else :
        df_train.at[index, "AVERAGE_MULTIPLE_RECHARGE_M1_M2_M3"] = 0

In [None]:
# si quelque chose entre 1 sinon 0

for index, row in df_train.iterrows() :
    if row["INC_DURATION_MINS_M1"] + row["INC_PROP_SMS_CALLS_M1"] == 0 :
        df_train.at[index, "FLAG_IN_M1"] = 0

    else :
        df_train.at[index, "FLAG_IN_M1"] = 1

    if row["INC_DURATION_MINS_M2"] + row["INC_PROP_SMS_CALLS_M2"] == 0 :
        df_train.at[index, "FLAG_IN_M2"] = 0

    else :
        df_train.at[index, "FLAG_IN_M2"] = 1

    if row["INC_DURATION_MINS_M3"] + row["INC_PROP_SMS_CALLS_M3"] == 0 :
        df_train.at[index, "FLAG_IN_M3"] = 0

    else :
        df_train.at[index, "FLAG_IN_M3"] = 1

In [None]:
# si quelque chose sort 1 sinon 0

for index, row in df_train.iterrows() :
    if row["OUT_DURATION_MINS_M1"] + row["OUT_SMS_NO_M1"] + row["OUT_INT_DURATION_MINS_M1"] + row["OUT_888_DURATION_MINS_M1"] + row["OUT_VMACC_NO_CALLS_M1"] == 0:
        df_train.at[index, "FLAG_OUT_M1"] = 0

    else :
        df_train.at[index, "FLAG_OUT_M1"] = 1

    if row["OUT_DURATION_MINS_M2"] + row["OUT_SMS_NO_M2"] + row["OUT_INT_DURATION_MINS_M2"] == 0 + row["OUT_888_DURATION_MINS_M2"] + row["OUT_VMACC_NO_CALLS_M2"] == 0 :
        df_train.at[index, "FLAG_OUT_M2"] = 0

    else :
        df_train.at[index, "FLAG_OUT_M2"] = 1

    if row["OUT_DURATION_MINS_M3"] + row["OUT_SMS_NO_M3"] + row["OUT_INT_DURATION_MINS_M3"] + row["OUT_888_DURATION_MINS_M3"] + row["OUT_VMACC_NO_CALLS_M3"] == 0 :
        df_train.at[index, "FLAG_OUT_M3"] = 0

    else :

        df_train.at[index, "FLAG_OUT_M3"] = 1

In [None]:
# type de contrat : ancien ou nouveau (règle : si supérieur à 2 ans vieux sinon nouveau)


for index, row in df_train.iterrows() :
    if row["CONTRACT_TENURE_DAYS"] > 730 :
        df_train.at[index, "OLD_CONTRACT"] = 1

    else :
        df_train.at[index,"OLD_CONTRACT"] = 0

Pour les données de test/validation

In [None]:
# les données sont figées à fin M1 (août), M2 = juillet etc jusqu'à M6 = mars

# création de 4 variables (comme on cherche les churners dans les 2 mois)

df_val["FLAG_RECHARGE_M1"] = df_val["RECENCY_OF_LAST_RECHARGE"].apply(lambda x : 1 if 0 <= x <= 31 else 0)

df_val["FLAG_RECHARGE_M2"] = df_val["RECENCY_OF_LAST_RECHARGE"].apply(lambda x : 1 if 32 <= x <= 62 else 0)

df_val["FLAG_RECHARGE_M3"] = df_val["RECENCY_OF_LAST_RECHARGE"].apply(lambda x : 1 if 63 <= x <= 92 else 0)

df_val["FLAG_RECHARGE_PLUS_M3"] = df_val["RECENCY_OF_LAST_RECHARGE"].apply(lambda x : 1 if x >= 93 else 0) #plus loin que M3

In [None]:
# approche via les balances : si balance M1 > M2 ou balance M2 > M3 ou balance M1 > M3 alors il a eu plusieurs recharges sur les 3 mois
# marche que si balance = reste des recharges

for index, row in df_val.iterrows():
    if row["BALANCE_M2"] > row["BALANCE_M1"] and row["BALANCE_M3"] > row["BALANCE_M2"]:
        df_val.at[index, "AVERAGE_MULTIPLE_RECHARGE_M1_M2_M3"] = 1

    else :
        df_val.at[index, "AVERAGE_MULTIPLE_RECHARGE_M1_M2_M3"] = 0

In [None]:
# si quelque chose entre 1 sinon 0

for index, row in df_val.iterrows() :
    if row["INC_DURATION_MINS_M1"] + row["INC_PROP_SMS_CALLS_M1"] == 0 :
        df_val.at[index, "FLAG_IN_M1"] = 0

    else :
        df_val.at[index, "FLAG_IN_M1"] = 1

    if row["INC_DURATION_MINS_M2"] + row["INC_PROP_SMS_CALLS_M2"] == 0 :
        df_val.at[index, "FLAG_IN_M2"] = 0

    else :
        df_val.at[index, "FLAG_IN_M2"] = 1

    if row["INC_DURATION_MINS_M3"] + row["INC_PROP_SMS_CALLS_M3"] == 0 :
        df_val.at[index, "FLAG_IN_M3"] = 0

    else :
        df_val.at[index, "FLAG_IN_M3"] = 1

In [None]:
# si quelque chose sort 1 sinon 0

for index, row in df_val.iterrows() :
    if row["OUT_DURATION_MINS_M1"] + row["OUT_SMS_NO_M1"] + row["OUT_INT_DURATION_MINS_M1"] + row["OUT_888_DURATION_MINS_M1"] + row["OUT_VMACC_NO_CALLS_M1"] == 0:
        df_val.at[index, "FLAG_OUT_M1"] = 0

    else :
        df_val.at[index, "FLAG_OUT_M1"] = 1

    if row["OUT_DURATION_MINS_M2"] + row["OUT_SMS_NO_M2"] + row["OUT_INT_DURATION_MINS_M2"] + row["OUT_888_DURATION_MINS_M2"] + row["OUT_VMACC_NO_CALLS_M2"] == 0 :
        df_val.at[index, "FLAG_OUT_M2"] = 0

    else :
        df_val.at[index, "FLAG_OUT_M2"] = 1

    if row["OUT_DURATION_MINS_M3"] + row["OUT_SMS_NO_M3"] + row["OUT_INT_DURATION_MINS_M3"] + row["OUT_888_DURATION_MINS_M3"] + row["OUT_VMACC_NO_CALLS_M3"] == 0 :
        df_val.at[index, "FLAG_OUT_M3"] = 0

    else :

        df_val.at[index, "FLAG_OUT_M3"] = 1

In [None]:
# type de contrat : ancien ou nouveau (règle : si supérieur à 2 ans vieux sinon nouveau)

for index, row in df_val.iterrows() :
    if row["CONTRACT_TENURE_DAYS"] > 730 :
        df_val.at[index, "OLD_CONTRACT"] = 1

    else :
        df_val.at[index, "OLD_CONTRACT"] = 0

<a class="anchor" id="section_3_2"></a>
<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: RGB(119, 150, 203); ">Pré-sélection des variables (filter methods)</h3>
</div>

In [None]:
# Variables catégortielles test du Chi 2:
info_types = pd.DataFrame(df_train.dtypes)
list_var_cat = info_types[info_types[0]=="object"].index.tolist()
list_col_to_drop = []

target = "AFTERGRACE_FLAG"
for v in list_var_cat:
    if v!=target:
        cont = df_train[[v, target]].pivot_table(index=v, columns=target, aggfunc=len).fillna(0).copy().astype(int) # Création de la table de contingence
        st_chi2, st_p, st_dof, st_exp = st.chi2_contingency(cont)

        #col to drop :
        if st_p >= 0.05 :
            list_col_to_drop.append(v)

        #print(v + ": p-value test chi 2 = " + str(st_p))

In [None]:
info_df_num = df_train.describe()

for v in info_df_num.columns.tolist():
    if v!= target:
        a=list(df_train[df_train[target]==0][v])
        b=list(df_train[df_train[target]==1][v])
        st_test, st_p = st.ttest_ind(a, b, axis=0, equal_var=False, nan_policy='omit')

        #col to drop :
        if st_p >= 0.05 :
            list_col_to_drop.append(v)

        #print(v + ": p-value test Student = " + str(st_p))

In [None]:
list_col_to_drop

In [None]:
df_train.drop(list_col_to_drop, axis=1, inplace=True)
df_val.drop(list_col_to_drop, axis=1, inplace=True)

In [None]:
#save the df_train for using shapley value in app.py :
df_train.to_csv(path_or_buf="../DATA/df_train.csv", sep=';', index=False)

<a class="anchor" id="section_3_3"></a>
<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: RGB(119, 150, 203); "> Split df_train to x_train and y_train (idem for df_val)</h3>
</div>

In [None]:
x_train = df_train.drop([target], axis=1)
y_train = df_train[target]

x_val = df_val.drop([target], axis=1)
y_val = df_val[target]

In [None]:
x_train

<a class="anchor" id="section_3_3"></a>
<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: RGB(119, 150, 203); ">Encoding et scaling (train et validation set)</h3>
</div>

In [None]:
#numeric and > 2 :
list_cont_col = x_train.select_dtypes(include=[np.number]).columns.tolist()
list_cont_col = [col for col in list_cont_col if x_train[col].nunique() > 2]
len(list_cont_col)

In [None]:
#numeric and <= 2 :
list_binary_col = x_train.select_dtypes(include=[np.number]).columns.tolist()
list_binary_col = [col for col in list_binary_col if x_train[col].nunique() <= 2]
len(list_binary_col)

In [None]:
#categorical col :
#list_cat_col = x_train.select_dtypes(include=['object']).columns.tolist()
list_cat_col_OHE = ['CUSTOMER_GENDER']
list_cat_col_TE =  ['marque']

In [None]:
len(list_cont_col) + len(list_binary_col) + len(list_cat_col_OHE) + len(list_cat_col_TE)

In [None]:
sys.path.append("pre_processing.py")
a = pre_processing()

In [None]:
#Pre processing for linear models :
#for boosting, we encode all variable.
x_train_bis = x_train.copy() #utilisé plus tard

x_train_preprocessed = a.pre_processing(df=x_train, train=True, categorical_var_OHE= list_cat_col_OHE,
                           categorical_var_OrdinalEncoding={}, categorical_var_TE=list_cat_col_TE, target=y_train,
                           continious_var=list_cont_col, encoding_type_cont=MinMaxScaler())
x_train_preprocessed

In [None]:
x_val_bis = x_val.copy() #utilisé plus tard

x_val_preprocessed = a.pre_processing(df=x_val, train=False, categorical_var_OHE= list_cat_col_OHE,
                         categorical_var_OrdinalEncoding={}, categorical_var_TE=list_cat_col_TE, target=y_train,
                         continious_var=list_cont_col, encoding_type_cont=MinMaxScaler())
x_val_preprocessed

In [None]:
#Pre processing for boosting models :
#for boosting, we only encode categorical variable.
x_train = a.pre_processing(df=x_train, train=True, categorical_var_OHE= list_cat_col_OHE,
                           categorical_var_OrdinalEncoding={}, categorical_var_TE=list_cat_col_TE, target=y_train,
                           continious_var=[], encoding_type_cont=StandardScaler())
x_train

In [None]:
x_val = a.pre_processing(df=x_val, train=False, categorical_var_OHE= list_cat_col_OHE,
                         categorical_var_OrdinalEncoding={}, categorical_var_TE=list_cat_col_TE, target=y_train,
                         continious_var=[], encoding_type_cont=StandardScaler())
x_val

<a class="anchor" id="section_3_4"></a>
<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: RGB(119, 150, 203); ">Sélection des variables - Regression logistique (RFE)</h3>
</div>

In [None]:
from sklearn.model_selection import StratifiedKFold
warnings.filterwarnings('ignore')

rfe_score = {"nb_var" : 0,
             "best_score" : 0}

for i in tqdm(range(1,len(x_train_preprocessed.columns)+1)) :
    estimator = LogisticRegression(penalty=None)
    selector = RFE(estimator, n_features_to_select=i,step=1)
    selector.fit(x_train_preprocessed,y_train)

    x_train_preprocessed_new = x_train_preprocessed[list(selector.get_feature_names_out())]
    estimator.fit(x_train_preprocessed_new,y_train)
    
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)
    skf.get_n_splits(x_train_preprocessed_new, y_train)   
    cross_val_score_ = mean(cross_val_score(estimator, x_train_preprocessed_new, y_train, cv=skf, scoring = "roc_auc")) #scoring : scoring = f1_weighted or accuracy

    if cross_val_score_ > rfe_score["best_score"] :
        rfe_score["nb_var"] = i
        rfe_score["best_score"] = cross_val_score_

In [None]:
rfe_score

In [None]:
warnings.filterwarnings('ignore')

estimator = LogisticRegression(penalty=None)
selector = RFE(estimator, n_features_to_select=rfe_score['nb_var'], step=1)
selector.fit(x_train_preprocessed,y_train)

print(list(selector.get_feature_names_out()))

In [None]:
# Linear model :
x_train_preprocessed = x_train_preprocessed[list(selector.get_feature_names_out())]
x_val_preprocessed = x_val_preprocessed[list(selector.get_feature_names_out())]

# Boosting model :
x_train = x_train[list(selector.get_feature_names_out())]
x_val = x_val[list(selector.get_feature_names_out())]

x_train

<a class="anchor" id="chapter4"></a>
<div style="display: flex; background-color: RGB(119, 150, 203);">
    <h1 style="margin: auto; padding: 30px 30px 30px 30px; color: RGB(255,255,255);">
            <b>Modélisation</b>
</div>

<a class="anchor" id="section_4_1"></a>
<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: RGB(119, 150, 203); ">Régression logistique</h3>
</div>

### Modélisation naïve

In [None]:
lr = LogisticRegression(penalty="none")
lr.fit(x_train_preprocessed, y_train)
print(lr.score(x_train_preprocessed, y_train)) # replace scoring='accuracy' by "recall"  #or roc_auc
print(lr.score(x_val_preprocessed, y_val)) # replace scoring='accuracy' by "recall"  #or roc_auc

### roc_auc_score

In [None]:
#roc_auc_score
print(roc_auc_score(y_train, lr.predict_proba(x_train_preprocessed)[:, 1])) #, multi_class='ovr'
print(roc_auc_score(y_val, lr.predict_proba(x_val_preprocessed)[:, 1])) #, multi_class='ovr'

In [None]:
#Cross_val_score :
from sklearn.model_selection import cross_val_score
print(cross_val_score(lr, x_train_preprocessed, y_train, cv=5, scoring='roc_auc').mean())

In [None]:
#cross_val_score stratifié : better than Classic Cross_val_score
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)
skf.get_n_splits(x_train_preprocessed, y_train)

print(cross_val_score(lr, x_train_preprocessed, y_train, cv=skf, scoring='roc_auc').mean())

### f1_score

In [None]:
#f1_score
print(f1_score(y_train, lr.predict(x_train_preprocessed), average='micro', pos_label=1)) #pos_label = 1 -> number of the target #or average=binary
print(f1_score(y_val, lr.predict(x_val_preprocessed), average='micro', pos_label=1))

In [None]:
from sklearn.model_selection import cross_val_score
print(cross_val_score(lr, x_train_preprocessed, y_train, cv=5, scoring='f1_micro').mean()) #or scoring="f1_weighted"

In [None]:
#cross_val_score stratifié : better than Classic Cross_val_score
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)
skf.get_n_splits(x_train_preprocessed, y_train)

print(cross_val_score(lr, x_train_preprocessed, y_train, cv=skf, scoring='f1_micro').mean())

### Optimisation des hyperparamètres

penalty = {‘l1’, ‘l2’, ‘elasticnet’, None}
C = float, default=1.0
class_weight= None or ‘balanced’, default=None
solver = {‘saga’}, default=’lbfgs’
max_iter = int, default=100
l1_ratio = float, default=None

In [None]:
# Bayesian Optimisation (optuna) :
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

def objective(trial):
    penalty = trial.suggest_categorical('penalty', ["l1", "l2", "elasticnet", "none"])
    C = trial.suggest_float('C', 0.1, 5) #, step=0.1
    class_weight = trial.suggest_categorical('class_weight', ["balanced", None])
    max_iter = trial.suggest_int('max_iter', 1, 500)
    l1_ratio = trial.suggest_float('l1_ratio', 0.001, 0.999)

    model = LogisticRegression(solver='saga', penalty=penalty, C=C, class_weight=class_weight, max_iter=max_iter, l1_ratio=l1_ratio)

    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)
    skf.get_n_splits(x_train_preprocessed, y_train) 
    return cross_val_score(model, x_train_preprocessed, y_train, n_jobs=-1, cv=skf, scoring='roc_auc').mean() # replace scoring='accuracy' by "recall"  #or auc


study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

In [None]:
trial = study.best_trial
print('score : {}'.format(trial.value)) # replace scoring='accuracy' by "recall"  #or auc
print("Best hyperparameters: {}".format(trial.params))

### Ajuster le modèle avec les meilleurs hyperparamètres

In [None]:
#model evaluation : r2, MSE, RMSE...
best_model_LR = LogisticRegression(solver='saga', penalty=(trial.params)["penalty"], C=(trial.params)["C"], class_weight=(trial.params)["class_weight"],
                               max_iter=(trial.params)["max_iter"], l1_ratio=(trial.params)["l1_ratio"])
best_model_LR.fit(x_train_preprocessed, y_train)

display(best_model_LR.score(x_train_preprocessed, y_train)) # replace scoring='accuracy' by "recall"  #or auc
display(best_model_LR.score(x_val_preprocessed, y_val)) # replace scoring='accuracy' by "recall"  #or auc

### roc_auc_score

In [None]:
#roc_auc_score
print(roc_auc_score(y_train, best_model_LR.predict_proba(x_train_preprocessed)[:, 1])) #, multi_class='ovr'
print(roc_auc_score(y_val, best_model_LR.predict_proba(x_val_preprocessed)[:, 1])) #, multi_class='ovr'

In [None]:
#Cross_val_score :
from sklearn.model_selection import cross_val_score
print(cross_val_score(best_model_LR, x_train_preprocessed, y_train, cv=5, scoring='roc_auc').mean())

In [None]:
#cross_val_score stratifié : better than Classic Cross_val_score
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)
skf.get_n_splits(x_train_preprocessed, y_train)

print(cross_val_score(best_model_LR, x_train_preprocessed, y_train, cv=skf, scoring='roc_auc').mean())

### f1_score

In [None]:
#f1_score
print(f1_score(y_train, best_model_LR.predict(x_train_preprocessed), average='micro', pos_label=1)) #pos_label = 1 -> number of the target #or average=binary
print(f1_score(y_val, best_model_LR.predict(x_val_preprocessed), average='micro', pos_label=1))

In [None]:
from sklearn.model_selection import cross_val_score
print(cross_val_score(best_model_LR, x_train_preprocessed, y_train, cv=5, scoring='f1_micro').mean()) #or scoring="f1_weighted"

In [None]:
#cross_val_score stratifié : better than Classic Cross_val_score
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)
skf.get_n_splits(x_train_preprocessed, y_train)

print(cross_val_score(best_model_LR, x_train_preprocessed, y_train, cv=skf, scoring='f1_micro').mean())

<a class="anchor" id="section_4_2"></a>
<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: RGB(119, 150, 203); ">XGBoost</h3>
</div>

### Modélisation naïve

In [None]:
xgboost = XGBClassifier(random_state=10, tree_method='gpu_hist', predictor="gpu_predictor") #, tree_method='gpu_hist'
xgboost.fit(x_train, y_train)
print(xgboost.score(x_train, y_train)) # replace scoring='accuracy' by "recall"  #or auc
print(xgboost.score(x_val, y_val)) # replace scoring='accuracy' by "recall"  #or auc

### roc_auc_score

In [None]:
#roc_auc_score
print(roc_auc_score(y_train, xgboost.predict_proba(x_train)[:, 1])) #, multi_class='ovr'
print(roc_auc_score(y_val, xgboost.predict_proba(x_val)[:, 1])) #, multi_class='ovr'

In [None]:
#Cross_val_score :
from sklearn.model_selection import cross_val_score
print(cross_val_score(xgboost, x_train,y_train, cv=5, scoring='roc_auc').mean())

In [None]:
#cross_val_score stratifié : better than Classic Cross_val_score
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)
skf.get_n_splits(x_train, y_train)

print(cross_val_score(xgboost, x_train, y_train, cv=skf, scoring='roc_auc').mean())

### f1_score

In [None]:
#f1_score
print(f1_score(y_train, xgboost.predict(x_train), average='micro', pos_label=1)) #pos_label = 1 -> number of the target #or average=binary
print(f1_score(y_val, xgboost.predict(x_val), average='micro', pos_label=1)) #or weighted

In [None]:
from sklearn.model_selection import cross_val_score
print(cross_val_score(xgboost, x_train, y_train, cv=5, scoring='f1_micro').mean()) #or scoring="f1"

In [None]:
#cross_val_score stratifié : better than Classic Cross_val_score
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)
skf.get_n_splits(x_train, y_train)

print(cross_val_score(xgboost, x_train, y_train, cv=skf, scoring='f1_micro').mean())

### Optimisation des hyperparamètres

In [None]:
#Bayesian Optimisation (optuna) :
#try to find an option where it can show us a rank of best hyper parameters options (from best to the worst)
def objective(trial):
    #random_state = trial.suggest_int('random_state', 1,100)
    n_estimators = trial.suggest_int('n_estimators', 1,2000) #nb of tree
    max_depth = trial.suggest_int('max_depth', 1, 10) #profondeur
    min_child_weight = trial.suggest_int('min_child_weight', 1, 10)
    learning_rate = trial.suggest_float('learning_rate', 0.01, 0.7)
    min_split_loss = trial.suggest_float('min_split_loss', 0, 5)
    colsample_bytree = trial.suggest_float('colsample_bytree', 0.1, 1) #min leaf of each tree
    subsample = trial.suggest_float('subsample', 0.1, 1)


    model = XGBClassifier(random_state=10, n_estimators=n_estimators, max_depth=max_depth, min_child_weight=min_child_weight,
                          learning_rate=learning_rate, min_split_loss=min_split_loss, colsample_bytree=colsample_bytree, subsample=subsample,
                          n_jobs=-1, tree_method='gpu_hist', predictor="gpu_predictor") #, tree_method='gpu_hist'

    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)
    skf.get_n_splits(x_train, y_train) 
    return cross_val_score(model, x_train, y_train, n_jobs=-1, cv=skf, scoring='f1_micro').mean() # or f1_weighted


study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

In [None]:
trial = study.best_trial
print('score: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

### Ajuster le modèle avec les meilleurs hyperparamètres

In [None]:
#model evaluation : accuracy, precision...
best_model_xgboost = XGBClassifier(random_state=10, n_estimators=(trial.params)["n_estimators"], max_depth=(trial.params)["max_depth"],
                          min_child_weight=(trial.params)["min_child_weight"], learning_rate=(trial.params)["learning_rate"],
                          min_split_loss=(trial.params)["min_split_loss"],colsample_bytree=(trial.params)["colsample_bytree"],
                          subsample=(trial.params)["subsample"], n_jobs=-1, tree_method='gpu_hist', predictor="gpu_predictor") #, tree_method='gpu_hist'
best_model_xgboost.fit(x_train, y_train)

display(best_model_xgboost.score(x_train, y_train)) # replace scoring='accuracy' by "recall"  #or auc
display(best_model_xgboost.score(x_val, y_val)) # replace scoring='accuracy' by "recall"  #or auc

### roc_auc_score

In [None]:
#roc_auc_score
print(roc_auc_score(y_train, best_model_xgboost.predict_proba(x_train)[:, 1])) #, multi_class='ovr'
print(roc_auc_score(y_val, best_model_xgboost.predict_proba(x_val)[:, 1])) #, multi_class='ovr'

In [None]:
#Cross_val_score :
from sklearn.model_selection import cross_val_score
print(cross_val_score(best_model_xgboost, x_train,y_train, cv=5, scoring='roc_auc').mean())

In [None]:
#cross_val_score stratifié : better than Classic Cross_val_score
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)
skf.get_n_splits(x_train, y_train)

print(cross_val_score(best_model_xgboost, x_train, y_train, cv=skf, scoring='roc_auc').mean())

### f1_score

In [None]:
#f1_score
print(f1_score(y_train, best_model_xgboost.predict(x_train), average='micro', pos_label=1)) #pos_label = 1 -> number of the target #or average=binary
print(f1_score(y_val, best_model_xgboost.predict(x_val), average='micro', pos_label=1))

In [None]:
from sklearn.model_selection import cross_val_score
print(cross_val_score(best_model_xgboost, x_train, y_train, cv=5, scoring='f1_micro').mean()) #or scoring="f1"

In [None]:
#cross_val_score stratifié : better than Classic Cross_val_score
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)
skf.get_n_splits(x_train, y_train)

print(cross_val_score(best_model_xgboost, x_train, y_train, cv=skf, scoring='f1_micro').mean())

<a class="anchor" id="section_4_3"></a>
<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: RGB(119, 150, 203); ">LightGBM</h3>
</div>

In [None]:
!pip install lightgbm

In [None]:
lgbm = LGBMClassifier(random_state=15)  #, device='gpu' or 'cuda' or device_type="gpu"
lgbm.fit(x_train, y_train)
print(lgbm.score(x_train, y_train))  #or auc
print(lgbm.score(x_val, y_val))  #or auc

Diff metrics (scoring) : accuracy, average_precision, precision, recall, f1(for binary targets) , roc_auc

### roc_auc_score

In [None]:
#roc_auc_score
print(roc_auc_score(y_train, lgbm.predict_proba(x_train)[:, 1])) #, multi_class='ovr'
print(roc_auc_score(y_val, lgbm.predict_proba(x_val)[:, 1])) #, multi_class='ovr'

In [None]:
#Cross_val_score :
from sklearn.model_selection import cross_val_score
print(cross_val_score(lgbm, x_train,y_train, cv=5, scoring='roc_auc').mean())

In [None]:
#cross_val_score stratifié : better than Classic Cross_val_score
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)
skf.get_n_splits(x_train, y_train)

print(cross_val_score(lgbm, x_train, y_train, cv=skf, scoring='roc_auc').mean())

### f1_score

In [None]:
#f1_score
print(f1_score(y_train, lgbm.predict(x_train), average='weighted', pos_label=1)) #pos_label = 1 -> number of the target #or average=binary
print(f1_score(y_val, lgbm.predict(x_val), average='weighted', pos_label=1))

In [None]:
from sklearn.model_selection import cross_val_score
print(cross_val_score(lgbm, x_train, y_train, cv=5, scoring='f1_weighted').mean()) #or scoring="f1"

In [None]:
#cross_val_score stratifié : better than Classic Cross_val_score
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)
skf.get_n_splits(x_train, y_train)

print(cross_val_score(lgbm, x_train, y_train, cv=skf, scoring='f1_micro').mean())

### Optimisation des hyperparamètres

In [None]:
#Bayesian Optimisation (optuna) :
def objective(trial):
    max_depth = trial.suggest_int('max_depth', 1, 10) #profondeur
    num_leaves = trial.suggest_int('num_leaves', 2,201)
    min_child_samples = trial.suggest_int('min_child_samples', 10,201)
    #colsample_bytree = trial.suggest_float('colsample_bytree', 0.1,1)
    subsample = trial.suggest_float('subsample', 0.1, 1)

    learning_rate = trial.suggest_float('learning_rate', 0.01, 0.7)
    n_estimators = trial.suggest_int('n_estimators', 1, 2000) #nb of tree

    model = LGBMClassifier(random_state=10, max_depth=max_depth, num_leaves=num_leaves, min_child_samples=min_child_samples,
                           subsample=subsample,
                           learning_rate=learning_rate, n_estimators=n_estimators)  #, device='gpu' or "cuda"
    #colsample_bytree=colsample_bytree

    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)
    skf.get_n_splits(x_train, y_train) 
    return cross_val_score(model, x_train, y_train, n_jobs=-1, cv=skf, scoring='f1_weighted').mean()  #or auc #or recall #or f1


study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

In [None]:
trial = study.best_trial
print('score: {}'.format(trial.value))  #or auc
print("Best hyperparameters: {}".format(trial.params))

### Ajuster le modèle avec les meilleurs hyperparamètres

In [None]:
#model evaluation : r2, MSE, RMSE...
best_model_lgbm = LGBMClassifier(random_state=10, max_depth=(trial.params)['max_depth'], num_leaves=(trial.params)['num_leaves'],
                           min_child_samples=(trial.params)['min_child_samples'],
                           subsample=(trial.params)['subsample'],learning_rate=(trial.params)['learning_rate'],
                           n_estimators=(trial.params)['n_estimators'])  #, device='gpu' or "cuda" #colsample_bytree=(trial.params)['colsample_bytree'],

best_model_lgbm.fit(x_train, y_train)
display(best_model_lgbm.score(x_train, y_train))  #or auc
display(best_model_lgbm.score(x_val, y_val)) #or auc

### roc_auc_score

In [None]:
#roc_auc_score
print(roc_auc_score(y_train, best_model_lgbm.predict_proba(x_train)[:, 1])) #, multi_class='ovr'
print(roc_auc_score(y_val, best_model_lgbm.predict_proba(x_val)[:, 1])) #, multi_class='ovr'

In [None]:
#Cross_val_score :
from sklearn.model_selection import cross_val_score
print(cross_val_score(best_model_lgbm, x_train,y_train, cv=5, scoring='roc_auc').mean())

In [None]:
#cross_val_score stratifié : better than Classic Cross_val_score
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)
skf.get_n_splits(x_train, y_train)

print(cross_val_score(best_model_lgbm, x_train, y_train, cv=skf, scoring='roc_auc').mean())

### f1_score

In [None]:
#f1_score
print(f1_score(y_train, best_model_lgbm.predict(x_train), average='weighted', pos_label=1)) #pos_label = 1 -> number of the target #or average=binary
print(f1_score(y_val, best_model_lgbm.predict(x_val), average='weighted', pos_label=1))

In [None]:
from sklearn.model_selection import cross_val_score
print(cross_val_score(best_model_lgbm, x_train, y_train, cv=5, scoring='f1_weighted').mean()) #or scoring="f1"

In [None]:
#cross_val_score stratifié : better than Classic Cross_val_score
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)
skf.get_n_splits(x_train, y_train)

print(cross_val_score(best_model_lgbm, x_train, y_train, cv=skf, scoring='f1_micro').mean())

<a class="anchor" id="chapter5"></a>
<div style="display: flex; background-color: RGB(119, 150, 203);">
    <h1 style="margin: auto; padding: 30px 30px 30px 30px; color: RGB(255,255,255);">
            <b>Choix du modèle final : XGBoost</b>
</div>

<a class="anchor" id="section_5_1"></a>
<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: RGB(119, 150, 203); ">Analyse des résultats (on df_val/test)</h3>
</div>

### Confusion matrix

In [None]:
# prediction on val
y_pred = best_model_xgboost.predict(x_val)

# compute the confusion matrix
cm = confusion_matrix(y_val,y_pred)

#Plot the confusion matrix.
f, ax=plt.subplots(figsize=(6,6))
sns.heatmap(cm,annot=True,linewidths=0.5,linecolor="red",fmt=".0f",ax=ax)
plt.ylabel('Actual',fontsize=13)
plt.xlabel('Prediction',fontsize=13)
plt.title('Confusion Matrix on y_val',fontsize=17)
plt.show()

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_val, y_pred))

### ROC-AUC

ROC Curve - One vs Rest (OvR)

Compares each class with the rest of the classes

In [None]:
def calculate_tpr_fpr(y_real, y_pred):
    '''
    Calculates the True Positive Rate (tpr) and the True Negative Rate (fpr) based on real and predicted observations

    Args:
        y_real: The list or series with the real classes
        y_pred: The list or series with the predicted classes

    Returns:
        tpr: The True Positive Rate of the classifier
        fpr: The False Positive Rate of the classifier
    '''

    # Calculates the confusion matrix and recover each element
    cm = confusion_matrix(y_real, y_pred)
    TN = cm[0, 0]
    FP = cm[0, 1]
    FN = cm[1, 0]
    TP = cm[1, 1]

    # Calculates tpr and fpr
    tpr =  TP/(TP + FN) # sensitivity - true positive rate
    fpr = 1 - TN/(TN+FP) # 1-specificity - false positive rate

    return tpr, fpr

In [None]:
def get_all_roc_coordinates(y_real, y_proba):
    '''
    Calculates all the ROC Curve coordinates (tpr and fpr) by considering each point as a threshold for the predicion of the class.

    Args:
        y_real: The list or series with the real classes.
        y_proba: The array with the probabilities for each class, obtained by using the `.predict_proba()` method.

    Returns:
        tpr_list: The list of TPRs representing each threshold.
        fpr_list: The list of FPRs representing each threshold.
    '''
    tpr_list = [0]
    fpr_list = [0]
    for i in range(len(y_proba)):
        threshold = y_proba[i]
        y_pred = y_proba >= threshold
        tpr, fpr = calculate_tpr_fpr(y_real, y_pred)
        tpr_list.append(tpr)
        fpr_list.append(fpr)
    return tpr_list, fpr_list

In [None]:
def plot_roc_curve(tpr, fpr, scatter = True, ax = None):
    '''
    Plots the ROC Curve by using the list of coordinates (tpr and fpr).

    Args:
        tpr: The list of TPRs representing each coordinate.
        fpr: The list of FPRs representing each coordinate.
        scatter: When True, the points used on the calculation will be plotted with the line (default = True).
    '''
    if ax == None:
        plt.figure(figsize = (5, 5))
        ax = plt.axes()

    if scatter:
        sns.scatterplot(x = fpr, y = tpr, ax = ax)
    sns.lineplot(x = fpr, y = tpr, ax = ax)
    sns.lineplot(x = [0, 1], y = [0, 1], color = 'green', ax = ax)
    plt.xlim(-0.05, 1.05)
    plt.ylim(-0.05, 1.05)
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")

In [None]:
y_pred = best_model_xgboost.predict(x_val)
y_proba = best_model_xgboost.predict_proba(x_val)

In [None]:
classes = best_model_xgboost.classes_
classes

In [None]:
# Plots the Probability Distributions and the ROC Curves One vs Rest
plt.figure(figsize = (12, 8))
bins = [i/20 for i in range(20)] + [1]
classes = best_model_xgboost.classes_
roc_auc_ovr = {}
for i in range(len(classes)):
    # Gets the class
    c = classes[i]

    # Prepares an auxiliar dataframe to help with the plots
    df_aux = x_val.copy()
    df_aux['class'] = [1 if y == c else 0 for y in y_val]
    df_aux['prob'] = y_proba[:, i]
    df_aux = df_aux.reset_index(drop = True)

    # Plots the probability distribution for the class and the rest
    ax = plt.subplot(2, 4, i+1)
    sns.histplot(x = "prob", data = df_aux, hue = 'class', color = 'b', ax = ax, bins = bins)
    ax.set_title(c)
    ax.legend([f"Class: {c}", "Rest"])
    ax.set_xlabel(f"P(x = {c})")

    # Calculates the ROC Coordinates and plots the ROC Curves
    ax_bottom = plt.subplot(2, 4, i+5)
    tpr, fpr = get_all_roc_coordinates(df_aux['class'], df_aux['prob'])
    plot_roc_curve(tpr, fpr, scatter = False, ax = ax_bottom)
    ax_bottom.set_title("ROC Curve OvR")

    # Calculates the ROC AUC OvR
    roc_auc_ovr[c] = roc_auc_score(df_aux['class'], df_aux['prob'])
plt.tight_layout()

In [None]:
from sklearn.metrics import roc_curve, auc

fpr_train_XGB, tpr_train_XGB, thresholds_train_XGB = roc_curve(y_train, best_model_xgboost.predict_proba(x_train)[:,1])
roc_auc_train_XGB = auc(fpr_train_XGB, tpr_train_XGB)


fpr_val_XGB, tpr_val_XGB, thresholds_val_XGB = roc_curve(y_val, best_model_xgboost.predict_proba(x_val)[:,1])
roc_auc_val_XGB = auc(fpr_val_XGB, tpr_val_XGB)


plt.figure()
lw = 2
plt.plot(fpr_train_XGB, tpr_train_XGB, color='darkorange',
         lw=lw, label='Train - ROC curve (area = %0.3f)' % roc_auc_train_XGB)

plt.plot(fpr_val_XGB, tpr_val_XGB, color='darkgreen',
         lw=lw, label='Val - ROC curve (area = %0.3f)' % roc_auc_val_XGB)

plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Comparaison courbes ROC Train/Val')
plt.legend(loc="lower right")
plt.show()

In [None]:
# Displays the ROC AUC for each class
avg_roc_auc = 0
i = 0
for k in roc_auc_ovr:
    avg_roc_auc += roc_auc_ovr[k]
    i += 1
    print(f"{k} ROC AUC OvR: {roc_auc_ovr[k]:.4f}")
print(f"average ROC AUC OvR: {avg_roc_auc/i:.4f}")

In [None]:
# Compares with sklearn (average only)
# "Macro" average = unweighted mean
roc_auc_score(y_val, y_proba[:,1], labels = classes, multi_class = 'ovo', average = 'macro')

### Precision/Recall curve :



In [None]:
#Precision/Recall curve :
y_train_pred = best_model_xgboost.predict(x_train)
y_train_proba = best_model_xgboost.predict_proba(x_train)

y_val_pred = best_model_xgboost.predict(x_val)
y_val_proba = best_model_xgboost.predict_proba(x_val)

In [None]:
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score

precision_train, recall_train, thresholds_train = precision_recall_curve(y_train,
                                                                         y_train_proba[:, 1])
precision_val, recall_val, thresholds_val = precision_recall_curve(y_val,
                                                                      y_val_proba[:, 1])
plt.figure()
lw = 2
plt.plot(recall_train,precision_train, color='darkorange',
         lw=lw, label='Train')

plt.plot(recall_val,precision_val, color='darkgreen',
         lw=lw, label='Val')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Comparaison courbe PRECISION / RAPPEL (TRAIN / VAL)')
plt.legend(loc="upper right")
plt.show()

In [None]:
## Choix du seuil - Optimiser le recall pour toper le plus de churner
table_choix_seuil = pd.DataFrame()
table_choix_seuil["SEUIL"] = [0] + list(thresholds_train)
table_choix_seuil["Precision_train"] = precision_train
table_choix_seuil["Recall_train"] = recall_train
table_choix_seuil.sort_values(by = "SEUIL", axis=0, ascending=False, inplace=True)

In [None]:
table_choix_seuil = pd.DataFrame(table_choix_seuil)
table_choix_seuil[table_choix_seuil["Recall_train"]>=0.95].sort_values(by = "Recall_train", axis=0, ascending=True)

In [None]:
from tqdm import tqdm
from tqdm.notebook import tqdm_notebook
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score

best_seuil = {"seuil":0,
              "recall" : 0,
              "precision": 0,
              "tot_perte_sans_model" : 0,
              "tot_perte_avec_model": 0,
              "profit_net_sauve_grace_au_model_sur_1an" : 0}

#objectif : limiter la perte de bénéfice sur les 12 prochains mois grace au modèle, pour ca on propose une offre de reduc de 3euros sur leur forfait pour 1 an
#ex : Ici, le modele nous a permis de limiter la perte de benefice (ou profit) de 94953.59 euros sur 1 an (pour environs 9800 clients)
#cela veut dire que si on garde les churner 1 an de plus et bien : au lieu d'avoir une perte de benefice de 302486.4 (pertes sans modele), on ne perd plus que 207532.8 euros (pertes avec modele)

#forfait mensuel pour 1 client = 18 euros
#profit mensuel par client (%)= 0.4
#cout campagne d'offre de reduction pour 1 client pour 1 mois(campagne pub + offre de reduction) = 3 euros
#coût de l'offre de reduction sur 1 an = 12*3
prix_forfait_mensuel_par_client = x_train_bis.AVERAGE_CHARGE_6M.mean()/6
profit_mensuel_par_forfait_par_client_en_porucent = 0.4
profit_mensuel_par_forfait_par_client = profit_mensuel_par_forfait_par_client_en_porucent*prix_forfait_mensuel_par_client
cout_campagne_offre_par_client_par_mois = (x_train_bis.AVERAGE_CHARGE_6M.mean()/6)*0.2

for i in tqdm(table_choix_seuil['SEUIL']) :
    seuil = i
    y_train_predict_seuil = (y_train_proba[:, 1]>=seuil)*1

    Confusion_matrix_train = confusion_matrix(y_train, y_train_predict_seuil)
    Confusion_matrix_train = pd.DataFrame(Confusion_matrix_train)

    #calcul :
    #nb d'euros économisés pour 1 mois (faire *12 si on veut pour tous les ans)
    Tot_OBS_1 = Confusion_matrix_train[1][1] + Confusion_matrix_train[0][1] #Tot_OBS_1 = observation total de tous les churner
    TP = Confusion_matrix_train[1][1] #TP =true positive
    Tot_PRED_1 = Confusion_matrix_train[1][0] + Confusion_matrix_train[1][1] #Tot_PRED_1 = observation total de tous les churner prédit par le modèle

    tot_perte_avec_model = Tot_OBS_1*profit_mensuel_par_forfait_par_client - (TP*profit_mensuel_par_forfait_par_client - (Tot_PRED_1*cout_campagne_offre_par_client_par_mois))
    tot_perte_sans_model = Tot_OBS_1*profit_mensuel_par_forfait_par_client
    profit_net_sauve_grace_au_model_sur_1an = (tot_perte_sans_model - tot_perte_avec_model)*12

    if profit_net_sauve_grace_au_model_sur_1an > best_seuil["profit_net_sauve_grace_au_model_sur_1an"] :
        best_seuil["seuil"] = seuil
        best_seuil["recall"] = str(recall_score(y_train, y_train_predict_seuil))
        best_seuil["precision"] = str(precision_score(y_train, y_train_predict_seuil))
        best_seuil["tot_perte_sans_model"] = tot_perte_sans_model*12
        best_seuil["tot_perte_avec_model"] = tot_perte_avec_model*12
        best_seuil["profit_net_sauve_grace_au_model_sur_1an"] = profit_net_sauve_grace_au_model_sur_1an

best_seuil

In [None]:
# Application du seuil selectionné au jeu d'apprentissage

seuil = 0.48784318566322327
y_train_predict_seuil = (y_train_proba[:, 1]>=seuil)*1

print("Metrique pour le jeu de données train : ")
print("\n Recall : " + str(recall_score(y_train, y_train_predict_seuil)))

print("\n Précision : " + str(precision_score(y_train, y_train_predict_seuil)))
Confusion_matrix_app = confusion_matrix(y_train, y_train_predict_seuil)
print(pd.DataFrame(Confusion_matrix_app))

In [None]:
## Choix du seuil - Optimiser le recall/precision pour avoir le meilleur profit sur le val
table_choix_seuil_val = pd.DataFrame()
table_choix_seuil_val["SEUIL"] = [0] + list(thresholds_val)
table_choix_seuil_val["Precision_val"] = precision_val
table_choix_seuil_val["Recall_val"] = recall_val
table_choix_seuil_val.sort_values(by = "SEUIL", axis=0, ascending=False, inplace=True)

In [None]:
table_choix_seuil_val = pd.DataFrame(table_choix_seuil_val)
table_choix_seuil_val

In [None]:
# Application au jeu de val

# Objectif : limiter la perte de bénéfice sur les 12 prochains mois grace au modèle. Avoir un Gain_sur_perte_1an le plus élevé possible.
# Solution : proposer une remise de 1 euro par mois pendant 1an aux potentiel churners.

def choose_best_seuil(x_df, y_df, prix_1recharge=5, pourcentage_profit=0.4, remise_pour_1churner_predit_1mois=1, table_choix_seuil=table_choix_seuil_val):
    y_proba = best_model_xgboost.predict_proba(x_df)
    
    best_seuil = {"seuil": 0,
                "recall" : 0,
                "precision": 0,
                "Profit_total_mois_prochain_sans_model_1mois" : 0,
                "Profit_total_mois_prochain_avec_model_1mois": 0,
                "Gain_de_profit_grace_model_1mois" : 0}


    # Gain sur perte sur 1 mois :
    nb_recharge_moyenne_1mois_1client = x_df["AVERAGE_CHARGE_6M"].mean()/6
    CA_1client_1mois = nb_recharge_moyenne_1mois_1client*prix_1recharge
    profit_1client_1mois = pourcentage_profit*CA_1client_1mois
    Profit_total_du_mois_actuel = len(x_df)*profit_1client_1mois #profit au moment où les churners ne sont pas encore parties
    best_seuil["Profit_total_du_mois_actuel"] = Profit_total_du_mois_actuel

    for i in tqdm(table_choix_seuil['SEUIL']) :
        seuil = i
        y_predict_seuil = (y_proba[:, 1]>=seuil)
        Confusion_matrix = confusion_matrix(y_df, y_predict_seuil)
        Confusion_matrix = pd.DataFrame(Confusion_matrix)

        # Calcul :
        nb_client_churner_obs = Confusion_matrix[1][1] + Confusion_matrix[0][1] #Tot_OBS_1 = observation total de tous les churner (True_positive + False_negative)
        Profit_total_mois_prochain_sans_model = Profit_total_du_mois_actuel - profit_1client_1mois*nb_client_churner_obs

        TP = Confusion_matrix[1][1] #TP = true positive
        FN = Confusion_matrix[0][1] #FP = false negative
        Tot_PRED_1 = Confusion_matrix[1][1] + Confusion_matrix[1][0]  #Tot_PRED_1 = observation total de tous les churner prédit par le modèle (true_pos + false_pos)
        Profit_total_mois_prochain_avec_model = Profit_total_du_mois_actuel - ((Tot_PRED_1*remise_pour_1churner_predit_1mois) + (FN*profit_1client_1mois) - (TP*profit_1client_1mois))

        Gain_sur_perte_de_profit_grace_model_1mois = Profit_total_mois_prochain_avec_model - Profit_total_mois_prochain_sans_model  
        if Gain_sur_perte_de_profit_grace_model_1mois > best_seuil["Gain_de_profit_grace_model_1mois"] :
            best_seuil["seuil"] = seuil
            best_seuil["recall"] = str(recall_score(y_df, y_predict_seuil))
            best_seuil["precision"] = str(precision_score(y_df, y_predict_seuil))
            best_seuil["Profit_total_mois_prochain_sans_model_1mois"] = Profit_total_mois_prochain_sans_model
            best_seuil["Profit_total_mois_prochain_avec_model_1mois"] = Profit_total_mois_prochain_avec_model
            best_seuil["Gain_sur_perte_de_profit_grace_model_1mois"] = Gain_sur_perte_de_profit_grace_model_1mois

    return best_seuil

In [None]:
# On sait que la recharge moyenne est de 12,5 par mois
# Pour 1.2 euros/recharge, on estime à une dépense moyenne de 15euros par mois
# Profit = 0.4*15 = 6
best_seuil = choose_best_seuil(x_df=x_val, y_df=y_val, prix_1recharge=1.2, pourcentage_profit=0.4, remise_pour_1churner_predit_1mois=4)
best_seuil

In [None]:
seuil = best_seuil["seuil"]

y_val_predict_seuil = (y_val_proba[:, 1]>=seuil)

print("Metrique pour le jeu de données val avec le meilleur seuil : ")

print("\n Recall : " + str(recall_score(y_val, y_val_predict_seuil)))
print("\n Précision : " + str(precision_score(y_val, y_val_predict_seuil)))

Confusion_matrix_val = confusion_matrix(y_val, y_val_predict_seuil)
print(pd.DataFrame(Confusion_matrix_val))

### Lift curve :


In [None]:
#courbe lift
#method 1 :
import scikitplot as skplt

plt.figure(figsize=(7,7))
y_predict_proba = best_model_xgboost.predict_proba(x_val)
skplt.metrics.plot_lift_curve(y_val,y_predict_proba)
plt.show()
plt.show()

In [None]:
res_modele=pd.DataFrame()
res_modele["Target"]=y_val
res_modele["Proba_target"]= best_model_xgboost.predict_proba(x_val)[:,1]

res_modele.sort_values(by =["Proba_target"], inplace = True,ascending=False)

res_modele['QuantileRank']= pd.qcut(res_modele["Proba_target"], q = 10, labels = False)
res_modele.head(6)

In [None]:
agg_tmp=pd.DataFrame(res_modele.groupby('QuantileRank')['Target'].agg(['sum','count']))
agg_tmp.sort_values(by =["QuantileRank"], inplace = True,ascending=False)
print(agg_tmp)

In [None]:
agg_tmp["Precision"] = agg_tmp["sum"]/agg_tmp["count"]
agg_tmp["Proba_alatoire"]= y_val.mean()
agg_tmp["lift"]=agg_tmp["Precision"]/agg_tmp["Proba_alatoire"]
agg_tmp["Part population cible"]=agg_tmp["sum"]/y_val.sum()
agg_tmp["Part population"]=agg_tmp["count"]/y_val.count()
print(agg_tmp)

In [None]:
#methode 2 :
import kds
kds.metrics.report(y_val, best_model_xgboost.predict_proba(x_val)[:,1])

# Learning curve :

In [None]:
#add tool box
from sklearn.model_selection import learning_curve
from matplotlib import pyplot as plt
import numpy as np

N, train_score, val_score = learning_curve(best_model_xgboost, x_train, y_train, train_sizes= np.linspace(0.1,1,10) ,cv=10, scoring="f1") #or f1_weighted

plt.plot(N, train_score.mean(axis=1), label='train')
plt.plot(N, val_score.mean(axis=1), label='validation')
plt.xlabel('train_sizes')
plt.legend()

In [None]:
N, train_score, val_score = learning_curve(best_model_xgboost, x_train, y_train, train_sizes= np.linspace(0.1,1,10) ,cv=10, scoring="f1")

#plt.plot(N, train_score.mean(axis=1), label='train')
plt.plot(N, val_score.mean(axis=1), label='validation')
plt.xlabel('train_sizes')
plt.legend()

<a class="anchor" id="section_5_3"></a>
<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: RGB(119, 150, 203); ">Features impact analysis</h3>
</div>

### Identification des variables les plus importantes

In [None]:
(pd.DataFrame({'Features': best_model_xgboost.feature_names_in_,
              'Features importance (in %)': (best_model_xgboost.feature_importances_)*100})).sort_values(by='Features importance (in %)', ascending=False)

### SHAP value des features

In [None]:
#General :
# compute the SHAP values for the linear model
explainer = shap.Explainer(best_model_xgboost.predict, x_train)
shap_values = explainer(x_train[:5000])

In [None]:
shap.plots.beeswarm(shap_values)

In [None]:
shap.summary_plot(shap_values, plot_type='violin', max_display=20)

In [None]:
shap.plots.bar(shap_values)

<a class="anchor" id="section_5_4"></a>
<div style="border: 1px solid RGB(119, 150, 203);" >
    <h3 style="margin: auto; padding: 20px; color: RGB(119, 150, 203); ">Sérialisation du modèle et déploiement en situation réelle</h3>
</div>

In [None]:
#the best model is model(hyper-param)

joblib.dump(value = best_model_xgboost, filename = '/home/jupyter/model/scoring_model.pkl')

In [None]:
#load model :
scoring_model = joblib.load(filename = '/home/jupyter/model/scoring_model.pkl')

In [None]:
#predict churn : (to review)
def predict_churn(model, feature_dict):

    df_for_pred = pd.DataFrame(feature_dict)

    # Feature engineering :
    df_for_pred["FLAG_RECHARGE_M1"] = df_for_pred["RECENCY_OF_LAST_RECHARGE"].apply(lambda x : 1 if 0 <= x <= 31 else 0)
    df_for_pred["FLAG_RECHARGE_M2"] = df_for_pred["RECENCY_OF_LAST_RECHARGE"].apply(lambda x : 1 if 32 <= x <= 62 else 0)
    df_for_pred["FLAG_RECHARGE_M3"] = df_for_pred["RECENCY_OF_LAST_RECHARGE"].apply(lambda x : 1 if 63 <= x <= 92 else 0)
    df_for_pred["FLAG_RECHARGE_PLUS_M3"] = df_for_pred["RECENCY_OF_LAST_RECHARGE"].apply(lambda x : 1 if x >= 93 else 0)

    for index, row in df_for_pred.iterrows():
        if row["BALANCE_M1"] > row["BALANCE_M2"] and row["BALANCE_M2"] > row["BALANCE_M3"] :
            df_for_pred.at[index, "AVERAGE_MULTIPLE_RECHARGE_M1_M2_M3"] = 1

        else :
            df_for_pred.at[index, "AVERAGE_MULTIPLE_RECHARGE_M1_M2_M3"] = 0

    for index, row in df_for_pred.iterrows() :
        if row["INC_DURATION_MINS_M1"] + row["INC_PROP_SMS_CALLS_M1"] == 0 :
            df_for_pred.at[index, "FLAG_IN_M1"] = 0
        else :
            df_for_pred.at[index, "FLAG_IN_M1"] = 1
        if row["INC_DURATION_MINS_M2"] + row["INC_PROP_SMS_CALLS_M2"] == 0 :
            df_for_pred.at[index, "FLAG_IN_M2"] = 0
        else :
            df_for_pred.at[index, "FLAG_IN_M2"] = 1
        if row["INC_DURATION_MINS_M3"] + row["INC_PROP_SMS_CALLS_M3"] == 0 :
            df_for_pred.at[index, "FLAG_IN_M3"] = 0
        else :
            df_for_pred.at[index, "FLAG_IN_M3"] = 1

    for index, row in df_for_pred.iterrows() :
        if row["OUT_DURATION_MINS_M1"] + row["OUT_SMS_NO_M1"] + row["OUT_INT_DURATION_MINS_M1"] + row["OUT_888_DURATION_MINS_M1"] + row["OUT_VMACC_NO_CALLS_M1"] == 0 :
            df_for_pred.at[index, "FLAG_OUT_M1"] = 0
        else :
            df_for_pred.at[index, "FLAG_OUT_M1"] = 1
        if row["OUT_DURATION_MINS_M2"] + row["OUT_SMS_NO_M2"] + row["OUT_INT_DURATION_MINS_M2"] + row["OUT_888_DURATION_MINS_M2"] + row["OUT_VMACC_NO_CALLS_M2"] == 0 :
            df_for_pred.at[index, "FLAG_OUT_M2"] = 0
        else :
            df_for_pred.at[index, "FLAG_OUT_M2"] = 1
        if row["OUT_DURATION_MINS_M3"] + row["OUT_SMS_NO_M3"] + row["OUT_INT_DURATION_MINS_M3"] + row["OUT_888_DURATION_MINS_M3"] + row["OUT_VMACC_NO_CALLS_M3"] == 0 :
            df_for_pred.at[index, "FLAG_OUT_M3"] = 0
        else :
            df_for_pred.at[index, "FLAG_OUT_M3"] = 1

    for index, row in df_for_pred.iterrows() :
        if row["CONTRACT_TENURE_DAYS"] > 730 :
            df_for_pred.at[index, "OLD_CONTRACT"] = 1
        else :
            df_for_pred.at[index,"OLD_CONTRACT"] = 0

    # 1st feature selection :
    df_for_pred.drop(list_col_to_drop, axis=1, inplace=True)

    # Encode only cat features :
    df_for_pred = a.pre_processing(df=df_for_pred, train=False, categorical_var_OHE= list_cat_col_OHE,
                                   categorical_var_OrdinalEncoding={}, categorical_var_TE=list_cat_col_TE, target=y_train,
                                   continious_var=[], encoding_type_cont=StandardScaler())

    # 2nd feature selection RFE :
    #df_for_pred = df_for_pred[list(selector.get_feature_names_out())]
    
    # Reorder the feature for xgboost :
    df_for_pred = df_for_pred[scoring_model.get_booster().feature_names]

    return {"Churn" : (model.predict(df_for_pred))[0] ,
            "Proba 0 ": [round(elem, 2) for elem in list(model.predict_proba(df_for_pred)[0])][0],
            "Proba 1 ": [round(elem, 2) for elem in list(model.predict_proba(df_for_pred)[0])][1]
           } , df_for_pred


In [None]:
#example of prediction on random observation :
feature_dict = {'CUSTOMER_AGE': [43],'CONTRACT_TENURE_DAYS': [1168.0], 'AVERAGE_CHARGE_6M': [25.0], 'FAILED_RECHARGE_6M': [0],
             'AVERAGE_RECHARGE_TIME_6M': [200], 'BALANCE_M3': [30.24], 'BALANCE_M2': [77.64], 'BALANCE_M1': [98.44],
             'FIRST_RECHARGE_VALUE': [100], 'LAST_RECHARGE_VALUE': [50], 'TIME_TO_GRACE': [-20], 'TIME_TO_AFTERGRACE': [-30],
             'RECENCY_OF_LAST_RECHARGE': [10], 'TOTAL_RECHARGE_6M': [200],'NO_OF_RECHARGES_6M': [3], 'ZERO_BALANCE_IND_M2': [0],
             'ZERO_BALANCE_IND_M1': [0], 'PASS_GRACE_IND_M3': [1], 'PASS_GRACE_IND_M2': [1], 'PASS_GRACE_IND_M1': [1],
             'PASS_AFTERGRACE_IND_M3': [0], 'PASS_AFTERGRACE_IND_M2': [0], 'DATA_FLAG': [1], 'INT_FLAG': [0], 'NUM_HANDSET_USED_6M': [2],
             'INC_DURATION_MINS_M3': [37], 'INC_PROP_SMS_CALLS_M3': [0], 'INC_PROP_OPE1__MIN_M1': [0.52], 'INC_PROP_OPE2_MIN_M1': [0.32],
             'INC_PROP_OPE2_MIN_M2': [0.4],'INC_PROP_FIXED_MIN_M1': [0.16],'INC_PROP_FIXED_MIN_M3': [0.46],'OUT_DURATION_MINS_M2': [22],
             'OUT_DURATION_MINS_M3': [12],'OUT_SMS_NO_M1': [6],'OUT_SMS_NO_M2': [3],'OUT_SMS_NO_M3': [3],'OUT_INT_DURATION_MINS_M1': [0],
             'OUT_INT_DURATION_MINS_M2': [0],'OUT_888_DURATION_MINS_M1': [0],'OUT_888_DURATION_MINS_M2': [0],'OUT_888_DURATION_MINS_M3': [0],
             'OUT_VMACC_NO_CALLS_M2': [0],'OUT_VMACC_NO_CALLS_M3': [1],'OUT_PROP_SMS_CALLS_M1': [0.2],'OUT_PROP_SMS_CALLS_M2': [0.19],
             'OUT_PROP_SMS_CALLS_M3': [0.25],'OUT_PROP_OPE1__MIN_M1': [0.58],'OUT_PROP_OPE1__MIN_M2': [0],'OUT_PROP_OPE1__MIN_M3': [0],
             'OUT_PROP_OPE2_MIN_M1': [0],'OUT_PROP_FIXED_MIN_M1': [0],'OUT_PROP_FIXED_MIN_M2': [0.05],'OUT_PROP_FIXED_MIN_M3': [0],
             'INC_OUT_PROP_DUR_MIN_M1': [3.13],'INC_OUT_PROP_DUR_MIN_M2': [1.59],'INC_OUT_PROP_DUR_MIN_M3': [3.08],
             'CUSTOMER_GENDER': ["male"],'ZERO_BALANCE_IND_M3': [0],'ROAM_FLAG': [0],'INC_DURATION_MINS_M2': [0], 'INC_PROP_SMS_CALLS_M1': [0], 'INC_DURATION_MINS_M1':[0],
             'INC_PROP_OPE1__MIN_M3': [0],'INC_PROP_FIXED_MIN_M2': [0],'OUT_INT_DURATION_MINS_M3': [0],'OUT_PROP_OPE2_MIN_M2': [0],'OUT_PROP_OPE2_MIN_M3': [0],
             'INC_PROP_SMS_CALLS_M3': [1],  "OUT_DURATION_MINS_M1":[1], 'INC_PROP_SMS_CALLS_M2': [0], 'INC_DURATION_MINS_M1':[0], "marque": ["nokia"],
             'PASS_AFTERGRACE_IND_M1' : [0], 'INC_PROP_OPE1__MIN_M2': [0], 'INC_PROP_OPE2_MIN_M3': [0], 'OUT_VMACC_NO_CALLS_M1': [0]}

predict_churn(scoring_model, feature_dict)[0]

### Explication du résultat avec Shapley value

In [None]:
df_p_h = predict_churn(scoring_model, feature_dict)[1]

In [None]:
# compute the SHAP values for the linear model
explainer = shap.Explainer(best_model_xgboost.predict, x_train)
shap_values = explainer(df_p_h)

In [None]:
#particular

#The additive nature of Shapley values
# the waterfall_plot shows how we get from shap_values.base_values to model.predict(X)[sample_ind]
shap.plots.waterfall(shap_values[0], max_display=10)

In [None]:
shap.initjs()
shap.plots.force(shap_values[0])