## 1) Introduction :
Le taux de désabonnement des clients se produit lorsque les clients ou les abonnés cessent de faire affaire avec une entreprise ou un service. Une entreprise veut savoir quels sont les clients qui vont se désabonner en examinant certains des attributs importants et en y appliquant le Machine Learning ou le Deep Learning.

## 2) Contexte du projet

Le taux de désabonnement fait référence au moment où un client met fin à sa relation avec une entreprise. Les entreprises en ligne considèrent un client désabonné une fois qu'un certain temps s'est écoulé depuis la dernière interaction du client avec le site ou le service.

Un modèle de désabonnement prédictif est un outil qui définit les étapes de l'attrition d'un client, ou le départ d'un client d'un service ou produit. En exploitant ce modèle de désabonnement en évolution, l'entreprise peut lutter pour cette rétention.


Le code doit bien être structuré, avec des explications sur le choix de l'architecture utilisée dans le modèle.

L'apprenant doit utiliser que les anciens notebooks pour réaliser ce travail (Interdit d'utiliser l'internet)


## 3) SCRIPT

### 1) Import des bibliothèques

In [7]:
import pandas as pd 
import matplotlib.pyplot as plt 
from sklearn.preprocessing import LabelEncoder,StandardScaler
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split,cross_val_score, GridSearchCV,cross_val_predict

from sklearn.metrics import mean_squared_error,r2_score,confusion_matrix,f1_score,precision_recall_curve,roc_curve,roc_auc_score
import numpy as np
from sklearn.base import BaseEstimator
from sklearn.datasets import fetch_openml
import cv2

### 2) Import des données

In [21]:
df = pd.read_csv("./data.csv")
df

Unnamed: 0,num_ligne,ID_Client,Nom,Score_Credit,Pays,Sex,Age,Tenure,Balance,Num_Produit,il_a_CrCard,Membre_actif,Salaire_estime,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


### 3) Prétraitement des données

#### nettoyage

In [22]:
df.isna().sum()

num_ligne         0
ID_Client         0
Nom               0
Score_Credit      0
Pays              0
Sex               0
Age               0
Tenure            0
Balance           0
Num_Produit       0
il_a_CrCard       0
Membre_actif      0
Salaire_estime    0
Exited            0
dtype: int64

In [23]:
df.isna().values.any()

False

In [24]:
df = df.drop(columns=["num_ligne", "ID_Client", "Nom"])

In [26]:
df.columns

Index(['Score_Credit', 'Pays', 'Sex', 'Age', 'Tenure', 'Balance',
       'Num_Produit', 'il_a_CrCard', 'Membre_actif', 'Salaire_estime',
       'Exited'],
      dtype='object')

#### encodage

In [27]:
# Récupération des index de colonnes catégorielles :
col = df.columns

col_num = df._get_numeric_data().columns
col_num

col_cat = list(set(col)-set(col_num))

# Création d'un système pour encoder plusieurs colones d'un coup et garder un dico avec les objets label encoder
# correspondant afin de pouvoir faire un inverse transform pour repasser en valeur qualitative si on le souhaite.
dico_lab = dict.fromkeys(col_cat, 0)
for col in col_cat:

    dico_lab[col] = LabelEncoder()
    dico_lab[col].fit(df[col])
    
    df[col] = dico_lab[col].transform(df[col])

print(dico_lab)
df.head(5)

{'Sex': LabelEncoder(), 'Pays': LabelEncoder()}


Unnamed: 0,Score_Credit,Pays,Sex,Age,Tenure,Balance,Num_Produit,il_a_CrCard,Membre_actif,Salaire_estime,Exited
0,619,0,0,42,2,0.0,1,1,1,101348.88,1
1,608,2,0,41,1,83807.86,1,0,1,112542.58,0
2,502,0,0,42,8,159660.8,3,1,0,113931.57,1
3,699,0,0,39,1,0.0,2,0,0,93826.63,0
4,850,2,0,43,2,125510.82,1,1,1,79084.1,0


In [28]:
# Pour décoder une colonne encodée avec ce système : 
dico_lab['Pays'].inverse_transform(df["Pays"])

array(['France', 'Spain', 'France', ..., 'France', 'Germany', 'France'],
      dtype=object)

#### Set d'apprentissage et d'essai

In [139]:
df.head()

Unnamed: 0,Score_Credit,Pays,Sex,Age,Tenure,Balance,Num_Produit,il_a_CrCard,Membre_actif,Salaire_estime,Exited
0,619,0,0,42,2,0.0,1,1,1,101348.88,1
1,608,2,0,41,1,83807.86,1,0,1,112542.58,0
2,502,0,0,42,8,159660.8,3,1,0,113931.57,1
3,699,0,0,39,1,0.0,2,0,0,93826.63,0
4,850,2,0,43,2,125510.82,1,1,1,79084.1,0


In [140]:
X = df.iloc[:,1:-1]
X.head()

Unnamed: 0,Pays,Sex,Age,Tenure,Balance,Num_Produit,il_a_CrCard,Membre_actif,Salaire_estime
0,0,0,42,2,0.0,1,1,1,101348.88
1,2,0,41,1,83807.86,1,0,1,112542.58
2,0,0,42,8,159660.8,3,1,0,113931.57
3,0,0,39,1,0.0,2,0,0,93826.63
4,2,0,43,2,125510.82,1,1,1,79084.1


In [141]:
y = df.iloc[:,-1:]
y.head()

Unnamed: 0,Exited
0,1
1,0
2,1
3,0
4,0


In [142]:
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.2,shuffle=True)

In [143]:
x_train.loc[1:1,:].shape

(1, 9)

In [74]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
import numpy as np

In [39]:
x_train2 = x_train.to_numpy()
x_test2 = x_test.to_numpy()

Standardisation des variables :

In [40]:
stand_sc = StandardScaler()
x_train2 = stand_sc.fit_transform(x_train2)
x_test2 = stand_sc.fit_transform(x_test2)

In [41]:
y_train2 = y_train.to_numpy()
y_test2 = y_test.to_numpy()

In [46]:
x_train2.shape

(8000, 9)

#### Création et entraînement du modèle de deep learning :

Les données étant unidimensionnelles et n'étant pas des images, je préfère essayer de faire un réseau de neuronne simple (ANN) plutôt qu'un CNN.

In [145]:
# Instanciation du modèle
model = Sequential()

# Couche d'entrée :
model.add(Dense(9, input_dim = 9, activation='relu'))

# Couches intermédiaires :
model.add(Dense(64, activation='relu'))
model.add(Dense(48, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dropout(.25))

# Couche de sortie :
model.add(Dense(1, activation='sigmoid'))

# Compilation du modèle
opt = Adam(learning_rate=0.0005)
model.compile(loss="binary_crossentropy",
              optimizer=opt,
              metrics=["accuracy"])

# Entraînement :
history = model.fit(x_train2,
                    y_train2,
                    epochs=50,
                    validation_split=0.2)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


Création de l'array de prédictions

In [117]:
predictions = model.predict(x_test2)
predictions

array([[0.22029078],
       [0.0279783 ],
       [0.2124083 ],
       ...,
       [0.00289929],
       [0.07093149],
       [0.01903743]], dtype=float32)

Visualisation des prédictions :

In [118]:
for i in range(0,len(y_test2)):
    print(y_test2[i])
    print(predictions[i])
    print()

[0]
[0.22029078]

[0]
[0.0279783]

[1]
[0.2124083]

[0]
[0.1622163]

[0]
[0.00926691]

[0]
[0.5242232]

[0]
[0.03311887]

[1]
[0.99846756]

[0]
[0.1679981]

[0]
[0.00929129]

[1]
[0.09682441]

[1]
[0.9955879]

[0]
[0.0439648]

[0]
[0.01694283]

[1]
[0.9989617]

[0]
[0.0141021]

[0]
[0.0501534]

[0]
[0.0255999]

[0]
[0.00464487]

[0]
[0.03024894]

[0]
[0.01560065]

[0]
[0.05784291]

[0]
[0.05530542]

[0]
[0.04540887]

[0]
[0.04310608]

[1]
[0.06158304]

[0]
[0.02396935]

[0]
[0.05303815]

[0]
[0.00377396]

[0]
[0.05220848]

[0]
[0.05255249]

[0]
[0.12574998]

[0]
[0.5272843]

[0]
[0.00132588]

[1]
[0.7478793]

[0]
[0.00173733]

[0]
[0.00667492]

[0]
[0.09713286]

[0]
[0.215523]

[0]
[0.24245065]

[0]
[0.11193794]

[0]
[0.29013675]

[0]
[0.30916375]

[0]
[0.06148648]

[0]
[0.01386613]

[0]
[0.17879373]

[0]
[0.01501605]

[0]
[0.0571577]

[0]
[0.09640545]

[1]
[0.4709997]

[0]
[0.0163438]

[0]
[0.0457544]

[0]
[0.00922087]

[0]
[0.6472752]

[0]
[0.16115957]

[1]
[0.18531492]

[0]
[0.05986

[0.94057715]

[0]
[0.04515356]

[0]
[0.05076551]

[0]
[0.05206326]

[0]
[0.0435732]

[0]
[0.12660557]

[0]
[0.50757366]

[0]
[0.18949565]

[0]
[0.2200166]

[0]
[0.13258135]

[0]
[0.07201296]

[1]
[0.9976262]

[0]
[0.10588133]

[0]
[0.15210146]

[0]
[0.04170424]

[0]
[0.1549994]

[0]
[0.01249388]

[1]
[0.2936039]

[0]
[0.02341339]

[1]
[0.1666283]

[0]
[0.18048587]

[0]
[0.91761595]

[0]
[0.0492236]

[0]
[0.22161633]

[0]
[0.06604391]

[0]
[0.07138014]

[0]
[0.08315775]

[0]
[0.05258894]

[1]
[0.9988502]

[0]
[0.2818724]

[0]
[0.13873684]

[0]
[0.04227754]

[0]
[0.26804352]

[0]
[0.28524524]

[0]
[0.49893054]

[0]
[0.01094103]

[0]
[0.34373367]

[0]
[0.05306685]

[0]
[0.21802074]

[1]
[0.76336145]

[0]
[0.03632981]

[0]
[0.00251573]

[1]
[0.02187803]

[0]
[0.03155741]

[0]
[0.01629302]

[1]
[0.11592567]

[0]
[0.03331184]

[1]
[0.8595536]

[0]
[0.06542599]

[0]
[0.06117257]

[0]
[0.13396597]

[0]
[0.0570741]

[1]
[0.46083403]

[0]
[0.0068503]

[0]
[0.05662844]

[0]
[0.11717802]

[0]
[0.0

[0]
[0.00868288]

[0]
[0.14222994]

[0]
[0.17649332]

[0]
[0.12675753]

[0]
[0.03128731]

[0]
[0.00339249]

[0]
[0.06480145]

[0]
[0.51341444]

[0]
[0.04300901]

[0]
[0.0022783]

[0]
[0.09523204]

[1]
[0.21572143]

[0]
[0.06657651]

[0]
[0.05163521]

[1]
[0.999205]

[0]
[0.19553319]

[1]
[0.20237073]

[0]
[0.02466547]

[0]
[0.02248591]

[1]
[0.17120796]

[1]
[0.16000286]

[0]
[0.02529299]

[0]
[0.01980731]

[0]
[0.07606035]

[0]
[0.17783359]

[1]
[0.03757834]

[0]
[0.2704397]

[0]
[0.00153819]

[0]
[0.02233332]

[0]
[0.04151922]

[0]
[0.37028444]

[0]
[0.25247073]

[0]
[0.06690177]

[0]
[0.01410753]

[0]
[0.00682485]

[1]
[0.03386274]

[0]
[0.06338081]

[1]
[0.6554532]

[0]
[0.04094356]

[0]
[0.05141696]

[0]
[0.01141623]

[0]
[0.28964362]

[0]
[0.11327463]

[0]
[0.04703638]

[0]
[0.05670565]

[0]
[0.21872571]

[0]
[0.08555347]

[1]
[0.6989662]

[1]
[0.52723986]

[0]
[0.13772976]

[0]
[0.10486522]

[0]
[0.05285117]

[0]
[0.01034817]

[0]
[0.18044963]

[0]
[0.09274197]

[0]
[0.12936717]


[1]
[0.0917702]

[0]
[0.3084194]

[0]
[0.01003352]

[1]
[0.0642724]

[0]
[0.04041091]

[0]
[0.07894918]

[0]
[0.2958573]

[1]
[0.93179625]

[0]
[0.01001927]

[0]
[0.11056164]

[1]
[0.92343855]

[0]
[0.06016341]

[1]
[0.05100682]

[0]
[0.02458322]

[1]
[0.9977778]

[0]
[0.0624868]

[0]
[0.1453548]

[1]
[0.36308998]

[0]
[0.01081622]

[0]
[0.08874899]

[0]
[0.44523636]

[0]
[0.28112075]

[0]
[0.00573498]

[0]
[0.30899653]

[0]
[0.04057446]

[0]
[0.16078576]

[0]
[0.42040297]

[0]
[0.00651756]

[0]
[0.01225206]

[0]
[0.4394776]

[0]
[0.08167186]

[0]
[0.17206171]

[0]
[0.12698689]

[0]
[0.06390402]

[0]
[0.5035335]

[1]
[0.80728054]

[0]
[0.05322355]

[0]
[0.1328752]

[0]
[0.01226783]

[0]
[0.02761102]

[0]
[0.04525784]

[1]
[0.99653286]

[0]
[0.0655762]

[0]
[0.01008469]

[1]
[0.34501353]

[0]
[0.06603745]

[1]
[0.08172441]

[0]
[0.06862861]

[0]
[0.06934786]

[0]
[0.28775477]

[0]
[0.3385218]

[0]
[0.06504849]

[1]
[0.96467733]

[0]
[0.07079303]

[0]
[0.03523344]

[0]
[0.00900024]

[0]

Binarisation de l'output décimal du modèle :

In [119]:
pred_bin = predictions.copy()
pred_bin[pred_bin<0.5] = 0
pred_bin[pred_bin>0.5] = 1
pred_bin

array([[0.],
       [0.],
       [0.],
       ...,
       [0.],
       [0.],
       [0.]], dtype=float32)

In [146]:
from sklearn.metrics import confusion_matrix
confusion_matrix(pred_bin, y_test2)

array([[1501,  184],
       [ 102,  213]], dtype=int64)

In [147]:
model.metrics_names

['loss', 'accuracy']

In [148]:
accuracy = model.evaluate(x_test2,y_test2)
print("loss =", accuracy[0])
print("accuracy =", accuracy[1])

loss = 0.3657515048980713
accuracy = 0.8514999747276306


Le modèle a une accuracy passable et il prédit bien les négatifs mais il prédit trop de faux positifs.

En jouant sur le seuil de la binarisation des labels prédits, on peut essayer d'améliorer vaguement les choses...

In [153]:
pred_bin = predictions.copy()
pred_bin[pred_bin<0.4] = 0
pred_bin[pred_bin>0.4] = 1
pred_bin

array([[0.],
       [0.],
       [0.],
       ...,
       [0.],
       [0.],
       [0.]], dtype=float32)

In [154]:
from sklearn.metrics import confusion_matrix
confusion_matrix(pred_bin, y_test2)

array([[1501,  184],
       [ 102,  213]], dtype=int64)

En regardant les données on voit qu'elles ne sont pas équilibrées et qu'il manque peut-être de matière pour entraîner les modèles sur les négatifs :

In [151]:
unique, counts = np.unique(y_train2, return_counts=True)
dict(zip(unique, counts))

{0: 6360, 1: 1640}

In [126]:
unique, counts = np.unique(y_test2, return_counts=True)
dict(zip(unique, counts))

{0: 1603, 1: 397}