# <b><center> Churn prediction </center></b>
# <center><i> Did you know that attracting a new customer costs <span style="color:#DC143C;">five times</span> as much as keeping an existing one?</i> </center>

  
  <p align="right">

</p>


<i align="left">Made by: Samar Rabeh / Wafaa Bousaid </i>

# <b> Table of Contents: </b>

1. [Introduction](#1)
    
    * [How can customer churn be reducded?](#3)
    * [Objectives](#4)
2. [Loading libraries and data](#5)
3. [Understanding the data](#6)
4. [Data Visualization](#7)
5. [Data Preprocessing](#8)
    * [Encoding the data](#51)
    * [Dealing with Nan values](#52)
    * [Features generation](#53)
6. [Data Modeling](#9)
   * [Random Forest](#61)
   * [LightGBM](#62)

7. [More Preprocessing and extra models](#10)
   * [Standardization](#71)
   * [KNN](#72)
   * [MLP](#73)

8. [Model Choice](#11)
9. [Recommendation on Discount](#12)
10. [Conclusion](#13)


## 1.   Introduction :


<a id = "3" ></a>
### <b>How can customer churn be reducded?</b>
<span align="justify">

<b>To reduce customer churn, telecom companies need to predict which customers are at high risk of churn.</b>

 To detect early signs of potential churn, one must first develop a holistic view of the customers and their interactions across numerous channels, including store/branch visits, product purchase histories, customer service calls, Web-based transactions, and social media interactions, to mention a few.  

As a result, by addressing churn, these businesses may not only preserve their market position, but also grow and thrive. More customers they have in their network, the lower the cost of initiation and the larger the profit. As a result, the company's key focus for success is reducing client attrition and implementing effective retention strategy.  
</span>
<a id="reduce"></a>


<a id = "4" ></a>
### <b>Objectives :</b>
<span align="justify">

TELCO  is a phoning company facing a churn problem. They collected a dataset on their past customers and we are asked to:
*  Rank their customers according to the probability of churn.
*  Tell who are the clients they should contact and what should be the relevant personalized discount to propose in order to maximize the future profit of TELCO Inc (tradeoff between churn prevention and reduced profit per customer after discount)
</span>
<a id="objectives"></a>

<a id = "5" ></a>
##  2. Loading libraries and data
<a id="loading"></a>

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import lightgbm as lgb
import tensorflow as tf
from tensorflow import keras

from sklearn.impute import KNNImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, roc_auc_score, accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

In [6]:
data_train_path = "/content/drive/Shared with me/Projet_Samar_Wafa/telco_train.csv"
train = pd.read_csv(data_train_path, delimiter=';')
data_test_path = "/content/drive/Shared with me/Projet_Samar_Wafa/telco_test.csv"
test = pd.read_csv(data_test_path, delimiter=';')

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/Shared with me/Projet_Samar_Wafa/telco_train.csv'

<a id = "6" ></a>
##  3. Understanding the data
<a id = "Undertanding the data" ></a>

In [None]:
#Check if there is duplicate client idx
print(train.CustomerID.nunique() - train.shape[0])
test.CustomerID.nunique() - test.shape[0]

In [None]:
test_customer_id = test.CustomerID
train.drop("CustomerID", axis=1, inplace=True)
test.drop("CustomerID", axis=1, inplace=True)

In [None]:
train.head()

In [None]:
train.describe()

In [None]:
train.nunique()

In [None]:
train.info()

In [None]:
train.isna().sum()

We notice that there are some missing data, we'll deal with them later on.

In [None]:
test.head()

In [None]:
train['Churn Label'].value_counts()

 Imbalanced class distributions influence the performance of a machine learning model negatively. We could use upsampling or downsampling to overcome this issue. Here in our case the data is not that umbalanced, since a considerable number of churn is existing already, so we won't mitigate such an issue.

<a id = "7" ></a>
## 4. Data Visualization :
In this section we'll split our visualizations into two parts, one for binary categorical data and the other for multiclass categorical and numerical data.
<a id = "datavisualization" ></a>

In [None]:
# Binary Categorical Features
binary_cols = [col for col in train.columns if train[col].value_counts().shape[0] == 2]
binary_cols

In [None]:
fig, axes = plt.subplots(1, 8, figsize=(12, 5), sharey=True)
fig.suptitle("Train Dataset")
for i, col in enumerate(binary_cols):
  train.groupby(col).size().plot(ax= axes[i], kind='bar', color=sns.palettes.mpl_palette('Set1'))

In [None]:
# Binary Categorical Features
binary_cols = [col for col in test.columns if test[col].value_counts().shape[0] == 2]
binary_cols

In [None]:
fig, axes = plt.subplots(1, 5, figsize=(10, 5), sharey=True)
fig.suptitle("Test Dataset")
for i, col in enumerate(binary_cols[:-1]):
  test.groupby(col).size().plot(ax= axes[i], kind='bar', color=sns.palettes.mpl_palette('Set1'))

In [None]:
total = train['Churn Label'].value_counts().values.sum()
_ = plt.pie(train['Churn Label'].value_counts().values, labels=train['Churn Label'].value_counts().index,
            autopct=lambda x:'{:.1f}%\n{:.0f}'.format(x, total*x/100), colors=sns.palettes.mpl_palette('Set1'))

* As a beginnning data imbalancement can be tolerated since a considerable percentage of clients have CHURNED.

* Train and test have roughly the same distribution, this is a good sign so nothing to take into consideration in order to compensate any disparity.

<a id = "51" ></a>
### 5.1. Encoding the data :

In [None]:

train["Senior Citizen"].replace({'Yes':1, 'No':0}, inplace=True)
train["Partner"].replace({'Yes':1, 'No':0}, inplace=True)
train["Dependents"].replace({'Yes':1, 'No':0}, inplace=True)
train["Paperless Billing"].replace({'Yes':1, 'No':0}, inplace=True)
train["Phone Service"].replace({'Yes':1, 'No':0}, inplace=True)
train["Multiple Lines"].replace({'Yes':1, 'No':0, 'No phone service':0}, inplace=True)
train["Online Security"].replace({'Yes':1, 'No':0, 'No internet service':0}, inplace=True)
train["Online Backup"].replace({'Yes':1, 'No':0, 'No internet service':0}, inplace=True)
train["Device Protection"].replace({'Yes':1, 'No':0, 'No internet service':0}, inplace=True)
train["Tech Support"].replace({'Yes':1, 'No':0, 'No internet service':0}, inplace=True)
train["Streaming TV"].replace({'Yes':1, 'No':0,'No internet service':0}, inplace=True)
train["Streaming Movies"].replace({'Yes':1, 'No':0,'No internet service':0}, inplace=True)


test["Senior Citizen"].replace({'Yes':1, 'No':0}, inplace=True)
test["Partner"].replace({'Yes':1, 'No':0}, inplace=True)
test["Dependents"].replace({'Yes':1, 'No':0}, inplace=True)
test["Paperless Billing"].replace({'Yes':1, 'No':0}, inplace=True)
test["Phone Service"].replace({'Yes':1, 'No':0}, inplace=True)
test["Multiple Lines"].replace({'Yes':1, 'No':0, 'No phone service':0}, inplace=True)
test["Online Security"].replace({'Yes':1, 'No':0, 'No internet service':0}, inplace=True)
test["Online Backup"].replace({'Yes':1, 'No':0, 'No internet service':0}, inplace=True)
test["Device Protection"].replace({'Yes':1, 'No':0, 'No internet service':0}, inplace=True)
test["Tech Support"].replace({'Yes':1, 'No':0, 'No internet service':0}, inplace=True)
test["Streaming TV"].replace({'Yes':1, 'No':0,'No internet service':0}, inplace=True)
test["Streaming Movies"].replace({'Yes':1, 'No':0,'No internet service':0}, inplace=True)

In [None]:

valeurs_distinctes_liste = train["Streaming Movies"].unique().tolist()
print("Valeurs distinctes dans la colonne :", valeurs_distinctes_liste)

In [None]:
print(train["Streaming Movies"].dtype)

In [None]:
# Assuming you want to count occurrences of a value 'desired_value' in the column 'column_name'
#count = train['Contract'].value_counts()['Month-to-month']
#print("Number of occurrences of 'desired_value':", count)
count = train['Contract'].value_counts()
print("Number of occurrences of 'desired_value':", count)


Here we'll rank churn consideration by gravity, the more the client is considering to change of plan the higher the value.

In [None]:
usage_lvl_dict = {
    "Month-to-month":0,
    "Two year":1,
    "One year":2,

}
train["Contract"].replace(usage_lvl_dict, inplace=True)
test["Contract"].replace(usage_lvl_dict, inplace=True)

In [None]:
usage_lvl_dict = {
    "Female":0,
    "Male":1,
}

train["Gender"].replace(usage_lvl_dict, inplace=True)
test["Gender"].replace(usage_lvl_dict, inplace=True)

In [None]:
# Définition du dictionnaire churn_consideration
churn_consideration = {
    0: range(0, 40),    # Pour les scores entre 0 et 39 inclus
    1: range(40, 70),   # Pour les scores entre 40 et 69 inclus
    2: range(70, 101)   # Pour les scores entre 70 et 100 inclus
}

# Remplacer les valeurs selon les règles spécifiées
train["Churn Score"] = train["Churn Score"].apply(lambda x: next((key for key, value in churn_consideration.items() if x in value), None))
test["Churn Score"] = test["Churn Score"].apply(lambda x: next((key for key, value in churn_consideration.items() if x in value), None))


* For rapidity and simplicity purposes, we change the Data column type into int since fractions of Mo doesn't have any effect in the real world pricing neither these fractions have a considerable effect.

In [None]:
plt.figure(figsize=(10, 10))
ax = sns.heatmap(
    train.corr(),
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
)

In [None]:
plt.figure(figsize=(10, 10))
ax = sns.heatmap(
    test.corr(),
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
)

<a id = "8" ></a>
## 5. Data Preprocessing :
In this section we'll fix our data anomalies and generate important features that might help the modeling phase later on.
<a id = "datapreprocessing" ></a>

<a id = "53" ></a>
### 5.3. Features generation :

In [None]:
train.groupby(["Churn Value"]).agg('mean')

In [None]:
plt.figure(figsize=(10, 10))
ax = sns.heatmap(
    train.corr(),
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
)

<a id = "9" ></a>
## 6. Data Modeling :
In this section we'll work on different models and finetune them in order to get the best results.
<a id = "datamodeling" ></a>

In [None]:
# Remove Count since it is highly correlated with House and its information already contained in it.
train.drop("Payment Method", axis=1, inplace=True)
test.drop("Payment Method", axis=1, inplace=True)
train.drop("Country", axis=1, inplace=True)
test.drop("Country", axis=1, inplace=True)
train.drop("State", axis=1, inplace=True)
test.drop("State", axis=1, inplace=True)
train.drop("City", axis=1, inplace=True)
test.drop("City", axis=1, inplace=True)
train.drop("Internet Service", axis=1, inplace=True)
test.drop("Internet Service", axis=1, inplace=True)
train.drop("Churn Reason", axis=1, inplace=True)
train.drop("Churn Label", axis=1, inplace=True)


test.drop("Lat Long", axis=1, inplace=True)
train.drop("Lat Long", axis=1, inplace=True)
test.drop("Latitude", axis=1, inplace=True)
train.drop("Latitude", axis=1, inplace=True)
test.drop("Longitude", axis=1, inplace=True)
train.drop("Longitude", axis=1, inplace=True)

In [None]:
from sklearn.preprocessing import LabelEncoder

# Créer une instance de LabelEncoder
label_encoder = LabelEncoder()

# Convertir une colonne spécifique en entiers
train['Multiple Lines'] = label_encoder.fit_transform(train['Multiple Lines'])

In [None]:
# Remplacer les virgules par des points dans la colonne "Monthly Charges"
train["Monthly Charges"] = train["Monthly Charges"].str.replace(',', '.')
# Convertir la colonne en type numérique float
train["Monthly Charges"] = train["Monthly Charges"].astype(float)
# Ensuite, vous pouvez convertir les valeurs en entiers
train["Monthly Charges"] = train["Monthly Charges"].astype(int)
# Vérifier le nouveau type de données
print(train["Monthly Charges"].dtype)

# Remplacer les virgules par des points dans la colonne "Monthly Charges"
test["Monthly Charges"] = test["Monthly Charges"].str.replace(',', '.')
# Convertir la colonne en type numérique float
test["Monthly Charges"] = test["Monthly Charges"].astype(float)
# Ensuite, vous pouvez convertir les valeurs en entiers
test["Monthly Charges"] = test["Monthly Charges"].astype(int)
# Vérifier le nouveau type de données
print(test["Monthly Charges"].dtype)

In [None]:

# Remplacer les valeurs vides par le produit de "Tenure Months" et "Monthly Charges"
train["Total Charges"] = train["Total Charges"].str.replace(',', '.')
train["Total Charges"] = np.where(train["Total Charges"] == ' ', train["Tenure Months"] * train["Monthly Charges"], train["Total Charges"])
# Convertir la colonne en type numérique float
train["Total Charges"] = train["Total Charges"].astype(float)
# Ensuite, vous pouvez convertir les valeurs en entiers
train["Total Charges"] = train["Total Charges"].astype(int)
# Vérifier le nouveau type de données
print(train["Total Charges"].dtype)

# Remplacer les valeurs vides par le produit de "Tenure Months" et "Monthly Charges"
test["Total Charges"] = test["Total Charges"].str.replace(',', '.')
test["Total Charges"] = np.where(test["Total Charges"] == ' ', test["Tenure Months"] * test["Monthly Charges"], test["Total Charges"])
# Convertir la colonne en type numérique float
test["Total Charges"] = test["Total Charges"].astype(float)
# Ensuite, vous pouvez convertir les valeurs en entiers
test["Total Charges"] = test["Total Charges"].astype(int)
# Vérifier le nouveau type de données
print(test["Total Charges"].dtype)


In [None]:
y = train["Churn Value"]
X = train.drop("Churn Value", axis=1)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=2023)

<a id = "61" ></a>
### 6.1 Random Forest :

In [None]:
#Execution time around 10 min for reproducibility
rfc = RandomForestClassifier(random_state=2023)

param_grid = {
    'n_estimators': [200, 300, 500],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=5)
CV_rfc.fit(x_train, y_train)

In [None]:
CV_rfc.best_params_

In [None]:
rfc1 = RandomForestClassifier(random_state=2023, n_estimators=200, max_depth=8, criterion='gini')
rfc1.fit(x_train, y_train)

In [None]:
# plot feature importance
feature_importances, feature_names= zip(*sorted(zip(rfc1.feature_importances_, x_train.columns), reverse=True))
plt.figure(figsize=(7,7))
plt.bar(feature_names, feature_importances)
plt.xticks(rotation=90)
plt.show()

In [None]:
pred=rfc1.predict(x_test)

In [None]:
print(classification_report(y_test,pred))

In [None]:
y_pred_prob = rfc1.predict_proba(x_test)[:, 1]
y_pred_prob

In [None]:
# Compute the false positive rate (FPR)
# and true positive rate (TPR) for different classification thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob, pos_label=1)

#Compute the ROC AUC score
roc_auc = roc_auc_score(y_test, y_pred_prob)
roc_auc

In [None]:
# Plot the ROC curve
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
# roc curve for tpr = fpr
plt.plot([0, 1], [0, 1], 'k--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

<a id = "62" ></a>
### 6.2 LightGBM :

In [None]:
clf = lgb.LGBMClassifier()
clf.fit(x_train, y_train)

In [None]:
# predict the results
y_pred_lgbm = clf.predict_proba(x_test)[:, 1]

# Compute the false positive rate (FPR)
# and true positive rate (TPR) for different classification thresholds
fpr_lgbm, tpr_lgbm, thresholds_lgbm = roc_curve(y_test, y_pred_lgbm, pos_label=1)

#Compute the ROC AUC score
roc_auc_lgbm = roc_auc_score(y_test, y_pred_lgbm)
roc_auc_lgbm

In [None]:
# Plot the ROC curve
plt.plot(fpr, tpr, label='ROC curve RF(area = %0.2f)' % roc_auc)
plt.plot(fpr_lgbm, tpr_lgbm, label='ROC curve Lgbm (area = %0.2f)' % roc_auc_lgbm)
# roc curve for tpr = fpr
plt.plot([0, 1], [0, 1], 'k--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

<a id = "10" ></a>
## 7. More preprocessing and extra models :

<a id = "71" ></a>
### 7.1 Standardization:
We tackle standardization in this part since tree-based models doesn't have an issue handeling non normalized data and we didn't want to have data leakage if we did it on our data set before train_test_split.

In [None]:
scaler = StandardScaler()
x_train.iloc[:, :] = scaler.fit_transform(x_train)
x_test.iloc[:, :] = scaler.transform(x_test)

In [None]:
df = x_train.copy()
df["CHURNED"] = y_train
df

In [None]:
pca = PCA(n_components=3)
pca_result = pca.fit_transform(df)

df['pca-one'] = pca_result[:,0]
df['pca-two'] = pca_result[:,1]
df['pca-three'] = pca_result[:,2]

print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))

Using only 3 feature we could preserve around 35% of the data variance. Which will be used below to calculate the tsne transformation and try to seperate the churn visually.

In [None]:
tsne = TSNE(n_components=3, verbose=1, perplexity=40, n_iter=300)
tsne_results = tsne.fit_transform(df)
df['tsne-2d-one'] = tsne_results[:,0]
df['tsne-2d-two'] = tsne_results[:,1]

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 5), sharey=True)
sns.scatterplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue="CHURNED",
    palette=sns.color_palette("hls", 10),
    data=df,
    legend="full",
    alpha=0.3,
    ax=axes[0]
)

sns.scatterplot(
    x="pca-one", y="pca-two",
    hue="CHURNED",
    palette=sns.color_palette("hls", 10),
    data=df,
    legend="full",
    alpha=0.3,
    ax=axes[1]
)

axes[0].set_title("T-sne Reduced")
axes[1].set_title("PCA Reduced")

The code above show clearly the power behind t-sne. Since contrary to PCA, it’s not a mathematical technique but a probabilistic one.
We can notice that the part below is mostly composed from churned customers whereas the upper part is from customers that won't leave. Still this is far from a perfect separation since there is an information loss still and we did this just to visualize that separation is possible.

<a id = "72" ></a>
### 7.2 KNN:

In [None]:
clf = KNeighborsClassifier(50)

clf = clf.fit(x_train, y_train)

#Test authenticity with cv
cv_scores = cross_val_score(clf, x_train, y_train, cv=5, scoring="roc_auc")
print('cv_scores mean:{}'.format(np.mean(cv_scores)))

#Predict the response for test dataset
y_pred_knn = clf.predict_proba(x_test)[:, 1]

# Compute the false positive rate (FPR)
# and true positive rate (TPR) for different classification thresholds
fpr_knn, tpr_knn, thresholds_knn = roc_curve(y_test, y_pred_knn, pos_label=1)

#Compute the ROC AUC score
roc_auc_knn = roc_auc_score(y_test, y_pred_knn)

In [None]:
# Plot the ROC curve
plt.plot(fpr, tpr, label='ROC curve RF(area = %0.2f)' % roc_auc)
plt.plot(fpr_lgbm, tpr_lgbm, label='ROC curve Lgbm (area = %0.2f)' % roc_auc_lgbm)
plt.plot(fpr_knn, tpr_knn, label='ROC curve KNN (area = %0.2f)' % roc_auc_knn)
# roc curve for tpr = fpr
plt.plot([0, 1], [0, 1], 'k--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

<a id = "73" ></a>
### 7.3 MLP:

In [None]:
print(x_train.shape)

In [None]:
# define sequential model
model = keras.Sequential([
    # input layer
    keras.layers.Dense(16, input_shape=(21,), activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(25, activation='relu'),
    keras.layers.Dense(15,activation = 'relu'),
    # we use sigmoid for binary output
    # output layer
    keras.layers.Dense(1, activation='sigmoid')
]
)

In [None]:
# time for compilation of neural net.
model.compile(optimizer = 'adam',
             loss = 'binary_crossentropy',
             metrics = [tf.keras.metrics.AUC()])

callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=10)
# now we fit our model to training data
model.fit(x_train, y_train, epochs=100, callbacks=[callback])

In [None]:
model.evaluate(x_test, y_test)

In [None]:
y_pred_proba = model.predict(x_test)

In [None]:
# Compute the false positive rate (FPR)
# and true positive rate (TPR) for different classification thresholds
fpr_mlp, tpr_mlp, thresholds_mlp = roc_curve(y_test, y_pred_proba, pos_label=1)

#Compute the ROC AUC score
roc_auc_mlp = roc_auc_score(y_test, y_pred_proba)

In [None]:
# Plot the ROC curve
plt.plot(fpr, tpr, label='ROC curve RF(area = %0.2f)' % roc_auc)
plt.plot(fpr_lgbm, tpr_lgbm, label='ROC curve Lgbm (area = %0.2f)' % roc_auc_lgbm)
plt.plot(fpr_knn, tpr_knn, label='ROC curve KNN (area = %0.2f)' % roc_auc_knn)
plt.plot(fpr_mlp, tpr_mlp, label='ROC curve MLP (area = %0.2f)' % roc_auc_mlp)
# roc curve for tpr = fpr
plt.plot([0, 1], [0, 1], 'k--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

<a id = "11" ></a>
## 8. Model Choice:
In this section, we decided to work on LightGBM since he gave the best results.

In [None]:
clf = lgb.LGBMClassifier()
clf.fit(X, y)

# predict the results
y_pred_lgbm = clf.predict_proba(test)[:, 1]

preds = clf.predict(test)
preds

In [None]:
test['CHURN_PROBABILITY'] = y_pred_lgbm
test['CHURN_LABEL'] = preds

<a id = "12" ></a>
## 9. Recommendation on Discount :
The approach we'll use when proposing discount is as following:
<br>
* For each client search for N neighbors and calculate the mean of their CLTV (Plus la valeur est élevée, plus le client est précieux)
* If the client in question has a CLTV more than his neighbors we propose a discount for him, else we don't.

In [None]:
nbrs = NearestNeighbors(n_neighbors=20, algorithm='auto').fit(test.drop(columns=['CHURN_LABEL']))
distances, indices = nbrs.kneighbors(test.drop(columns=['CHURN_LABEL']))

* We tried KMeans, DBSCAN and Unsupervised NearestNeighbors. The latter seems to be the best.
* KMeans and DBSCAN clustering had bad scores (Kmeans high interia) (DBSCAN all samples are labeled as noise (-1)).

In [None]:
print("validation Monthly Charges mean :",test["Monthly Charges"].mean())
print("validation CLTV mean :",test.CLTV.mean())

In [None]:
pd.options.mode.chained_assignment = None  # default='warn'

test['MEAN_CLTV'] = 0
for i in range(len(indices)):
    test['MEAN_CLTV'].iloc[i] = test.iloc[indices[i]]['CLTV'].mean()

In [None]:
test.head()

In [None]:
np.max((test['CLTV'])/test['MEAN_CLTV'])

In [None]:
np.min((test['CLTV'])/test['MEAN_CLTV'])

In [None]:
test['DISCOUNT'] = np.where((test['CHURN_LABEL']==1) & ((test['CLTV'] - test['MEAN_CLTV']) > 0),
                             (test['CLTV']-10)/(test['MEAN_CLTV'])*0.25, np.nan)

* Max discount will be 67% and min discount will be 5% with an average 25% discount, which seems fair.

In [None]:
test.DISCOUNT.hist()

In [None]:
test['CLIENT_TO_CONTACT'] = np.where(test['DISCOUNT'].isnull(),'NO','YES')
test['CHURN_LABEL'] = test['CHURN_LABEL'].map({1 : "LEAVE", 0: "STAY"})

In [None]:
test

In [None]:
df = pd.concat([test_customer_id, test[["CHURN_PROBABILITY", "CHURN_LABEL", "CLIENT_TO_CONTACT", "DISCOUNT"]]], axis=1)
df.head()

In [None]:
df.to_csv('Result_samar.csv')

In [None]:
import os

# Chemin d'accès complet vers le répertoire où vous souhaitez enregistrer le fichier CSV
chemin_complet = "/content/drive/My Drive/Projet_Samar_Wafa/Result_samar.csv"

# Enregistrer le DataFrame dans le fichier CSV en utilisant le chemin d'accès complet
df.to_csv(chemin_complet)


#Performance

In [None]:

# Charger les prédictions
df_predictions = pd.read_csv('/content/drive/My Drive/Projet_Samar_Wafa/Result_.csv')

# Charger les vraies valeurs de churn
df_true_values = pd.read_excel('/content/drive/My Drive/Projet_Samar_Wafa/base_test2.xlsx', engine='openpyxl')

# Conversion des étiquettes en valeurs numériques
df_predictions['CHURN_LABEL'] = df_predictions['CHURN_LABEL'].map({'LEAVE': 1, 'STAY': 0})


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Extraire les prédictions numériques et les vraies valeurs
predictions_numeric = df_predictions['CHURN_LABEL'].values
true_values = df_true_values['Churn Value'].values

# Calculer les métriques
accuracy = accuracy_score(true_values, predictions_numeric)
precision = precision_score(true_values, predictions_numeric)
recall = recall_score(true_values, predictions_numeric)
f1 = f1_score(true_values, predictions_numeric)
auc = roc_auc_score(true_values, predictions_numeric)

# Afficher les résultats
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"AUC: {auc}")


Accuracy (Exactitude) : 0.813
Interprétation : Environ 81.3% des prédictions du modèle sont correctes. C'est un bon point de départ, mais l'exactitude seule ne donne pas une image complète, surtout si vos données sont déséquilibrées (c'est-à-dire si une classe est beaucoup plus fréquente que l'autre).

Precision (Précision) : 0.942
Interprétation : Lorsque le modèle prédit qu'un client va partir (churn), il a raison 94.2% du temps. C'est une valeur assez élevée, ce qui signifie que le modèle est fiable dans ses prédictions positives (churn).

Recall (Rappel) : 0.733
Interprétation : De tous les clients qui ont réellement quitté, votre modèle a réussi à en identifier 73.3%. Bien que moins élevé que la précision, un rappel supérieur à 70% est considéré comme bon dans de nombreux contextes, mais cela pourrait aussi indiquer une marge d'amélioration, surtout si l'identification de tous les cas de churn est critique.

F1 Score : 0.824
Interprétation : Le score F1 est une moyenne harmonique de la précision et du rappel, donnant une mesure unique de la précision et de la complétude. Un score F1 de 82.4% est assez bon, indiquant un équilibre raisonnable entre précision et rappel.

AUC (Area Under the ROC Curve) : 0.833
Interprétation : L'AUC mesure la capacité du modèle à distinguer entre les classes. Une AUC de 83.3% est considérée comme très bonne, indiquant que le modèle a une bonne capacité de discrimination entre les clients qui partent et ceux qui restent.


# <a id = "13" ></a>
## Conclusion :

* This project encompasses two primary challenges: firstly, addressing the underrepresented class ('LEAVE') and secondly, devising an approach to compute discounts. To tackle the first challenge, a LightGBM was employed, striking a favorable balance between precision and recall for the positive class. Managing such issues invariably involves a trade-off between precision-recall metrics, particularly for the smaller class. Since, selecting the appropriate evaluation metric is pivotal in such scenarios, with F-scores and ROC AUC commonly employed for imbalanced classification.
* Some paths to discover for futher immersion could be using a voting classifier that uses the power of all well performing ones. Also, we could generate sythetic data using SMOTE for example to better balance our data distribution.
* The second challenge involved clustering clients and extending discounts to those with a high probability of churn, especially those with higher telecom expenses compared to their peers. The optimal resolution to this issue lies in incorporating business-side logic by collaborating with sales professionals, emphasizing the need for more than just modeling.
<br> <br>
<b>
Overall, this project provided an invaluable opportunity to gather different set of skills into one project as well as to acquire new ones adressing churn problems in the telecommunication industry.
</b>