 # Telco Churn Prediction

**Objective** : Predict churn characteristics to retain customers. 

After a quick exploration of Telco's customer data, we will implement Machine Learning models to help the company to identify customers at risk of churn. This customer classification will allow the company to implement actions to try to keep these customers.

This notebook is my first contribution on Kaggle. I'm open to any kind of feedback to help me improve my work and skills !

# 1. Libraries and data importation

In [None]:
#Basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#Preprocessing
from sklearn.preprocessing import RobustScaler

#Machine Learning
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest, f_classif

#Metrics
from sklearn.metrics import f1_score, recall_score, confusion_matrix, classification_report, precision_recall_curve
from sklearn.model_selection import learning_curve
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings('ignore')

In [None]:
#data importation
data = pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')

# 2. Quick data exploration and cleaning

In [None]:
#quick data visualization
data.head()

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

In [None]:
data.shape

The dataset gathers data from 7043 customers described by 21 attributes.

### Let's have a look on the 21 attributes

**Demographic info about customers**
* gender : the customer is a male or a female
* SeniorCitizen : the customer is a old person no longer employed (1 or 0 if not)
* Partner : the customer has a partner (Yes or No)
* Dependents : the client has dependents (Yes or No)
* tenure :  number of months a customer has had an account
    
    
**Services that each customer has signed up for**
* PhoneService (Yes or No)
* MultipleLines  (Yes, No or No phone service)
* InternetService (DSL, Fiberoptic or NO)
* OnlineSecurity (Yes, No or No internet Service)
* OnlineBackup (Yes, No or No internet Service)
* DeviceProtection (Yes, No or No internet Service)
* TechSupport (Yes, No or No internet Service)
* StreamingTV (Yes, No or No internet Service)
* StreamingMovies (Yes, No or No internet Service)


**Customer account information**
* customerID : unique identification number given to each customer
* Contract : : contract renewal (One year, Two year or Month-to-month)
* PaperlessBilling : online billing (Yes or No)
* PaymentMethod :  (Credit card (automatic), Electronic check, Bank transfert (automatic) or Mailed check)
* MonthlyCharges : from 18.25 to 118.75
* TotalCharges : from 0 to 8884.80


**Target**
* Churn : customers who left within the last month (Yes or No)

In [None]:
data.dtypes

We can notice some problems: 
* the SeniorCitizen attribute is considered as a numerical variable whereas it is a cathegory variable. 
* The TotalCharges attribute is considered as an object whereas it is a numerical variable.

In [None]:
#Convert SeniorCitizen to object
data['SeniorCitizen'] = data['SeniorCitizen'].apply(str)

#convert TotalCharges to float
data['TotalCharges'] = data['TotalCharges'].replace({" ":'0'})
data['TotalCharges'] = data['TotalCharges'].astype(float)

In [None]:
#Let's delete unusefull features
data = data.drop('customerID', axis=1)

### Separate quantitative and qualitative features

In [None]:
# Colonnes quantitative
numeric_features = ['tenure', 'MonthlyCharges', 'TotalCharges']
# Colonnes qualitative
nominal_features = ['gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']

### Quick exploration on numeric features

In [None]:
#Description of numericall variables
data.describe()

In [None]:
for col in data[numeric_features]:
    plt.figure()
    sns.distplot(data[col])

We can notice that none of the quantitative variables are normally distributed.

### Quick exploration on the nominal features + target

In [None]:
for col in data.select_dtypes('object'):
    plt.figure()
    data[col].value_counts().plot.pie()

**Demographic info**
* Gender : balanced distribution
* SeniorCitizen : unbalanced variable with only 20% of SeniorCitizen
* Partner : balanced distribution

**Services** 
* MultipleLines : balanced distribution between Yes and No. A minority of customers don't have a phone service
* InternetService : balanced distribution between DSL, Fiber optic and no
* OnlineSecurity : balanced distribution between Yes, No and no internet service (but we can notice a majority of No)
* OnlineBackup : balanced distribution between Yes, No and no internet service
* DeviceProtection : balanced distribution between Yes, No and no internet service (but we can notice a majority of No)
* TechSupport : balanced distribution between Yes, No and no internet service (but we can notice a majority of No)
* StreamingTV : balanced distribution between Yes, No and no internet service
* StreamingMovies : balanced distribution between Yes, No and no internet service

Even if these variables are equally distributed, we can note that the answer "No" and the answer "no internet/phone service" mean the same thing => The customer did not subscribe to the service. So there is a minority of customers subscribed for each service.

**Customer account information**
* Contract : the majority of clients have a month-to-month contract
* PaperlessBilling : majority of paperless billing
* PaymentMethod : balanced distribution between Credit card (automatic), Electronic check, Bank transfert (automatic) and Mailed check

**Target**
* Churn : a quarter of the clients are in churn

### Explore relations between variables and target

In [None]:
columns = ['tenure', 'MonthlyCharges', 'TotalCharges','gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']

#### Heatmap crosstab : target distribution for each nominal features

In [None]:
for col in nominal_features :
    plt.figure()
    sns.heatmap(pd.crosstab(data['Churn'], data[col]), annot=True, fmt='d')


### Countplot nominal features / target

In [None]:
for col in nominal_features:
    plt.figure()
    sns.countplot(x=col, hue='Churn', data=data)

We can notice characteristics of churn customers :
* Fiber optic 
* month to month contract 
* Paperless billing
* Electronic check

#### Timechart : chrun distribution for each numeric features

In [None]:
churn_df = data[data['Churn'] == 'Yes']
noChurn_df = data[data['Churn'] == 'No']

for col in numeric_features:
    plt.figure()
    sns.distplot(churn_df[col], label='Yes')
    sns.distplot(noChurn_df[col], label='No')
    plt.legend()

clients who have a contract for less than 20 months are more likely to churn.

# 3. Preprocessing

In [None]:
df = data.copy()

### Let's split the dataset as train and test set

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
trainset, testset = train_test_split(df, test_size=0.2, random_state=0)

In [None]:
trainset['Churn'].value_counts(normalize=True)

In [None]:
testset['Churn'].value_counts(normalize=True)

The churn proportions are similar between the train and the test set.

### Encoding

In [None]:
#encoding for our services columns

#columns for label endoding 
labelEndoding_cols =  ['gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'PhoneService', 'MultipleLines', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'PaperlessBilling','Churn']

#columns for oneHot encoding
oneHot_cols = ['InternetService','Contract', 'PaymentMethod']

In [None]:
#create encoding function
def encoding(df):
    
    code = {'Male':1,
        'Female':0,
        '1':1,
        '0':0,
        'Yes':1,
       'No':0,
       'No internet service':0,
       'No phone service':0}
        
    for col in df[labelEndoding_cols].columns:
        df.loc[:,col] = df[col].map(code)

    df = pd.get_dummies(df,columns=['InternetService'],prefix='InternetService')
    df = pd.get_dummies(df,columns=['Contract'],prefix='Contract')
    df = pd.get_dummies(df,columns=['PaymentMethod'],prefix='PaymentMethod')
    
    return df

### Preprocessing function

In [None]:
def preprocessing(df):
    
    df = encoding(df)
    
    X = df.drop('Churn',axis=1)
    y = df['Churn']
    
    print(y.value_counts())
    
    return X,y

In [None]:
X_train, y_train = preprocessing(trainset)

In [None]:
X_test, y_test = preprocessing(testset)

# 4. Modeling

In this project, we prefer to focus on recall and f1 score metrics. Indeed, the company will prefer to identify a maximum number of customers potentially wishing to churn in order to offer them a different offer. It would be a shame to let customers leave without offering them another offer. We therefore seek to minimize the number of false negatives.

### Evaluation funtion

In [None]:
def evaluation(name,model):
    
    model.fit(X=X_train, y=y_train)
    ypred = model.predict(X_test)
    
    print(name)
    print(confusion_matrix(y_test,ypred))
    print(classification_report(y_test,ypred))
    
    N, train_score, val_score = learning_curve(model, X_train, y_train, 
                                               cv=4, scoring='f1',
                                               train_sizes=np.linspace(0.1,1,10))
    
    
    plt.figure(figsize=(12,8))
    plt.title(name)
    plt.plot(N,train_score.mean(axis=1), label='train score')
    plt.plot(N,val_score.mean(axis=1), label='val score')

### Models

Models to test : 
* Decision Tree
* Random Forest
* Logistic Regression
* AdaBoost
* SVM
* KNN

In [None]:
preprocessor = make_pipeline(SelectKBest(f_classif,k=8))

In [None]:
DecisionTree = make_pipeline(preprocessor,DecisionTreeClassifier(random_state=0))
RandomForest = make_pipeline(preprocessor, RandomForestClassifier(random_state=0))
LR = make_pipeline(preprocessor,LogisticRegression(random_state=0))
AdaBoost = make_pipeline(preprocessor, AdaBoostClassifier(random_state=0))
SVM = make_pipeline(preprocessor,StandardScaler(), SVC(random_state=0))
KNN = make_pipeline(preprocessor,StandardScaler(), KNeighborsClassifier())

In [None]:
list_of_models = [DecisionTree,RandomForest,LogisticRegression, AdaBoost, SVM, KNN]

### Models Evaluation

In [None]:
dict_of_models = {'DecisionTree': DecisionTree,
                 'RandomForest': RandomForest,
                 'LR': LR,
                 'AdaBoost': AdaBoost,
                 'SVM': SVM,
                 'KNN': KNN
                 }

In [None]:
for name, model in dict_of_models.items():
    evaluation(name,model)

Decision Tree and Random Forest are overfeating.
After analyzing the results, we will focus on the Logistic Regression, KNN, SVM and AdaBoost models.

### Logistic Regression Optimization

In [None]:
LR.get_params().keys()

In [None]:
hyper_params_lr = {
    'logisticregression__penalty':['l1', 'l2', 'elasticnet'],        # l1 is Lasso, l2 is Ridge
    'logisticregression__C': np.arange(1e-05, 3, 0.1),
}

In [None]:
grid_lr = GridSearchCV(LR,hyper_params_lr,scoring='recall', cv=4)

grid_lr.fit(X_train,y_train)

print(grid_lr.best_params_)

y_pred = grid_lr.predict(X_test)

In [None]:
evaluation('Logistic Regression',grid_lr.best_estimator_)

### SVM Optimization

In [None]:
SVM.get_params().keys()

In [None]:
hyper_params_svm = {'svc__gamma':[1e-3, 1e-4, 0.0005],
                'svc__C':[1, 10, 100, 1000, 3000],
               }

In [None]:
grid_svm = GridSearchCV(SVM,hyper_params_svm,scoring='recall', cv=4)

grid_svm.fit(X_train,y_train)

print(grid_svm.best_params_)

y_pred = grid_svm.predict(X_test)

In [None]:
evaluation('SVM',grid_svm.best_estimator_)

### AdaBoost Optimization

In [None]:
AdaBoost.get_params().keys()

In [None]:
hyper_params_abc = {
     'adaboostclassifier__n_estimators': np.arange(10,300,10),
     'adaboostclassifier__learning_rate': [0.01, 0.05, 0.1, 1],
 }

In [None]:
grid_abc = GridSearchCV(AdaBoost,hyper_params_abc,scoring='recall', cv=4)

grid_abc.fit(X_train,y_train)

print(grid_abc.best_params_)

y_pred = grid_abc.predict(X_test)

In [None]:
evaluation('AdaBoost',grid_abc.best_estimator_)

### KNN Optimization

In [None]:
KNN.get_params().keys()

In [None]:
hyper_params_knn = {'kneighborsclassifier__n_neighbors':[4,5,6,7],
              'kneighborsclassifier__leaf_size':[1,3,5],
              }

In [None]:
grid_knn = GridSearchCV(KNN,hyper_params_knn,scoring='recall', cv=4)

grid_knn.fit(X_train,y_train)

print(grid_knn.best_params_)

y_pred = grid_knn.predict(X_test)

In [None]:
evaluation('KNN',grid_abc.best_estimator_)

### Model selection

* The SVM model could have been an interesting model if we had less data. 
* Logistic Regression has good results, but the Adaboost and KNN models have the best performance with a recall of 0.52 and an f1-score of 0.57. 

We will continue our project with the Adaboost model.

### Precision Recall Curve for the Adaboost model

In [None]:
precision, recall, threshold = precision_recall_curve(y_test,grid_abc.best_estimator_.decision_function(X_test))

In [None]:
plt.plot(threshold, precision[:-1], label='precision')
plt.plot(threshold, recall[:-1], label='recall')
plt.legend()

Let's choose a threshold at -0.1 in order to get a best recall.

In [None]:
def model_final(model, X, threshold=0) :
    return model.decision_function(X) > threshold

In [None]:
y_pred = model_final(grid_abc.best_estimator_, X_test, threshold=-0.1)

In [None]:
recall_score(y_test,y_pred)

In [None]:
f1_score(y_test, y_pred)

### Conclusion

With Adaboost and some optimization elements, we managed to obtain a model able to **detect 91% of TELCO's customers who went to churn**. Thanks to this type of model, the company could contact these customers in order to propose them to modify their contract.