# Telecom Churn Prediction
Purpose: Predict behavior to retain customers

### Introduction

Churn Prediciton is a method to identify customers who are least likely to continue a company's service using factors in the customer data. It helps companies forecast revenue, develop strategies to retain high-risk customers and improve existing services to attract new customers. 

## Exploratory Data Analysis

In [1]:
# IMPORT REQUIRED PACKAGES
import pandas as pd
import numpy as np
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

### Dataset Description
As stated [here](https://www.kaggle.com/datasets/blastchar/telco-customer-churn) in the dataset source page.

>Each row represents a customer, each column contains customer’s attributes described on the column Metadata.
>
>The data set includes information about:
> - Customers who left within the last month – the column is called Churn
> - Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
> - Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
>_ Demographic info about customers – gender, age range, and if they have partners and dependents

In [2]:
original_dataset = pd.read_csv("./WA_Fn-UseC_-Telco-Customer-Churn.csv")
original_dataset.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


### Feature Selection

In [3]:
print("Total Data values in the dataset:", original_dataset.shape[0])
print("Total Columns:", original_dataset.shape[1])
print(original_dataset.columns.values)

Total Data values in the dataset: 7043
Total Columns: 21
['customerID' 'gender' 'SeniorCitizen' 'Partner' 'Dependents' 'tenure'
 'PhoneService' 'MultipleLines' 'InternetService' 'OnlineSecurity'
 'OnlineBackup' 'DeviceProtection' 'TechSupport' 'StreamingTV'
 'StreamingMovies' 'Contract' 'PaperlessBilling' 'PaymentMethod'
 'MonthlyCharges' 'TotalCharges' 'Churn']


Out of the given 21 columns, the features that should be used for analysis and prediction are:

gender, SeniorCitizen, Partner, Dependents, tenure, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges = 19 features

**Target Variable:** Churn

### Dataset Cleaning

In [4]:
original_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [5]:
dataset_copy = original_dataset

In [6]:
# Converting TotalCharges to numeric
dataset_copy['TotalCharges'] = pd.to_numeric(dataset_copy['TotalCharges'], errors='coerce')

In [7]:
# Checking for any null values
dataset_copy.isnull().sum()

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

In [8]:
# Removing missing values and customerID
dataset_copy.dropna(inplace=True)
df = dataset_copy.iloc[:,1:]
df.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [9]:
df['Churn'].replace(to_replace='Yes', value=1, inplace=True)
df['Churn'].replace(to_replace='No', value=0, inplace=True)

In [10]:
df = pd.get_dummies(df)
df.head()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,Churn,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,...,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,1,29.85,29.85,0,1,0,0,1,1,...,0,1,0,0,0,1,0,0,1,0
1,0,34,56.95,1889.5,0,0,1,1,0,1,...,0,0,1,0,1,0,0,0,0,1
2,0,2,53.85,108.15,1,0,1,1,0,1,...,0,1,0,0,0,1,0,0,0,1
3,0,45,42.3,1840.75,0,0,1,1,0,1,...,0,0,1,0,1,0,1,0,0,0
4,0,2,70.7,151.65,1,1,0,1,0,1,...,0,1,0,0,0,1,0,0,1,0


### Data Exploration

#### Correlation Analysis

In [11]:
correlation = df.corr()
fig = px.imshow(correlation, title="Correlation between all the features in the dataset", aspect='auto', height=1200)
fig.show()

In [12]:
correlation_churn = correlation['Churn'].sort_values(ascending=True)
fig = px.bar(correlation_churn, text_auto='.2', height=800, width=1200)
fig.show()

__Conclusion from Correlation Analysis:__

Major factors affecting Churn _postively_:
- Month-to-Month Contracts
- Absence of Online Security
- Absence of Tech Support

Major factors affecting Churn _negatively_:
- Tenure
- Two-year Contracts

#### Univariate Analysis

In [13]:
churn_distribution = dataset_copy['Churn'].value_counts()
px.pie(churn_distribution, values='Churn', names=churn_distribution.index, title="Data Distribution w.r.t. Churn").show()

In [14]:
# GENDER DISTRIBUTION
gender_distribution = dataset_copy['gender'].value_counts()

# SENIORITY DISTRIBUTION
seniority_distribution = dataset_copy['SeniorCitizen'].value_counts()
seniority_distribution.rename(index={0: 'No', 1: 'Yes'}, inplace=True)

# PARTNER DISTRIBUTION
partner_distribution = dataset_copy['Partner'].value_counts()

# DEPENDENCE DISTRIBUTION
dependent_distribution = dataset_copy['Dependents'].value_counts()

In [15]:
fig = make_subplots(
    rows=2, cols=2,
    specs=[[{"type": "pie"}, {"type": "pie"}],[{"type": "pie"}, {"type": "pie"}]],
    subplot_titles=('Data Distribution w.r.t. Gender', 'Data Distribution w.r.t. Seniority', 'Data Distribution w.r.t. Partner', 'Data Distribution w.r.t. Dependents')
)
fig.add_trace(
    px.pie(gender_distribution, values='gender', names=gender_distribution.index).data[0],
    row=1, col=1
)
fig.add_trace(
    px.pie(seniority_distribution, values='SeniorCitizen', names=seniority_distribution.index).data[0],
    row=1, col=2
)
fig.add_trace(
    px.pie(partner_distribution, values='Partner', names=partner_distribution.index).data[0],
    row=2, col=1
)
fig.add_trace(
    px.pie(dependent_distribution, values='Dependents', names=dependent_distribution.index).data[0],
    row=2, col=2
)
fig.update_layout(height=800, width=1000)
fig.show()

In [16]:
tenure_distribution = dataset_copy['tenure'].value_counts()
fig = px.bar(tenure_distribution,
    title="Data Distribution w.r.t Tenure",
    labels={
        "index": "months",
        "value": "customers"
    }
)
fig.update_layout(
    xaxis_title="No. of Months",
    yaxis_title="No. of Customers",
    showlegend=False
)
fig.show()

In [17]:
contract_type_distribution = dataset_copy['Contract'].value_counts()
fig = px.bar(contract_type_distribution,
    title="Data Distribution w.r.t Contract Type",
    labels={
        "index": "type",
        "value": "customers"
    }
)
fig.update_layout(
    xaxis_title="Type of Contract",
    yaxis_title="No. of Customers",
    showlegend=False
)
fig.show()

In [18]:
services = ['PhoneService','MultipleLines','InternetService','OnlineSecurity', 'OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']
fig = make_subplots(
    rows=3, cols=3,
    subplot_titles=services
)

for i, service in enumerate(services):
    service_df = dataset_copy[service].value_counts()
    if 0 <= i < 3:
        row, col = 1, i+1
    elif 3 <= i < 6:
        row, col = 2, (i+1) - 3
    elif 6 <= i < 9:
        row, col = 3, (i+1) - 6


    fig.add_trace(
        px.bar(
            service_df,
            labels={
                "index": "service_choice",
                "value": "customers"
            }
        ).data[0],
        row=row, col=col
    )
    
fig.update_layout(
    height=800,
    showlegend=False
)
fig.show()

__Conclusion from Univariate Analysis:__

- Data distribution is __highly uneven__ w.r.t. the target variable: Churn, which might affect the performance of the ML model.
- Data distribution is __highly uneven__ w.r.t. features like:
    - Whether the customer is a senior citizen or not.
    - Whether the customer has dependents or not.
- Data distribution is __fair__ w.r.t. features like:
    - Gender
    - Whether the customer has a partner or not.
- Majority of the customers choose Month-to-Month plans over an year or two year contracts. It will be interesting to know the tenure and contract type relation w.r.t. Churn.

#### Bivariate and Multivariate Analysis

In [19]:
contract_churn = dataset_copy.groupby(['Contract', 'Churn']).size().unstack()
contract_churn = (contract_churn.T*100/contract_churn.T.sum()).T

fig = px.bar(
        contract_churn,
        title="Churn Rate vs Type of Contract",
        labels={
            "value": "% Customers",
            "Contract": "Type of Contract"
        }
    )
fig.show()

In [20]:
a = dataset_copy['MonthlyCharges'][(dataset_copy['Churn']=='No')]
b = dataset_copy['MonthlyCharges'][(dataset_copy['Churn']=='Yes')]
fig = ff.create_distplot([a,b], ['Churn (No)', 'Churn (Yes)'], bin_size=.2)
fig.show()

In [21]:
a = dataset_copy['TotalCharges'][(dataset_copy['Churn']=='No')]
b = dataset_copy['TotalCharges'][(dataset_copy['Churn']=='Yes')]
fig = ff.create_distplot([a,b], ['Churn (No)', 'Churn (Yes)'], bin_size=.2)
fig.show()

__Conclusion from Bivariate and Multivariate Analysis:__

- Customers are likely to not churn if the duration of the contract is high while customer with monthly contracts are unpredictable based on contract type alone.
- It can also be concluded that the customer is likely to churn when the monthly charges are high, and the opposite is true when the total charges are high as shown in the distplot.

## Model Training

Considering all the conclusions obtained and observations made during EDA, we will use the following algorithms to train a Binary Classification Machine Learning Model.
- __Logistic Regression__

    Logistic regression is a type of regression analysis and is easy to implement. It is considered to have a good performance measure for a binary classification problems where the size of the dataset is small which suits our use case.

- __Decision Tree__

    Decision Tree is defined as the graphical representation of the possible solutions to a problem on given conditions. It is a fast and easy to implement as there is no need of normalization and feature scaling for this algorithm which makes it suitable for our use case

In [22]:
# IMPORTING REQUIRED PACKAGES
from sklearn.model_selection import train_test_split
from sklearn import metrics

from imblearn.combine import SMOTEENN

from sklearn.model_selection import GridSearchCV

### Feature and Target Variable Processing

In [23]:
labels = [f"{i} - {i+11}" for i in range(1, df['tenure'].max(), 12)]
df['tenure_group'] = pd.cut(df.tenure, range(1, 80, 12), right=False, labels=labels)
df.drop(columns=['tenure'], axis=1, inplace=True)
df = pd.get_dummies(df)
df.to_csv('final_dataset.csv')
df.head()

Unnamed: 0,SeniorCitizen,MonthlyCharges,TotalCharges,Churn,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,Dependents_Yes,...,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,tenure_group_1 - 12,tenure_group_13 - 24,tenure_group_25 - 36,tenure_group_37 - 48,tenure_group_49 - 60,tenure_group_61 - 72
0,0,29.85,29.85,0,1,0,0,1,1,0,...,0,0,1,0,1,0,0,0,0,0
1,0,56.95,1889.5,0,0,1,1,0,1,0,...,0,0,0,1,0,0,1,0,0,0
2,0,53.85,108.15,1,0,1,1,0,1,0,...,0,0,0,1,1,0,0,0,0,0
3,0,42.3,1840.75,0,0,1,1,0,1,0,...,1,0,0,0,0,0,0,1,0,0
4,0,70.7,151.65,1,1,0,1,0,1,0,...,0,0,1,0,1,0,0,0,0,0


In [24]:
X = df.drop(columns=['Churn'], axis=1)
Y = df['Churn']
display(X.head())
display(Y.head())

Unnamed: 0,SeniorCitizen,MonthlyCharges,TotalCharges,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,Dependents_Yes,PhoneService_No,...,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,tenure_group_1 - 12,tenure_group_13 - 24,tenure_group_25 - 36,tenure_group_37 - 48,tenure_group_49 - 60,tenure_group_61 - 72
0,0,29.85,29.85,1,0,0,1,1,0,1,...,0,0,1,0,1,0,0,0,0,0
1,0,56.95,1889.5,0,1,1,0,1,0,0,...,0,0,0,1,0,0,1,0,0,0
2,0,53.85,108.15,0,1,1,0,1,0,0,...,0,0,0,1,1,0,0,0,0,0
3,0,42.3,1840.75,0,1,1,0,1,0,1,...,1,0,0,0,0,0,0,1,0,0
4,0,70.7,151.65,1,0,1,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0


0    0
1    0
2    1
3    0
4    1
Name: Churn, dtype: int64

### Training and Testing

#### Logistic Regression

In [25]:
# CREATING TRAIN AND TEST DATA
# Using a ratio of 8:2 for train:test since the dataset is not large enough
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2)

##### Default Model

In [26]:
from sklearn.linear_model import LogisticRegression
model_lr = LogisticRegression()
model_lr.fit(X_train, Y_train)


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



In [27]:
Y_pred_model_lr = model_lr.predict(X_test)
print("Logistic Regression Model Accuracy (without hyper parameter tuning): {:.2f}%".format(model_lr.score(X_test, Y_test)*100))

Logistic Regression Model Accuracy (without hyper parameter tuning): 78.82%


In [28]:
print(metrics.classification_report(Y_test, Y_pred_model_lr, labels=[0,1]))

              precision    recall  f1-score   support

           0       0.83      0.90      0.86      1029
           1       0.64      0.49      0.55       378

    accuracy                           0.79      1407
   macro avg       0.73      0.69      0.71      1407
weighted avg       0.78      0.79      0.78      1407



In [29]:
fig = px.imshow(
    metrics.confusion_matrix(Y_test, Y_pred_model_lr),
    labels=dict(x="PREDICTED", y="ACTUAL", color='value'),
    x=['Postive', 'Negative'],
    y=['Postive', 'Negative'],
    title="Confusion Matrix for Logistic Regression Model (without hyper parameter tuning)"
)
fig.update_traces(
    text=[['TP','FN'],['FP','TN']],
    texttemplate="%{text}",
)
fig.show()

##### Hyper Parameter Tuning

In [30]:
gscv_lr = GridSearchCV(
    LogisticRegression(),
    {
        'penalty': ['l1','l2','elasticnet','none'],
        'C': [1,5,10,15],
        'solver': ['newton-cg','lbfgs','liblinear','sag','saga']

    },
    cv=5,
    return_train_score=False
)
gscv_lr.fit(X,Y)


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The line search algorithm did not converge


The line search algorithm did not converge


The line search algorithm did not converge


The line search algorithm did not converge


The line search algorithm did not converge


The line search algorithm did not converge


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


lbfgs failed to conver

In [31]:
gscv_lr_result = pd.DataFrame(gscv_lr.cv_results_)

In [32]:
gscv_lr_result.sort_values(by='mean_test_score',ascending=False)[['param_penalty','param_C','param_solver', 'mean_test_score']].head()

Unnamed: 0,param_penalty,param_C,param_solver,mean_test_score
2,l1,1,liblinear,0.802901
65,l2,15,newton-cg,0.802901
45,l2,10,newton-cg,0.802901
22,l1,5,liblinear,0.802758
25,l2,5,newton-cg,0.802758


In [33]:
final_model_lr = gscv_lr.best_estimator_
final_model_lr

In [34]:
Y_pred_final_model_lr = final_model_lr.predict(X_test)
print("Best Logistic Regression Model Accuracy: {:.2f}%".format(final_model_lr.score(X_test, Y_test)*100))

Best Logistic Regression Model Accuracy: 79.32%


In [35]:
print(metrics.classification_report(Y_test, Y_pred_final_model_lr, labels=[0,1]))

              precision    recall  f1-score   support

           0       0.83      0.90      0.86      1029
           1       0.65      0.49      0.56       378

    accuracy                           0.79      1407
   macro avg       0.74      0.70      0.71      1407
weighted avg       0.78      0.79      0.78      1407



In [36]:
fig = px.imshow(
    metrics.confusion_matrix(Y_test, Y_pred_final_model_lr),
    labels=dict(x="PREDICTED", y="ACTUAL", color='value'),
    x=['Postive', 'Negative'],
    y=['Postive', 'Negative'],
    title="Confusion Matrix for Logistic Regression Model (with hyper parameter tuning)"
)
fig.update_traces(
    text=[['TP','FN'],['FP','TN']],
    texttemplate="%{text}",
)
fig.show()

##### Training on Resampling

Considering that our dataset is highly uneven and imbalanced [73:27 for Churn(No):Churn(Yes)], we will have to use resampling to evenly distribute our data as uneven dataset results in poor model performance. Using [SMOTEENN](https://imbalanced-learn.org/stable/references/generated/imblearn.combine.SMOTEENN.html), we will resample our dataset to (hopefully) improve the performance of the model. For reference, we will use the original train_test_split against resampled train_test_split for comparision.

In [37]:
smoteenn = SMOTEENN()
X_resampled, Y_resampled = smoteenn.fit_resample(X,Y)

Xr_train, Xr_test, Yr_train, Yr_test = train_test_split(X_resampled, Y_resampled,test_size=0.2)

In [38]:
# Get best params of best estimator to run training on resampled dataset
gscv_lr.best_params_

{'C': 1, 'penalty': 'l1', 'solver': 'liblinear'}

In [39]:
final_model_lr_resampled = LogisticRegression(
    C=1,
    penalty='l1',
    solver='liblinear'
)
final_model_lr_resampled.fit(Xr_train, Yr_train)

In [40]:
Y_pred_model_lr_resampled = final_model_lr_resampled.predict(Xr_test)
print("Logistic Regression Model Accuracy (on resampled data): {:.2f}%".format(final_model_lr_resampled.score(Xr_test, Yr_test)*100))

Logistic Regression Model Accuracy (on resampled data): 94.25%


In [41]:
print(metrics.classification_report(Yr_test, Y_pred_model_lr_resampled, labels=[0,1]))

              precision    recall  f1-score   support

           0       0.94      0.93      0.94       526
           1       0.95      0.95      0.95       656

    accuracy                           0.94      1182
   macro avg       0.94      0.94      0.94      1182
weighted avg       0.94      0.94      0.94      1182



In [42]:
fig = px.imshow(
    metrics.confusion_matrix(Yr_test, Y_pred_model_lr_resampled),
    labels=dict(x="PREDICTED", y="ACTUAL", color='value'),
    x=['Postive', 'Negative'],
    y=['Postive', 'Negative'],
    title="Confusion Matrix for Logistic Regression Model (on resampled dataset)"
)
fig.update_traces(
    text=[['TP','FN'],['FP','TN']],
    texttemplate="%{text}",
)
fig.show()

##### Conclusion

After applying __Logistic Regression__ on various combinations like default hyper parameters, tuning hyper parameters and resampling dataset, it is evident that resampling the dataset resulted in the best accuracy with a precision of over 90% for both the classes. 

#### Decision Tree

In [43]:
# CREATING TRAIN AND TEST DATA
# Using a ratio of 8:2 for train:test since the dataset is not large enough
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2)

##### Default Model

In [44]:
from sklearn.tree import DecisionTreeClassifier
model_dtc = DecisionTreeClassifier()
model_dtc.fit(X_train, Y_train)

In [45]:
Y_pred_model_dtc = model_dtc.predict(X_test)
print("Decision Tree Accuracy (without hyper parameter tuning): {:.2f}%".format(model_dtc.score(X_test, Y_test)*100))

Decision Tree Accuracy (without hyper parameter tuning): 72.28%


In [46]:
print(metrics.classification_report(Y_test, Y_pred_model_dtc, labels=[0,1]))

              precision    recall  f1-score   support

           0       0.81      0.81      0.81      1021
           1       0.49      0.49      0.49       386

    accuracy                           0.72      1407
   macro avg       0.65      0.65      0.65      1407
weighted avg       0.72      0.72      0.72      1407



In [47]:
fig = px.imshow(
    metrics.confusion_matrix(Y_test, Y_pred_model_dtc),
    labels=dict(x="PREDICTED", y="ACTUAL", color='value'),
    x=['Postive', 'Negative'],
    y=['Postive', 'Negative'],
    title="Confusion Matrix for Decision Tree (without hyper parameter tuning)"
)
fig.update_traces(
    text=[['TP','FN'],['FP','TN']],
    texttemplate="%{text}",
)
fig.show()

##### Hyper Parameter Tuning

In [48]:
gscv_dtc = GridSearchCV(
    DecisionTreeClassifier(),
    {
        'criterion': ['gini','entropy','log_loss'],
        'max_depth': [6,7,8,9,10],
        'splitter': ['best','random'],
        'min_samples_leaf': [6,7,8,9,10],
        'min_samples_split': [6,7,8,9,10],
        'random_state': [100,120,140]
    },
    cv=5,
    return_train_score=False
)
gscv_dtc.fit(X,Y)

In [49]:
gscv_dtc_result = pd.DataFrame(gscv_dtc.cv_results_)

In [50]:
gscv_dtc_result.sort_values(by='mean_test_score',ascending=False)[['param_criterion','param_max_depth','param_splitter','param_min_samples_leaf','param_min_samples_split','param_random_state','mean_test_score']].head()

Unnamed: 0,param_criterion,param_max_depth,param_splitter,param_min_samples_leaf,param_min_samples_split,param_random_state,mean_test_score
971,entropy,7,random,8,7,140,0.793939
1727,log_loss,7,random,8,8,140,0.793939
1715,log_loss,7,random,8,6,140,0.793939
965,entropy,7,random,8,6,140,0.793939
1733,log_loss,7,random,8,9,140,0.793939


In [51]:
final_model_dtc = gscv_dtc.best_estimator_
final_model_dtc

In [52]:
Y_pred_final_model_dtc = final_model_dtc.predict(X_test)
print("Best Decision Tree Accuracy: {:.2f}%".format(final_model_dtc.score(X_test, Y_test)*100))

Best Decision Tree Accuracy: 81.09%


In [53]:
print(metrics.classification_report(Y_test, Y_pred_final_model_dtc, labels=[0,1]))

              precision    recall  f1-score   support

           0       0.86      0.89      0.87      1021
           1       0.67      0.61      0.64       386

    accuracy                           0.81      1407
   macro avg       0.76      0.75      0.76      1407
weighted avg       0.81      0.81      0.81      1407



In [54]:
fig = px.imshow(
    metrics.confusion_matrix(Y_test, Y_pred_final_model_dtc),
    labels=dict(x="PREDICTED", y="ACTUAL", color='value'),
    x=['Postive', 'Negative'],
    y=['Postive', 'Negative'],
    title="Confusion Matrix for Decision Tree (with hyper parameter tuning)"
)
fig.update_traces(
    text=[['TP','FN'],['FP','TN']],
    texttemplate="%{text}",
)
fig.show()

##### Training on Resampling

Considering that our dataset is highly uneven and imbalanced [73:27 for Churn(No):Churn(Yes)], we will have to use resampling to evenly distribute our data as uneven dataset results in poor model performance. Using [SMOTEENN](https://imbalanced-learn.org/stable/references/generated/imblearn.combine.SMOTEENN.html), we will resample our dataset to (hopefully) improve the performance of the model. For reference, we will use the original train_test_split against resampled train_test_split for comparision.

In [55]:
smoteenn = SMOTEENN()
X_resampled, Y_resampled = smoteenn.fit_resample(X,Y)

Xr_train, Xr_test, Yr_train, Yr_test = train_test_split(X_resampled, Y_resampled,test_size=0.2)

In [56]:
# Get best params of best estimator to run training on resampled dataset
gscv_dtc.best_params_

{'criterion': 'entropy',
 'max_depth': 7,
 'min_samples_leaf': 8,
 'min_samples_split': 6,
 'random_state': 140,
 'splitter': 'random'}

In [57]:
final_model_dtc_resampled = DecisionTreeClassifier(
    criterion='entropy',
    max_depth=7,
    min_samples_leaf=8,
    min_samples_split=6,
    random_state=140,
    splitter='random'
)
final_model_dtc_resampled.fit(Xr_train, Yr_train)

In [58]:
Y_pred_model_dtc_resampled = final_model_dtc_resampled.predict(Xr_test)
print("Decision Tree Accuracy (on resampled data): {:.2f}%".format(final_model_dtc_resampled.score(Xr_test, Yr_test)*100))

Decision Tree Accuracy (on resampled data): 91.75%


In [59]:
print(metrics.classification_report(Yr_test, Y_pred_model_dtc_resampled, labels=[0,1]))

              precision    recall  f1-score   support

           0       0.91      0.90      0.90       494
           1       0.93      0.93      0.93       670

    accuracy                           0.92      1164
   macro avg       0.92      0.92      0.92      1164
weighted avg       0.92      0.92      0.92      1164



In [60]:
fig = px.imshow(
    metrics.confusion_matrix(Yr_test, Y_pred_model_dtc_resampled),
    labels=dict(x="PREDICTED", y="ACTUAL", color='value'),
    x=['Postive', 'Negative'],
    y=['Postive', 'Negative'],
    title="Confusion Matrix for Decision Tree (on resampled dataset)"
)
fig.update_traces(
    text=[['TP','FN'],['FP','TN']],
    texttemplate="%{text}",
)
fig.show()

##### Conclusion

After applying __Decision Tree__ on various combinations like default hyper parameters, tuning hyper parameters and resampling dataset, it is evident that resampling the dataset resulted in the best accuracy with a precision of over 90% for both the classes. 

### Comparision (Logistic Regression vs Decision Tree) 

In [87]:
best_lr_model = final_model_lr_resampled
best_dtc_model = final_model_dtc_resampled
model_data = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree'],
    'Accuracy': [best_lr_model.score(Xr_test, Yr_test)*100, best_dtc_model.score(Xr_test, Yr_test)*100]
})
model_data

Unnamed: 0,Model,Accuracy
0,Logistic Regression,93.298969
1,Decision Tree,91.752577


In [88]:
fig = px.bar(model_data, x='Model', y='Accuracy', width=500)
fig.update_yaxes(range=[0,100])
fig.show()

Comparing the results above, we can say that although it is marginal but the model trained using Logistic Regression outperforms the one trained by Decison Tree. 

## Risks and Benefits

Considering we use our best trained model i.e. the one trained using Logistic Regression, the Risk Analysis is as follows:

Considering the formula for risk is
    `Risk = Severity * Chances of wrong prediction`

Let's say the business defines risk as:
| Monthly Charges Range  | Severity |
|------------------------|----------|
| 0-30                   | 0        |
| 30-60                  | 1        |
| 60-90                  | 2        |
| 90-120                 | 3        |
| 120-150                | 4        |

In [145]:
df = pd.read_csv("./final_dataset.csv", index_col=0).sample(50)
labels = [i for i in range(0, 4)]
df['severity'] = pd.cut(df['MonthlyCharges'], range(0, 150, 30), right=False, labels=labels)
df['severity'] = pd.to_numeric(df['severity'])
df.head()

Unnamed: 0,SeniorCitizen,MonthlyCharges,TotalCharges,Churn,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,Dependents_Yes,...,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,tenure_group_1 - 12,tenure_group_13 - 24,tenure_group_25 - 36,tenure_group_37 - 48,tenure_group_49 - 60,tenure_group_61 - 72,severity
7039,0,103.2,7362.9,0,1,0,0,1,0,1,...,1,0,0,0,0,0,0,0,1,3
1468,0,92.3,5731.45,0,0,1,1,0,1,0,...,0,0,0,0,0,0,0,0,1,3
6415,0,45.35,2540.1,0,1,0,0,1,1,0,...,0,1,0,0,0,0,0,1,0,1
3534,0,105.2,4822.85,1,1,0,1,0,0,1,...,0,0,0,0,0,0,1,0,0,3
2369,0,99.25,6549.45,0,0,1,1,0,1,0,...,1,0,0,0,0,0,0,0,1,3


The "risk" column defines the risk associated with the application's prediction to a particular customer. 

In [164]:
chance_of_wrong_prediction = 100 - model_data[model_data['Model'] == 'Logistic Regression'].Accuracy[0]
df['risk'] = df['severity'].multiply(chance_of_wrong_prediction)
df.head()

Unnamed: 0,SeniorCitizen,MonthlyCharges,TotalCharges,Churn,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,Dependents_Yes,...,PaymentMethod_Electronic check,PaymentMethod_Mailed check,tenure_group_1 - 12,tenure_group_13 - 24,tenure_group_25 - 36,tenure_group_37 - 48,tenure_group_49 - 60,tenure_group_61 - 72,severity,risk
7039,0,103.2,7362.9,0,1,0,0,1,0,1,...,0,0,0,0,0,0,0,1,3,20.103093
1468,0,92.3,5731.45,0,0,1,1,0,1,0,...,0,0,0,0,0,0,0,1,3,20.103093
6415,0,45.35,2540.1,0,1,0,0,1,1,0,...,1,0,0,0,0,0,1,0,1,6.701031
3534,0,105.2,4822.85,1,1,0,1,0,0,1,...,0,0,0,0,0,1,0,0,3,20.103093
2369,0,99.25,6549.45,0,0,1,1,0,1,0,...,0,0,0,0,0,0,0,1,3,20.103093


In [165]:
# Average Risk on the sample set of 50 customers
print("Average risk on the sample set:",df[['risk']].mean(axis=0).risk)

Average risk on the sample set: 12.865979381443296


__Other Risks__

Since the application here is based on a machine learning model, it must be ensured that the data being fed to it is free from polluting data points like outliers. 

__Benefits of using Churn Prediction:__
- helps companies forecast revenue
- identify services that are not beneficial to the company
- develop strategies to retain high-risk customers
- improve existing services to attract new customers. 