## 1. Introduction

#### **Customer Churn in TELCOs**

Companies usually have a greater focus on customer acquisition and keep retention as a secondary priority. However, it can cost five times more to attract a new customer than it does to retain an existing one. Increasing customer retention rates by 5% can increase profits by 25% to 95%, according to research done by Bain & Company.

_Churn_ is a metric that shows customers who stop doing business with a company or a particular service, also known as customer attrition. By following this metric, what most businesses could do was try to understand the reason behind churn numbers and tackle those factors, with reactive action plans


The main goal is to develop a machine learning model capable to predict customer churn based on the customer’s data available.

<p>I'll using the A sample <a href="https://www.kaggle.com/blastchar/telco-customer-churn">Teleco Churn</a> dataset from Kaggle. The structure of this notebook is as follows:</p>
<ul>
<li>First, loading and viewing the dataset.</li>
<li>The dataset has a mixture of both numerical and non-numerical features, that it contains values from different ranges, plus that it contains a number of missing entries.</li>
<li>Preprocessing of the dataset to ensure the machine learning model we choose can make good classifications.</li>
<li>After our data is in good shape, exploratory data analysis to build our intuitions.</li>
<li>Finally, building a machine learning model that can predict if an individual would churn the service.</li>
</ul>


- `Author - Chinmay Gaikwad`
- `Email - chinnmaygaikwad123@gmail.com`

## 2. Exploratory Data Analysis

### 2.1 Data Load

In [None]:
# Importing pandas
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Loading dataset
df = pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv',header=0)

In [None]:
# Inspecting data
df.head()

### 2.2 Data Summary

In [None]:
# Inspecting basic information out of columns
df.info()

In [None]:
# Displaying summary statistics
df.describe()

In [None]:
# Creating a list of Object data type columns
obj_cols = df.select_dtypes(np.object).columns.tolist()
# Checking the categorical values in the Object columns
def check_value_counts(col_list):
  for col in col_list:
    print('-----------------------------')
    print(round((df[col].value_counts()/df.shape[0])*100,2))
    print('-----------------------------')

check_value_counts(obj_cols)

### 2.3 Qualitative Data Analysis

There are few quantitative features which we would be converting into categorical in order to perform classification using Decision Tree model.

In [None]:
# Assigning 0 and 1 to Yes and No
df['SeniorCitizen'] = df['SeniorCitizen'].map({0:'No',1:'Yes'})

In [None]:
#Binning the tenure column
cut_labels = ['0-12', '13-24', '25-36', '37-48','49-60','61-72']
cut_bins = [0, 12,24,36,48,60,72]
df['Tenure Period'] = pd.cut(df['tenure'], bins=cut_bins, labels=cut_labels)
df['Tenure Period'].value_counts()

In [None]:
#Binning the MonthlyCharges column
cut_labels = ['0-20', '21-40', '41-60', '61-80','81-100','101-120']
cut_bins = [0, 20,40,60,80,100,120]
df['MonthlyCharges_Range'] = pd.cut(df['MonthlyCharges'], bins=cut_bins, labels=cut_labels)
df['MonthlyCharges_Range'].value_counts()

In [None]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'],errors='coerce')
df['TotalCharges'].describe()

In [None]:
#Binning the Age column
cut_labels = ['0-1000', '1001-2000','2001-4000','4001-6000','6001-8000','8001-10000']
cut_bins = [0, 1000,2000,4000,6000,8000,10000]
df['TotalCharges_Range'] = pd.cut(df['TotalCharges'], bins=cut_bins, labels=cut_labels)
df['TotalCharges_Range'].value_counts()

In [None]:
# Dropping colummns that are not required
cols_to_drop = ['customerID','MonthlyCharges','tenure','TotalCharges']
df.drop(labels=cols_to_drop,axis=1,inplace=True)

In [None]:
# Sanity checks
df.head(4)

### 2.4 Missing values

In [None]:
# Checking count of null values by the columns
df.isna().sum()

Since the data is categorical, the best strategy to impute them is by taking most frequent values

In [None]:
# Missing values imputation
df['TotalCharges_Range'].fillna(df['TotalCharges_Range'].mode()[0], inplace=True)
df['Tenure Period'].fillna(df['Tenure Period'].mode()[0], inplace=True)

### 2.5 Label Encoding

In [None]:
# Importing LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Instantiating LabelEncoder
le=LabelEncoder()

# Iterating over all the values of each column and extract their dtypes
for col in df.columns.to_numpy():
    # Comparing if the dtype is object
    if df[col].dtypes in ('object','category'):
    # Using LabelEncoder to do the numeric transformation
        df[col]=le.fit_transform(df[col].astype(str))

In [None]:
# Sanity Check
df.head()

### 2.6 Train Test Split

In [None]:
# Putting feature variable to X
X = df.drop('Churn',axis=1)

# Putting response variable to y
y = df['Churn']

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)
X_train.shape, X_test.shape

## 3. Model Building

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dt = DecisionTreeClassifier(max_depth=3,random_state=43)
dt.fit(X_train, y_train)

## 4. Visualization

In [None]:
# Install required dependancy
!pip install six
!pip install pydotplus
!pip install graphviz

In [None]:
# Importing required packages for visualization
from IPython.display import Image  
from six import StringIO
from sklearn.tree import export_graphviz
import pydotplus, graphviz

In [None]:
# plotting tree with max_depth=3
dot_data = StringIO()  

export_graphviz(dt, out_file=dot_data, filled=True, rounded=True,
                feature_names=X.columns, 
                class_names=['Churn', "Not Churn"])

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

In [None]:
# Uncomment below line of code to save the Decision Tree Viz to a pdf file.
#graph.write_pdf("dt_heartdisease.pdf")

## 5. Model Evaluation

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score

In [None]:
y_train_pred = dt.predict(X_train)
y_test_pred = dt.predict(X_test)

In [None]:
print(accuracy_score(y_train, y_train_pred))
confusion_matrix(y_train, y_train_pred)

In [None]:
print(accuracy_score(y_test, y_test_pred))
confusion_matrix(y_test, y_test_pred)

In [None]:
# Let's check the overall accuracy.
trainaccuracy= accuracy_score(y_train, y_train_pred)
testaccuracy= accuracy_score(y_test, y_test_pred)

confusion_TRN = confusion_matrix(y_train, y_train_pred)
confusion_TST = confusion_matrix(y_test, y_test_pred)

In [None]:
TP = confusion_TRN[1,1] # true positive 
TN = confusion_TRN[0,0] # true negatives
FP = confusion_TRN[0,1] # false positives
FN = confusion_TRN[1,0] # false negatives

TP_TST = confusion_TST[1,1] # true positive 
TN_TST = confusion_TST[0,0] # true negatives
FP_TST = confusion_TST[0,1] # false positives
FN_TST = confusion_TST[1,0] # false negatives

trainsensitivity= TP / float(TP+FN)
trainspecificity= TN / float(TN+FP)

testsensitivity= TP_TST / float(TP_TST+FN_TST)
testspecificity= TN_TST / float(TN_TST+FP_TST)

# Let us compare the values obtained for Train & Test:
print('-'*30)
print('On Train Data')
print('-'*30)
print("Accuracy    : {} %".format(round((trainaccuracy*100),2)))
print("Sensitivity : {} %".format(round((trainsensitivity*100),2)))
print("Specificity : {} %".format(round((trainspecificity*100),2)))
print('-'*30)
print('On Test Data')
print('-'*30)
print("Accuracy    : {} %".format(round((testaccuracy*100),2)))
print("Sensitivity : {} %".format(round((testsensitivity*100),2)))
print("Specificity : {} %".format(round((testspecificity*100),2)))
print('-'*30)

Decision Trees are simple and intutive models, However the are high variance models i.e the slight change in train data may result in poor performance on the test as they try to overfit.

Altough our model has stable results on `Train` and `Test data`, this model has low `Sensitivity` and high `Specificity`, to  further improve the performance we need to do Hyperparameter tuning.

## Thank You!