# **Customer Churn - Machine Learning approach**

First, upload your Kaggle credentials JSON file. Then setup the Colab Notebook

In [1]:
! pip install kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json



Download dataset and unzip

In [2]:
! kaggle datasets download yeanzc/telco-customer-churn-ibm-dataset
! unzip telco-customer-churn-ibm-dataset.zip

Downloading telco-customer-churn-ibm-dataset.zip to /content
  0% 0.00/1.25M [00:00<?, ?B/s]
100% 1.25M/1.25M [00:00<00:00, 148MB/s]
Archive:  telco-customer-churn-ibm-dataset.zip
  inflating: Telco_customer_churn.xlsx  


Import necessary libraries for the project

In [3]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt

pd.set_option("display.precision", 3)

Import the downloaded dataset and perform various statistical analyses for better understanding.

In [4]:
dataset = pd.read_excel('/content/Telco_customer_churn.xlsx')
dataset.shape

(7043, 33)

In [5]:
dataset.head(10)

Unnamed: 0,CustomerID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,...,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV,Churn Reason
0,3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964,-118.273,Male,...,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,86,3239,Competitor made better offer
1,9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059,-118.307,Female,...,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1,67,2701,Moved
2,9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048,-118.294,Female,...,Month-to-month,Yes,Electronic check,99.65,820.5,Yes,1,86,5372,Moved
3,7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062,-118.316,Female,...,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes,1,84,5003,Moved
4,0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039,-118.266,Male,...,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,Yes,1,89,5340,Competitor had better devices
5,4190-MFLUW,1,United States,California,Los Angeles,90020,"34.066367, -118.309868",34.066,-118.31,Female,...,Month-to-month,No,Credit card (automatic),55.2,528.35,Yes,1,78,5925,Competitor offered higher download speeds
6,8779-QRDMV,1,United States,California,Los Angeles,90022,"34.02381, -118.156582",34.024,-118.157,Male,...,Month-to-month,Yes,Electronic check,39.65,39.65,Yes,1,100,5433,Competitor offered more data
7,1066-JKSGK,1,United States,California,Los Angeles,90024,"34.066303, -118.435479",34.066,-118.435,Male,...,Month-to-month,No,Mailed check,20.15,20.15,Yes,1,92,4832,Competitor made better offer
8,6467-CHFZW,1,United States,California,Los Angeles,90028,"34.099869, -118.326843",34.1,-118.327,Male,...,Month-to-month,Yes,Electronic check,99.35,4749.15,Yes,1,77,5789,Competitor had better devices
9,8665-UTDHZ,1,United States,California,Los Angeles,90029,"34.089953, -118.294824",34.09,-118.295,Male,...,Month-to-month,No,Electronic check,30.2,30.2,Yes,1,97,2915,Competitor had better devices


In [6]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 33 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CustomerID         7043 non-null   object 
 1   Count              7043 non-null   int64  
 2   Country            7043 non-null   object 
 3   State              7043 non-null   object 
 4   City               7043 non-null   object 
 5   Zip Code           7043 non-null   int64  
 6   Lat Long           7043 non-null   object 
 7   Latitude           7043 non-null   float64
 8   Longitude          7043 non-null   float64
 9   Gender             7043 non-null   object 
 10  Senior Citizen     7043 non-null   object 
 11  Partner            7043 non-null   object 
 12  Dependents         7043 non-null   object 
 13  Tenure Months      7043 non-null   int64  
 14  Phone Service      7043 non-null   object 
 15  Multiple Lines     7043 non-null   object 
 16  Internet Service   7043 

In [7]:
dataset.astype('object').describe().transpose()

Unnamed: 0,count,unique,top,freq
CustomerID,7043.0,7043.0,3668-QPYBK,1.0
Count,7043.0,1.0,1,7043.0
Country,7043.0,1.0,United States,7043.0
State,7043.0,1.0,California,7043.0
City,7043.0,1129.0,Los Angeles,305.0
Zip Code,7043.0,1652.0,90003,5.0
Lat Long,7043.0,1652.0,"33.964131, -118.272783",5.0
Latitude,7043.0,1652.0,33.964,5.0
Longitude,7043.0,1651.0,-121.995,8.0
Gender,7043.0,2.0,Male,3555.0


In [8]:
dataset[['Monthly Charges', 'Total Charges','Churn Score', 'CLTV']].describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Monthly Charges,7043.0,64.762,30.09,18.25,35.5,70.35,89.85,118.75
Churn Score,7043.0,58.699,21.525,5.0,40.0,61.0,75.0,100.0
CLTV,7043.0,4400.296,1183.057,2003.0,3469.0,4527.0,5380.5,6500.0


In [9]:
dataset['Churn Label'].value_counts()

No     5174
Yes    1869
Name: Churn Label, dtype: int64

Drop features and perform some transformations

In [10]:
df = dataset.drop(columns=['CustomerID', 'Count', 'Country', 'State', 'City', 'Zip Code',
       'Lat Long', 'Latitude', 'Longitude','Churn Value','Churn Reason','Phone Service'])
df.replace({'Yes': 1, 'No': 0,'No phone service':np.NaN,'No internet service': np.NaN },inplace = True)
df.replace({'Female': 1, 'Male': 0},inplace = True)
df.dropna(inplace=True)
df.shape

(4835, 21)

In [11]:
df.head()

Unnamed: 0,Gender,Senior Citizen,Partner,Dependents,Tenure Months,Multiple Lines,Internet Service,Online Security,Online Backup,Device Protection,...,Streaming TV,Streaming Movies,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Score,CLTV
0,0,0,0,0,2,0.0,DSL,1.0,1.0,0.0,...,0.0,0.0,Month-to-month,1,Mailed check,53.85,108.15,1,86,3239
1,1,0,0,1,2,0.0,Fiber optic,0.0,0.0,0.0,...,0.0,0.0,Month-to-month,1,Electronic check,70.7,151.65,1,67,2701
2,1,0,0,1,8,1.0,Fiber optic,0.0,0.0,1.0,...,1.0,1.0,Month-to-month,1,Electronic check,99.65,820.5,1,86,5372
3,1,0,1,1,28,1.0,Fiber optic,0.0,0.0,1.0,...,1.0,1.0,Month-to-month,1,Electronic check,104.8,3046.05,1,84,5003
4,0,0,0,1,49,1.0,Fiber optic,0.0,1.0,1.0,...,1.0,1.0,Month-to-month,1,Bank transfer (automatic),103.7,5036.3,1,89,5340


In [12]:
df = pd.concat([df,pd.get_dummies(df[['Internet Service','Contract','Payment Method']], dummy_na=False)],axis=1).drop(['Internet Service','Contract','Payment Method'],axis=1)
df = df.apply(lambda x: pd.to_numeric(x, errors = 'coerce')).dropna()

In [13]:
df.head()

Unnamed: 0,Gender,Senior Citizen,Partner,Dependents,Tenure Months,Multiple Lines,Online Security,Online Backup,Device Protection,Tech Support,...,CLTV,Internet Service_DSL,Internet Service_Fiber optic,Contract_Month-to-month,Contract_One year,Contract_Two year,Payment Method_Bank transfer (automatic),Payment Method_Credit card (automatic),Payment Method_Electronic check,Payment Method_Mailed check
0,0,0,0,0,2,0.0,1.0,1.0,0.0,0.0,...,3239,1,0,1,0,0,0,0,0,1
1,1,0,0,1,2,0.0,0.0,0.0,0.0,0.0,...,2701,0,1,1,0,0,0,0,1,0
2,1,0,0,1,8,1.0,0.0,0.0,1.0,0.0,...,5372,0,1,1,0,0,0,0,1,0
3,1,0,1,1,28,1.0,0.0,0.0,1.0,1.0,...,5003,0,1,1,0,0,0,0,1,0
4,0,0,0,1,49,1.0,0.0,1.0,1.0,0.0,...,5340,0,1,1,0,0,1,0,0,0


In [14]:
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

  


Unnamed: 0,Gender,Senior Citizen,Partner,Dependents,Tenure Months,Multiple Lines,Online Security,Online Backup,Device Protection,Tech Support,Streaming TV,Streaming Movies,Paperless Billing,Monthly Charges,Total Charges,Churn Label,Churn Score,CLTV,Internet Service_DSL,Internet Service_Fiber optic,Contract_Month-to-month,Contract_One year,Contract_Two year,Payment Method_Bank transfer (automatic),Payment Method_Credit card (automatic),Payment Method_Electronic check,Payment Method_Mailed check
Gender,1.0,0.01,-0.03,-0.01,-0.01,0.0,0.02,0.01,0.0,0.01,0.01,0.0,0.02,0.01,-0.0,0.01,0.01,0.01,-0.01,0.01,-0.0,0.01,-0.01,0.02,0.01,-0.01,-0.01
Senior Citizen,0.01,1.0,0.02,-0.16,0.0,0.11,-0.1,-0.01,-0.01,-0.12,0.03,0.05,0.11,0.14,0.03,0.11,0.07,-0.01,-0.21,0.21,0.11,-0.05,-0.1,-0.02,-0.03,0.13,-0.11
Partner,-0.03,0.02,1.0,0.32,0.4,0.15,0.18,0.18,0.19,0.15,0.15,0.15,-0.01,0.19,0.38,-0.17,-0.13,0.16,0.01,-0.01,-0.29,0.12,0.24,0.11,0.09,-0.11,-0.08
Dependents,-0.01,-0.16,0.32,1.0,0.15,-0.0,0.14,0.1,0.06,0.11,0.03,0.01,-0.07,-0.01,0.13,-0.25,-0.19,0.09,0.11,-0.11,-0.15,0.05,0.13,0.07,0.05,-0.11,0.02
Tenure Months,-0.01,0.0,0.4,0.15,1.0,0.37,0.39,0.44,0.43,0.38,0.33,0.34,-0.02,0.44,0.96,-0.4,-0.27,0.41,0.01,-0.01,-0.7,0.28,0.58,0.23,0.25,-0.27,-0.18
Multiple Lines,0.0,0.11,0.15,-0.0,0.37,1.0,0.05,0.16,0.17,0.06,0.22,0.22,0.09,0.46,0.42,-0.02,-0.0,0.14,-0.22,0.22,-0.15,0.03,0.16,0.07,0.07,0.01,-0.17
Online Security,0.02,-0.1,0.18,0.14,0.39,0.05,1.0,0.19,0.17,0.27,0.05,0.06,-0.13,0.09,0.35,-0.28,-0.19,0.17,0.24,-0.24,-0.39,0.14,0.34,0.1,0.15,-0.23,0.03
Online Backup,0.01,-0.01,0.18,0.1,0.44,0.16,0.19,1.0,0.19,0.21,0.15,0.14,-0.0,0.3,0.45,-0.2,-0.13,0.18,0.02,-0.02,-0.32,0.14,0.26,0.11,0.12,-0.14,-0.08
Device Protection,0.0,-0.01,0.19,0.06,0.43,0.17,0.17,0.19,1.0,0.24,0.28,0.29,-0.02,0.39,0.46,-0.18,-0.13,0.15,-0.0,0.0,-0.38,0.14,0.33,0.1,0.14,-0.13,-0.09
Tech Support,0.01,-0.12,0.15,0.11,0.38,0.06,0.27,0.21,0.24,1.0,0.18,0.18,-0.07,0.18,0.37,-0.27,-0.18,0.15,0.23,-0.23,-0.43,0.14,0.4,0.12,0.15,-0.24,0.02


Employ feature selection. We choose 

In [15]:
from sklearn.feature_selection import SelectKBest, chi2

x = df.loc[:, df.columns != 'Churn Label']
y = df['Churn Label']

bestf = SelectKBest(score_func = chi2,k=10)
fit = bestf.fit(x,y)
dfscores = pd.DataFrame(fit.scores_)
featureScores = pd.concat([pd.DataFrame(x.columns),dfscores],axis = 1)
featureScores.columns = ['Specs','Score']
featureScores.nlargest(10, 'Score')

Unnamed: 0,Specs,Score
14,Total Charges,1139000.0
16,CLTV,42150.0
15,Churn Score,18180.0
4,Tenure Months,14310.0
21,Contract_Two year,325.4
19,Contract_Month-to-month,277.3
3,Dependents,242.7
6,Online Security,240.4
9,Tech Support,222.7
17,Internet Service_DSL,206.0


In [16]:
f = featureScores.nlargest(10, 'Score')
#X = df[list(f.Specs)].drop('Churn Score',axis=1)
X = df[list(f.Specs)]
y = df['Churn Label']

Churn Score feature is highly correlated with the dependent variable, so we keep the rest top-9 features of SelectKBest for our analysis. We apply some basic classification algorithms first.

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

names = ["Logistic Regression",
    "Nearest Neighbors",
    "Linear SVM",
    "RBF SVM",
    "Decision Tree",
    "Naive Bayes",
]

classifiers = [LogisticRegression(solver='lbfgs', max_iter=500),
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    DecisionTreeClassifier(max_depth=5),
    GaussianNB(),
]

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

clf_scores = []

for name, clf in zip(names, classifiers):
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    clf_scores.append(score)



In [19]:
clfScores = pd.DataFrame(list(zip(names,clf_scores)))
clfScores.columns = ['Classifier','Score']
clfScores.style.hide_index()

Classifier,Score
Logistic Regression,0.888
Nearest Neighbors,0.72
Linear SVM,0.91
RBF SVM,0.642
Decision Tree,0.897
Naive Bayes,0.852


Then we apply ensemble methods.

**Random Forest** classifier with cross-validation.

In [21]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report

clf = RandomForestClassifier(n_estimators=50, max_depth=5,
    max_features = 0.6)

scores = cross_val_score(clf, X, y, cv=5)
scores.mean()


0.9124591862733133

In [22]:
from sklearn.metrics import classification_report

target_names=['yes','no']

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42)

clf = RandomForestClassifier(n_estimators=50, max_depth=5,
    max_features = 0.6)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

         yes       0.95      0.93      0.94       950
          no       0.86      0.91      0.88       500

    accuracy                           0.92      1450
   macro avg       0.91      0.92      0.91      1450
weighted avg       0.92      0.92      0.92      1450



**AdaBoost** classifier with cross-validation.

In [23]:
from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier(n_estimators=50)
scores = cross_val_score(clf, X, y, cv=5)
scores.mean()

0.9163907926373642

In [26]:
from sklearn.metrics import classification_report

target_names=['yes','no']

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42)

clf = AdaBoostClassifier(n_estimators=50)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

         yes       0.95      0.92      0.93       950
          no       0.86      0.90      0.88       500

    accuracy                           0.91      1450
   macro avg       0.90      0.91      0.91      1450
weighted avg       0.92      0.91      0.91      1450



**Gradient Boosting** classifier with cross-validation.

In [25]:
from sklearn.ensemble import GradientBoostingClassifier

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42)

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
     max_depth=1, random_state=0).fit(X_train, y_train)
clf.score(X_test, y_test)

0.9186206896551724

In [28]:
from sklearn.metrics import classification_report

target_names=['yes','no']

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42)

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
     max_depth=1, random_state=0)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

         yes       0.94      0.93      0.94       950
          no       0.88      0.89      0.88       500

    accuracy                           0.92      1450
   macro avg       0.91      0.91      0.91      1450
weighted avg       0.92      0.92      0.92      1450

