# DATA DESCRIPTION 
The dataset was obtained from Kaggle, which contains around 7000 rows and 21 columns with 17 categorical columns, and 3 numerical columns, and 1 label column.

customerID: A unique ID that identifies each customer.

gender: The customer’s gender: Female,Male

SeniorCitizen: Indicates if the customer is 65 or older: Yes, No

Partner: Indicates if the customer has partner: Yes, No

Dependents: Indicates if the customer lives with any dependents: Yes, No. Dependents could be children, parents, grandparents, etc.

tenure: Indicates the total amount of months that the customer has been with the company.

PhoneService: Indicates if the customer subscribes to home phone service with the company: Yes, No

Multiple Lines: Indicates if the customer subscribes to multiple telephone lines with the company: Yes, No, No phone service

Internet Service: Indicates if the customer subscribes to Internet service with the company: No, DSL, Fiber optic, No

Online Security: Indicates if the customer subscribes to an additional online security service provided by the company: Yes, No, No internet service

OnlineBackup: Indicates if the customer subscribes to an additional online backup service provided by the company: Yes, No, No internet service

DeviceProtection: Indicates if the customer subscribes to an additional device protection plan for their Internet equipment provided by the company: Yes, No, No internet service

TechSupport: Indicates if the customer subscribes to an additional technical support plan from the company with reduced wait times: Yes, No, No internet service

StreamingTV: Indicates if the customer uses their Internet service to stream television programing from a third party provider: Yes, No, No internet service

Streaming Movies: Indicates if the customer uses their Internet service to stream movies from a third party provider: Yes, No, No internet service

Contract: Indicates the customer’s current contract type: Month-to-Month, One Year, Two Year.

Paperless Billing: Indicates if the customer has chosen paperless billing: Yes, No

Payment Method: Indicates how the customer pays their bill: Bank transfer, Credit card, Electronic check, Mailed Check

Monthly Charge: Indicates the customer’s current total monthly charge for all their services from the company.

Total Charges: Indicates the customer’s total charges, calculated to the end of the quarter specified above.

Churn: Yes = the customer left the company this quarter. No = the customer remained with the company. Directly related to Churn Value.

# Data Prepration

Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
plt.rcParams['figure.figsize'] = (10.0, 8.0)
import seaborn as sns

In [2]:
df = pd.read_csv('Tel_Customer_Churn_Dataset.csv')

In [3]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


# Data Exploration

In [4]:
df.shape

(7043, 21)

In [5]:
# Numeric Columns
df.describe(include = [np.number])

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


In [6]:
# Object Columns
df.describe(include = [np.object])

Unnamed: 0,customerID,gender,Partner,Dependents,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,TotalCharges,Churn
count,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043.0,7043
unique,7043,2,2,2,2,3,3,3,3,3,3,3,3,3,2,4,6531.0,2
top,7590-VHVEG,Male,No,No,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,,No
freq,1,3555,3641,4933,6361,3390,3096,3498,3088,3095,3473,2810,2785,3875,4171,2365,11.0,5174


In [7]:
# Checking for Null Values
df.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

# Data Cleaning & Preprocessing

In [8]:
# As there is Unique ID for each customer, hence no role would this column play in deciding output, hence dropping ID colummn

In [9]:
df.drop('customerID', axis = 1, inplace = True)

In [10]:
# Total charge seems numeric in nature, but it is objective type, hence convert this into Numeric
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors = 'coerce')

Above syntax will convert object into numeric, and places where other then numeric value present, 
it will replace that value with Nan

In [11]:
# To find number of Null in TotalCharges Column
print("Total Null Values: {}".format(df['TotalCharges'].isnull().sum()))
print("% of Missing values: {}".format(df['TotalCharges'].isnull().sum() * 100 / df.shape[0]))

Total Null Values: 11
% of Missing values: 0.15618344455487718


As % is too small (Less than 10%) hence dropping them makes sense

In [12]:
df.dropna(how = 'any', inplace = True)

In [13]:
# Checking data again

df.shape

(7032, 20)

In [14]:
# Identify Objective variables and find its relation with Output
object = [feature for feature in df.columns if df[feature].dtypes == "O"]

In [15]:
# Identify distribution of Output
df['Churn'].value_counts()

No     5163
Yes    1869
Name: Churn, dtype: int64

It is Imbalanced dataset, we will handle this during model Building stage

In [16]:
for value in object:
    print(value)
    print(df[value].value_counts())
    print(pd.crosstab(df[value], df['Churn']))

gender
Male      3549
Female    3483
Name: gender, dtype: int64
Churn     No  Yes
gender           
Female  2544  939
Male    2619  930
Partner
No     3639
Yes    3393
Name: Partner, dtype: int64
Churn      No   Yes
Partner            
No       2439  1200
Yes      2724   669
Dependents
No     4933
Yes    2099
Name: Dependents, dtype: int64
Churn         No   Yes
Dependents            
No          3390  1543
Yes         1773   326
PhoneService
Yes    6352
No      680
Name: PhoneService, dtype: int64
Churn           No   Yes
PhoneService            
No             510   170
Yes           4653  1699
MultipleLines
No                  3385
Yes                 2967
No phone service     680
Name: MultipleLines, dtype: int64
Churn               No  Yes
MultipleLines              
No                2536  849
No phone service   510  170
Yes               2117  850
InternetService
Fiber optic    3096
DSL            2416
No             1520
Name: InternetService, dtype: int64
Churn              No

Columns like MultipleLines has three class, No , Yes and 'No Phone Service'. We need to convert "No Phone Service" into No class. We Will use where function from Numpy

In [17]:
df['MultipleLines'] = np.where(df['MultipleLines'] != "Yes", "No", "Yes")
df['OnlineSecurity'] = np.where(df['OnlineSecurity'] != "Yes", "No", "Yes")
df['OnlineBackup'] = np.where(df['OnlineBackup'] != "Yes", "No", "Yes")
df['DeviceProtection'] = np.where(df['DeviceProtection'] != "Yes", "No", "Yes")
df['TechSupport'] = np.where(df['TechSupport'] != "Yes", "No", "Yes")
df['StreamingTV'] = np.where(df['StreamingTV'] != "Yes", "No", "Yes")
df['StreamingMovies'] = np.where(df['StreamingMovies'] != "Yes", "No", "Yes")

In [18]:
#verifying above syntax
df['MultipleLines'].value_counts()

No     4065
Yes    2967
Name: MultipleLines, dtype: int64

In [19]:
# Using Label Encoder, we will convert all above object into numeric

from sklearn.preprocessing import LabelEncoder

lb = LabelEncoder()
for value in object:
    df[value] = lb.fit_transform(df[value])

In [20]:
# Separating Dependent and Independent variables

x = df.iloc[:,:-1]
y = df.iloc[:,-1]

In [21]:
# Using standard scaler, lets convert each numeric variable into similar numeric range
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = sc.fit_transform(x)

x = pd.DataFrame(x)

In [22]:
# Split data into Train test 

from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, stratify = y, test_size = 0.2, random_state = 0)

In [23]:
# We will use StratifiedCV to verify output,
# This will iterate each model 5 times, to generate output accuracy

In [24]:
from sklearn.model_selection import StratifiedKFold

# Lets define a function to carryout StratifiedCV

def skfold_cv(x, y, algo, param, n_jobs = -1):
    skf = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 0)
    
    for i, j in skf.split(x, y):
        xtrain, xtest = x.iloc[i, :], x.iloc[j, :]
        ytrain, ytest = y.iloc[i], y.iloc[j]
        
        model = algo(**param)
        model.fit(xtrain, ytrain)
        ypred = model.predict(xtest)
        print(accuracy_score(ytest, ypred))
        print(confusion_matrix(ytest, ypred))

In [25]:
# Lets call LogisticRegression and RandomForestClassifier to carryout classification

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [26]:
# Lets train basic model without any hyper parameter tuning

log_param = {}
logit_model = skfold_cv(x, y, LogisticRegression, log_param)

0.8002842928216063
[[914 119]
 [162 212]]
0.8116560056858564
[[935  98]
 [167 207]]
0.798719772403983
[[926 107]
 [176 197]]
0.7887624466571835
[[903 129]
 [168 206]]
0.8044096728307255
[[928 104]
 [171 203]]


In [27]:
#Now lets call RandomizedGridCV to carryout hyperparameter tuning

from sklearn.model_selection import RandomizedSearchCV

In [28]:
param_grid = {'penalty' : ['l1', 'l2'],
             'C' : [0.001, 0.01, 0.1, 1, 10]}

log_clf = LogisticRegression(n_jobs = -1, random_state = 0)

model = RandomizedSearchCV(estimator = log_clf, 
                          param_distributions = param_grid,
                          scoring = 'accuracy', 
                          verbose = 10,
                          n_jobs = -1,
                          cv = StratifiedKFold(5, shuffle = True))

model.fit(x, y)

print(f'Best score: {model.best_score_}')
print('Best parameters set:')

best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print(f'\t{param_name}: {best_parameters[param_name]}')

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best score: 0.8019062379627974
Best parameters set:
	C: 1
	penalty: l2


We have best parameters from above combination of  paramaters,

Best parameters set:

C: 0.1
    
penalty: l2

In [29]:
# Using above parameter lets train model and find accuracy

log_param = {'penalty':'l2', 'C' : 0.1}
logit_model = skfold_cv(x, y, LogisticRegression, log_param)

0.8024164889836531
[[921 112]
 [166 208]]
0.8137882018479033
[[937  96]
 [166 208]]
0.7994310099573257
[[930 103]
 [179 194]]
0.7908961593172119
[[908 124]
 [170 204]]
0.8065433854907539
[[934  98]
 [174 200]]


No Major change in accuracy, may be imbalanced data is culprit


Lets now use Randomforest classifier to classify data

In [30]:
rf_param = {}
rf_model = skfold_cv(x, y, RandomForestClassifier, rf_param)

0.7938877043354655
[[932 101]
 [189 185]]
0.8009950248756219
[[924 109]
 [171 203]]
0.8029871977240398
[[946  87]
 [190 183]]
0.783072546230441
[[923 109]
 [196 178]]
0.7880512091038406
[[938  94]
 [204 170]]


In [31]:
# Lets do some hyperparameter tunning to find best parameter to train Randomforest

param_grid = {'n_estimators' : [100, 200, 300, 400], 
             'max_depth' : [2,5,7,15],
             'criterion' : ['gini', 'entropy'], 
             'min_samples_split' : [2,5,10,20,100], 
             'min_samples_leaf' : [2,5,10], 
             'max_features' : ['log2', 'sqrt', 'None']}

rf_clf = RandomForestClassifier(n_jobs = -1, random_state = 0)

model = RandomizedSearchCV(estimator = rf_clf,
                          param_distributions = param_grid,
                          scoring = 'neg_log_loss',
                          verbose = 10,
                          n_jobs = 1,
                          cv = StratifiedKFold(5, shuffle = True))

model.fit(x, y)

print(f'Best score: {model.best_score_}')
print('Best parameters set:')

best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print(f'\t{param_name}: {best_parameters[param_name]}')

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5; 1/10] START criterion=entropy, max_depth=7, max_features=log2, min_samples_leaf=2, min_samples_split=20, n_estimators=200
[CV 1/5; 1/10] END criterion=entropy, max_depth=7, max_features=log2, min_samples_leaf=2, min_samples_split=20, n_estimators=200;, score=-0.411 total time=   0.5s
[CV 2/5; 1/10] START criterion=entropy, max_depth=7, max_features=log2, min_samples_leaf=2, min_samples_split=20, n_estimators=200
[CV 2/5; 1/10] END criterion=entropy, max_depth=7, max_features=log2, min_samples_leaf=2, min_samples_split=20, n_estimators=200;, score=-0.428 total time=   0.3s
[CV 3/5; 1/10] START criterion=entropy, max_depth=7, max_features=log2, min_samples_leaf=2, min_samples_split=20, n_estimators=200
[CV 3/5; 1/10] END criterion=entropy, max_depth=7, max_features=log2, min_samples_leaf=2, min_samples_split=20, n_estimators=200;, score=-0.387 total time=   0.3s
[CV 4/5; 1/10] START criterion=entropy, max_depth=7, max_

Ok, After so many iterations, we finally have best parameters to train Randomforest classifier

Best parameters set:

	criterion: entropy
    
	max_depth: 15
    
	max_features: sqrt
    
	min_samples_leaf: 10
    
	min_samples_split: 2
    
	n_estimators: 300
    

Let's verify these data using GridserchCV, Im not trying to be greedy, but just cant trust Randomseach :)

Now we will use values of parameters in viscinity of above values specified by Randomsearchcv

In [32]:
param_grid = {'criterion': ['entropy'],
             'max_depth' : [7,10,15,20],
             'max_features' : ['sqrt'],
             'min_samples_leaf' : [5, 8,10,12, 15],
             'min_samples_split' : [2,3,4,5,6,10],
             'n_estimators' : [100,200,300,400]}

In [33]:
from sklearn.model_selection import GridSearchCV

rf = RandomForestClassifier()
gsearch = GridSearchCV(estimator=rf, param_grid = param_grid, cv = 3, n_jobs = -1)

In [34]:
# This will take lot of time

gsearch.fit(xtrain, ytrain)

GridSearchCV(cv=3, estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'criterion': ['entropy'], 'max_depth': [7, 10, 15, 20],
                         'max_features': ['sqrt'],
                         'min_samples_leaf': [5, 8, 10, 12, 15],
                         'min_samples_split': [2, 3, 4, 5, 6, 10],
                         'n_estimators': [100, 200, 300, 400]})


Please finish your breakfast, lunch and dinner, before you have final answer :)

hurray we have value, After a decade long wait

In [35]:
gsearch.best_estimator_

RandomForestClassifier(criterion='entropy', max_depth=20, max_features='sqrt',
                       min_samples_leaf=10, min_samples_split=10)

We should have trusted Randomizedserch, we have approximately same answer for each parameter :(

In [36]:
clf_final = RandomForestClassifier(criterion='entropy',
                                   max_depth=20,
                                   max_features='sqrt',
                                   min_samples_leaf=12,
                                   min_samples_split=3,
                                   n_estimators = 300)
clf_final.fit(xtrain, ytrain)
clf_pred = clf_final.predict(xtest)
print(accuracy_score(ytest, clf_pred))
print(confusion_matrix(ytest, clf_pred))

0.8017057569296375
[[945  88]
 [191 183]]


Wow, this is frustrating, after so many iterations, no measurable change in accuracy, 

It has to be imbalance present in dataset

Lets tackle that

Lets use SMOTETomek from imblearn library

You guys, if following this code, Please comment full form of SMOTETomek, be honest dont google it :)

In [37]:
from imblearn.combine import SMOTETomek
from collections import Counter
os=SMOTETomek(0.8)
xover,yover=os.fit_resample(x,y)
print("The number of classes before fit {}".format(Counter(y)))
print("The number of classes after fit {}".format(Counter(yover)))

The number of classes before fit Counter({0: 5163, 1: 1869})
The number of classes after fit Counter({0: 4848, 1: 3815})


So we increased minority class to 80% of majority class 

In [38]:
# lets train basic Randomforest classifier to get notion of how effective this imbalance is

rf_param = {}
rf_model = skfold_cv(xover, yover, RandomForestClassifier, rf_param)

0.8661281015579919
[[852 118]
 [114 649]]
0.8649740334679746
[[862 108]
 [126 637]]
0.8672821696480092
[[857 113]
 [117 646]]
0.8556581986143187
[[862 107]
 [143 620]]
0.8654734411085451
[[862 107]
 [126 637]]


That is clear now, we have found culprit, it is imbalanceness of data

Now, lets do some hyperparameter tuning

In [39]:
param_grid = {'n_estimators' : [100, 200, 300, 400], 
             'max_depth' : [2,5,7,15],
             'criterion' : ['gini', 'entropy'], 
             'min_samples_split' : [2,5,10,20,100], 
             'min_samples_leaf' : [2,5,10], 
             'max_features' : ['log2', 'sqrt', 'None']}

rf_clf = RandomForestClassifier(n_jobs = -1, random_state = 0)

model = RandomizedSearchCV(estimator = rf_clf,
                          param_distributions = param_grid,
                          scoring = 'neg_log_loss',
                          verbose = 10,
                          n_jobs = 1,
                          cv = StratifiedKFold(5, shuffle = True))

model.fit(xover, yover)

print(f'Best score: {model.best_score_}')
print('Best parameters set:')

best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print(f'\t{param_name}: {best_parameters[param_name]}')

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5; 1/10] START criterion=entropy, max_depth=7, max_features=None, min_samples_leaf=10, min_samples_split=10, n_estimators=400
[CV 1/5; 1/10] END criterion=entropy, max_depth=7, max_features=None, min_samples_leaf=10, min_samples_split=10, n_estimators=400;, score=nan total time=   0.1s
[CV 2/5; 1/10] START criterion=entropy, max_depth=7, max_features=None, min_samples_leaf=10, min_samples_split=10, n_estimators=400
[CV 2/5; 1/10] END criterion=entropy, max_depth=7, max_features=None, min_samples_leaf=10, min_samples_split=10, n_estimators=400;, score=nan total time=   2.5s
[CV 3/5; 1/10] START criterion=entropy, max_depth=7, max_features=None, min_samples_leaf=10, min_samples_split=10, n_estimators=400
[CV 3/5; 1/10] END criterion=entropy, max_depth=7, max_features=None, min_samples_leaf=10, min_samples_split=10, n_estimators=400;, score=nan total time=   2.9s
[CV 4/5; 1/10] START criterion=entropy, max_depth=7, max_fea

No Gridsearch now, I am not cruel :)

In [40]:
rf_params = {'n_estimators' : 400,
            'max_depth': 20,
            'max_features': 'log2',
            'criterion': 'entropy',
            'min_samples_leaf': 5,
            'min_samples_split': 20}

rf_model = skfold_cv(xover, yover, RandomForestClassifier, rf_params)

0.842469705712637
[[822 148]
 [125 638]]
0.8465089440276976
[[828 142]
 [124 639]]
0.8395845354875938
[[829 141]
 [137 626]]
0.8400692840646651
[[832 137]
 [140 623]]
0.8464203233256351
[[842 127]
 [139 624]]


Now, its upto you guyz, i will share new Notebook with Xgboost, Catboost using Hyperopt


Thanks to all of you and you should be proud of you, if you are reading this line, please make sure you follow alongwith me on your system