<h1> Customer Churn Prediction </h1>
<p> Most companies would like to retain their customers as customer acquisition is an expensive exercise. Hence predicting churn is important for most organizations. </p>

<p> Also, churn datasets typically are imbalanced data sets. What this means is that there will be very few samples in the dataset for the case one would like to predict (whether the customer will churn or not) </p>

<p> In this proof of concept, the dataset being referenced is the <a href= "https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-Telco-Customer-Churn.csv?cm_mc_uid=98565687174815521433546&cm_mc_sid_50200000=19288781552143354626&cm_mc_sid_52640000=19735201552143354638">IBM Telecom Data Set </a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

from sklearn.preprocessing import MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier, DMatrix
from sklearn.ensemble import AdaBoostClassifier
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss
from sklearn.metrics import recall_score, accuracy_score, confusion_matrix, cohen_kappa_score

from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

Using TensorFlow backend.


<h4> Read the Data Set </h4>
<p>The first step is to read the data set and see the first five rows of the data. This ensures that the data is loaded in the Pandas dataframe </p>

In [2]:
data = pd.read_csv("data/CustomerChurn.csv")
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


<h4> Check for Missing values </h4>
<p>To check for missing values, we check for NaN or even blank values. Sometimes blank values are loaded as ' ' </p>

In [3]:
print("Checking for NA")
print(data.isna().sum())
print("#######################################################")
print("Checking for Blank Data")
print(data.isin([' ','','  ']).sum())


Checking for NA
customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64
#######################################################
Checking for Blank Data
customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges 

<p> From the above analysis, 11 records have missing data for TotalCharges. However, since we have Monthly Charge and TotalCharge, we decide to keep MonthlyCharge variable </p>

<p> So now we need to check unique values and count of the unique values. For this we define a local function to compute value counts </p>

In [4]:
## Print value Counts

def value_counts(df):
    colnms= df.columns
    for cnm in colnms:
        print("Column :" + cnm)
        print(str(round(df[cnm].value_counts()/len(df)*100)))
        
value_counts(data.drop(['MonthlyCharges','customerID','TotalCharges','tenure'],axis=1))

Column :gender
Male      50.0
Female    50.0
Name: gender, dtype: float64
Column :SeniorCitizen
0    84.0
1    16.0
Name: SeniorCitizen, dtype: float64
Column :Partner
No     52.0
Yes    48.0
Name: Partner, dtype: float64
Column :Dependents
No     70.0
Yes    30.0
Name: Dependents, dtype: float64
Column :PhoneService
Yes    90.0
No     10.0
Name: PhoneService, dtype: float64
Column :MultipleLines
No                  48.0
Yes                 42.0
No phone service    10.0
Name: MultipleLines, dtype: float64
Column :InternetService
Fiber optic    44.0
DSL            34.0
No             22.0
Name: InternetService, dtype: float64
Column :OnlineSecurity
No                     50.0
Yes                    29.0
No internet service    22.0
Name: OnlineSecurity, dtype: float64
Column :OnlineBackup
No                     44.0
Yes                    34.0
No internet service    22.0
Name: OnlineBackup, dtype: float64
Column :DeviceProtection
No                     44.0
Yes                    34.0
No

<p> We see that while most categorical variables have been classified as Yes, No...senior citizens are classified as 0 and 1. So we map 0 to No and 1 to Yes </p>

In [5]:
print(data['SeniorCitizen'].value_counts()/len(data)*100)
data['SeniorCitizen']= data['SeniorCitizen'].map({0: 'No', 1: 'Yes'})
print(data.head())

0    83.785319
1    16.214681
Name: SeniorCitizen, dtype: float64
   customerID  gender SeniorCitizen Partner Dependents  tenure PhoneService  \
0  7590-VHVEG  Female            No     Yes         No       1           No   
1  5575-GNVDE    Male            No      No         No      34          Yes   
2  3668-QPYBK    Male            No      No         No       2          Yes   
3  7795-CFOCW    Male            No      No         No      45           No   
4  9237-HQITU  Female            No      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity  ...  DeviceProtection  \
0  No phone service             DSL             No  ...                No   
1                No             DSL            Yes  ...               Yes   
2                No             DSL            Yes  ...                No   
3  No phone service             DSL            Yes  ...               Yes   
4                No     Fiber optic             No  ...                No 

In [6]:
print(data['MultipleLines'].value_counts())
data['MultipleLines'] = data['MultipleLines'].map({'No': 'No','Yes': 'Yes','No phone service': 'NoPhoneService'})
print(data['MultipleLines'].value_counts())

No                  3390
Yes                 2971
No phone service     682
Name: MultipleLines, dtype: int64
No                3390
Yes               2971
NoPhoneService     682
Name: MultipleLines, dtype: int64


In [7]:
print(data['InternetService'].value_counts())
data['InternetService'] = data['InternetService'].map({'No': 'No','DSL': 'DSL','Fiber optic': 'FiberOptic'})
print(data['InternetService'].value_counts())

Fiber optic    3096
DSL            2421
No             1526
Name: InternetService, dtype: int64
FiberOptic    3096
DSL           2421
No            1526
Name: InternetService, dtype: int64


In [8]:
print(data['OnlineSecurity'].value_counts())
data['OnlineSecurity'] = data['OnlineSecurity'].map({'No': 'No','Yes': 'Yes','No internet service': 'NoInternetService'})
print(data['OnlineSecurity'].value_counts())

No                     3498
Yes                    2019
No internet service    1526
Name: OnlineSecurity, dtype: int64
No                   3498
Yes                  2019
NoInternetService    1526
Name: OnlineSecurity, dtype: int64


In [9]:
print(data['OnlineBackup'].value_counts())
data['OnlineBackup'] = data['OnlineBackup'].map({'No': 'No','Yes': 'Yes','No internet service': 'NoInternetService'})
print(data['OnlineBackup'].value_counts())

No                     3088
Yes                    2429
No internet service    1526
Name: OnlineBackup, dtype: int64
No                   3088
Yes                  2429
NoInternetService    1526
Name: OnlineBackup, dtype: int64


In [10]:
print(data['DeviceProtection'].value_counts())
data['DeviceProtection'] = data['DeviceProtection'].map({'No': 'No','Yes': 'Yes','No internet service': 'NoInternetService'})
print(data['DeviceProtection'].value_counts())

No                     3095
Yes                    2422
No internet service    1526
Name: DeviceProtection, dtype: int64
No                   3095
Yes                  2422
NoInternetService    1526
Name: DeviceProtection, dtype: int64


In [11]:
print(data['TechSupport'].value_counts())
data['TechSupport'] = data['TechSupport'].map({'No': 'No','Yes': 'Yes','No internet service': 'NoInternetService'})
print(data['TechSupport'].value_counts())

No                     3473
Yes                    2044
No internet service    1526
Name: TechSupport, dtype: int64
No                   3473
Yes                  2044
NoInternetService    1526
Name: TechSupport, dtype: int64


In [12]:
print(data['StreamingTV'].value_counts())
data['StreamingTV'] = data['StreamingTV'].map({'No': 'No','Yes': 'Yes','No internet service': 'NoInternetService'})
print(data['StreamingTV'].value_counts())

No                     2810
Yes                    2707
No internet service    1526
Name: StreamingTV, dtype: int64
No                   2810
Yes                  2707
NoInternetService    1526
Name: StreamingTV, dtype: int64


In [13]:
print(data['StreamingMovies'].value_counts())
data['StreamingMovies'] = data['StreamingMovies'].map({'No': 'No','Yes': 'Yes','No internet service': 'NoInternetService'})
print(data['StreamingMovies'].value_counts())

No                     2785
Yes                    2732
No internet service    1526
Name: StreamingMovies, dtype: int64
No                   2785
Yes                  2732
NoInternetService    1526
Name: StreamingMovies, dtype: int64


In [14]:
print(data['Contract'].value_counts())
data['Contract'] = data['Contract'].map({'Month-to-month':'M2M','Two year': 'TwoYear','One year': 'OneYear'})
print(data['Contract'].value_counts())

Month-to-month    3875
Two year          1695
One year          1473
Name: Contract, dtype: int64
M2M        3875
TwoYear    1695
OneYear    1473
Name: Contract, dtype: int64


In [15]:
data['PaperlessBilling'].value_counts()

Yes    4171
No     2872
Name: PaperlessBilling, dtype: int64

In [16]:
print(data['PaymentMethod'].value_counts())
data['PaymentMethod'] = data['PaymentMethod'].map({'Electronic check': 'ElectronicChk','Mailed check': 'MailedChk','Bank transfer (automatic)': 'BankTransferAuto','Credit card (automatic)': 'CreditCardAuto'})
print(data['PaymentMethod'].value_counts())

Electronic check             2365
Mailed check                 1612
Bank transfer (automatic)    1544
Credit card (automatic)      1522
Name: PaymentMethod, dtype: int64
ElectronicChk       2365
MailedChk           1612
BankTransferAuto    1544
CreditCardAuto      1522
Name: PaymentMethod, dtype: int64


<h4> Additional Features Added/Dropped </h4>
<p> We see that Payment Method has an information on Automatic Payments. This can be added as an additional variable </p>

In [17]:
data['Automatic'] = data['PaymentMethod'].apply(lambda x: "Yes" if x in ['BankTransferAuto','CreditCardAuto'] else "No")

<p> Tenure is a continuous variable. It is best to bin this variable. Here, we have used time period </p>

In [18]:
def processTenure(x):
    if x<=12:
        return("LT12M")
    elif x>12 and x<=24:
        return("BT1Y2Y")
    elif x>24 and x<=36:
        return("BT2Y3Y")
    elif x>36 and x<=48:
        return("BT3Y4Y")
    elif x>48 and x<=60:
        return("BT4Y5Y")
    else:
        return("GT5Y")
data['TenureBinned'] = data.tenure.apply(lambda x: processTenure(int(x)))

In [19]:
## Scale the Monthly Charges variable using RobustScaler
scaler = RobustScaler()
data['MonthlyCharges'] = scaler.fit_transform(data['MonthlyCharges'].values.reshape(-1,1))

<p> Convert the categorical variables into Dummy variables </p>

In [20]:
data= pd.get_dummies(data=data, columns = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService',  'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport','StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling','PaymentMethod','Automatic','TenureBinned',  'Churn'],drop_first=True)

In [21]:
data.drop(['customerID','TotalCharges','tenure'],axis=1, inplace=True)
data.head()

Unnamed: 0,MonthlyCharges,gender_Male,SeniorCitizen_Yes,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_NoPhoneService,MultipleLines_Yes,InternetService_FiberOptic,InternetService_No,...,PaymentMethod_CreditCardAuto,PaymentMethod_ElectronicChk,PaymentMethod_MailedChk,Automatic_Yes,TenureBinned_BT2Y3Y,TenureBinned_BT3Y4Y,TenureBinned_BT4Y5Y,TenureBinned_GT5Y,TenureBinned_LT12M,Churn_Yes
0,-0.74517,0,0,1,0,0,1,0,0,0,...,0,1,0,0,0,0,0,0,1,0
1,-0.24655,1,0,0,0,1,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,-0.303588,1,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,1,1
3,-0.516099,1,0,0,0,0,1,0,0,0,...,0,0,0,1,0,1,0,0,0,0
4,0.00644,0,0,0,0,1,0,0,1,0,...,0,1,0,0,0,0,0,0,1,1


<h4> Splitting the Data into training and testing </h4>

In [22]:
X = data.drop(['Churn_Yes'],axis=1).values
y = data[['Churn_Yes']].values
X_train, X_test,y_train, y_test = train_test_split(X,y,random_state=100, stratify=y)

<h4> Model Development </h4>
<p> Here we decided to create a single Function for creating models. This function takes the Training and Testing Inputs, Model Type (Logistic Regression, Decision Trees, Random Forest, Support Vector Machine, XGBoost and ADABoost). Since there is a problem of class imbalance, we use two methods for sampling - <a href="https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#smote-adasyn"> Over Sampling (SMOTE) </a> and <a href="https://imbalanced-learn.readthedocs.io/en/stable/under_sampling.html"> Under Sampling (NEARMISS)</a></p>

<p> This function returns Accuracy, Recall, Confusion Matrix and Cohen's Kappa Score </p>

In [23]:
def predictionF(X_tr,y_tr,X_tt,y_tt,model_type, imb_method='SMOTE'):
    if imb_method == 'SMOTE':
        sm = SMOTE()
        X_train,y_train = sm.fit_sample(X=X_tr,y=y_tr)
    elif imb_method == 'NEARMISS':
        nm = NearMiss()
        X_train,y_train = nm.fit_sample(X_tr,y_tr)
    
    if model_type == 'LOGREG':
        model = LogisticRegression()
    elif model_type == 'DECISIONTREE':
        model = DecisionTreeClassifier()
    elif model_type == 'RANDOMFOREST':
        model = RandomForestClassifier()
    elif model_type == 'SVM':
        model = SVC()
    elif model_type == 'XGBOOST':
        model = XGBClassifier()
    elif model_type == 'ADABOOST':
        model = AdaBoostClassifier()
    elif model_type == 'NEURALNET':
        model=Sequential()
        model.add(Dense(60, input_dim=X_train.shape[1], kernel_initializer='normal', activation='relu'))
        model.add(Dense(30, kernel_initializer='normal', activation='relu'))
        model.add(Dense(10, kernel_initializer='normal', activation='tanh'))
        model.add(Dense(10, kernel_initializer='normal', activation='tanh'))
        model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
        model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    else:
        print("Invalid Model Type")
        return()
    
    if model_type == 'NEURALNET':
        results = model.fit(X_train, y_train, epochs= 50, batch_size = 500, validation_data = (X_tt, y_tt),verbose=False)
        ypred = model.predict(X_tt)[:,0]
        ypred = np.round(ypred)
    else:        
        model.fit(X_train,y_train)        
        ypred = model.predict(X_tt)
        
    accuracy = accuracy_score(y_pred=ypred, y_true=y_tt)
    recall   = recall_score(y_pred=ypred, y_true=y_tt)
    conf_matrix = confusion_matrix(y_pred=ypred, y_true=y_tt)
    coh_kappa= cohen_kappa_score(y1=ypred, y2=y_tt)   
    
    return(accuracy, recall, conf_matrix,coh_kappa)

In [24]:
## Run Logistic Regression
acc_lm_sm, recall_lm_sm, confm_lm_sm, coh_lm_sm = predictionF(X_tr=X_train,y_tr=y_train, X_tt=X_test, y_tt=y_test, imb_method='SMOTE',model_type='LOGREG')
acc_lm_nm, recall_lm_nm, confm_lm_nm, coh_lm_nm = predictionF(X_tr=X_train,y_tr=y_train, X_tt=X_test, y_tt=y_test, imb_method='NEARMISS',model_type='LOGREG')

In [25]:
## Run Decision Trees
acc_dt_sm, recall_dt_sm, confm_dt_sm, coh_dt_sm = predictionF(X_tr=X_train,y_tr=y_train, X_tt=X_test, y_tt=y_test, imb_method='SMOTE',model_type='DECISIONTREE')
acc_dt_nm, recall_dt_nm, confm_dt_nm, coh_dt_nm = predictionF(X_tr=X_train,y_tr=y_train, X_tt=X_test, y_tt=y_test, imb_method='NEARMISS',model_type='DECISIONTREE')

In [26]:
acc_rf_sm, recall_rf_sm, confm_rf_sm, coh_rf_sm = predictionF(X_tr=X_train,y_tr=y_train, X_tt=X_test, y_tt=y_test, imb_method='SMOTE',model_type='RANDOMFOREST')
acc_rf_nm, recall_rf_nm, confm_rf_nm, coh_rf_nm = predictionF(X_tr=X_train,y_tr=y_train, X_tt=X_test, y_tt=y_test, imb_method='NEARMISS',model_type='RANDOMFOREST')

In [27]:
acc_sv_sm, recall_sv_sm, confm_sv_sm, coh_sv_sm = predictionF(X_tr=X_train,y_tr=y_train, X_tt=X_test, y_tt=y_test, imb_method='SMOTE',model_type='SVM')
acc_sv_nm, recall_sv_nm, confm_sv_nm, coh_sv_nm = predictionF(X_tr=X_train,y_tr=y_train, X_tt=X_test, y_tt=y_test, imb_method='NEARMISS',model_type='SVM')

In [28]:
acc_xg_sm, recall_xg_sm, confm_xg_sm, coh_xg_sm = predictionF(X_tr=X_train,y_tr=y_train, X_tt=X_test, y_tt=y_test, imb_method='SMOTE',model_type='XGBOOST')
acc_xg_nm, recall_xg_nm, confm_xg_nm, coh_xg_nm = predictionF(X_tr=X_train,y_tr=y_train, X_tt=X_test, y_tt=y_test, imb_method='NEARMISS',model_type='XGBOOST')

In [29]:
acc_ad_sm, recall_ad_sm, confm_ad_sm, coh_ad_sm = predictionF(X_tr=X_train,y_tr=y_train, X_tt=X_test, y_tt=y_test, imb_method='SMOTE',model_type='ADABOOST')
acc_ad_nm, recall_ad_nm, confm_ad_nm, coh_ad_nm = predictionF(X_tr=X_train,y_tr=y_train, X_tt=X_test, y_tt=y_test, imb_method='NEARMISS',model_type='ADABOOST')

In [30]:
acc_nn_sm, recall_nn_sm, confm_nn_sm, coh_nn_sm = predictionF(X_tr=X_train,y_tr=y_train, X_tt=X_test, y_tt=y_test, imb_method='SMOTE',model_type='NEURALNET')
acc_nn_nm, recall_nn_nm, confm_nn_nm, coh_nn_nm = predictionF(X_tr=X_train,y_tr=y_train, X_tt=X_test, y_tt=y_test, imb_method='SMOTE',model_type='NEURALNET')

In [31]:
comparisondf = pd.DataFrame(data= [[acc_lm_sm, acc_lm_nm, recall_lm_sm, recall_lm_nm, coh_lm_sm, coh_lm_nm],
                                   [acc_dt_sm,acc_dt_nm, recall_dt_sm, recall_dt_nm, coh_dt_sm, coh_dt_nm],
                                   [acc_rf_sm,acc_rf_nm, recall_rf_sm, recall_rf_nm, coh_rf_sm, coh_rf_nm],
                                   [acc_sv_sm,acc_sv_nm, recall_sv_sm, recall_sv_nm, coh_sv_sm, coh_sv_nm],
                                   [acc_xg_sm,acc_xg_nm, recall_xg_sm, recall_xg_nm, coh_xg_sm, coh_xg_nm],
                                   [acc_ad_sm,acc_ad_nm, recall_ad_sm, recall_ad_nm, coh_ad_sm, coh_ad_nm],
                                   [acc_nn_sm,acc_nn_nm, recall_nn_sm, recall_nn_nm, coh_nn_sm, coh_nn_nm]], columns=['ACCURACY_SMOTE','ACCURACY_NEARMISS', 'RECALL_SMOTE','RECALL_NEARMISS','COHEN_KAPPA_SMOTE','COHEN_KAPPA_NEARMISS'], index=['LOGREG','DECISIONTREE','RANDOMFOREST','SVC','XGBOOST','ADABOOST','NEURALNET'])

In [32]:
print(acc_lm_sm)
print(acc_lm_nm)
print(recall_lm_sm)
print(recall_lm_nm)

0.7410562180579217
0.6598523566155593
0.7837259100642399
0.7794432548179872


In [33]:
round(comparisondf.iloc[:,0:4]*100)

Unnamed: 0,ACCURACY_SMOTE,ACCURACY_NEARMISS,RECALL_SMOTE,RECALL_NEARMISS
LOGREG,74.0,66.0,78.0,78.0
DECISIONTREE,70.0,51.0,48.0,60.0
RANDOMFOREST,76.0,55.0,47.0,61.0
SVC,74.0,60.0,81.0,78.0
XGBOOST,77.0,51.0,67.0,78.0
ADABOOST,77.0,61.0,73.0,79.0
NEURALNET,73.0,73.0,77.0,75.0


## 