## Business Problem - Customer Churn Prediction
### Context - "Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs."
### Following is the data dictionary
#### customerID : Customer ID
#### Gender : Whether the customer is a male or a female
#### SeniorCitizen : Whether the customer is a senior citizen or not (1, 0)
#### Partner: Whether the customer has a partner or not (Yes, No)
#### Dependents: Whether the customer has dependents or not (Yes, No)
#### Tenure: Number of months the customer has stayed with the company
#### PhoneService : Whether the customer has a phone service or not (Yes, No)
#### MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone service)
#### InternetServiceCustomer’s : internet service provider (DSL, Fiber optic, No)
#### OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet service)
#### OnlineBackup: Whether the customer has online backup or not (Yes, No, No internet service)
#### DeviceProtection : Whether the customer has device protection or not (Yes, No, No internet service)
#### TechSupport :Whether the customer has tech support or not (Yes, No, No internet service)
#### StreamingTV :Whether the customer has streaming TV or not (Yes, No, No internet service)
#### StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)
#### Contract The contract term of the customer (Month-to-month, One year, Two year)
#### PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)
#### PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
#### MonthlyCharges: The amount charged to the customer monthly
#### TotalCharges: The total amount charged to the customer
#### Churn: Whether the customer churned or not (Yes or No)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import warnings
warnings.filterwarnings('ignore')  # Supress Warnings


In [2]:
churn_data=pd.read_csv("D:\Python Projects\DT_Customer_churn_prediction\ML 03 Classification Case Dataset.csv")

In [3]:
churn_data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [4]:
churn_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


churn_data.describe()

In [5]:
churn_data.isnull().sum()*100/churn_data.shape[0]        #convert into float values

customerID          0.0
gender              0.0
SeniorCitizen       0.0
Partner             0.0
Dependents          0.0
tenure              0.0
PhoneService        0.0
MultipleLines       0.0
InternetService     0.0
OnlineSecurity      0.0
OnlineBackup        0.0
DeviceProtection    0.0
TechSupport         0.0
StreamingTV         0.0
StreamingMovies     0.0
Contract            0.0
PaperlessBilling    0.0
PaymentMethod       0.0
MonthlyCharges      0.0
TotalCharges        0.0
Churn               0.0
dtype: float64

#There are no NULL values in the dataset,so data is clean.

In [6]:
#Replacing NAN values in totalcharges
churn_data['TotalCharges'].describe()
churn_data['TotalCharges'] = churn_data['TotalCharges'].replace(' ', np.nan)
churn_data['TotalCharges'] = pd.to_numeric(churn_data['TotalCharges'])

value = (churn_data['TotalCharges']/churn_data['MonthlyCharges']).median()*churn_data['MonthlyCharges']
churn_data['TotalCharges'] = value.where(churn_data['TotalCharges'] == np.nan, other =churn_data['TotalCharges'])
churn_data['TotalCharges'].describe()

count    7032.000000
mean     2283.300441
std      2266.771362
min        18.800000
25%       401.450000
50%      1397.475000
75%      3794.737500
max      8684.800000
Name: TotalCharges, dtype: float64

### Converting some binary variables (Yes/No) to 0/1

In [7]:
#Model Building
#Data Preparation
#Converting some binary variables (Yes/No) to 0/1
# List of variables to map

varlist =  ['PhoneService', 'PaperlessBilling', 'Churn', 'Partner', 'Dependents']

### Binary encoding

In [8]:
# Defining the map function
def binary_map(x):
    return x.map({'Yes': 1, "No": 0})

In [9]:
# Applying the function to the var list
churn_data[varlist] = churn_data[varlist].apply(binary_map)
churn_data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,1,0,1,0,No phone service,DSL,No,...,No,No,No,No,Month-to-month,1,Electronic check,29.85,29.85,0
1,5575-GNVDE,Male,0,0,0,34,1,No,DSL,Yes,...,Yes,No,No,No,One year,0,Mailed check,56.95,1889.5,0
2,3668-QPYBK,Male,0,0,0,2,1,No,DSL,Yes,...,No,No,No,No,Month-to-month,1,Mailed check,53.85,108.15,1
3,7795-CFOCW,Male,0,0,0,45,0,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,0,Bank transfer (automatic),42.3,1840.75,0
4,9237-HQITU,Female,0,0,0,2,1,No,Fiber optic,No,...,No,No,No,No,Month-to-month,1,Electronic check,70.7,151.65,1


### Creating dummy variables and removing the extra columns

In [10]:

#For categorical variables with multiple levels, create dummy features (one-hot encoded)
# Creating a dummy variable for some of the categorical variables and dropping the first one.
dummy1 = pd.get_dummies(churn_data[['Contract', 'PaymentMethod', 'gender', 'InternetService']], drop_first=True)

# Adding the results to the master dataframe
churn_data = pd.concat([churn_data, dummy1], axis=1)
churn_data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,TotalCharges,Churn,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,gender_Male,InternetService_Fiber optic,InternetService_No
0,7590-VHVEG,Female,0,1,0,1,0,No phone service,DSL,No,...,29.85,0,0,0,0,1,0,0,0,0
1,5575-GNVDE,Male,0,0,0,34,1,No,DSL,Yes,...,1889.5,0,1,0,0,0,1,1,0,0
2,3668-QPYBK,Male,0,0,0,2,1,No,DSL,Yes,...,108.15,1,0,0,0,0,1,1,0,0
3,7795-CFOCW,Male,0,0,0,45,0,No phone service,DSL,Yes,...,1840.75,0,1,0,0,0,0,1,0,0
4,9237-HQITU,Female,0,0,0,2,1,No,Fiber optic,No,...,151.65,1,0,0,0,1,0,0,1,0


In [11]:
# Creating dummy variables for the remaining categorical variables and dropping the level with big names.

# Creating dummy variables for the variable 'MultipleLines'
ml = pd.get_dummies(churn_data['MultipleLines'], prefix='MultipleLines')
# Dropping MultipleLines_No phone service column
ml1 = ml.drop(['MultipleLines_No phone service'], 1)
#Adding the results to the master dataframe
churn_data = pd.concat([churn_data,ml1], axis=1)

In [12]:
# Creating dummy variables for the variable 'OnlineSecurity'.

os = pd.get_dummies(churn_data['OnlineSecurity'], prefix='OnlineSecurity')
os1 = os.drop(['OnlineSecurity_No internet service'], 1)
# Adding the results to the master dataframe
churn_data = pd.concat([churn_data,os1], axis=1)



In [13]:
# Creating dummy variables for the variable 'OnlineBackup'.

ob = pd.get_dummies(churn_data['OnlineBackup'], prefix='OnlineBackup')
ob1 = ob.drop(['OnlineBackup_No internet service'], 1)
# Adding the results to the master dataframe
churn_data = pd.concat([churn_data,ob1], axis=1)


In [14]:
# Creating dummy variables for the variable 'DeviceProtection'. 

dp = pd.get_dummies(churn_data['DeviceProtection'], prefix='DeviceProtection')
dp1 = dp.drop(['DeviceProtection_No internet service'], 1)
# Adding the results to the master dataframe
churn_data = pd.concat([churn_data,dp1], axis=1)

In [15]:
# Creating dummy variables for the variable 'TechSupport'. 
ts = pd.get_dummies(churn_data['TechSupport'], prefix='TechSupport')
ts1 = ts.drop(['TechSupport_No internet service'], 1)
# Adding the results to the master dataframe
churn_data = pd.concat([churn_data,ts1], axis=1)

In [16]:
# Creating dummy variables for the variable 'StreamingTV'.
st =pd.get_dummies(churn_data['StreamingTV'], prefix='StreamingTV')
st1 = st.drop(['StreamingTV_No internet service'], 1)
# Adding the results to the master dataframe
churn_data = pd.concat([churn_data,st1], axis=1)


In [17]:
# Creating dummy variables for the variable 'StreamingMovies'. 
sm = pd.get_dummies(churn_data['StreamingMovies'], prefix='StreamingMovies')
sm1 = sm.drop(['StreamingMovies_No internet service'], 1)
# Adding the results to the master dataframe
churn_data= pd.concat([churn_data,sm1], axis=1)

In [18]:
churn_data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,OnlineBackup_No,OnlineBackup_Yes,DeviceProtection_No,DeviceProtection_Yes,TechSupport_No,TechSupport_Yes,StreamingTV_No,StreamingTV_Yes,StreamingMovies_No,StreamingMovies_Yes
0,7590-VHVEG,Female,0,1,0,1,0,No phone service,DSL,No,...,0,1,1,0,1,0,1,0,1,0
1,5575-GNVDE,Male,0,0,0,34,1,No,DSL,Yes,...,1,0,0,1,1,0,1,0,1,0
2,3668-QPYBK,Male,0,0,0,2,1,No,DSL,Yes,...,0,1,1,0,1,0,1,0,1,0
3,7795-CFOCW,Male,0,0,0,45,0,No phone service,DSL,Yes,...,1,0,0,1,0,1,1,0,1,0
4,9237-HQITU,Female,0,0,0,2,1,No,Fiber optic,No,...,1,0,1,0,1,0,1,0,1,0


In [19]:
# We have created dummies for the below variables, so we can drop them

churn_data = churn_data.drop(['Contract','PaymentMethod','gender','MultipleLines','InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies'], 1)
churn_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 32 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   customerID                             7043 non-null   object 
 1   SeniorCitizen                          7043 non-null   int64  
 2   Partner                                7043 non-null   int64  
 3   Dependents                             7043 non-null   int64  
 4   tenure                                 7043 non-null   int64  
 5   PhoneService                           7043 non-null   int64  
 6   PaperlessBilling                       7043 non-null   int64  
 7   MonthlyCharges                         7043 non-null   float64
 8   TotalCharges                           7032 non-null   float64
 9   Churn                                  7043 non-null   int64  
 10  Contract_One year                      7043 non-null   uint8  
 11  Cont

In [20]:
# Checking for outliers in the continuous variables

num_telecom = churn_data[['tenure','MonthlyCharges','SeniorCitizen','TotalCharges']]
# Checking outliers at 25%, 50%, 75%, 90%, 95% and 99%

num_telecom.describe(percentiles=[.25, .5, .75, .90, .95, .99])

Unnamed: 0,tenure,MonthlyCharges,SeniorCitizen,TotalCharges
count,7043.0,7043.0,7043.0,7032.0
mean,32.371149,64.761692,0.162147,2283.300441
std,24.559481,30.090047,0.368612,2266.771362
min,0.0,18.25,0.0,18.8
25%,9.0,35.5,0.0,401.45
50%,29.0,70.35,0.0,1397.475
75%,55.0,89.85,0.0,3794.7375
90%,69.0,102.6,1.0,5976.64
95%,72.0,107.4,1.0,6923.59
99%,72.0,114.729,1.0,8039.883


In [21]:
# Checking up the missing values (column-wise)
churn_data.isnull().sum()

customerID                                0
SeniorCitizen                             0
Partner                                   0
Dependents                                0
tenure                                    0
PhoneService                              0
PaperlessBilling                          0
MonthlyCharges                            0
TotalCharges                             11
Churn                                     0
Contract_One year                         0
Contract_Two year                         0
PaymentMethod_Credit card (automatic)     0
PaymentMethod_Electronic check            0
PaymentMethod_Mailed check                0
gender_Male                               0
InternetService_Fiber optic               0
InternetService_No                        0
MultipleLines_No                          0
MultipleLines_Yes                         0
OnlineSecurity_No                         0
OnlineSecurity_Yes                        0
OnlineBackup_No                 

In [22]:
# Removing NaN TotalCharges rows

churn_data = churn_data[~np.isnan(churn_data['TotalCharges'])]

In [23]:
# Checking percentage of missing values after removing the missing values

round(100*(churn_data.isnull().sum()/len(churn_data.index)), 2)

customerID                               0.0
SeniorCitizen                            0.0
Partner                                  0.0
Dependents                               0.0
tenure                                   0.0
PhoneService                             0.0
PaperlessBilling                         0.0
MonthlyCharges                           0.0
TotalCharges                             0.0
Churn                                    0.0
Contract_One year                        0.0
Contract_Two year                        0.0
PaymentMethod_Credit card (automatic)    0.0
PaymentMethod_Electronic check           0.0
PaymentMethod_Mailed check               0.0
gender_Male                              0.0
InternetService_Fiber optic              0.0
InternetService_No                       0.0
MultipleLines_No                         0.0
MultipleLines_Yes                        0.0
OnlineSecurity_No                        0.0
OnlineSecurity_Yes                       0.0
OnlineBack

In [24]:
# Putting feature variable to X

from sklearn.model_selection import train_test_split #use 'cross_validation' instead of
                                                     #'model_selection' Executing in jupyter or spyder 
X = churn_data.drop(['Churn','customerID'], axis=1)
X.head()

# Putting response variable to y
y = churn_data['Churn']

y.head()

0    0
1    0
2    1
3    0
4    1
Name: Churn, dtype: int64

In [25]:
# Splitting the data into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

In [26]:
#Feature Scaling

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_train[['tenure','MonthlyCharges','TotalCharges']] = scaler.fit_transform(X_train[['tenure','MonthlyCharges','TotalCharges']])

X_train.head()

Unnamed: 0,SeniorCitizen,Partner,Dependents,tenure,PhoneService,PaperlessBilling,MonthlyCharges,TotalCharges,Contract_One year,Contract_Two year,...,OnlineBackup_No,OnlineBackup_Yes,DeviceProtection_No,DeviceProtection_Yes,TechSupport_No,TechSupport_Yes,StreamingTV_No,StreamingTV_Yes,StreamingMovies_No,StreamingMovies_Yes
879,0,0,0,0.019693,1,1,-0.338074,-0.276449,0,0,...,0,1,1,0,1,0,1,0,1,0
5790,0,1,1,0.305384,0,1,-0.464443,-0.112702,0,0,...,0,1,1,0,1,0,0,1,0,1
6498,0,0,0,-1.286319,1,1,0.581425,-0.97443,0,0,...,0,1,0,1,1,0,1,0,1,0
880,0,0,0,-0.919003,1,1,1.505913,-0.550676,0,0,...,0,1,0,1,0,1,0,1,0,1
2784,0,0,1,-1.16388,1,1,1.106854,-0.835971,0,0,...,1,0,0,1,0,1,0,1,0,1


### Model Building

In [27]:
from sklearn.linear_model import LogisticRegression
#To assign the model to my dataset:
logmodel=LogisticRegression(solver='lbfgs')

In [28]:
#we have to fit this model in our dataset 
#I will be fitting the model in my train dataset
logmodel.fit(X_train,y_train)

LogisticRegression()

In [29]:
#I have to make the predictions on the basis of my num, variables
predictions=logmodel.predict(X_test)

In [30]:
from sklearn.metrics import classification_report

In [31]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00      1528
           1       0.28      1.00      0.43       582

    accuracy                           0.28      2110
   macro avg       0.14      0.50      0.22      2110
weighted avg       0.08      0.28      0.12      2110



In [32]:
from sklearn.metrics import confusion_matrix

In [33]:
print(confusion_matrix(y_test,predictions))

[[   0 1528]
 [   0  582]]


In [34]:
import statsmodels.api as sm

In [35]:
def get_lrm(y_train, x_train):
    lrm = sm.GLM(y_train, (sm.add_constant(x_train)), family = sm.families.Binomial())
    lrm = lrm.fit()
    print(lrm.summary())
    return lrm

In [36]:
# running the logistic regression model once
lrm_1 =get_lrm(y_train, X_train)

                 Generalized Linear Model Regression Results                  
Dep. Variable:                  Churn   No. Observations:                 4922
Model:                            GLM   Df Residuals:                     4898
Model Family:                Binomial   Df Model:                           23
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -2004.7
Date:                Sun, 26 Nov 2023   Deviance:                       4009.4
Time:                        10:58:47   Pearson chi2:                 6.07e+03
No. Iterations:                    10   Pseudo R-squ. (CS):             0.2844
Covariance Type:            nonrobust                                         
                                            coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------

In [37]:
X_train.shape

(4922, 30)

In [53]:
# Logistic regression model

import statsmodels.api as sm
logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
print(logm1.fit().summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                  Churn   No. Observations:                 4922
Model:                            GLM   Df Residuals:                     4898
Model Family:                Binomial   Df Model:                           23
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -2004.7
Date:                Sun, 26 Nov 2023   Deviance:                       4009.4
Time:                        10:59:54   Pearson chi2:                 6.07e+03
No. Iterations:                    10   Pseudo R-squ. (CS):             0.2844
Covariance Type:            nonrobust                                         
                                            coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------

In [39]:
#Feature Selection Using RFE
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
from sklearn.feature_selection import RFE
rfe = RFE(logreg)            # running RFE with 13 variables as output
rfe = rfe.fit(X_train, y_train)
rfe.support_

list(zip(X_train.columns, rfe.support_, rfe.ranking_))


col = X_train.columns[rfe.support_]
X_train.columns[~rfe.support_]

Index(['Partner', 'Dependents', 'PhoneService',
       'PaymentMethod_Electronic check', 'gender_Male', 'MultipleLines_No',
       'MultipleLines_Yes', 'OnlineSecurity_Yes', 'OnlineBackup_No',
       'OnlineBackup_Yes', 'DeviceProtection_No', 'DeviceProtection_Yes',
       'TechSupport_Yes', 'StreamingTV_No', 'StreamingMovies_No'],
      dtype='object')

In [40]:
#Adding a constant

X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
print(res.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                  Churn   No. Observations:                 4922
Model:                            GLM   Df Residuals:                     4906
Model Family:                Binomial   Df Model:                           15
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -2013.3
Date:                Sun, 26 Nov 2023   Deviance:                       4026.5
Time:                        10:58:48   Pearson chi2:                 6.22e+03
No. Iterations:                     7   Pseudo R-squ. (CS):             0.2819
Covariance Type:            nonrobust                                         
                                            coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------

### Getting the predicted values on train set

In [41]:
# Getting the predicted values on the train set
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]


y_train_pred_final = pd.DataFrame({'Churn':y_train.values, 'Churn_Prob':y_train_pred})
y_train_pred_final['CustID'] = y_train.index
y_train_pred_final.head()

Unnamed: 0,Churn,Churn_Prob,CustID
879,0,0.191284,879
5790,0,0.28038,5790
6498,1,0.66637,6498
880,1,0.520629,880
2784,1,0.681216,2784


### Creating a new column predicted with 1 if churn > 0.5 else 0

In [42]:
#Creating new column 'predicted' with 1 if Churn_Prob > 0.5 else 0
y_train_pred_final['predicted'] = y_train_pred_final.Churn_Prob.map(lambda x: 1 if x > 0.5 else 0)

# Let's see the head
y_train_pred_final.head()

Unnamed: 0,Churn,Churn_Prob,CustID,predicted
879,0,0.191284,879,0
5790,0,0.28038,5790,0
6498,1,0.66637,6498,1
880,1,0.520629,880,1
2784,1,0.681216,2784,1


### Create a confusion matrix on train set and test

In [64]:
# Confusion matrix
from sklearn import metrics
confusion_matrix = metrics.confusion_matrix(y_train_pred_final.Churn, y_train_pred_final.predicted )
print(confusion_matrix)

[[3268  367]
 [ 584  703]]


In [63]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_train_pred_final.Churn, y_train_pred_final.predicted))

0.8067858594067452


In [65]:
print(classification_report(y_train_pred_final.Churn, y_train_pred_final.predicted))

              precision    recall  f1-score   support

           0       0.85      0.90      0.87      3635
           1       0.66      0.55      0.60      1287

    accuracy                           0.81      4922
   macro avg       0.75      0.72      0.73      4922
weighted avg       0.80      0.81      0.80      4922



In [45]:
#Making predictions on the test set
X_test[['tenure','MonthlyCharges','TotalCharges']] = scaler.fit_transform(X_test[['tenure','MonthlyCharges','TotalCharges']])
X_test = X_test[col]
X_test.head()

X_test_sm = sm.add_constant(X_test)
y_test_pred = res.predict(X_test_sm)
y_test_pred[:10]

942     0.473692
3730    0.307858
1761    0.003638
2283    0.615125
1872    0.008976
1970    0.713136
2532    0.347704
1616    0.004648
2485    0.558708
5914    0.112117
dtype: float64

In [46]:
# Converting y_pred to a dataframe which is an array
y_pred_1 = pd.DataFrame(y_test_pred)
y_pred_1.head()

Unnamed: 0,0
942,0.473692
3730,0.307858
1761,0.003638
2283,0.615125
1872,0.008976


In [47]:
# Converting y_test to dataframe
y_test_df = pd.DataFrame(y_test)

In [48]:
# Putting CustID to index
y_test_df['CustID'] = y_test_df.index

In [49]:
# Removing index for both dataframes to append them side by side 
y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)

In [50]:
# Appending y_test_df and y_pred_1
y_pred_final = pd.concat([y_test_df, y_pred_1],axis=1)
y_pred_final.head()

Unnamed: 0,Churn,CustID,0
0,0,942,0.473692
1,1,3730,0.307858
2,0,1761,0.003638
3,1,2283,0.615125
4,0,1872,0.008976


In [51]:
# Rearranging the columns
y_pred_final = y_pred_final.reindex(['CustID','Churn','Churn_Prob'], axis=1)
# Let's see the head of y_pred_final
y_pred_final.head()
y_pred_final['final_predicted'] = y_pred_final.Churn_Prob.map(lambda x: 1 if x > 0.42 else 0)
y_pred_final.head()

Unnamed: 0,CustID,Churn,Churn_Prob,final_predicted
0,942,0,,0
1,3730,1,,0
2,1761,0,,0
3,2283,1,,0
4,1872,0,,0


In [52]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_pred_final.Churn, y_pred_final.final_predicted)

0.7241706161137441