# Lab: Logistic Regression Analysis

For this lab, we will use the CustomerChurn.csv data set. You can find a copy of the dataset in the git hub folder. This dataset includes variables related to customer characteristics, as well as a variable indicating whether or not they churned. As discussed in class the goal of this exercise is to predict whether or not a customer will churn. 

In [24]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns

In [25]:
customer= pd.read_csv('Customer-Churn.csv')
customer.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In order to use 'Churn' as a target variable, we need to encode it to 0 - 1 (or True - False) instead of yes / no. Use np.where to create a variable called y, which has the value 1 or True whenever 'Churn' is yes, and 0 or False otherwise.

In [26]:
y=np.where(customer['Churn']=='Yes', 1,0)
print(y)


[0 0 1 ... 0 1 0]


First, we would like use 'tenure' as an explanatory variable. Declare this as your variable X, add a constant term and run a logistic regression of 'Churn' on 'tenure'. Interpret the values of the model.

In [27]:
Y=y
X=customer['tenure']
X = sm.add_constant(X)
model = sm.Logit(Y, X)
results=model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.510569
         Iterations 6


0,1,2,3
Dep. Variable:,y,No. Observations:,7043.0
Model:,Logit,Df Residuals:,7041.0
Method:,MLE,Df Model:,1.0
Date:,"Thu, 12 Nov 2020",Pseudo R-squ.:,0.1176
Time:,18:40:37,Log-Likelihood:,-3595.9
converged:,True,LL-Null:,-4075.1
Covariance Type:,nonrobust,LLR p-value:,2.106e-210

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,0.0273,0.042,0.647,0.518,-0.055,0.110
tenure,-0.0388,0.001,-27.586,0.000,-0.042,-0.036


In [28]:
#interpretation: higher tenure , no churn

Next, we would like to add the variable 'Senior Citizen' to the model. Run a logistic regression of 'Churn' on 'tenure' and 'SeniorCitizen'. Interpret the values of the model.

In [29]:
Y=y
X=customer[['tenure', 'SeniorCitizen']]
X = sm.add_constant(X)
model = sm.Logit(Y, X)
results=model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.496871
         Iterations 6


0,1,2,3
Dep. Variable:,y,No. Observations:,7043.0
Model:,Logit,Df Residuals:,7040.0
Method:,MLE,Df Model:,2.0
Date:,"Thu, 12 Nov 2020",Pseudo R-squ.:,0.1413
Time:,18:40:38,Log-Likelihood:,-3499.5
converged:,True,LL-Null:,-4075.1
Covariance Type:,nonrobust,LLR p-value:,1.038e-250

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-0.1232,0.044,-2.801,0.005,-0.209,-0.037
tenure,-0.0405,0.001,-27.981,0.000,-0.043,-0.038
SeniorCitizen,1.0465,0.075,13.964,0.000,0.900,1.193


In [30]:
#customers with longer tenure are less likely to have churned.

#A positive sign means that all else being equal, senior citizens were more likely to have churned than 
#non-senior citizens.  Note that no estimate is shown for the non-senior citizens; this is because they are 
#necessarily the other side of the same coin. If senior citizens are more likely to churn, then non-senior citizens 
#must be less likely to churn to the same degree, so there is no need to have a coefficient showing this.

Finally, we would like to add the variable 'Contract' to the model. Please inspect the possible values for 'Contract'. What type of variable is it?

In [31]:
#first we see how many distinct values we have for Contract column. 
customer['Contract'].value_counts()

Month-to-month    3875
Two year          1695
One year          1473
Name: Contract, dtype: int64

Please convert Contract to dummy variables, and add it to the matrix of explanatory variables. Then run a logistic regression of 'Churn' on 'tenure', 'SeniorCitizen' and 'Contract'. Interpret the values of the model.

In [32]:
cont=pd.get_dummies(customer['Contract'], drop_first=True)

X=pd.concat([X, cont], axis=1)
X.head()


Unnamed: 0,const,tenure,SeniorCitizen,One year,Two year
0,1.0,1,0,0,0
1,1.0,34,0,1,0
2,1.0,2,0,0,0
3,1.0,45,0,1,0
4,1.0,2,0,0,0


In [33]:
X = sm.add_constant(X)
model = sm.Logit(y, X).fit()
model.summary()

Optimization terminated successfully.
         Current function value: 0.466523
         Iterations 8


0,1,2,3
Dep. Variable:,y,No. Observations:,7043.0
Model:,Logit,Df Residuals:,7038.0
Method:,MLE,Df Model:,4.0
Date:,"Thu, 12 Nov 2020",Pseudo R-squ.:,0.1937
Time:,18:40:58,Log-Likelihood:,-3285.7
converged:,True,LL-Null:,-4075.1
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-0.1061,0.045,-2.344,0.019,-0.195,-0.017
tenure,-0.0197,0.002,-11.157,0.000,-0.023,-0.016
SeniorCitizen,0.7413,0.077,9.682,0.000,0.591,0.891
One year,-1.2895,0.097,-13.328,0.000,-1.479,-1.100
Two year,-2.4539,0.162,-15.130,0.000,-2.772,-2.136


## Bonus Challenge: Feature Selection

Use the above data set on customer churn, and try including and excluding different variables to build the best model. Which criteria can you use for deciding whether a variable is helpful for predicting whether a customer will churn or not?

In [18]:
# https://www.displayr.com/how-to-interpret-logistic-regression-coefficients/


In [34]:
customer.head(1)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No


In [35]:
#X1 = pd.concat([X, pd.get_dummies(customer['Contract'], drop_first=True)], axis=1) # one line code to get dummies 
# and concat (this is for the columns that have more than two different values)
# for columns with two values of yes and no , we use this: y=pd.DataFrame(np.where(customer['Churn']=='Yes', 1,0)) 

partner = pd.DataFrame(np.where(customer['Partner']=='Yes', 1, 0))

internet = pd.get_dummies(customer['InternetService']) # we drop the 'No'
#column in internetservice
internet.drop(columns='No', inplace=True)

X1=pd.concat([X, partner, internet], axis=1)
X1.head()

Unnamed: 0,const,tenure,SeniorCitizen,One year,Two year,0,DSL,Fiber optic
0,1.0,1,0,0,0,1,1,0
1,1.0,34,0,1,0,0,1,0
2,1.0,2,0,0,0,0,1,0
3,1.0,45,0,1,0,0,1,0
4,1.0,2,0,0,0,0,0,1


In [36]:
#X1.rename(columns={'0': "partner"}, axis=1, inplace=True)


In [37]:
#after defining the dataframe, we add constant and proceed:
model = sm.Logit(Y, X1)
results=model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.428982
         Iterations 8


0,1,2,3
Dep. Variable:,y,No. Observations:,7043.0
Model:,Logit,Df Residuals:,7035.0
Method:,MLE,Df Model:,7.0
Date:,"Thu, 12 Nov 2020",Pseudo R-squ.:,0.2586
Time:,18:42:25,Log-Likelihood:,-3021.3
converged:,True,LL-Null:,-4075.1
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-1.3559,0.105,-12.869,0.000,-1.562,-1.149
tenure,-0.0318,0.002,-15.802,0.000,-0.036,-0.028
SeniorCitizen,0.3881,0.081,4.796,0.000,0.229,0.547
One year,-0.8073,0.103,-7.869,0.000,-1.008,-0.606
Two year,-1.6279,0.169,-9.642,0.000,-1.959,-1.297
0,-0.0570,0.069,-0.829,0.407,-0.192,0.078
DSL,1.0199,0.118,8.677,0.000,0.789,1.250
Fiber optic,2.1671,0.116,18.629,0.000,1.939,2.395
