> # Telco Customer Churn Analysis and Prediction

_Matteo Facchetti_, _Mario Damiano Russo_, _Mirko Frigerio_.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Univariate-and-Bivariate-Analysis" data-toc-modified-id="Univariate-and-Bivariate-Analysis-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Univariate and Bivariate Analysis</a></span><ul class="toc-item"><li><span><a href="#Customer-Churn" data-toc-modified-id="Customer-Churn-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Customer Churn</a></span></li><li><span><a href="#Gender-distribution" data-toc-modified-id="Gender-distribution-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Gender distribution</a></span></li><li><span><a href="#Age-distribution" data-toc-modified-id="Age-distribution-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Age distribution</a></span></li><li><span><a href="#Phone-Service-distribution" data-toc-modified-id="Phone-Service-distribution-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Phone Service distribution</a></span></li><li><span><a href="#Internet-service-distribution" data-toc-modified-id="Internet-service-distribution-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Internet service distribution</a></span></li><li><span><a href="#Tenure-distribution" data-toc-modified-id="Tenure-distribution-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Tenure distribution</a></span></li><li><span><a href="#Contract-distribution" data-toc-modified-id="Contract-distribution-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Contract distribution</a></span></li></ul></li><li><span><a href="#Dealing-with-Missing-Values" data-toc-modified-id="Dealing-with-Missing-Values-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Dealing with Missing Values</a></span><ul class="toc-item"><li><span><a href="#Encoding-the-dummy-variables" data-toc-modified-id="Encoding-the-dummy-variables-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Encoding the dummy variables</a></span></li></ul></li><li><span><a href="#Dealing-with-Outliers" data-toc-modified-id="Dealing-with-Outliers-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Dealing with Outliers</a></span></li><li><span><a href="#Logistic-regression" data-toc-modified-id="Logistic-regression-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Logistic regression</a></span></li><li><span><a href="#Resampling-methods" data-toc-modified-id="Resampling-methods-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Resampling methods</a></span></li></ul></div>

In [None]:
import pandas as pd
from IPython.display import display
pd.options.display.max_columns = None
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
import pylab as pl
import scikitplot as skplt
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples
from matplotlib import cm
import plotly.graph_objs as go
import plotly.offline as py
%matplotlib inline
import os
from PIL import  Image
import io
py.init_notebook_mode(connected=True)
import plotly.tools as tls
import plotly.figure_factory as ff
import matplotlib.ticker as mtick
import scipy

In [None]:
tcc = pd.read_csv(r"../input/WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [None]:
tcc.head()

## Univariate and Bivariate Analysis

### Customer Churn

Let's first have a look at the churn rate.

In [None]:
senior = (tcc['Churn'].value_counts()*100.0 /len(tcc)).plot(kind='pie',\
        labels = ['No', 'Yes'], figsize = (7,7) , colors = ['yellow','blue'])

senior.set_title('Churn rate')
senior.legend(labels=['No','Yes']);

One customer over four churns.

### Gender distribution

In [None]:
gb = tcc.groupby("gender")["Churn"].value_counts().to_frame().rename({"Churn": "Number of Customers"}, axis = 1).reset_index()
sns.barplot(x = "gender", y = "Number of Customers", data = gb, hue = "Churn", palette = sns.color_palette("hls", 8)).set_title("Gender and relative Churn Rates in our population");

Men and women are evenly distributed in our sample, and show the same proportions of Churn.

### Age distribution

In [None]:
senior = (tcc['SeniorCitizen'].value_counts()*100.0 /len(tcc)).plot(kind='pie',\
        labels = ['No', 'Yes'], figsize = (7,7) , colors = ['yellow','green'], fontsize = 15)

senior.set_title('Seniors')
senior.legend(labels=['Non Senior','Senior']);

In [None]:
gb = tcc.groupby("SeniorCitizen")["Churn"].value_counts().to_frame().rename({"Churn": "Number of Customers"}, axis = 1).reset_index()
gb.replace([0, 1], ["Young", "Senior"], inplace = True)
gb

In [None]:
tp = gb.groupby("SeniorCitizen")["Number of Customers"].sum().to_frame().reset_index().rename({"Number of Customers": "# Customers in Age Group"}, axis = 1)
gb = pd.merge(gb, tp, on = "SeniorCitizen")
gb["Churn Rate in Age Group"] = gb["Number of Customers"]/gb["# Customers in Age Group"]
gb = gb[gb.Churn == "Yes"]

sns.barplot(x = "SeniorCitizen", y = "Churn Rate in Age Group", data = gb).set_title("Churn Rate for Young and Senior customers");

Our sample is mainly composed by young people. Senior customers are more prone to churning.

### Phone Service distribution

In [None]:
phone = (tcc['PhoneService'].value_counts()*100.0 /len(tcc)).plot(kind='bar', stacked = True,\
                                                rot = 0, color = ['red','lightblue'])
  
phone.yaxis.set_major_formatter(mtick.PercentFormatter())
phone.set_ylabel('Customers')
phone.set_xlabel('Phone Service')
phone.set_ylabel('Customers')
phone.set_title('Phone service distribution');

Just a little amount of people does not have phone service.

### Internet service distribution

In [None]:
internet = (tcc['InternetService'].value_counts()*100.0 /len(tcc)).plot(kind='pie',\
        labels = ['Fiber optic', 'DSL', 'No'], figsize = (7,7) , colors = ['orange','purple', 'black'], fontsize = 15)

senior.set_title('Seniors')
senior.legend(labels=['Non Senior','Senior']);

Among the ones that have internet service, DSL and Fiber optic are almost equally distributed (the fraction of people having Fiber optic is slightly greater). Less than one fourth of the members of our sample has no internet service.

### Tenure distribution

In [None]:
plt.hist(tcc.tenure)
plt.xlabel('tenure')
plt.title("Tenure Distribution");

The majority of the customers in our sample are new clients. There is also a high number of people with a tenure around 70 months. Most likely the company is not older than 72 months, and there either was a strong incentive for subscription (like a competitive launch offer, which could explain the high number through efficient retention rates) or there was some form of selection bias (the offers were unique on the market and highly valued by a group of customers, leading to fast market saturation, which could explain the high number by keeping the retention rate constant and leveraging high sales volumes). These are the only two reasons that can explain such a sharp kickstart in the number of subscriptions and their sudden drop.

### Contract distribution

In [None]:
contract = (tcc['Contract'].value_counts()*100.0 /len(tcc)).plot(kind='bar', stacked = True,\
                                                rot = 0, color = ['orange','blue','magenta'])
  
contract.yaxis.set_major_formatter(mtick.PercentFormatter())
contract.set_ylabel('Customers')
contract.set_xlabel('Contract')
contract.set_ylabel('Customers')
contract.set_title('Contract distribution');

More than half customers have a month-to-month contract.

In [None]:
tcc.columns

## Dealing with Missing Values

In [None]:
missing_values = []
for col in tcc.columns:
    missing_values.append(tcc[col].isna().any())

missing_values = pd.DataFrame(np.array(missing_values).reshape(1, 21))
missing_values.columns = tcc.columns
missing_values_table   = tcc.append(missing_values).tail(1)
missing_values_table   = missing_values_table.astype(bool)
missing_values_table   = missing_values_table.transpose()
missing_values_table.columns = ["Missing?"]

missing_values_table["dtype"] = tcc.dtypes
missing_values_table

The dtypes are not coherent with logic. There's no point in encoding TotalCharges as a string and MonthlyCharges as a float, or PhoneService as Yes/No and SeniorCitizen as a 0/1 dummy. Let's fix that.

In [None]:
try:
    tcc.TotalCharges.astype("float64")
except ValueError:
    print("We can't convert this column to floats, there must be some non-convertible values")

In [None]:
print(tcc.TotalCharges.value_counts().head())
print("")
print("We have 11 observations that take an empty string value. Let's drop that. The string we want to drop is:")
tcc.TotalCharges.value_counts().index[1]

Let's drop the observations with empty values, reset the index and now we should be able to convert the TotalCharges column to float:

In [None]:
tcc.drop(tcc[tcc.TotalCharges == " "].index, axis = 0, inplace = True)
tcc.reset_index(drop = True, inplace = True)

In [None]:
tcc.TotalCharges = tcc.TotalCharges.astype("float64")

Let's compute some last computations before extracting the Dummy Variables from our dataset and proceeding to the Regression Part.

In [None]:
for col in tcc.columns:
    print("{0}: {1}".format(col, tcc.loc[:, col].unique()))

In [None]:
fig, axis = plt.subplots(nrows = 2, ncols = 3, figsize = (16, 10))

gb = tcc.groupby("InternetService")["Churn"].value_counts()/len(tcc)
gb = gb.to_frame().rename({"Churn": "% of customers"}, axis = 1).reset_index()
sns.barplot(x = "InternetService", y = "% of customers", data = gb, hue = "Churn", ax = axis[0][0]).set_title("Internet Service and Churn");

gb = tcc.groupby("PhoneService")["Churn"].value_counts()/len(tcc)
gb = gb.to_frame().rename({"Churn": "% of customers"}, axis = 1).reset_index()
sns.barplot(x = "PhoneService", y = "% of customers", data = gb, hue = "Churn", ax = axis[0][1]).set_title("Phone Service and Churn");

gb = tcc.groupby("MultipleLines")["Churn"].value_counts()/len(tcc)
gb = gb.to_frame().rename({"Churn": "% of customers"}, axis = 1).reset_index()
sns.barplot(x = "MultipleLines", y = "% of customers", data = gb, hue = "Churn", ax = axis[0][2]).set_title("Multiple Lines Phone Option and Churn");

gb = tcc.groupby("Contract")["Churn"].value_counts()/len(tcc)
gb = gb.to_frame().rename({"Churn": "% of customers"}, axis = 1).reset_index()
sns.barplot(x = "Contract", y = "% of customers", data = gb, hue = "Churn", ax = axis[1][1]).set_title("Contract Type and Churn");

We notice that The customers with Fiber optic tend to churn a lot more when compared to DSL and No Internet. Maybe the Internet connection offered is low-quality? (Other option: Elder Customers don't need an internet connection. Spoiler: No. See following graph that proves that elderly are proportionally more connected than youngsters and are only a reduced percentage of the population).

MultipleLines do not seem to affect the churn rate.

Shorter-term contract renewals are highly correlated with the churn rate. But most likely it's an omitted variable bias issue: the more I trust a provider, the more I reason in long-terms with it.

In [None]:
gb = tcc.groupby("SeniorCitizen")["InternetService"].value_counts()/len(tcc)
gb = gb.to_frame().rename({"InternetService": "% of customers"}, axis = 1).reset_index()
sns.barplot(x = "SeniorCitizen", y = "% of customers", data = gb, hue = "InternetService").set_title("Age Group and Internet Connection");

Now we want to see how the "Additional Internet Services" that follow the variable pattern: ["No", "Yes", "No internet service"] affect the churn rate.

In [None]:
fig, axis = plt.subplots(nrows = 2, ncols = 3, figsize = (16, 10))

gb = tcc.groupby("OnlineSecurity")["Churn"].value_counts()/len(tcc)
gb = gb.to_frame().rename({"Churn": "% of customers"}, axis = 1).reset_index()
sns.barplot(x = "OnlineSecurity", y = "% of customers", data = gb, hue = "Churn", ax = axis[0][0]).set_title("OnlineSecurity Internet Service and Churn")

gb = tcc.groupby("OnlineBackup")["Churn"].value_counts()/len(tcc)
gb = gb.to_frame().rename({"Churn": "% of customers"}, axis = 1).reset_index()
sns.barplot(x = "OnlineBackup", y = "% of customers", data = gb, hue = "Churn", ax = axis[0][1]).set_title("OnlineBackup Internet Service and Churn")

gb = tcc.groupby("DeviceProtection")["Churn"].value_counts()/len(tcc)
gb = gb.to_frame().rename({"Churn": "% of customers"}, axis = 1).reset_index()
sns.barplot(x = "DeviceProtection", y = "% of customers", data = gb, hue = "Churn", ax = axis[0][2]).set_title("DeviceProtection Internet Service and Churn")

gb = tcc.groupby("TechSupport")["Churn"].value_counts()/len(tcc)
gb = gb.to_frame().rename({"Churn": "% of customers"}, axis = 1).reset_index()
sns.barplot(x = "TechSupport", y = "% of customers", data = gb, hue = "Churn", ax = axis[1][0]).set_title("TechSupport Internet Service and Churn")

gb = tcc.groupby("StreamingTV")["Churn"].value_counts()/len(tcc)
gb = gb.to_frame().rename({"Churn": "% of customers"}, axis = 1).reset_index()
sns.barplot(x = "StreamingTV", y = "% of customers", data = gb, hue = "Churn", ax = axis[1][1]).set_title("StreamingTV Internet Service and Churn")

gb = tcc.groupby("StreamingMovies")["Churn"].value_counts()/len(tcc)
gb = gb.to_frame().rename({"Churn": "% of customers"}, axis = 1).reset_index()
sns.barplot(x = "StreamingMovies", y = "% of customers", data = gb, hue = "Churn", ax = axis[1][2]).set_title("StreamingMovies Internet Service and Churn");

- OnlineSecurity, OnlineBackup, TechSupport seem to have a significant impact on lowering the churn. If the company wants to lower the churn rate, It may be a good idea to include these services as standard in the following order: OnlineSecurity, TechSupport, OnlineBackUp, DeviceProtection (although removing the internet connection service altogether may be potentially be more beneficial, at least the Fiber one; see graphs above for details). Although unlikely, it is also possible that these services get cumulated with tenure, and thus their effect on the churn only reflects the negative impact of tenure on the churn rate; in the next cells we will try to figure out whether this is true or not.

- StreamingTV and StreamingMovies do not seem to have a large enough effect on customer Churn Rate.

To assess whether additional services are accumulated through tenure (e.g. fidalty programs), we run a lmplot for each additional Internet service.

In [None]:
gb = tcc[(tcc.OnlineSecurity != "No internet service")].replace(["Yes", "No"], [1, 0]).groupby("tenure")["OnlineSecurity"].sum().to_frame().reset_index()
sns.lmplot("tenure", "OnlineSecurity", data = gb, line_kws={'color': 'red'}, lowess = True);
ax = plt.gca()
ax.set_title("Number of OnlineSecurity subscribers per Tenure level");

In [None]:
gb = tcc[(tcc.OnlineSecurity != "No internet service")].replace(["Yes", "No"], [1, 0]).groupby("tenure")["OnlineBackup"].sum().to_frame().reset_index()
sns.lmplot("tenure", "OnlineBackup", data = gb, line_kws={'color': 'red'}, lowess = True)
ax = plt.gca()
ax.set_title("Number of OnlineBackup subscribers per Tenure level");

In [None]:
gb = tcc[(tcc.OnlineSecurity != "No internet service")].replace(["Yes", "No"], [1, 0]).groupby("tenure")["DeviceProtection"].sum().to_frame().reset_index()
sns.lmplot("tenure", "DeviceProtection", data = gb, line_kws={'color': 'red'}, lowess = True)
ax = plt.gca()
ax.set_title("Number of DeviceProtection subscribers per Tenure level");

In [None]:
gb = tcc[(tcc.OnlineSecurity != "No internet service")].replace(["Yes", "No"], [1, 0]).groupby("tenure")["TechSupport"].sum().to_frame().reset_index()
sns.lmplot("tenure", "TechSupport", data = gb, line_kws={'color': 'red'}, lowess = True)
ax = plt.gca()
ax.set_title("Number of TechSupport subscribers per Tenure level");

In [None]:
gb = tcc[(tcc.OnlineSecurity != "No internet service")].replace(["Yes", "No"], [1, 0]).groupby("tenure")["StreamingTV"].sum().to_frame().reset_index()
sns.lmplot("tenure", "StreamingTV", data = gb, line_kws={'color': 'red'}, lowess = True)
ax = plt.gca()
ax.set_title("Number of StreamingTV subscribers per Tenure level");

In [None]:
gb = tcc[(tcc.OnlineSecurity != "No internet service")].replace(["Yes", "No"], [1, 0]).groupby("tenure")["StreamingMovies"].sum().to_frame().reset_index()
sns.lmplot("tenure", "StreamingMovies", data = gb, line_kws={'color': 'red'}, lowess = True)
ax = plt.gca()
ax.set_title("Number of StreamingMovies subscribers per Tenure level");

The absolute number of each Additional Service seems to move in syncro with the others as tenure increases. It does not seem that there is any significant correlation between the number of active Additional Services and tenure, although people with borderline tenures have an extremely high number of Additional Services.

It's weird that so many people with high tenures have so many additional services. Is it just that there are many people with maximum tenure although the percentage of additional services across tenure level stays the same? Hypothesis: at the beginning, the company had a launch offer all-included. Let's check the percentages of people that have these services for each tenure level.

In [None]:
gb = tcc[(tcc.OnlineSecurity != "No internet service")].replace(["Yes", "No"], [1, 0])
gb["AllServices"] = gb.OnlineSecurity*gb.OnlineBackup*gb.DeviceProtection*gb.TechSupport*gb.StreamingTV*gb.StreamingMovies
sns.lmplot("tenure", "AllServices", data = gb, line_kws={'color': 'red'}, lowess = True);
ax = plt.gca()
ax.set_title("Percentage of subscribers to all services per Tenure level");

In [None]:
tvc = gb.tenure.value_counts()
i = []
v = []
for tenure in tvc.index:
    i.append(tenure)
    v.append(len(gb[(gb.tenure == tenure) & (gb.AllServices == 1)])/len(gb[gb.tenure == tenure]))

In [None]:
df = pd.DataFrame(data = v, index = i, columns = ["%AllServices"]).reset_index().sort_values("index").reset_index(drop = True).rename({"index": "tenure"}, axis = 1)
sns.lmplot("tenure", "%AllServices", data = df, line_kws={'color': 'red'}, lowess = True)
ax = plt.gca()
ax.set_title("Percentage of Customers with all Additional Services Active per Tenure level");

In [None]:
plt.plot(df.tenure, df["%AllServices"]);
ax = plt.gca()
ax.set_title("Trend in percentage of customers subscribed to all services for each tenure level");

<a id='HERE'></a>

Indeed, it seems that the people who subscribed for first have many additional services. Possible explanations:
- Launch offer: all additional services forever included at a discounted price.
- Selection bias: the first customers are the ones who appreciate the most the services offered by the company.

Either case, we can dismiss the hypothesis that additional services are accumulated through tenure, for two reasons:
- There is a strong spike up in the percentage and number of users with all the services around tenure = 70. Nonetheless, the trend in percentage of users with all the services grows constantly, while the absolute number of the individual services stays pretty much constant across tenure levels. This means the the increase in percentage is justifiable only by a convenient launch offer all-inclusive, that rules out the large amount of active offers for customers with extremely high tenure. [This regression](#another_cell) confirms our result;
- if there was a cumulation of benefits, the drop on the 69th tenure value could be hardly justifiable, whereas it could be justified by a change in the offer or a decrease in interest towards the company.

### Encoding the dummy variables

In [None]:
tcc = pd.get_dummies(tcc.iloc[:, 1 :])
tcc.head()

In [None]:
tcc.dtypes

Let's have a look at the variables.

In [None]:
for col in tcc.columns:
    print("{0}: {1}".format(col, tcc.loc[:, col].unique()))

## Dealing with Outliers

In [None]:
tcc.head()

In [None]:
sns.boxplot(x = tcc.MonthlyCharges).set_title("Monthly Charges Box Plot");

In [None]:
sns.distplot(tcc.MonthlyCharges).set_title("Monthly Charges Distribution");

In [None]:
sns.boxplot(x = tcc.TotalCharges).set_title("Total Charges Box Plot");

In [None]:
sns.distplot(tcc.TotalCharges).set_title("Total Charges Distribution");

In [None]:
tcc.columns

In [None]:
sns.lmplot("TotalCharges", "Churn_Yes", data = tcc, line_kws={'color': 'red'}, lowess = True, size = 5)
plt.suptitle("Chances of Churn relative to Total Charges");

In [None]:
sns.lmplot("MonthlyCharges", "Churn_Yes", data = tcc, line_kws={'color': 'red'}, lowess = True, size = 5)
plt.suptitle("Chances of Churn relative to Monthly Charges");

In [None]:
sns.lmplot("tenure", "Churn_Yes", data = tcc, line_kws={'color': 'red'}, lowess = True, size = 5)
plt.suptitle("Chances of Churn relative to Tenure");

## Logistic regression

We want to build a predictive model using _Churn_ as our dependent variable. First we try to run the regression by including all the variables.

In [None]:
tcc.columns

Most of the variables differentiate between "No" and "No internet service". Given that the information about "Internet Service" or "No internet service" is already provided by the variable _InternetService_, we can just analyze the impact of having a service that implies having Internet Service versus not having it, without considering that a person could have for example no OnlineSecurity due to the fact that they do not have Internet Service.

We will also exclude _TotalCharges_ from our model, since it is likely to be correlated with _MonthlyCharges_ (we are going to test this hypothesis by calculating the Pearson correlation coefficient).

In [None]:
# Pearson correlation coefficient
print("Coefficient:",scipy.stats.pearsonr(tcc["MonthlyCharges"], tcc["TotalCharges"])[0])
print("p-value:",scipy.stats.pearsonr(tcc["MonthlyCharges"], tcc["TotalCharges"])[1])

The two variables are highly correlated.

In [None]:
tcc.head()

Let's run a first regression including all the variables. We will then progressively improve our model.

In [None]:
# Intercept
tcc["intercept"] = 1.0

variables = tcc.copy()[['SeniorCitizen', 'tenure', 'MonthlyCharges', 
       'gender_Female', 'Partner_Yes', 'PhoneService_Yes',
        'Dependents_Yes', 'MultipleLines_Yes',
       'InternetService_DSL', 'InternetService_Fiber optic', 'OnlineSecurity_Yes',
        'OnlineBackup_Yes', 'DeviceProtection_Yes', 'TechSupport_Yes', 'StreamingTV_Yes',
       'StreamingMovies_Yes', 'Contract_One year', 'Contract_Two year', 'PaperlessBilling_Yes',
       'PaymentMethod_Bank transfer (automatic)', 'PaymentMethod_Credit card (automatic)',
       'PaymentMethod_Electronic check', 'intercept']]

# Setting the model
logistical_regression = sm.Logit(tcc["Churn_Yes"], variables)

# Fitting the model
fitted_model = logistical_regression.fit()
fitted_model.summary2()

To improve the goodness of our results we can first work on _PaymentMethod_. We will transform the variable _PaymentMethod_ in order to analyze the difference between automatic Payment Method and non automatic. Clients with automatic payment are less likely to churn with respect to clients with no automatic payment. We are not interested in the difference between Bank transfer and Credit card, or between Electronic check or Mailed check.

In [None]:
# Transforming PaymentMethod
tcc["PaymentMethod_Automatic"] = tcc["PaymentMethod_Bank transfer (automatic)"] + tcc["PaymentMethod_Credit card (automatic)"]

In [None]:
variables = tcc[['SeniorCitizen', 'tenure', 'MonthlyCharges', 
       'gender_Female', 'Partner_Yes',
        'Dependents_Yes', 'MultipleLines_Yes',
       'InternetService_DSL', 'InternetService_Fiber optic', 'OnlineSecurity_Yes',
        'OnlineBackup_Yes', 'DeviceProtection_Yes', 'TechSupport_Yes', 'StreamingTV_Yes',
       'StreamingMovies_Yes', 'Contract_One year', 'Contract_Two year', 'PaperlessBilling_Yes',
       'PaymentMethod_Automatic', 'intercept']]

# Setting the model
logistical_regression = sm.Logit(tcc["Churn_Yes"], variables)

# Fitting the model
fitted_model = logistical_regression.fit()
fitted_model.summary2()

We will remove from our model _OnlineBackup_, _DeviceProtection_, _gender_ and _partner_ as they are not significant.

In [None]:
variables = tcc[['SeniorCitizen', 'tenure', 'MonthlyCharges', 
        'Dependents_Yes', 'MultipleLines_Yes',
       'InternetService_DSL', 'InternetService_Fiber optic', 'OnlineSecurity_Yes',
        'TechSupport_Yes', 'StreamingTV_Yes',
       'StreamingMovies_Yes', 'Contract_One year', 'Contract_Two year', 'PaperlessBilling_Yes',
       'PaymentMethod_Automatic', 'intercept']]

# Setting the model
logistical_regression = sm.Logit(tcc["Churn_Yes"], variables)

# Fitting the model
fitted_model = logistical_regression.fit()
fitted_model.summary2()

To improve the interpretability of our regression, instead of considering _tenure_ as a continuous variable we can divide it into 4 clusters.

In [None]:
tcc["tenure"].describe()

In [None]:
tcc["tenure_0:18"]  = 0
tcc["tenure_19:36"] = 0
tcc["tenure_37:54"] = 0
tcc["tenure_55:72"] = 0

tcc.loc[tcc.tenure <= 18, "tenure_0:18"] = 1
tcc.loc[((tcc.tenure >= 19) & (tcc.tenure <= 36)), "tenure_19:36"] = 1
tcc.loc[((tcc.tenure >= 37) & (tcc.tenure <= 54)), "tenure_37:54"] = 1
tcc.loc[tcc.tenure >= 55, "tenure_55:72"] = 1

In [None]:
tcc.head()

In [None]:
tcc.columns

Let's run a new regression.

In [None]:
variables = tcc[['SeniorCitizen', 'MonthlyCharges', 
        'Dependents_Yes', 'MultipleLines_Yes',
       'InternetService_DSL', 'InternetService_Fiber optic', 'OnlineSecurity_Yes',
        'TechSupport_Yes', 'StreamingTV_Yes',
       'StreamingMovies_Yes', 'Contract_One year', 'Contract_Two year', 'PaperlessBilling_Yes',
       'PaymentMethod_Automatic', 'tenure_19:36',
       'tenure_37:54', 'tenure_55:72','intercept']]

# Setting the model
logistical_regression = sm.Logit(tcc["Churn_Yes"], variables)

# Fitting the model
fitted_model = logistical_regression.fit()
fitted_model.summary2()

Let's calculate the VIF to see if there is multicollinearity among our variables.

In [None]:
vif = pd.DataFrame()
vif["Variables"]  = variables.columns[0:-1]
vif["VIF Factor"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1]-1)]
vif

The two _InternetService_ variables have a high VIF, along with _MonthlyCharges_. A possible explanation is that customers who have the Optic Fiber connection pay a different price compared to those who have a DSL connection. For this reason, we are going to exclude _MonthlyCharges_ from our model, and then re-run our VIF analysis. This time around, we expect a low VIF for both _InternetService_ variables.

We are also going to re-insert in our regression _OnlineBackup_ and _DeviceProtection_, which we removed earlier on, as they might have been affected by multicollinearity.

In [None]:
variables = tcc[['SeniorCitizen',
        'Dependents_Yes', 'MultipleLines_Yes',
         'InternetService_DSL', 'InternetService_Fiber optic', 'OnlineSecurity_Yes',
        'TechSupport_Yes', "OnlineBackup_Yes", "DeviceProtection_Yes", 'StreamingTV_Yes',
       'StreamingMovies_Yes', 'Contract_One year', 'Contract_Two year', 'PaperlessBilling_Yes',
       'PaymentMethod_Automatic', 'tenure_19:36',
       'tenure_37:54', 'tenure_55:72','intercept']]

vif = pd.DataFrame()
vif["Variables"]  = variables.columns[0:-1]
vif["VIF Factor"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1]-1)]
vif

And, in fact, we are right: data proves that _MonthlyCharges_ greatly depend on _InternetService_.

What we need to do now is just to run our logistic regression as before, this time without _MonthlyCharges_.

In [None]:
# Setting the model
logistical_regression = sm.Logit(tcc["Churn_Yes"], variables)

# Fitting the model
fitted_model = logistical_regression.fit()
fitted_model.summary2()

_OnlineBackup_ is now significant, whereas _DeviceProtection_ stays unsignificant. We are going to remove the latter from the model.

In [None]:
variables = tcc[['SeniorCitizen',
        'Dependents_Yes', 'MultipleLines_Yes',
         'InternetService_DSL', 'InternetService_Fiber optic', 'OnlineSecurity_Yes',
        'TechSupport_Yes', "OnlineBackup_Yes", 'StreamingTV_Yes',
       'StreamingMovies_Yes', 'Contract_One year', 'Contract_Two year', 'PaperlessBilling_Yes',
       'PaymentMethod_Automatic', 'tenure_19:36',
       'tenure_37:54', 'tenure_55:72','intercept']]

# Setting the model
logistical_regression = sm.Logit(tcc["Churn_Yes"], variables)

# Fitting the model
fitted_model = logistical_regression.fit()
fitted_model.summary2()

This is our final model. Let's get the marginal effect of our variables in order to be able to easily interpret them.

In [None]:
margeff = fitted_model.get_margeff()
margeff.summary()

The results are largely consistent with our plottings: 
- Both _InternetService_ variables present a positive impact on the churn rate, with Optic Fiber's being almost twice the of DSL's. It might be a good idea to consider discontinuing at least the Optic Fiber service or improving it.
- Senior customers tend to churn more easily.
- Additional Internet Services (_OnlineSecurity_, _TechSupport_, _OnlineBackup_) negatively affect Churn Rate and are therefore a potential way to decrease it in a managerial setting. _DeviceProtection_ on the other hand is inconsistent with our plottings, and it appears that its effect is largely explained by the other variables of our model. _StreamingMovies_ and _StreamingTV_, unlike expected, are significant and positively affect the Churn Rate: the management might consider stop offering those services.

Let's measure the goodness of our model by building the confusion matrix. We are going to plot the Kolmogorov–Smirnov statistics in order to set the threshold.

In [None]:
# Compute the predicted probability
pred = np.array([ 1 - fitted_model.predict(), fitted_model.predict() ])

skplt.metrics.plot_ks_statistic(tcc["Churn_Yes"], pred.T, figsize=(5, 5));

0.274 is going to be our threshold. We begin the solution of the business case by considering as potential churners all the observations with a predicted probability of churning greater than 0.274.

In [None]:
# Lift Chart
skplt.metrics.plot_lift_curve(tcc["Churn_Yes"], pred.T, figsize=(5, 5));

In [None]:
# Gain Chart
skplt.metrics.plot_cumulative_gain(tcc["Churn_Yes"], pred.T, figsize=(5, 5));

## Resampling methods

As a final part of our work, we are going to perform a nonparametric bootstrap to assess the precision of our estimates.

In [None]:
churn = fitted_model.predict()

In [None]:
def bootstrap_replicate(data, function):
    bs_sample = np.random.choice(data, len(data))
    return function(bs_sample)

bs_results = np.empty(10000)

for i in range(10000):
    bs_results[i] = bootstrap_replicate(churn, np.mean)

_ = plt.hist(bs_results, bins = 30, normed = True)
_ = plt.xlabel("Mean of Predicted Churners")
_ = plt.ylabel("Empirical Probability Density Function")
plt.show()

In [None]:
print("Bootstrap confidence interval:", np.percentile(bs_results, [2.5, 97.5]))
print("Bootstrap mean:", bs_results.mean())

Let's compare the estimates of our prediction with the mean of the sample.

In [None]:
print("Sample mean:", tcc.Churn_Yes.mean())

Our prediction is almost identical with respect to the actual churn rate of the sample.

<a id='another_cell'></a>

# Appendix

This regression aims to assess the correlation between additional internet services and _Churn_ also for low values of tenure. We want to prove that the impact of additional services on _Churn_ is causal and not due to an omitted variable or selection bias. This means that additional services are not given as gift to those clients with a higher tenure but they have an actual impact on the churn phenomenon.

In [None]:
low_tenure = tcc[tcc["tenure_0:18"]+tcc["tenure_19:36"]==1]

variables = low_tenure[['SeniorCitizen',
        'Dependents_Yes', 'MultipleLines_Yes',
         'InternetService_DSL', 'InternetService_Fiber optic', 'OnlineSecurity_Yes',
        'TechSupport_Yes', "OnlineBackup_Yes", 'StreamingTV_Yes',
       'StreamingMovies_Yes', 'Contract_One year', 'Contract_Two year', 'PaperlessBilling_Yes',
       'PaymentMethod_Automatic', 'tenure_19:36',
       'tenure_37:54', 'tenure_55:72','intercept']]

# Setting the model
logistical_regression_low = sm.Logit(low_tenure["Churn_Yes"], variables)

# Fitting the model
fitted_model_low = logistical_regression.fit()
fitted_model_low.summary2()

CLICK [HERE](#HERE) TO GO KEEP READING