# Predicting Customers Churn

Idea from [Yhat](http://blog.yhat.com/posts/predicting-customer-churn-with-sklearn.html), then modified by Hassen Taidirt to include business considerations.

***

In this notebook, I'll try to predict customers churn, in a business perspective to reduce churn costs. I started with this blog post from [Yhat](http://blog.yhat.com/posts/predicting-customer-churn-with-sklearn.html) then I modified their study because they didn't included business considerations in their predictions.

The goal is not only to predict which customer is more likely to churn. _Why_ do we want to predict customers churn? To reduce loss. Here, there is an implicit business goal. The predictive model doesn'r only need to preduct customers at risk but also reduce the company loss due to churns. That's what I do in this notebook, which is not covered in the original Yhat blog post (in fact, there is a mention of this business perspective, but no additional details were provided).

I will load a customers database of a telecom company and preprocess it before study. Then, I will discuss some modeling considerations that are specific to this database. Next, I will build few models and tweak them to find the best predictor that reduces costs. Finally, a conclusion that will also summarize learned lessons will close this notebook.

Here we go!

_Nota:_ Please note that I'm not going to define any business term until this is useful for the study. This is out of the scope of our study goal which is to show how to apply Predictive Modeling in a business context. Please refer to wikipedia/google if needed, or just email me at: htaidirt [at] gmail [dot] com

***

## Step 0: Preparation and data load

Nothing exciting here, only loading necessary python modules and data drom Dropbox.

In [None]:
%matplotlib inline

import numpy as np

import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)

import seaborn as sns
sns.set_style('whitegrid')
sns.set_context('poster')

from matplotlib import rcParams

In [118]:
# Source: https://dl.dropboxusercontent.com/u/75194/churn.csv
data_original = pd.read_csv('./data/data_original.csv')

In [119]:
data_original.head()

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


In [72]:
data_original.tail()

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
3328,AZ,192,415,414-4276,no,yes,36,156.2,77,26.55,215.5,126,18.32,279.1,83,12.56,9.9,6,2.67,2,False.
3329,WV,68,415,370-3271,no,no,0,231.1,57,39.29,153.4,55,13.04,191.3,123,8.61,9.6,4,2.59,3,False.
3330,RI,28,510,328-8230,no,no,0,180.8,109,30.74,288.8,58,24.55,191.9,91,8.64,14.1,6,3.81,2,False.
3331,CT,184,510,364-6381,yes,no,0,213.8,105,36.35,159.6,84,13.57,139.2,137,6.26,5.0,10,1.35,2,False.
3332,TN,74,415,400-4344,no,yes,25,234.4,113,39.85,265.9,82,22.6,241.4,77,10.86,13.7,4,3.7,0,False.


Finally we load the necessary `scikit-learn` modules for our experiments.

In [73]:
from sklearn.cross_validation import train_test_split

***

## Step 1: Loading and pre-processing the data

Obvious step! I will load the data and preprocess it to build the models.

In [27]:
data_original.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3333 entries, 0 to 3332
Data columns (total 21 columns):
State             3333 non-null object
Account Length    3333 non-null int64
Area Code         3333 non-null int64
Phone             3333 non-null object
Int'l Plan        3333 non-null object
VMail Plan        3333 non-null object
VMail Message     3333 non-null int64
Day Mins          3333 non-null float64
Day Calls         3333 non-null int64
Day Charge        3333 non-null float64
Eve Mins          3333 non-null float64
Eve Calls         3333 non-null int64
Eve Charge        3333 non-null float64
Night Mins        3333 non-null float64
Night Calls       3333 non-null int64
Night Charge      3333 non-null float64
Intl Mins         3333 non-null float64
Intl Calls        3333 non-null int64
Intl Charge       3333 non-null float64
CustServ Calls    3333 non-null int64
Churn?            3333 non-null object
dtypes: float64(8), int64(8), object(5)
memory usage: 572.9+ KB


As we can see, we have 3333 customers and 21 attributes. We also have no missing values which will greatly help us in our predictions. We also have a lot of categorical attributes that need to be converted into dummy variables.

In [28]:
data = data_original

***

### Remove unneccessary attributes

**Phone** number is not useful to make predictions. Let's remove it:

In [48]:
data = data.drop('Phone', axis = 1)

***

### Convert boolean attributes to 0/1

Attributes like **Int'l Plan**, **VMail Plan** and **Churn?** are boolean attributes expressed as strings. I will convert them to 0/1 values using transformations:

In [54]:
data["Int'l Plan"] = (data_original["Int'l Plan"] == 'yes').astype(int)
data["VMail Plan"] = (data_original["VMail Plan"] == 'yes').astype(int)
data["Churn?"] = (data_original["Churn?"] == 'True.').astype(int)

In [55]:
data.head()

Unnamed: 0,State,Account Length,Area Code,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,0,1,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0
1,OH,107,415,0,1,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0
2,NJ,137,415,0,0,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,0
3,OH,84,408,1,0,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,0
4,OK,75,415,1,0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,0


***

### Convert categorical attributes to dummy variables

**State** and **Area Code** are categorical attributes. We can see how many unique values are included:

In [57]:
# How many different values are in State and Area Code categorical attributes
print str(len(data['State'].unique())) + " unique States"
print str(len(data['Area Code'].unique())) + " unique Area Codes"

51 unique States
3 unique Area Codes


In [60]:
# Generate the dummy variables
dummy_states = pd.get_dummies(data['State'], prefix = 'State')
dummy_area_codes = pd.get_dummies(data['Area Code'], prefix = 'Area_Code')

In [63]:
# Join the new dummy variables to the data
data = data.join(dummy_states)
data = data.join(dummy_area_codes)

In [67]:
# Drop the old categorical attributes
data = data.drop('State', axis = 1)
data = data.drop('Area Code', axis = 1)

In [68]:
data.head()

Unnamed: 0,Account Length,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?,State_AK,State_AL,State_AR,State_AZ,State_CA,State_CO,State_CT,State_DC,State_DE,State_FL,State_GA,State_HI,State_IA,State_ID,State_IL,State_IN,State_KS,State_KY,State_LA,State_MA,State_MD,State_ME,State_MI,State_MN,State_MO,State_MS,State_MT,State_NC,State_ND,State_NE,State_NH,State_NJ,State_NM,State_NV,State_NY,State_OH,State_OK,State_OR,State_PA,State_RI,State_SC,State_SD,State_TN,State_TX,State_UT,State_VA,State_VT,State_WA,State_WI,State_WV,State_WY,Area_Code_408,Area_Code_415,Area_Code_510
0,128,0,1,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
1,107,0,1,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
2,137,0,0,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
3,84,1,0,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,75,1,0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


***

Now we should have only numerical values for all our attributes:

In [70]:
print data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3333 entries, 0 to 3332
Data columns (total 72 columns):
Account Length    3333 non-null int64
Int'l Plan        3333 non-null int64
VMail Plan        3333 non-null int64
VMail Message     3333 non-null int64
Day Mins          3333 non-null float64
Day Calls         3333 non-null int64
Day Charge        3333 non-null float64
Eve Mins          3333 non-null float64
Eve Calls         3333 non-null int64
Eve Charge        3333 non-null float64
Night Mins        3333 non-null float64
Night Calls       3333 non-null int64
Night Charge      3333 non-null float64
Intl Mins         3333 non-null float64
Intl Calls        3333 non-null int64
Intl Charge       3333 non-null float64
CustServ Calls    3333 non-null int64
Churn?            3333 non-null int64
State_AK          3333 non-null float64
State_AL          3333 non-null float64
State_AR          3333 non-null float64
State_AZ          3333 non-null float64
State_CA          3333 non-null f

We can now start working on our data, starting by splitting our dataset to a training and a testing datasets.

***

## Step 2: Splitting dataset into training and testing datasets

Let's now split our dataset into two datasets (80% training and 20% testing), with the same output rates.

In [77]:
X = data.drop('Churn?', axis = 1)
Y = data['Churn?']

In [113]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size = 0.8)
epsilon = np.abs(np.mean(Y_train) - np.mean(Y_test))

print "Mean of Y = " + str(np.mean(Y))
print "Mean of Y_train = " + str(np.mean(Y_train))
print "Mean of Y_test = " + str(np.mean(Y_test))
print "epsilon = " + str(epsilon)

Mean of Y = 0.144914491449
Mean of Y_train = 0.145161290323
Mean of Y_test = 0.143928035982
epsilon = 0.00123325434057


In [114]:
# Reconstruct datasets
data_train = X_train.join(Y_train)
data_test = X_test.join(Y_test)

In [117]:
# Save dataset in case we need to load them later (in another project)
#data_train.to_csv('./data/data_train.csv', index = False)
#data_test.to_csv('./data/data_test.csv', index = False)

***

## Step 3: Modeling Predictors