# Churn Modelling using Keras

In [1]:
import pandas as pd
#setting pandas to show all columns
pd.set_option('display.max_columns', None)

In [2]:
#Loading the CSV into a Data Frame Object
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

### Let's take a look at our Data

In [3]:
df.head(3)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes


CustomerID does not tell us anything, therefore we can delete it from the Data Frame

In [4]:
df.drop('customerID', axis=1, inplace=True)

With the exception of <b>Tenure</b>, <b>TotalCharges</b>, <b>MonthlyCharges</b> and <b>Churn</b> all the other variables are <b>Categorical</b>, so we should get their dummie variables.<br>
Always remember to drop the first dummie colums for two basic reasons:<br>
- It is always possible to <b>infer</b> the values you dropped given the values you kept.<br>
- To avoid the <b>Dummie Variable Trap</b>.<br>
If you don't know what that is, take a look at this webpage:<br>
<a>http://www.algosome.com/articles/dummy-variable-trap-regression.html<a>

In [5]:
X = pd.get_dummies(df.drop(['tenure', 'MonthlyCharges', 'TotalCharges', 'Churn'], axis=1), drop_first=True)
#Putting together our independent variables
X = pd.concat([X, df[['tenure', 'MonthlyCharges', 'TotalCharges']]], axis=1)

In [6]:
#Our dependent variables are basically the df 'Churn' column
y = pd.get_dummies(df['Churn'], drop_first=True).values.ravel()

### Taking a deeper look

In [7]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 30 columns):
SeniorCitizen                            7043 non-null int64
gender_Male                              7043 non-null uint8
Partner_Yes                              7043 non-null uint8
Dependents_Yes                           7043 non-null uint8
PhoneService_Yes                         7043 non-null uint8
MultipleLines_No phone service           7043 non-null uint8
MultipleLines_Yes                        7043 non-null uint8
InternetService_Fiber optic              7043 non-null uint8
InternetService_No                       7043 non-null uint8
OnlineSecurity_No internet service       7043 non-null uint8
OnlineSecurity_Yes                       7043 non-null uint8
OnlineBackup_No internet service         7043 non-null uint8
OnlineBackup_Yes                         7043 non-null uint8
DeviceProtection_No internet service     7043 non-null uint8
DeviceProtection_Yes                   

Looking carefully it is possible to see that <b>TotalCharges</b> is being treated as a <b>Object</b> and we need it to be a <b>Float</b>.<br>
So, let's us take care of that.

In [8]:
X['TotalCharges'] = pd.to_numeric(X['TotalCharges'], errors='coerce')
X['TotalCharges'].fillna(0, inplace=True)

In [9]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 30 columns):
SeniorCitizen                            7043 non-null int64
gender_Male                              7043 non-null uint8
Partner_Yes                              7043 non-null uint8
Dependents_Yes                           7043 non-null uint8
PhoneService_Yes                         7043 non-null uint8
MultipleLines_No phone service           7043 non-null uint8
MultipleLines_Yes                        7043 non-null uint8
InternetService_Fiber optic              7043 non-null uint8
InternetService_No                       7043 non-null uint8
OnlineSecurity_No internet service       7043 non-null uint8
OnlineSecurity_Yes                       7043 non-null uint8
OnlineBackup_No internet service         7043 non-null uint8
OnlineBackup_Yes                         7043 non-null uint8
DeviceProtection_No internet service     7043 non-null uint8
DeviceProtection_Yes                   

<b>Done!</b><br>Now we can move on to the next step.

### Scaling the independent variables

It is always a common practice to scale our data, specially if there are some values in it that would make the smaller values seem insignificant.

In [10]:
from sklearn.preprocessing import StandardScaler
scl = StandardScaler()
X = scl.fit_transform(X)

It is, however, unnecessary to scale the dependent variables. Cause that wouldn't add any value to our model since we're expecting a binary outcome.

### Spliting into Train and Test sets

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=101)

## Now what?

Now that we have our <b>train</b> and <b>test</b> data, we must decide which way to go.<br>
We could simply <b>fit</b> the model to the train data and make some <b>predictions</b> with it, it could even achieve a <b>high accuracy</b>. But who can garantee that its accuracy wasn't <b>just a coincidence</b>?<br><br>
We'll get to that later on.

### The simplest way

In [12]:
#import keras
import keras
from keras.models import Sequential
from keras.layers import Dense

Using TensorFlow backend.


Why use <b>Keras</b>?<br>
<b>Keras</b> is a <b>High Level API</b> written in <b>Python</b> and it is, above all, <b>very easy</b> to learn and implement. <br><br>
So... let us proceed.

In [13]:
#creating the Model
model = Sequential()
model.add(Dense(units=16, kernel_initializer='uniform', activation='relu', input_dim=30)) # first hidden layer
model.add(Dense(units=16, kernel_initializer='uniform', activation='relu')) # second hidden layer
model.add(Dense(units=1, kernel_initializer='uniform', activation='sigmoid')) # output layer
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) #compiling the model

### What the heck happend here?!
<br>
<b>Sequential</b> model is a linear stack of layers.<br>
<b>Dense</b> is a Densely-connected Neural Network Layer. It makes it incredibly easy to add layers to our NN.<br>
The <b>units</b> parameter is the number of neurons in the current layer. It is a rule of thumb to use as it's value the sum of inputs and outputs divided by two. But, feel free to mess around with it and see what happens.<br>
The <b>activation</b> parameter is the activation function to use in that particular layer.<br>
The <b>input_dim</b> parameter is the dimension of the input layer.<br>
<b>Compile</b> is the method that states the learning process of our model. It does so by defining an optimizer, a loss function (what the NN tries to minimize) and a list of metrics.<br><br>
For more info, take a look at Keras documentation:<br>
<a>https://keras.io/getting-started/sequential-model-guide/<a>

Now that our model is <b>built</b>, we should <b>fit</b> it.

In [14]:
model.fit(X_train, y_train, batch_size=35, epochs=50, verbose=1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x23babc9b978>

And now let's make some <b>predictions</b>.

In [15]:
y_pred = model.predict(X_test)

### Is that it?
<br>
Hold your horses, cowboy!<br>
If you are anywhat familiar with the math of a <b>Sigmoid Function</b> you should know that a sigmoid outputs the likelihood of some event.<br>
If we take a look at our <b>y_pred</b> variable, we would see that it has the probabilities of a customer churn.<br>
To see how our model performed, however, we should have a <b>True</b>-<b>False</b> variable. And to do so, we must define a threshold to decide when a value becomes 0 and when it becomes 1.<br><br>
Our threshold here will be <b>.5</b>

In [16]:
y_pred = (y_pred > 0.5)

### To the performance!

In [17]:
from sklearn.metrics import confusion_matrix, classification_report
print (confusion_matrix(y_test, y_pred))
print ('\n')
print (classification_report(y_test, y_pred))

[[913 113]
 [182 201]]


             precision    recall  f1-score   support

          0       0.83      0.89      0.86      1026
          1       0.64      0.52      0.58       383

avg / total       0.78      0.79      0.78      1409



### Wait a second...
<br>
We got a relatively <b>lower accuracy</b> in the test set than in the train set. But <b>why</b>?<br>
It could mean two things:<br>
<b>Overfitting</b>, or...<br>
Remember when I said earlier that it <b>wasn't a good idea</b> to just simply train the model? Well, I didn't. But I never said that it was a good one either.<br>
The <b>accuracy</b> obtained while training the model could have been just a <b>coincidence</b>, just as I said.<br><br>
One way to prevent that is to use SciKit-learn's <b>Cross Validation</b>.<br><br>
So let's do just that.

In [18]:
from sklearn.model_selection import cross_val_score

In order to use the <b>Cross Validation</b>, we should observe a couple things:<br><br>
<li>
The Cross Validation expects, as a parameter, an estimator.
</li>
<li>
But we can't just feed him the model we've created.<br><br>
</li>    
So, we'll use a <b>wrapper</b> from keras. It will give us a model that we can use with the Cross Validation.

In [19]:
from keras.wrappers.scikit_learn import KerasClassifier

The KerasClassifier also need shomething:<br>
A <b>building function</b>.<br>
We basically repeat our steps to build the NN. But within a function.

In [20]:
def build_model():
    #creating the Model
    model = Sequential()
    model.add(Dense(units=16, kernel_initializer='uniform', activation='relu', input_dim=30)) # first hidden layer
    model.add(Dense(units=16, kernel_initializer='uniform', activation='relu')) # second hidden layer
    model.add(Dense(units=1, kernel_initializer='uniform', activation='sigmoid')) # output layer
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) #compiling the model
    return model

Now we instantiate a new <b>Keras Classifier</b> using our <b>building function</b>.<br>
And then, we are good to go.

In [21]:
model = KerasClassifier(build_fn=build_model, batch_size=35, epochs=50, verbose=0)
accuracies = cross_val_score(model, X_train, y_train, cv=10, verbose=0)

In [22]:
print ('Mean: ', accuracies.mean(), '-- Std: ', accuracies.std())

Mean:  0.7994252666850727 -- Std:  0.013430464178131032


In [23]:
model.fit(X_train, y_train, batch_size=35, epochs=50, verbose=0)
y_pred = model.predict(X_test)
y_pred = (y_pred > 0.5)
print (confusion_matrix(y_test, y_pred))
print ('\n')
print (classification_report(y_test, y_pred))

[[937  89]
 [199 184]]


             precision    recall  f1-score   support

          0       0.82      0.91      0.87      1026
          1       0.67      0.48      0.56       383

avg / total       0.78      0.80      0.78      1409



See? Now we know what to <b>expect</b> regarding accuracy from out NN.<br><br>

Right, Right... Now we know what to expect, but <b>isn't this accuracy too little?!</b><br>
Well, it ain't. But rest assured, we can <b>make it better</b>.<br><br>

<b>But First...</b>

### How to prevent Overfitting?
<br>
Had our NN became overfitted, a simple thing that could help is <b>Dropout</b>.<br>
Which, in short, is <b>Learning less, but learning BETTER</b><br>

<b>Dropout</b> acts in the layers in which it's added, randomly disabling some neurons.<br>
`But, isn't that bad?`<br>
Although it sound <b>bad</b>, I can assure you, <b>it is not!</b><br>
By doing it, we make sure that our neurons become more independent within the layer.<br>
If <b>overfitting</b> is your problem then you should use it.<br>
There is no rule of thumb for the <b>percentage of Dropout</b>, so you should play around and see what suits you best.<br>
<b>Be careful</b> though, at some point, it will stop being <b>overfitted</b> and become <b>underfitted</b>.<br><br>
### Dropout

In [24]:
from keras.layers import Dropout

In [25]:
def build_model():
    #creating the Model
    model = Sequential()
    model.add(Dense(units=16, kernel_initializer='uniform', activation='relu', input_dim=30)) # first hidden layer
    #model.add(Dropout(rate=.1)) uncomment this if you want to add dropout
    model.add(Dense(units=16, kernel_initializer='uniform', activation='relu')) # second hidden layer
    #model.add(Dropout(rate=.2)) uncomment this if you want to add dropout
    model.add(Dense(units=1, kernel_initializer='uniform', activation='sigmoid')) # output layer
    #model.add(Dropout(rate=.3)) uncomment this if you want to add dropout
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) #compiling the model
    return model
    #play around with the dropout value if you're experiencing overfit

We won't add dropout to our NN layers simply because it ain't overfitting,<br>
But now you know how to do it if that is your case.

### Improving Overall Accuracy
<br>
We'll use the <b>GridSearchCV</b> to improve our NN's accuracy.<br>
Basically, we give it a bunch of diferent <b>hyperparameters</b> and the GridSearch will tell us those that <b>performed best</b>, by testing them against each other.<br><br>
So, without futher ado...

In [26]:
from sklearn.model_selection import GridSearchCV

We'll redefine our building function with a little twist, we'll pass an optimizer as a parameter so we can play around with it as well.

In [27]:
def build_model(optimizer):
    #creating the Model
    model = Sequential()
    model.add(Dense(units=16, kernel_initializer='uniform', activation='relu', input_dim=30)) # first hidden layer
    #model.add(Dropout(rate=.1)) uncomment this if you want to add dropout
    model.add(Dense(units=16, kernel_initializer='uniform', activation='relu')) # second hidden layer
    #model.add(Dropout(rate=.2)) uncomment this if you want to add dropout
    model.add(Dense(units=1, kernel_initializer='uniform', activation='sigmoid')) # output layer
    #model.add(Dropout(rate=.3)) uncomment this if you want to add dropout
    model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy']) #compiling the model
    return model
    #play around with the dropout value if you're experiencing overfit

Just like <b>Cross Validation</b>, <b>Grid Search</b> also expects, as a parameter, an <b>estimator</b> .<br>
And, also, the <b>parameters</b>  we want to test:

In [28]:
model = KerasClassifier(build_fn=build_model)
#please note that this time we are passing neither the 'batch_size' nor the 'epochs' number.
#we'll use those as parameters as well.
parameters = {'batch_size': [20,30],
             'epochs': [50,75],
             'optimizer': ['adam']}
grid_search = GridSearchCV(estimator=model, param_grid=parameters, scoring='accuracy', cv=10, verbose=0)

In [29]:
grid_search = grid_search.fit(X_train, y_train, verbose=0)
best_param = grid_search.best_params_
best_acc = grid_search.best_score_

It will take quite some time. So you'd better go do something else, like take a nap, for instance.<br>
No, seriously, go take a nap!

In [30]:
best_acc

0.7962371317003905

In [31]:
best_param

{'batch_size': 20, 'epochs': 50, 'optimizer': 'adam'}

`Well.. I wasn't better at all..`<br>
I know, I know... But let me tell you a couple things before you close this tab and move on.<br><br>
<li>Performing a Grid Search is computationally expensive.</li>
<li>I didn't used the best parameters to perform the grid search.</li>
<li>Instead, I used the ones that I knew would be fast.</li>
<li>I wasn't kidding... performing a grid search with several parameter take HOURS</li><br>

Let's make a comparisson:

In [40]:
#creating the Model
model = Sequential()
model.add(Dense(units=16, kernel_initializer='uniform', activation='relu', input_dim=30)) # first hidden layer
#model.add(Dropout(rate=.1))
model.add(Dense(units=16, kernel_initializer='uniform', activation='relu')) # second hidden layer
#model.add(Dropout(rate=.1))
#model.add(Dense(units=16, kernel_initializer='uniform', activation='relu')) # third hidden layer
model.add(Dense(units=1, kernel_initializer='uniform', activation='sigmoid')) # output layer
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) #compiling the model

In [41]:
model.fit(X_train, y_train, batch_size=20, epochs=150, verbose=0)

<keras.callbacks.History at 0x23bd8e86da0>

In [42]:
y_pred = model.predict(X_test)
y_pred = (y_pred > 0.5)

In [43]:
print (confusion_matrix(y_test, y_pred))
print ('\n')
print (classification_report(y_test, y_pred))

[[920 106]
 [197 186]]


             precision    recall  f1-score   support

          0       0.82      0.90      0.86      1026
          1       0.64      0.49      0.55       383

avg / total       0.77      0.78      0.78      1409



In [44]:
y_pred = grid_search.predict(X_test)
y_pred = (y_pred > 0.5)
print (confusion_matrix(y_test, y_pred))
print ('\n')
print (classification_report(y_test, y_pred))

[[960  66]
 [230 153]]


             precision    recall  f1-score   support

          0       0.81      0.94      0.87      1026
          1       0.70      0.40      0.51       383

avg / total       0.78      0.79      0.77      1409

