<h1>Deep Learning : Knowing why customers leaves</h1>

<p>In this tutorial, we will learn how to use Deep Learning to know why the customers of the bank are leaving.<br>
This bank,measured some things is that the customers leaves at unusually rates, and they want to understand what the problem is and they want to assess and adress the problem.
This dataset contains relevant informations of customers. It's a record of 10000 transactions in the past months that contains the estimated salary.
The column <i>Exited</i> show us if the customers leaves or not, so it's equal 1 for leaving and 0 for staying.
</p>
<p>The main task here is to predict if the customer will leave so the columns equal 1 or if he will stay in the bank and 0 instead.</p>

<h2>Data Preprocessing</h2>

In [27]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [28]:
# Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')
dataset.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


<p>So let's get deep here in the dataset. We have 14 columns, we will see each one if we can consider it as a valuable information for our model or not.<br><br>
<b>RowNumber</b> and <b>CustomerId</b> respresent a unique ID for each row and customer. Of course this has no impact on the prediciton.<br>
<b>Surname</b>Same for the Surname, if your name is Onio it's doesn't mean that you have more chance to leave the bank than Mitchell.<br>
<b>CreditScore</b> the credit score is very likely to have an impact on the customer's decision to stay or not, if we think about it, we might expect that customers with a low credit score are more likely to leave the bank than the customers with a high credit.<br>
<b>Geography</b> the country might be a valuable information to predict the decision<br>
<b>Gender</b> Yes, maybe the men are more likely the leave the bank than women<br>
<b>Age</b>Of course, young people mayben are more likely to leave the bank than the old one, due to advantages or taxes or whatever.<br>
<b>Tenure</b> is how long the customers are staying on the bank and of course this has an impact on the final decision, maybe with the more years there is a lot of advantages with bank so the customers stay and not leave<br>
<b>Balance</b> Same, a customers with 0 Balance are more likely to leave than a customer with high balance.<br>
<b>NumOfProducts</b> We never know, maybe and maybe not.<br>
<b>HasCrCard</b> The customer with credit card are more likely to stay than who doesn't.<br>
<b>IsActiveMember</b> Same for HasCrCard.<br>
<b>EstimatedSalary</b> Same logic for Balance, customers with high estimated salary are more likely to stay than with low one.<br>
</p>
<p>We have listed all the columns and our intuitions, but in reality, we don't know which independant variable has the most impact on the dependant variable (the one that we wil predict). And that's what our Artificial Neural Network will spot.</p>
<p>For now, we will not include <b>RowNumber</b> and <b>CustomerId</b> on the model.<p>

In [29]:
# Defining the dependant variable and independant variable
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values

In [30]:
X.shape

(10000, 10)

<h3>Dealing with categorical variable</h3>
<p>To run the algorithm, we have to encode the categorical variable. Here we have 2 ones : <b>Geography</b> and <b>Gender</b>

In [31]:
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# For the geography variable
labelencoder_geo = LabelEncoder()
X[:, 1] = labelencoder_geo.fit_transform(X[:, 1])
# For the gender variable
labelencoder_gender = LabelEncoder()
X[:, 2] = labelencoder_gender.fit_transform(X[:, 2])
X

array([[619, 0, 0, ..., 1, 1, 101348.88],
       [608, 2, 0, ..., 0, 1, 112542.58],
       [502, 0, 0, ..., 1, 0, 113931.57],
       ..., 
       [709, 0, 0, ..., 0, 1, 42085.58],
       [772, 1, 1, ..., 1, 0, 92888.52],
       [792, 0, 0, ..., 1, 0, 38190.78]], dtype=object)

<p>The two columns are converted to integer, for the geography variable, each country take one number : France for 0, Germany for 1 and Spain for 2.<br>
The same for gender, 0 for female and 1 for male (this is purely random).</p>
<p>However, if we let this, the algorithm will consider that if France for 0 and Spain for 2, than Spain is greated and more valuable than France which is not correct, these categorical value are nominal, so there is no order between them. In order to deal with this, we will use the dummy variable</p>

In [32]:
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
X

array([[  1.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          1.00000000e+00,   1.00000000e+00,   1.01348880e+05],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00, ...,
          0.00000000e+00,   1.00000000e+00,   1.12542580e+05],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          1.00000000e+00,   0.00000000e+00,   1.13931570e+05],
       ..., 
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   1.00000000e+00,   4.20855800e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00, ...,
          1.00000000e+00,   0.00000000e+00,   9.28885200e+04],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          1.00000000e+00,   0.00000000e+00,   3.81907800e+04]])

In [33]:
# Remove one dummy variable category to not fall into the dummy variable trap
X = X[:, 1:]
X.shape

(10000, 11)

In [34]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

<p>Due to the high computation, We should apply Feature Scaling to ease the computation.</p>

In [35]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [36]:
X_train

array([[-0.5698444 ,  1.74309049,  0.16958176, ...,  0.64259497,
        -1.03227043,  1.10643166],
       [ 1.75486502, -0.57369368, -2.30455945, ...,  0.64259497,
         0.9687384 , -0.74866447],
       [-0.5698444 , -0.57369368, -1.19119591, ...,  0.64259497,
        -1.03227043,  1.48533467],
       ..., 
       [-0.5698444 , -0.57369368,  0.9015152 , ...,  0.64259497,
        -1.03227043,  1.41231994],
       [-0.5698444 ,  1.74309049, -0.62420521, ...,  0.64259497,
         0.9687384 ,  0.84432121],
       [ 1.75486502, -0.57369368, -0.28401079, ...,  0.64259497,
        -1.03227043,  0.32472465]])

<h2>Creating the ANN model</h2>

In [37]:
# Importing the libraries
import keras
from keras.models import Sequential
# For creating the Layers
from keras.layers import Dense

<p>Now, the first step of creating our Artifical Neural Network is initializing the end, it's defined by a sequence of layers.<br>
Actually, there is two ways to initializing a deep learning model, it's either by <i>defining the sequence of layers</i> or the other way <i>defining a graph</i>. Here, since we will make an artificial neural network with successive layers so we will use the first method.<br>
From the Sequential module, we will create object of it. And since, our model will be a classifier (predict if the customer will leave or not), so we will call it "classifier".
</p>

In [38]:
classifier = Sequential()

<p>Here, we will add the first layers of our ANN, for this step we will use Dense function.<br><br>
In advance, we already know the number of nodes in the input layers and this number is nothing else than the number of independant variable and in this case is 11.<br><br>
For the propagation, from the left to right the neurons are activated by the activation function is such a way that the higher the value of the activation fucntion is for the neuron the more impact this neuron is going to have in the network. The activation function which will have to do in this case to define the first hidden layers is the rectifier function, and for the output layers, the sigmoid function is also a good choice since it's a classification case.<br><br>
The result of the sigmoid activation function the result in the output layer, will be a probabilities of the customer choice for leaving or not, to greated the probability, the decision of customer to leave is more affirmative. We can segment the cutomers according to their probablity to leave the bank, and according to the terms of business constraints and business goals we will make decisions to add value to the business.
</p>

In [39]:
# Adding the input layer and the first hidden layer
# Why we choose hidden layer ? Well some says that is called simple "Art", and like any Art, by practice and experiment we will
# Develope our intuition, or there is a little rule is : (Input layers + Outputs layers) / 2
classifier.add(Dense(output_dim = 6, init = "uniform", activation = "relu", input_dim = 11))

  after removing the cwd from sys.path.


In [40]:
# Adding the second hidden layers
# For this layers, we will delete the input layers that's because simple we have already set the input layers
classifier.add(Dense(output_dim = 6, init = "uniform", activation = "relu"))

  This is separate from the ipykernel package so we can avoid doing imports until


In [41]:
# Adding the output layers
# For this layers, we will use the sigmoid function
# If we have an output results that consists of many categories, we use the function Softmax
classifier.add(Dense(output_dim = 1, init = "uniform", activation = "sigmoid"))

  after removing the cwd from sys.path.


<p>In the steps Above, we have created the ANN and the weights but It's just an initalization step. The compilation consists of the application of the algorithm that will find the optimal set of weights in the whole ANN.<br>
For this method, there are 3 parameter :<br>
- First, we specity the <i>algorithm of optimisation</i>. We choose the stochastic algorithm, there are several types of stochastic gradient method, we choose Adam.<br>
- The seconde parameter is the <i>loss</i> function.<br>
- The last parameter is <i>metrics</i>, it's is the creteria for evaluate our algorithm.
</p>

In [42]:
classifier.compile(optimizer = "adam", loss ="binary_crossentropy", metrics = ['accuracy'])

In [44]:
# Fitting the dataset to the ANN
classifier.fit(X_train, y_train, batch_size = 10, nb_epoch = 100)

Epoch 1/100
1230/8000 [===>..........................] - ETA: 0s - loss: 0.3883 - acc: 0.8390

  


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 7

Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x2687865f4e0>

In [50]:
# After training our ANN on the train set, we will predict the test set
y_pred = classifier.predict(X_test)
y_pred

array([[ 0.18717389],
       [ 0.32767144],
       [ 0.14578097],
       ..., 
       [ 0.14106224],
       [ 0.15789446],
       [ 0.11409371]], dtype=float32)

<p>The result is a value of float, like we said in the beginning, The output will be a probablities of customers'leaving the bank.<br>
If we take the first customer, the value is 0.18 so the probablity of this customer for leaving the bank is 0.18.<br>
We will convert this values to a decision, by puttin a threashold of 0.5. If the value is equal or greater of the threashold, then the customers will Leave, else he will stay.
</p>


In [52]:
y_pred = (y_pred > 0.5)
y_pred

array([[False],
       [False],
       [False],
       ..., 
       [False],
       [False],
       [False]], dtype=bool)

In [54]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[1543,   52],
       [ 265,  140]], dtype=int64)

<p>The confusion matrix show us that the prediction consists of 1543 + 140 correct predictions and 52 + 265 incorrect predictions, let's get the accuracy of the prediction</p>

In [56]:
print("The accuracy : ", (1543 + 140) / (2000))

The accuracy :  0.8415


<p>Like we say in the beginning, We can sort the probablities from highest to the lowest to get the customers most likely to leave the bank, for example tha bank can look to the 20 % highest probablities of their customers to leave the bank, and make it a segment and then analyzed in more depth to take measers and prevent more customers from leaving.</p>