<h1>Deep Learning : Knowing why customers leaves</h1>

<p>In this tutorial, we will learn how to use Deep Learning to know why the customers of the bank are leaving.<br>
This bank,measured some things is that the customers leaves at unusually rates, and they want to understand what the problem is and they want to assess and adress the problem.
This dataset contains relevant informations of customers. It's a record of 10000 transactions in the past months that contains the estimated salary.
The column <i>Exited</i> show us if the customers leaves or not, so it's equal 1 for leaving and 0 for staying.
</p>
<p>The main task here is to predict if the customer will leave so the columns equal 1 or if he will stay in the bank and 0 instead.</p>

<h2>Data Preprocessing</h2>

In [21]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [30]:
# Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')
dataset.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


<p>So let's get deep here in the dataset. We have 14 columns, we will see each one if we can consider it as a valuable information for our model or not.<br><br>
<b>RowNumber</b> and <b>CustomerId</b> respresent a unique ID for each row and customer. Of course this has no impact on the prediciton.<br>
<b>Surname</b>Same for the Surname, if your name is Onio it's doesn't mean that you have more chance to leave the bank than Mitchell.<br>
<b>CreditScore</b> the credit score is very likely to have an impact on the customer's decision to stay or not, if we think about it, we might expect that customers with a low credit score are more likely to leave the bank than the customers with a high credit.<br>
<b>Geography</b> the country might be a valuable information to predict the decision<br>
<b>Gender</b> Yes, maybe the men are more likely the leave the bank than women<br>
<b>Age</b>Of course, young people mayben are more likely to leave the bank than the old one, due to advantages or taxes or whatever.<br>
<b>Tenure</b> is how long the customers are staying on the bank and of course this has an impact on the final decision, maybe with the more years there is a lot of advantages with bank so the customers stay and not leave<br>
<b>Balance</b> Same, a customers with 0 Balance are more likely to leave than a customer with high balance.<br>
<b>NumOfProducts</b> We never know, maybe and maybe not.<br>
<b>HasCrCard</b> The customer with credit card are more likely to stay than who doesn't.<br>
<b>IsActiveMember</b> Same for HasCrCard.<br>
<b>EstimatedSalary</b> Same logic for Balance, customers with high estimated salary are more likely to stay than with low one.<br>
</p>
<p>We have listed all the columns and our intuitions, but in reality, we don't know which independant variable has the most impact on the dependant variable (the one that we wil predict). And that's what our Artificial Neural Network will spot.</p>
<p>For now, we will not include <b>RowNumber</b> and <b>CustomerId</b> on the model.<p>

In [31]:
# Defining the dependant variable and independant variable
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values

In [32]:
X.shape

(10000, 10)

<h3>Dealing with categorical variable</h3>
<p>To run the algorithm, we have to encode the categorical variable. Here we have 2 ones : <b>Geography</b> and <b>Gender</b>

In [33]:
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# For the geography variable
labelencoder_geo = LabelEncoder()
X[:, 1] = labelencoder_geo.fit_transform(X[:, 1])
# For the gender variable
labelencoder_gender = LabelEncoder()
X[:, 2] = labelencoder_gender.fit_transform(X[:, 2])
X

array([[619, 0, 0, ..., 1, 1, 101348.88],
       [608, 2, 0, ..., 0, 1, 112542.58],
       [502, 0, 0, ..., 1, 0, 113931.57],
       ..., 
       [709, 0, 0, ..., 0, 1, 42085.58],
       [772, 1, 1, ..., 1, 0, 92888.52],
       [792, 0, 0, ..., 1, 0, 38190.78]], dtype=object)

<p>The two columns are converted to integer, for the geography variable, each country take one number : France for 0, Germany for 1 and Spain for 2.<br>
The same for gender, 0 for female and 1 for male (this is purely random).</p>
<p>However, if we let this, the algorithm will consider that if France for 0 and Spain for 2, than Spain is greated and more valuable than France which is not correct, these categorical value are nominal, so there is no order between them. In order to deal with this, we will use the dummy variable</p>

In [34]:
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
X

array([[  1.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          1.00000000e+00,   1.00000000e+00,   1.01348880e+05],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00, ...,
          0.00000000e+00,   1.00000000e+00,   1.12542580e+05],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          1.00000000e+00,   0.00000000e+00,   1.13931570e+05],
       ..., 
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   1.00000000e+00,   4.20855800e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00, ...,
          1.00000000e+00,   0.00000000e+00,   9.28885200e+04],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          1.00000000e+00,   0.00000000e+00,   3.81907800e+04]])

In [35]:
# Remove one dummy variable category to not fall into the dummy variable trap
X = X[:, 1:]
X.shape

(10000, 11)

In [38]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

<p>Due to the high computation, We should apply Feature Scaling to ease the computation.</p>

In [39]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [40]:
X_train

array([[-0.5698444 ,  1.74309049,  0.16958176, ...,  0.64259497,
        -1.03227043,  1.10643166],
       [ 1.75486502, -0.57369368, -2.30455945, ...,  0.64259497,
         0.9687384 , -0.74866447],
       [-0.5698444 , -0.57369368, -1.19119591, ...,  0.64259497,
        -1.03227043,  1.48533467],
       ..., 
       [-0.5698444 , -0.57369368,  0.9015152 , ...,  0.64259497,
        -1.03227043,  1.41231994],
       [-0.5698444 ,  1.74309049, -0.62420521, ...,  0.64259497,
         0.9687384 ,  0.84432121],
       [ 1.75486502, -0.57369368, -0.28401079, ...,  0.64259497,
        -1.03227043,  0.32472465]])

<h3>Creating the ANN model</h3>