## Artificial Neural Network - Classification Problem

###### Business Problem

We have a dataset of a bank records. There are 10,000 customers with details such as name, credit score, age, tenure, balance, etc. It also has a column called 'Exited' which tells us if a customer has left the bank. The bank needs to look at this data and understand why it has a high churn rate ( many people exiting the bank ). We need to look at the dataset and determine which customer's are most likely to leave the bank.  

In [1]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')
rows, columns = dataset.shape
dataset.head(5)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


#### Preprocessing the data

In [3]:
X = dataset.iloc[:, 3:columns-1].values
y = dataset.iloc[:, columns-1].values

In [4]:
# Checking the values of X and y
print("X Values")
print(X[:5])
print("\nY Values")
print(y[:10])

X Values
[[619 'France' 'Female' 42 2 0.0 1 1 1 101348.88]
 [608 'Spain' 'Female' 41 1 83807.86 1 0 1 112542.58]
 [502 'France' 'Female' 42 8 159660.8 3 1 0 113931.57]
 [699 'France' 'Female' 39 1 0.0 2 0 0 93826.63]
 [850 'Spain' 'Female' 43 2 125510.82 1 1 1 79084.1]]

Y Values
[1 0 1 0 0 1 0 1 0 0]


In [5]:
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
labelencoder_X_2 = LabelEncoder()

# Encoding the Country
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
# Encoding the Gender
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])

# OneHotEncoder will be applied only for Country Column as we have 3 categories. For gender column we have only 2 
# categories. We will also remove 1 column to avoid the dummy variable trap
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
X = X[:, 1:]

In [6]:
# Viewing the input data after preprocessing
print(["{0:0.2f}".format(i) for i in X[1,:]])

['0.00', '1.00', '608.00', '0.00', '41.00', '1.00', '83807.86', '1.00', '0.00', '1.00', '112542.58']


In [7]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [8]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#### Building the ANN

In [9]:
import keras
# Module required to initalize the network
from keras.models import Sequential
# Module required to build the layers of the network
from keras.layers import Dense

Using TensorFlow backend.


In [10]:
# Initialising the ANN
classifier = Sequential()

Now that we have our network intialized we will start building the layers of the network. First we create an input layer and the first hidden layer. We need to figure out how many nodes we want in the 1st hidden layer. As there is no rule of thumb, we can use parameter tuning to determine this for more complicated networks. But for now we will take the average of the input and output nodes which is 6. Our input will have 11 nodes as we have 11 independent variables. 

In [11]:
# Adding the input layer and the first hidden layer
classifier.add(Dense(units=6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))

We will be using 'uniform' function to initalize the weights for this layer. This will help us to get a random set of weights which are uniformly distributed and close to 0.  We will be using the rectifier activation function for this layer using 'relu' which returns max(x,0). 

In [12]:
# Adding the second hidden layer
classifier.add(Dense(units=6, kernel_initializer = 'uniform', activation = 'relu'))

This above layer is not typically necessary for this problem because we can use a simple network to solve the problem. But adding it not going to cause much harm and we are trying to build a deep neural network. Hence we will add 1 with the same node as the 1st hidden layer

Next we need to add the ouput layer. The output layer will have just 1 node as we need to classify the data into 1 or 0 which is a binary outcome. We will be using 'sigmoid' activation function. If we had more than 2 categories we can use 'softmax' activation function. 

In [13]:
# Adding the output layer
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

Now all our layers are created, we need to compile the network. In this process we will be fixing the parameters of the netwrok.

optimizer - Algorithm we want to use to find the optimal set of weight for the network. We will be using stochastic gradient descent for this problem. 'adam' is the most common SGD algorithm.

loss - We need to fix a loss function within the SGD algorithm. We wil be 'binary_crossentropy' which is similar to the loss funciton in logistic regression model. If we have more than 2 categories in the output, we need to use 'categorical_crossentropy'.

metrics - A measure to find the performance of the model.

In [14]:
# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

Now we are going to fit the ANN to our training set. We need to input the X_train and y_train variables. We also need to input the number of epochs in which our training data will be passed through the network. And we also need to specify after how many training data inputs, we want to update the weight. This is specefied using batch_size argument. Again there is no rule of thumb for choosing these values. We will just set it to 100 epochs and batch_size of 10 which means after every 10 training data rows, our weights will be updated

In [15]:
# Fitting the ANN to the Training set
classifier.fit(X_train, y_train, batch_size = 10, epochs = 100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x1a1db8e2e8>

#### Making the predictions and evaluating the model

In [16]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)

# We get a probability as outpur because we have used a sigmoid activation function. 
# If we want a binary ouput we need to process the output a little more as below. 
y_pred = (y_pred > 0.5)

In [17]:
print(y_pred)

[[False]
 [False]
 [False]
 ..., 
 [False]
 [False]
 [False]]


In [18]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[1546   49]
 [ 261  144]]


In [19]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
accuracy = (tn+tp)/(tn + fp + fn + tp) * 100
print("Accuracy: ",accuracy,"%")

Accuracy:  84.5 %


### Predicting a single new observation 

###### Predict if the below customer will exit the bank ?

* Geography: France
* Credit Score: 600
* Gender: Male
* Age: 40 years old
* Tenure: 3 years
* Balance: \$60000
* Number of Products: 2
* Does this customer have a credit card ? Yes
* Is this customer an Active Member: Yes
* Estimated Salary: \$50000

In [21]:
customer = np.array([[0.0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]])
scaled_customer = sc.transform(customer)
new_prediction = classifier.predict(scaled_customer)
new_prediction = (new_prediction > 0.5)
print(new_prediction)

[[False]]


### Evaluating and Improving the performance of ANN

We will be using K-Fold Validation to evaluate the performance of the ANN. We will be using Grid Search to tune the performance of the algorithm. We will also see how the model's performance will improve if we use dropouts. The execution output for the below parts are not shown as they are very large. I have included the final result obtained for each of the below sections

In [23]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score

In [24]:
def build_classifier():
    classifier = Sequential()
    classifier.add(Dense(units=6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))
    classifier.add(Dense(units=6, kernel_initializer = 'uniform', activation = 'relu'))
    classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
    classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
    return classifier 

In [None]:
classifier = KerasClassifier(build_fn = build_classifier, batch_size = 10, nb_epochs = 100)
accuracies = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=10, n_jobs=-1)
print("Accuracies: ",accuracies)
print("Accuracies Mean: ",accuracies.mean())
print("Accuracies SD: ",accuracies.std())

After training, the following results were obtained

* Accuracies: \[ 0.85374999 0.855,  0.87625     0.83374999  0.88249999  0.8525
  0.83625     0.825       0.85124999  0.86 \]
* Accuracies Mean:  0.852624994963
* Accuracies SD:  0.0170335144316

###### Imporving the performance of the model using Dropouts
Dropout regularization is used to reduce overfitting. In this, we randomly set a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting.

In [None]:
from keras.layers import Dropout

classifier = Sequential()
classifier.add(Dense(units=6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))
classifier.add(Dropout(rate= 0.1))
classifier.add(Dense(units=6, kernel_initializer = 'uniform', activation = 'relu'))
classifier.add(Dropout(rate= 0.1))
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
classifier.fit(X_train, y_train, batch_size = 10, epochs = 100)
y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
accuracy = (tn+tp)/(tn + fp + fn + tp) * 100
print("Accuracy: ",accuracy,"%")

After training with Dropout, the following accuracy was obtained.

Accuracy:  85.35 %

We can see that the performance of the network did improve with dropouts

###### Tunning the performance of the model using Grid Search 

In [26]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV

def build_classifier(optimizer):
    classifier = Sequential()
    classifier.add(Dense(units=6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))
    classifier.add(Dense(units=6, kernel_initializer = 'uniform', activation = 'relu'))
    classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
    classifier.compile(optimizer = optimizer, loss = 'binary_crossentropy', metrics = ['accuracy'])
    return classifier 

classifier = KerasClassifier(build_fn = build_classifier)
parameters = {'batch_size': [25, 32], 'epochs': [100, 500], 'optimizer': ['adam', 'rmsprop']}

grid_search = GridSearchCV(estimator = classifier,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10)

grid_search = grid_search.fit(X_train,y_train)

best_accuracy = grid_search.best_score_
print("Best Accuracy: ", best_accuracy)

best_params = grid_search.best_params_
print("Best Parameters: ", best_params)

On execution, Grid Search provided the following outputs.

Best Accuracy:  0.8516

Best Parameters:  {'batch_size': 32, 'nb_epoch': 500, 'optimizer': 'rmsprop'}


** When I trained the network using these parameters, the network was able to achieve an accuracy of 86.35 % **