# Artificial Neural Network

* We have a dataset of a bank which contains information about customer (credit score, geography, gender, age, balance, tenure, number of products used, activity, estimated salary... AND whether the customer remains the client of the bank 6 months hence)
* The following is an ANN algorithm to determine whether a new/ existing client, given above information, would stay a bank's client or not
* This is called Churn Modeling

## Importing the libraries

In [89]:
import numpy as np
import pandas as pd
import tensorflow as tf

In [90]:
tf.__version__   # prints the version of tensorflow being used

'2.6.0'

## Part 1: Data Preprocessing

### Importing the dataset

In [91]:
dataset = pd.read_csv('Churn_Modelling.csv')
# The first 3 columns in dataset have absolutely no impact on the outcome
# so we ignore them. Although the NN would sense that instinctively anyway
x = dataset.iloc[:, 3 : -1].values # taking columns from 4th till second last
y = dataset.iloc[:, -1].values # taking only last column

In [92]:
# the independent variables
print (x)

[[619 'France' 'Female' ... 1 1 101348.88]
 [608 'Spain' 'Female' ... 0 1 112542.58]
 [502 'France' 'Female' ... 1 0 113931.57]
 ...
 [709 'France' 'Female' ... 0 1 42085.58]
 [772 'Germany' 'Male' ... 1 0 92888.52]
 [792 'France' 'Female' ... 1 0 38190.78]]


In [93]:
# the dependent variable
print (y)

[1 0 1 ... 1 1 0]


### Encoding categorical data
There are two categorical variables: Customer's country, and customer's gender. They need to be encoded

Label encoding the 'Gender' column, with index 2

In [94]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
x[:, 2] = le.fit_transform(x[:, 2]) 

In [95]:
# We observe that 'Female' was encoded as 0 and 'Male' as 1
print (x)

[[619 'France' 0 ... 1 1 101348.88]
 [608 'Spain' 0 ... 0 1 112542.58]
 [502 'France' 0 ... 1 0 113931.57]
 ...
 [709 'France' 0 ... 0 1 42085.58]
 [772 'Germany' 1 ... 1 0 92888.52]
 [792 'France' 0 ... 1 0 38190.78]]


One Hot Encoding the 'Geography' column. This needs to be done because there is no binary relationship among geographies as with 'Gender' in our dataset. Index for geography column is 1

In [96]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [1])], remainder = 'passthrough')
x = np.array(ct.fit_transform(x))

In [97]:
# We observe that France was encoded as 1, 0, 0 Spain as 0, 0, 1 and Germany as 0, 1, 0
print(x)

[[1.0 0.0 0.0 ... 1 1 101348.88]
 [0.0 0.0 1.0 ... 0 1 112542.58]
 [1.0 0.0 0.0 ... 1 0 113931.57]
 ...
 [1.0 0.0 0.0 ... 0 1 42085.58]
 [0.0 1.0 0.0 ... 1 0 92888.52]
 [1.0 0.0 0.0 ... 1 0 38190.78]]


### Splitting dataset into Training set and Test set

In [98]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.2, random_state = 0)

Feature Scaling: fundamental for Deep Learning, applied to all features

In [99]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train) # Scaler object is only fitted to training set to avoid information leakage
x_test = sc.transform(x_test)

## Part 2: Building the ANN

### Initialising the ANN

In [100]:
# Creating an ANN variable as an object of a sequential class in layers
ann = tf.keras.models.Sequential()

### Adding the input layer and the first hidden layer

using Dense class and specifying number of neurons, and activation function (rectifier is relu)

In [101]:
ann.add(tf.keras.layers.Dense(units = 6, activation = 'relu'))

### Adding the second hidden layer

by simply copying the code from first layer above. Add method can add any layer this way at any stage of construction of ANN. The number of neurons and activation function can be changed ofcourse

In [102]:
ann.add(tf.keras.layers.Dense(units = 6, activation = 'relu'))

### Adding the output layer

Dense class is used because we want output layer to be fully connected to the hidden layers. The number of units in the output required is 1, because we only need a binary outcome whether the customer left the bank (1) or stayed with the bank (0).

However, if we were doing classification with non-binary dependent variable with e.g. 3 classes A,B,C, we would need 3 dimensions (3 output neurons to OneHotEncode that dependent variable A as 1, 0, 0 B as 0, 1, 0 and C as 0, 0, 1

We use the sigmoid activation function for outer layer instead of rectifier (relu), because sigmoid function can also give the probablities that the binary outcome is 1 or 0

If the output is non-binary we used Softmax activation function

In [103]:
ann.add(tf.keras.layers.Dense(units = 1, activation = 'sigmoid'))

## Part 3: Training the ANN

### Compiling the ANN

with an Optimiser, a loss function and metrics

Adam optimiser can perform stochastic gradient descent 

for binary output classification (as in our case), loss function used is binary_crossentropy

for non-binary output classification, loss function used is crossentropy

several metrics can be entered as a list, but here we use accuracy only

In [104]:
ann.compile(optimizer = 'adam',  loss = 'binary_crossentropy', metrics = ['accuracy'])

### Training the ANN 
the method to train any ML model is fit method

Batch learning is more efficient because of comparing prediction results one-by-one to compute and reduce the loss, we compute several results in a batch.

batch_size paarmeter tells the number of predictions we want to have in the batch to be compared to. classic value chosen is 32, but it can be changed

epochs - number of full cycles of training. Make sure to not choose a small number because a NN needs sufficient epochs to learn properly

In [105]:
ann.fit(x_train, y_train, batch_size = 32, epochs = 100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x7ff74cfa2c10>

As we can observe, the accuracy is evolving over the epochs, and ultimately converging around 0.86. 
Which means that out of 100 observations we had 86 accurate predictions


## Part 4: Making predictions and evaluating the model

### Predicting the result of a single observation

We can now predict if a customer with following details would leave the bank or not:
Geography: France, Credit Score: 600, Gender: Male, Age: 40, Tenure: 3 years, Balance: $ 60,000 

Number of products: 2, does this customer have a credit card? Yes, Is this customer an Active Member? Yes, Estimated Salary: $ 50,000

An input to the predict method must be a 2-d array, hence the double pair of square brackets.

For the geography variable, we must enter the value of dummy variable [1 0 0]. Rest as they are or binary wherever necessary

**NOTE: Predict method should be called on with the same scaling was applied in the training. so we use sc.transform**

REMEMBER: When compiling an ANN to an optimiser, a loss function and metrics, in the output we chose a sigmoid function which will now render the prediction in the form of a probability

In [106]:
print(ann.predict(sc.transform([[1, 0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000 ] ])))

[[0.09622014]]


So, the probability of this customer leaving the bank is very less (0.1)

If instead of a probability, we want a  binary outcome whether a customer will leave the bank or not, we can instruct the predict method to tell us whether the probability is greater than 0.5 or not. see below

different values for threshold could be chosen instead of 0.5, depending on critical values

In [107]:
print(ann.predict(sc.transform([[1, 0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000 ] ])) > 0.5)

[[False]]


So the customer would not leave the bank then

### Predicting the Test set results

In [108]:
y_pred = ann.predict(x_test)
y_pred = (y_pred > 0.5) # This is done to get a binary outcome because y_pred is a probability. 0.5 is the threshold here
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1))

[[0 0]
 [0 1]
 [0 0]
 ...
 [0 0]
 [0 0]
 [0 0]]


On the left we have vector predictions y_pred, and on the right is vector real results y_test

### The Confusion Matrix

To compute the final accuracy of the test set

In [109]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[6083  281]
 [ 989  647]]


0.84125

So out of 100 customers, 85 were predicted correctly whether they'd stay or leave the bank