## Project Description - Bank Churn Prediction
**Bank Churn Prediction**

**Objective:**
Given a Bank customer, build a neural network-based classifier that can determine whether they will leave or not in the next 6 months.


**Context:**

Businesses like banks that provide service have to worry about the problem of 'Churn' i.e. customers leaving and joining another service provider. It is important to understand which aspects of the service influence a customer's decision in this regard. Management can concentrate efforts on the improvement of service, keeping in mind these priorities.

**Data Description:**

The case study is from an open-source dataset from Kaggle. The dataset contains 10,000 sample points with 14 distinct features such as CustomerId, CreditScore, Geography, Gender, Age, Tenure, Balance, etc.
Link to the Kaggle project site:https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling

**Data Dictionary:**
RowNumber: Row number.

CustomerId: Unique identification key for different customers.

Surname: Surname of the customer

Credit Score: Credit score is a measure of an individual's ability to pay back the borrowed amount. It is the numerical representation of their creditworthiness. A credit score is a 3-digit number that falls in the range of 300-900, 900 being the highest.

Geography: The country to which the customer belongs.

Gender: The gender of the customer.

Age: Age of the customer.

Tenure: The period of time a customer has been associated with the bank.

Balance: The account balance (the amount of money deposited in the bank account) of the customer.

NumOfProducts: How many accounts, bank account affiliated products the person has.

HasCrCard: Does the customer have a credit card through the bank?

IsActiveMember: Subjective, but for the concept

EstimatedSalary: Estimated salary of the customer.

Exited: Did they leave the bank after all?

**Points Distribution:**
The points distribution for this case is as follows:

1. Read the dataset
2. Drop the columns which are unique for all users like IDs (5 points)
3. Perform bivariate analysis and give your insights from the same (5 points) 
4. Distinguish the feature and target set and divide the data set into training and test sets (5 points)
5. Normalize the train and test data (10points)
6. Initialize & build the model. Identify the points of improvement and implement the same. (20)
7. Predict the results using 0.5 as a threshold (10points)
8. Print the Accuracy score and confusion matrix (5 points)


In [None]:
!pip install nbconvert 

In [None]:
!jupyter nbconvert --to html Project_BankChurnPrediction .ipynb

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install tensorflow==2.2

In [1]:
import tensorflow as tf
print(tf.__version__)

2.2.0


### **1. Data Preprocessing**

In [3]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import numpy as np
import matplotlib
np.set_printoptions(threshold=np.inf)
import matplotlib.pyplot as plt

In [4]:


bank_data = pd.read_csv('/content/drive/My Drive/CoLab-Project/bank.xls')
bank_data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [5]:
bank_data.shape

(10000, 14)

the dataset is very small in terms of features available, with the majority being numeric and only three categorical


In [6]:
bank_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


In [7]:
bank_data.isna().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

There appear to be no NaN values in our dataset, which is good

we can Drop the columns which are unique for all users like CustomerId, RowNumber and Surname

In [8]:
X = bank_data.drop(labels=['CustomerId', 'Surname', 'RowNumber', 'Exited'], axis = 1)
y = bank_data['Exited']

In [9]:
X.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,619,France,Female,42,2,0.0,1,1,1,101348.88
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58
2,502,France,Female,42,8,159660.8,3,1,0,113931.57
3,699,France,Female,39,1,0.0,2,0,0,93826.63
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1


In [10]:
# X = bank_data.iloc[:, 3:13].values
# y = bank_data.iloc[:, 13].values
# print(X[:10,:], '\n')
# print(y[:10])

Looking at the data we can see there are two categorical features that need to be encoded : Gender and Geography .

In [11]:
from sklearn.preprocessing import LabelEncoder
label1 = LabelEncoder()
X['Geography'] = label1.fit_transform(X['Geography'])
label = LabelEncoder()
X['Gender'] = label.fit_transform(X['Gender'])
X.head()


Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,619,0,0,42,2,0.0,1,1,1,101348.88
1,608,2,0,41,1,83807.86,1,0,1,112542.58
2,502,0,0,42,8,159660.8,3,1,0,113931.57
3,699,0,0,39,1,0.0,2,0,0,93826.63
4,850,2,0,43,2,125510.82,1,1,1,79084.1


In [12]:
X = pd.get_dummies(X, drop_first=True, columns=['Geography'])
X.head()

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_1,Geography_2
0,619,0,42,2,0.0,1,1,1,101348.88,0,0
1,608,0,41,1,83807.86,1,0,1,112542.58,0,1
2,502,0,42,8,159660.8,3,1,0,113931.57,0,0
3,699,0,39,1,0.0,2,0,0,93826.63,0,0
4,850,0,43,2,125510.82,1,1,1,79084.1,0,1


In [13]:
X.shape[1]

11

**Feature Scaling and test train split**: In ANN feature scaling is very important so that all inputs are at a comparable range and only the weights assigned to them are, in fact, the only factor which makes a difference on the predicted value.

In [14]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, stratify = y)

In [15]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [40]:
print("X_train :" , X_train.shape)
print("X_test:",  X_test.shape)

X_train : (8000, 11)
X_test: (2000, 11)


**Normalizing X_test and X_train**

In [16]:
# X_train = preprocessing.normalize(X_train)
# X_test = preprocessing.normalize(X_test)

### **2. Creating Artificial Neural Network**

In [17]:
import keras
import sys
from keras.models import Sequential #to initialize NN
from keras.layers import Dense #used to create layers in NN
from tensorflow.keras.layers import Flatten, Dense




**Step 1.** Randomly initialise the weights with small numbers close to zero but not zero. This will be done by our Dense function.

**Step 2.** Distribute features of the first observation, from the dataset, per each node in the input layer. Thus, eleven independent variables will be added to input layer.



In [18]:
#Initialising the ANN - Defining as a sequence of layers or a Graph
classifier = Sequential()


**Adding the input layer**


*   units :  number of nodes to add to the hidden layer.
Tip: units should be the average of nodes in the input layer (11 nodes) and the number of nodes in the output layer (1 node). For this case is 11+1/2 = 6


*   kernel_initializer : randomly initialize the weight with small numbers close to zero, according to uniform distribution.
*   
activation : Activation function.

*   input_dim : number of nodes in the input layer, that our hidden layer should be expecting






In [20]:
#Input Layer
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11 ))

**Step 3.** **Foward-Propagation** : 
From the input to the output the neurons are activated, and the impact they have in the predicted results is measured by the assigned weights. Depending on the number of hidden layers, the system propagates the activation until getting the predicted result y.

To define the first hidden layer, we firstly will have to define an activation function. The best one is the Rectifier Function and we’ll choose this one for the hidden layers. Furthermore, also by using a Sigmoid function to the output layer will allow us to calculate the probabilities of the different class (leaving or staying the bank). 

In the end, we will be able to rank the customers by their probability to leave the bank.

**Adding Second hidden layer**
There is no need to specify the input dimensions since our network already knows.

In [21]:
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))

**Adding Output layer**
There is no need to specify the input dimensions since our network already knows.
units :  one node in the output layer
activation : If there are more than two categories in the output we would use the softmax, so here we use sigmoid function 

In [22]:
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

**Summary of the model**

In [23]:
classifier.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 6)                 72        
_________________________________________________________________
dense_1 (Dense)              (None, 6)                 42        
_________________________________________________________________
dense_2 (Dense)              (None, 6)                 42        
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 7         
Total params: 163
Trainable params: 163
Non-trainable params: 0
_________________________________________________________________


we can play around with neural network model by changing the number of dense layers, learning rate, number of neurons in hidden layers. get better accuracy and different loss

**Step 4.** **Cost Function** : 

Measure the generated error by comparing the predicted value with the true value.

Stochastic Gradient Descent : Compiling the ANN

optimizer : algorithm to use to find the best weights that will make our system powerful
loss : Loss function within our optimizer algorithm
metric : criteria to evaluate the model

In [24]:
classifier.compile(optimizer='adam', loss = 'binary_crossentropy', metrics=['accuracy'])

**Step 5.** **Back-Propagation**: from the output to the input layer, the calculated error is back-propagated and the weights get updated according to the influence they had on the error. The learning rates indicate how much these weights are updated.

**Step 6.** Reinforcement Learning : Update weights at each observation (steps 1 to 5) or Batch Learning : Update the weights after each batch of observations (steps 1 to 5)

**Step 7.** When the system has gone through the whole training dataset, an epoch has been run. Redo more epochs.

**Fitting the ANN to the Training Set**

batch_size : number of observations after which we update the weights

epochs: How many times you train your model

In [26]:
classifier.fit(X_train, y_train.to_numpy(), batch_size = 10, epochs = 10 ,verbose = 1 )

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f5e8afbf588>

This is trained ANN model which, after running 10 epochs on the training set, returned an accuracy of around 86%.

Changing epoc fro 10 to 100 remaned almost same with 0.8378 and batch size fro 10 to 100 decreased accuracy score to from 0.84 to 0.8016 

### **3. Making Predictions**
We’ve trained our ANN model and now we’re ready to see its capability on predicting future churn results with our test set.

In [27]:
#Predicting the Test set results
y_pred = classifier.predict(X_test)
#Threshold of 50%
y_pred = (y_pred > 0.5)

In [28]:
y_pred[:10]

array([[False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False]])

In [29]:
y_test[:10]

1344    1
8167    0
4747    0
5004    1
3124    1
1940    1
2090    0
3298    0
8364    1
9485    0
Name: Exited, dtype: int64

In [30]:
#Making the COnfusion Matric
from sklearn import metrics
from sklearn.metrics import confusion_matrix, accuracy_score
from mlxtend.plotting import plot_confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[1532,   61],
       [ 227,  180]])

If we run our confusion matrix, we can see that out of 2000 observation the model accurately predicted 1553 plus 180 correct predictions and 227 plus 61 incorrect predictions.

In [31]:
cr=metrics.classification_report(y_test,y_pred)
print(cr)

              precision    recall  f1-score   support

           0       0.87      0.96      0.91      1593
           1       0.75      0.44      0.56       407

    accuracy                           0.86      2000
   macro avg       0.81      0.70      0.73      2000
weighted avg       0.85      0.86      0.84      2000



In [35]:
accuracy = (1532+180)/(2000)
accuracy

0.856

In [36]:
accuracy_score(y_test, y_pred)

0.856

So, in the end, our model was able to predict the probability of our clients leaving the bank with an accuracy of 85,7%. Nevertheless, we didn’t do any hyperparameters tuning so maybe we can improve these results.

**Predicting a single observation**

In [37]:
#use sc.transform to scale data
new_prediction = classifier.predict(scaler.transform(np.array([[0.0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]])))
new_prediction = (new_prediction > 0.5)
new_prediction

array([[False]])

In conclusion, this model predicted that our client will probably not leave the bank in the future.