## Bank Churn Prediction Project

Businesses like banks which provide service have to worry about problem of 'Churn' i.e. customers
leaving and joining another service provider. It is important to understand which aspects of the service
influence a customer's decision in this regard. Management can concentrate efforts on improvement of
service, keeping in mind these priorities.


**Objective:**
Given a Bank customer, build a neural network based classifier that can determine whether they will leave
or not in the next 6 months.


Points Distribution:
The points distribution for this case is as follows:
1. Read the dataset
2. Drop the columns which are unique for all users like IDs (5 points)
3. Distinguish the features and target variable (5 points)
4. Divide the data set into training and test sets (5 points)
5. Normalize the train and test data (10 points)
6. Initialize & build the model. Identify the points of improvement and implement the same the same.(20)
7. Predict the results using 0.5 as a threshold (10 points)
8. Print the Accuracy score and confusion matrix (5 points)

### Dataset Description

The case study is from an open-source dataset from Kaggle.
The dataset contains 10,000 sample points with 14 distinct features such as CustomerId, CreditScore,
Geography, Gender, Age, Tenure, Balance etc.
Link to the Kaggle project site:
https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling

In [None]:
!pip install tensorflow==2.0

Collecting tensorflow==2.0
[?25l  Downloading https://files.pythonhosted.org/packages/46/0f/7bd55361168bb32796b360ad15a25de6966c9c1beb58a8e30c01c8279862/tensorflow-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl (86.3MB)
[K     |████████████████████████████████| 86.3MB 46kB/s 
Collecting tensorflow-estimator<2.1.0,>=2.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/fc/08/8b927337b7019c374719145d1dceba21a8bb909b93b1ad6f8fb7d22c1ca1/tensorflow_estimator-2.0.1-py2.py3-none-any.whl (449kB)
[K     |████████████████████████████████| 450kB 51.8MB/s 
[?25hCollecting keras-applications>=1.0.8
[?25l  Downloading https://files.pythonhosted.org/packages/71/e3/19762fdfc62877ae9102edf6342d71b28fbfd9dea3d2f96a882ce099b03f/Keras_Applications-1.0.8-py3-none-any.whl (50kB)
[K     |████████████████████████████████| 51kB 8.3MB/s 
Collecting tensorboard<2.1.0,>=2.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/76/54/99b9d5d52d5cb732f099baaaf7740403e83fe6b0cedde940fabd2b13d75

In [None]:
import tensorflow as tf
print(tf.__version__)

2.3.0


In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, precision_recall_curve, auc
import matplotlib.pyplot as plt
from tensorflow.keras import optimizers
from sklearn.compose import ColumnTransformer
from sklearn.metrics import confusion_matrix


In [None]:
from google.colab import drive

In [None]:
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


Read the dataset

In [None]:
project_path = '/content/drive/My Drive/Colab Notebooks/'

In [None]:
dataset_file = project_path + 'bank.csv'

In [None]:
data = pd.read_csv(dataset_file)

In [None]:
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [None]:
data.isna().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

Drop the columns which are unique for all users like IDs (5 points)

In [None]:
data['Geography'].value_counts()

France     5014
Germany    2509
Spain      2477
Name: Geography, dtype: int64

In [None]:
#RowNumber #CustomerId and #Surname are unique hence dropping it
data = data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      10000 non-null  int64  
 1   Geography        10000 non-null  object 
 2   Gender           10000 non-null  object 
 3   Age              10000 non-null  int64  
 4   Tenure           10000 non-null  int64  
 5   Balance          10000 non-null  float64
 6   NumOfProducts    10000 non-null  int64  
 7   HasCrCard        10000 non-null  int64  
 8   IsActiveMember   10000 non-null  int64  
 9   EstimatedSalary  10000 non-null  float64
 10  Exited           10000 non-null  int64  
dtypes: float64(2), int64(7), object(2)
memory usage: 859.5+ KB


Distinguish the features and target variable (5 points)

In [None]:
# Columns: Credit Score Geography	Gender	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary
X = data.iloc[:,0:10].values 
# Column: Exited
y = data.iloc[:,10].values 

In [None]:
# Encoding country categorical data (Object to  binary)
print(X[:8,1], 'is: ')

label_X_country_encoder = LabelEncoder()
X[:,1] = label_X_country_encoder.fit_transform(X[:,1])
print(X[:8,1])

['France' 'Spain' 'France' 'France' 'Spain' 'Spain' 'France' 'Germany'] is: 
[0 2 0 0 2 2 0 1]


In [None]:
# Encoding gender
print(X[:6,2], 'is: ')

label_X_gender_encoder = LabelEncoder()
X[:,2] = label_X_gender_encoder.fit_transform(X[:,2])
print(X[:6,2])

['Female' 'Female' 'Female' 'Female' 'Female' 'Male'] is: 
[0 0 0 0 0 1]


In [None]:
# Converting ordinal values to different dimensions 
countryhotencoder = ColumnTransformer([("countries", OneHotEncoder(), [1])], remainder="passthrough")

#X = countryhotencoder.fit_transform(X).toarray()
X = countryhotencoder.fit_transform(X)

In [None]:
X.shape

(10000, 12)

In [None]:
X

array([[1.0, 0.0, 0.0, ..., 1, 1, 101348.88],
       [0.0, 0.0, 1.0, ..., 0, 1, 112542.58],
       [1.0, 0.0, 0.0, ..., 1, 0, 113931.57],
       ...,
       [1.0, 0.0, 0.0, ..., 0, 1, 42085.58],
       [0.0, 1.0, 0.0, ..., 1, 0, 92888.52],
       [1.0, 0.0, 0.0, ..., 1, 0, 38190.78]], dtype=object)

In [None]:
# Remove Spain dimension
X = X[:,1:] 

Divide the data set into training and test sets (5 points)

In [None]:
# Split the dataset into the train and test set

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0)

Normalize the train and test data (10 points)

In [None]:
# Feature Scaling
sc=StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Initialize & build the model. Identify the points of improvement and implement the same the same.(20)

In [None]:
# Initializing the ANN
classifier = Sequential()

In [None]:
classifier.add(Dense(activation = 'relu', input_dim = 11, units=6, kernel_initializer='uniform'))

In [None]:
# 1st hidden layer
classifier.add(Dense(6, activation='sigmoid', kernel_initializer='uniform'))

In [None]:
# Adding the output layer
classifier.add(Dense(1, activation = 'sigmoid', kernel_initializer='uniform')) 

In [None]:
# Optimize using the default learning rate and compile the model
classifier.compile(optimizer='SGD', loss='mse', metrics=['accuracy'])

In [None]:
classifier.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_6 (Dense)              (None, 6)                 72        
_________________________________________________________________
dense_7 (Dense)              (None, 6)                 42        
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 7         
Total params: 121
Trainable params: 121
Non-trainable params: 0
_________________________________________________________________


In [None]:
classifier.fit(X_train, y_train,           
          validation_data=(X_test,y_test),
          epochs=100,
          batch_size=32)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7fa1653cdbe0>

Predict the results using 0.5 as a threshold (10 points)


In [None]:
y_pred = classifier.predict(X_test)
print(y_pred)

[[0.20556968]
 [0.20624334]
 [0.20172228]
 ...
 [0.20138817]
 [0.20420489]
 [0.2055082 ]]


In [None]:
# Convert the probability of a customer leaving the bank to boolean(true or false) using the cutoff value 0.5
y_pred = (y_pred > 0.5)
print(y_pred)

[[False]
 [False]
 [False]
 ...
 [False]
 [False]
 [False]]


Print the Accuracy score and confusion matrix (5 points)

In [None]:
cm1 = confusion_matrix(y_test, y_pred)
print(cm1)

[[1595    0]
 [ 405    0]]


In [None]:
accuracy_model1 = ((cm1[0][0]+cm1[1][1])*100)/(cm1[0][0]+cm1[1][1]+cm1[0][1]+cm1[1][0])
print (accuracy_model1, '%')

79.75 %


Let me try to improve (optimise) the model using more parameters

In [None]:
# Initializing the ANN
optimised_classifier = Sequential()
# This adds the input layer (by specifying input dimension) AND the first hidden layer (units)
optimised_classifier.add(Dense(activation = 'relu', input_dim = 11, units=6, kernel_initializer='uniform'))

In [None]:
# Add hidden layer
optimised_classifier.add(Dense(activation = 'relu', units=6, kernel_initializer='uniform')) 

In [None]:
# Add the output layer

optimised_classifier.add(Dense(activation = 'sigmoid', units=1, kernel_initializer='uniform')) 

In [None]:
optimised_classifier.compile(optimizer='adam', loss = 'binary_crossentropy', metrics=['accuracy'])

In [None]:
optimised_classifier.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_9 (Dense)              (None, 6)                 72        
_________________________________________________________________
dense_10 (Dense)             (None, 6)                 42        
_________________________________________________________________
dense_11 (Dense)             (None, 1)                 7         
Total params: 121
Trainable params: 121
Non-trainable params: 0
_________________________________________________________________


In [None]:
optimised_classifier.fit(X_train, y_train,           
          validation_data=(X_test,y_test),
          epochs=100,
          batch_size=32)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7fa163a9e710>

2nd prediction of results using 0.5 as a threshold

In [None]:
y_pred = optimised_classifier.predict(X_test)
print(y_pred)

[[0.20273343]
 [0.3420826 ]
 [0.1396132 ]
 ...
 [0.18157512]
 [0.10916205]
 [0.24121255]]


In [None]:
# Convert the probabilities that a customer will leave the bank into bolean using 0.5 benchmark
y_pred = (y_pred > 0.5)
print(y_pred)

[[False]
 [False]
 [False]
 ...
 [False]
 [False]
 [False]]


Accuracy score and confusion matrix

In [None]:
cm2 = confusion_matrix(y_test, y_pred)
print(cm2)

[[1533   62]
 [ 212  193]]


In [None]:
accuracy_model2 = ((cm2[0][0]+cm2[1][1])*100)/(cm2[0][0]+cm2[1][1]+cm2[0][1]+cm2[1][0])
print (accuracy_model2, '%')

86.3 %


Conclusion:
The improved (optimised) model has a better accuracy score