### Objective
Given a Bank customer, build a neural network based classifier that can determine whether they will leave
or not in the next 6 months.

### Context:

Businesses like banks which provide service have to worry about problem of 'Churn' i.e. customers
leaving and joining another service provider. It is important to understand which aspects of the service
influence a customer's decision in this regard. Management can concentrate efforts on improvement of
service, keeping in mind these priorities.

### Data Description:
The case study is from an open-source dataset from Kaggle.
The dataset contains 10,000 sample points with 14 distinct features such as CustomerId, CreditScore,
Geography, Gender, Age, Tenure, Balance etc.

In [2]:
!pip install tensorflow==2.0

Collecting tensorflow==2.0
  Downloading tensorflow-2.0.0-cp37-cp37m-win_amd64.whl (48.1 MB)
Collecting keras-applications>=1.0.8
  Downloading Keras_Applications-1.0.8-py3-none-any.whl (50 kB)
Collecting opt-einsum>=2.3.2
  Downloading opt_einsum-3.3.0-py3-none-any.whl (65 kB)
Collecting gast==0.2.2
  Downloading gast-0.2.2.tar.gz (10 kB)
Collecting termcolor>=1.1.0
  Downloading termcolor-1.1.0.tar.gz (3.9 kB)
Collecting tensorboard<2.1.0,>=2.0.0
  Downloading tensorboard-2.0.2-py3-none-any.whl (3.8 MB)
Collecting google-pasta>=0.1.6
  Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
Collecting protobuf>=3.6.1
  Downloading protobuf-3.13.0-cp37-cp37m-win_amd64.whl (1.0 MB)
Collecting tensorflow-estimator<2.1.0,>=2.0.0
  Downloading tensorflow_estimator-2.0.1-py2.py3-none-any.whl (449 kB)
Collecting astor>=0.6.0
  Downloading astor-0.8.1-py2.py3-none-any.whl (27 kB)
Collecting absl-py>=0.7.0
  Downloading absl_py-0.10.0-py3-none-any.whl (127 kB)
Collecting keras-preprocessing>=

### Import the packages and dataframes that are needed

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf
from sklearn import preprocessing
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, precision_recall_curve, auc
import matplotlib.pyplot as plt
from tensorflow.keras import optimizers

In [2]:
import tensorflow as tf
print(tf.__version__)

2.0.0


In [3]:
df = pd.read_csv('Bank.csv')
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


### Check for shape of data and data types!

In [4]:
df.shape

(10000, 14)

In [5]:
df.dtypes

RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

In [6]:
df.isna().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


### Print the descriptive statistics of each & every column using describe() function

In [8]:
df.drop(['RowNumber','CustomerId', 'Surname'], axis=1, inplace=True)
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [9]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CreditScore,10000.0,650.5288,96.653299,350.0,584.0,652.0,718.0,850.0
Age,10000.0,38.9218,10.487806,18.0,32.0,37.0,44.0,92.0
Tenure,10000.0,5.0128,2.892174,0.0,3.0,5.0,7.0,10.0
Balance,10000.0,76485.889288,62397.405202,0.0,0.0,97198.54,127644.24,250898.09
NumOfProducts,10000.0,1.5302,0.581654,1.0,1.0,1.0,2.0,4.0
HasCrCard,10000.0,0.7055,0.45584,0.0,0.0,1.0,1.0,1.0
IsActiveMember,10000.0,0.5151,0.499797,0.0,0.0,1.0,1.0,1.0
EstimatedSalary,10000.0,100090.239881,57510.492818,11.58,51002.11,100193.915,149388.2475,199992.48
Exited,10000.0,0.2037,0.402769,0.0,0.0,0.0,0.0,1.0


In [10]:
print(df['CreditScore'] == 0)
print(df['Tenure'] == 0)
print(df['EstimatedSalary'] == 0)

0       False
1       False
2       False
3       False
4       False
        ...  
9995    False
9996    False
9997    False
9998    False
9999    False
Name: CreditScore, Length: 10000, dtype: bool
0       False
1       False
2       False
3       False
4       False
        ...  
9995    False
9996    False
9997    False
9998    False
9999    False
Name: Tenure, Length: 10000, dtype: bool
0       False
1       False
2       False
3       False
4       False
        ...  
9995    False
9996    False
9997    False
9998    False
9999    False
Name: EstimatedSalary, Length: 10000, dtype: bool


In [11]:
df.nunique()

CreditScore         460
Geography             3
Gender                2
Age                  70
Tenure               11
Balance            6382
NumOfProducts         4
HasCrCard             2
IsActiveMember        2
EstimatedSalary    9999
Exited                2
dtype: int64

In [12]:
df[df.duplicated()] == True

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited


Observations!
1. No Null Values
2. Balance seems to be zero in some cases but its an valid scenario.
3. Geography and gender is object type, need to be label encoded or one hot encoded
4. None of records where Tenure, EstimatedSalary and credit score is 0. Data looks good
5. No Duplicate Data in given set.

In [13]:
BankChurnDf = df.copy()

for i in BankChurnDf.columns:
    x = BankChurnDf[i].value_counts()
    print("Column name is:",i,"and it value is:",x)
    print()

Column name is: CreditScore and it value is: 850    233
678     63
655     54
705     53
667     53
      ... 
419      1
417      1
373      1
365      1
401      1
Name: CreditScore, Length: 460, dtype: int64

Column name is: Geography and it value is: France     5014
Germany    2509
Spain      2477
Name: Geography, dtype: int64

Column name is: Gender and it value is: Male      5457
Female    4543
Name: Gender, dtype: int64

Column name is: Age and it value is: 37    478
38    477
35    474
36    456
34    447
     ... 
92      2
88      1
82      1
85      1
83      1
Name: Age, Length: 70, dtype: int64

Column name is: Tenure and it value is: 2     1048
1     1035
7     1028
8     1025
5     1012
3     1009
4      989
9      984
6      967
10     490
0      413
Name: Tenure, dtype: int64

Column name is: Balance and it value is: 0.00         3617
105473.74       2
130170.82       2
113063.83       1
80242.37        1
             ... 
183555.24       1
137648.41       1
112689.95 

In [14]:
#Manuanl Encoding for geography and gender
df.replace({'France' : 0, 'Germany' : 1, 'Spain' : 2,
              'Female' : 0, 'Male' : 1}, inplace = True)

In [15]:
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,0,0,42,2,0.0,1,1,1,101348.88,1
1,608,2,0,41,1,83807.86,1,0,1,112542.58,0
2,502,0,0,42,8,159660.8,3,1,0,113931.57,1
3,699,0,0,39,1,0.0,2,0,0,93826.63,0
4,850,2,0,43,2,125510.82,1,1,1,79084.1,0


### Find Feature and Target 

In [16]:
feature = df.drop(["Exited"],axis=1)
target = df["Exited"]

In [17]:
X_train, X_test, y_train, y_test = train_test_split(feature, target, test_size = 0.2, random_state = 7)

In [18]:
feature.shape

(10000, 10)

In [19]:
target.shape

(10000,)

In [20]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(8000, 10)
(2000, 10)
(8000,)
(2000,)


In [21]:
from sklearn.tree import  DecisionTreeClassifier
from sklearn.metrics import recall_score

dtc=DecisionTreeClassifier( max_depth=10)
dtc.fit(X_train,y_train)
pred=dtc.predict(X_test)
recall_score(y_test,pred)


0.46958637469586373

In [22]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,pred)

0.835

### Creating a model

Keras model object can be created with Sequential class

At the outset, the model is empty per se. It is completed by adding additional layers and compilation

In [23]:
model = Sequential()

### Adding layers [layers and activations]

Keras layers can be added to the model

Adding layers are like stacking lego blocks one by one

It should be noted that as this is a classification problem, sigmoid layer (softmax for multi-class problems) should be added

In [24]:
model.add(Dense(32,input_shape = (10,), activation = 'relu'))
model.add(Dense(16, activation = 'tanh'))
model.add(Dense(1, activation = 'sigmoid'))

### Model compile [optimizers and loss functions]

Keras model should be "compiled" prior to training

Types of loss (function) and optimizer should be designated

In [25]:
sgd = optimizers.Adam(lr = 0.001)

In [26]:
from tensorflow.keras.metrics import Recall
model.compile(optimizer = sgd, loss = 'binary_crossentropy', metrics=["accuracy",Recall(class_id=0,name="recall_0")])

In [27]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 32)                352       
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 897
Trainable params: 897
Non-trainable params: 0
_________________________________________________________________


Training [Forward pass and Backpropagation]
Training the model

In [28]:
model.fit(X_train, y_train.values, batch_size = 700, epochs = 10, verbose = 1)

Train on 8000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x299b8b6c208>

In [30]:
model.compile(optimizer = sgd, loss = 'binary_crossentropy', metrics=["accuracy"])

In [31]:
model.fit(X_train, y_train.values, batch_size = 700, epochs = 10, verbose = 1)

Train on 8000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x299b9ed3788>

In [32]:
loss,acc = model.evaluate(X_test,y_test, verbose=1)
print('Accuracy: %.3f' % acc)
print('Loss: %.3f' % loss)

Accuracy: 0.794
Loss: 0.528


In [33]:
y_predict = model.predict(X_test)
y_predict



array([[0.27244136],
       [0.27244136],
       [0.27244136],
       ...,
       [0.20123327],
       [0.39182314],
       [0.20123327]], dtype=float32)

In [34]:
np.argmax(y_predict[1])

0

In [35]:
from sklearn import metrics
y_pred=[]
for val in y_predict:
    y_pred.append(np.argmax(val))
cm = metrics.confusion_matrix(y_test, y_pred)
print(cm)

[[1589    0]
 [ 411    0]]


In [36]:
cr = metrics.classification_report(y_test,y_pred)
print(cr)

              precision    recall  f1-score   support

           0       0.79      1.00      0.89      1589
           1       0.00      0.00      0.00       411

    accuracy                           0.79      2000
   macro avg       0.40      0.50      0.44      2000
weighted avg       0.63      0.79      0.70      2000



  _warn_prf(average, modifier, msg_start, len(result))


#### Evaluation

Keras model can be evaluated with evaluate() function

Evaluation results are contained in a list

In [37]:
X_test = preprocessing.normalize(X_test)

In [40]:
results = model.evaluate(X_test, y_test.values)

['loss', 'accuracy']
[0.6104157781600952, 0.7945]


In [41]:
print(model.metrics_names)
print(results)   

['loss', 'accuracy']
[0.6104157781600952, 0.7945]


In [42]:
Y_pred_cls = model.predict_classes(X_test, batch_size=200, verbose=1)
print('Accuracy Model1 (Dropout): '+ str(model.evaluate(X_test,y_test.values)[1]))
print('Recall_score: ' + str(recall_score(y_test.values,Y_pred_cls)))
print('Precision_score: ' + str(precision_score(y_test.values, Y_pred_cls)))
print('F-score: ' + str(f1_score(y_test.values,Y_pred_cls)))
print(confusion_matrix(y_test.values, Y_pred_cls))

Accuracy Model1 (Dropout): 0.7945
Recall_score: 0.0
Precision_score: 0.0
F-score: 0.0
[[1589    0]
 [ 411    0]]


  _warn_prf(average, modifier, msg_start, len(result))
