Data Description: Given a Bank customer, can we build a classifier that can determine whether they will leave or not using Neural networks?


The dataset contains 10,000 sample points with 14 distinct features such as CustomerId, CreditScore, Geography, Gender, Age, Tenure, Balance etc. Know your data: https://www.kaggle.com/barelydedicated/bank-
customer-churn-modeling.

 

Context:
Businesses like banks which provide service have to worry about problem of 'Churn' i.e. customers leaving and joining another service provider. It is important to understand which aspects of the service influence a
customer's decision in this regard. Management can concentrate efforts on improvement of service, keeping in mind these priorities.

Steps and Milestones (100%):


 Setup Environment and Load Necessary Packages (5%)

 Data Preparation (40%)
o Loading Data (5%)
o Cleaning Data (10%)
o Data Representation & Feature Engineering (If Any) (15%)
o Creating Train and Validation Set (10%)

 Model Creation (30%)
o Write & Configure Model (10%)
o Compile Model (10%)
o Build Model & Checking Summary (10%)

 Training and Evaluation (25%)
o Run Multiple Experiments (10%)
o Reason & Visualize Model Performance (5%)
o Evaluate Model on Test Set (10%)

Learning Outcomes:
o Neural Networks for Predictive Analytics
o Fine-tuning Model
o Data Preparation
o Feature Engineering
o Visualization

 

The points distribution for this case is as follows:

Read the data set
Drop the columns which are unique for all users like IDs (2.5 points)
Distinguish the feature and target set (2.5 points)
Divide the data set into training and test sets ( 2.5 points)
Normalize the train and test data (5 points)
Initialize & build the model (10 points)
Predict the results using 0.5 as a threshold (5 points)
Print the Accuracy score and confusion matrix (2.5 points)

In [0]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt 
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
from sklearn import metrics

# 1. Read the data set

In [0]:
# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [3]:
# Import pandas 
import pandas as pd
import numpy as np



link = "https://drive.google.com/open?id=1T5FQ3061Rbt5e6FHf4xub0SYmutrBKFg"
fluff, id = link.split('=')
print (id) 

downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('bank.csv')  
bank_df = pd.read_csv('bank.csv')
bank_df_2 = pd.read_csv('bank.csv')
# Dataset is now stored in a Pandas Dataframe
bank_df.head()

1T5FQ3061Rbt5e6FHf4xub0SYmutrBKFg


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [4]:
bank_df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [5]:
#check for missing values
bank_df.isnull().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [6]:
#Lets analysze the distribution of the various attributes
bank_df.describe(include='all').transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
RowNumber,10000,,,,5000.5,2886.9,1.0,2500.75,5000.5,7500.25,10000.0
CustomerId,10000,,,,15690900.0,71936.2,15565700.0,15628500.0,15690700.0,15753200.0,15815700.0
Surname,10000,2932.0,Smith,32.0,,,,,,,
CreditScore,10000,,,,650.529,96.6533,350.0,584.0,652.0,718.0,850.0
Geography,10000,3.0,France,5014.0,,,,,,,
Gender,10000,2.0,Male,5457.0,,,,,,,
Age,10000,,,,38.9218,10.4878,18.0,32.0,37.0,44.0,92.0
Tenure,10000,,,,5.0128,2.89217,0.0,3.0,5.0,7.0,10.0
Balance,10000,,,,76485.9,62397.4,0.0,0.0,97198.5,127644.0,250898.0
NumOfProducts,10000,,,,1.5302,0.581654,1.0,1.0,1.0,2.0,4.0


In [7]:
bank_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber          10000 non-null int64
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             10000 non-null int64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


In [8]:
%tensorflow_version 2.x

TensorFlow 2.x selected.


In [9]:
import tensorflow as tf
print(tf.__version__)

2.1.0


In [0]:
tf.random.set_seed(42)

# 2. Drop the columns which are unique for all users like IDs
## observation

## 1.Dropping columns "RowNumber", "CustomerId" and "Surname" as they have no relevance




In [11]:
bank_df = bank_df.drop(['RowNumber','CustomerId','Surname'], axis=1)
bank_df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [12]:
bank_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             10000 non-null int64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(2), int64(7), object(2)
memory usage: 859.5+ KB


In [13]:
## Label Encoding of all the columns
from sklearn.preprocessing import LabelEncoder
# instantiate labelencoder object
le = LabelEncoder()

# Categorical boolean mask
categorical_feature_mask = bank_df.dtypes==object
# # filter categorical columns using mask and turn it into a list
categorical_cols = bank_df.columns[categorical_feature_mask].tolist()
# df_drop[categorical_cols] = df_drop[categorical_cols].apply(lambda col: le.fit_transform(col))
# print(df_drop.info())
categorical_feature_mask
categorical_cols

['Geography', 'Gender']

In [0]:
bank_df['Geography'] =le.fit_transform(bank_df['Geography'])
bank_df['Gender'] =le.fit_transform(bank_df['Gender'])

In [0]:
bank_df.head(10)
bank_df_cv = bank_df.copy(deep=True)

In [16]:
#checking for negative numbers in all coumns
bank_df.lt(0).sum()


CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [17]:
bank_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
CreditScore        10000 non-null int64
Geography          10000 non-null int64
Gender             10000 non-null int64
Age                10000 non-null int64
Tenure             10000 non-null int64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(2), int64(9)
memory usage: 859.5 KB


# 3. Distinguish the feature and target set 
# 5. Normalize the train and test data

In [18]:
from scipy.stats import zscore
bank_df = bank_df.apply(zscore)
X_columns =  bank_df.columns.tolist()[0:10]
Y_Columns = bank_df.columns.tolist()[-1:]


X = bank_df[X_columns].values # Credit Score through Estimated Salary
Y = np.array(bank_df['Exited']) # Exited

print(X_columns)
print(Y_Columns)
print(Y)
print(X)
print(X.shape)
print(Y.shape)

['CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']
['Exited']
[ 1.97716468 -0.50577476  1.97716468 ...  1.97716468  1.97716468
 -0.50577476]
[[-0.32622142 -0.90188624 -1.09598752 ...  0.64609167  0.97024255
   0.02188649]
 [-0.44003595  1.51506738 -1.09598752 ... -1.54776799  0.97024255
   0.21653375]
 [-1.53679418 -0.90188624 -1.09598752 ...  0.64609167 -1.03067011
   0.2406869 ]
 ...
 [ 0.60498839 -0.90188624 -1.09598752 ... -1.54776799  0.97024255
  -1.00864308]
 [ 1.25683526  0.30659057  0.91241915 ...  0.64609167 -1.03067011
  -0.12523071]
 [ 1.46377078 -0.90188624 -1.09598752 ...  0.64609167 -1.03067011
  -1.07636976]]
(10000, 10)
(10000,)


# 4. Divide the data set into training and test sets

In [19]:
# Split the data up in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

print(y_test)
print(y_train)

(8000, 10)
(8000,)
(2000, 10)
(2000,)
[-0.50577476 -0.50577476 -0.50577476 ...  1.97716468  1.97716468
  1.97716468]
[-0.50577476 -0.50577476  1.97716468 ...  1.97716468  1.97716468
 -0.50577476]


In [20]:
from tensorflow.keras.utils import to_categorical
#Encoding the output class label (One-Hot Encoding)
y_train=to_categorical(y_train,2,dtype='int')
y_test=to_categorical(y_test,2,dtype='int')

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
print(y_test)
print(y_train)


(8000, 10)
(8000, 2)
(2000, 10)
(2000, 2)
[[1 0]
 [1 0]
 [1 0]
 ...
 [0 1]
 [0 1]
 [0 1]]
[[1 0]
 [1 0]
 [0 1]
 ...
 [0 1]
 [0 1]
 [1 0]]


# 6. Initialize & build the model

In [0]:
import tensorflow as tf
from tensorflow.keras import models
from tensorflow.keras.layers import Dense
#Initialize Sequential Graph (model)
model = tf.keras.Sequential()

In [0]:
model.add(Dense(18, activation='relu', input_shape=(10,)))
model.add(Dense(20, activation='relu'))
model.add(Dense(20, activation='relu'))
model.add(Dense(2, activation='softmax'))


In [23]:
model.compile(optimizer='sgd', loss='categorical_crossentropy',metrics=['accuracy'])
#model.compile(optimizer='sgd', loss='binary_crossentropy',metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 18)                198       
_________________________________________________________________
dense_1 (Dense)              (None, 20)                380       
_________________________________________________________________
dense_2 (Dense)              (None, 20)                420       
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 42        
Total params: 1,040
Trainable params: 1,040
Non-trainable params: 0
_________________________________________________________________


In [24]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
model.fit(X_train, y_train, epochs=25, validation_data=(X_test,y_test))

(8000, 10)
(8000, 2)
(2000, 10)
(2000, 2)
Train on 8000 samples, validate on 2000 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<tensorflow.python.keras.callbacks.History at 0x7f4bc78d1ef0>

# 7. Predict the results using 0.5 as a threshold

In [25]:
y_pred = np.round(model.predict(X_test))
y_pred[0:10]
print(y_pred.shape)
print(y_pred[0:10])

(2000, 2)
[[1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [1. 0.]]


In [26]:
score = model.evaluate(X_test, y_test,verbose=1)

print(score)

[0.3530278091430664, 0.86]


# 8. Print the Accuracy score and confusion matrix

In [27]:
from sklearn import metrics

cm = metrics.confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))
cm

array([[1548,   59],
       [ 221,  172]])

# Grid Search CV

In [0]:
def create_model():
  model_2 = tf.keras.Sequential()
  model_2.add(Dense(18, activation='relu', input_shape=(10,)))
  model_2.add(Dense(20, activation='relu'))
  model_2.add(Dense(20, activation='relu'))
  model_2.add(Dense(2, activation='softmax'))
  model_2.compile(optimizer='sgd', loss='categorical_crossentropy',metrics=['accuracy'])
  return model_2

In [29]:
from keras.wrappers.scikit_learn import KerasClassifier
#from sklearn.grid_search import GridSearchCV
from sklearn.model_selection import GridSearchCV

# param = {'n_estimators': [10,50,100,200,500], 
#          'max_features': [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
#          }

model_KC = KerasClassifier(build_fn=create_model)

optimizers = ['rmsprop', 'adam']
init = ['glorot_uniform', 'normal', 'uniform']
batches = [100,5000]
epochs = [1,10,50,100,150]
n_estimators = [10,50,100,200,500]
max_features = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22]
param_grid = dict(epochs=epochs, batch_size=batches)

Using TensorFlow backend.


In [0]:
# gs = GridSearchCV(model_2, param_grid=param_grid,cv=5,scoring='accuracy', n_jobs=1)
gs = GridSearchCV(model_KC, param_grid=param_grid,cv=5,scoring='accuracy')

In [31]:
grid_result = gs.fit(X_train,y_train.argmax(axis=1))


Train on 6400 samples
Train on 6400 samples
Train on 6400 samples
Train on 6400 samples
Train on 6400 samples
Train on 6400 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Train on 6400 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Train on 6400 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Train on 6400 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Train on 6400 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Train on 6400 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18

In [36]:



print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means,stds,params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.851250 using {'batch_size': 100, 'epochs': 150}
0.792750 (0.008915) with: {'batch_size': 100, 'epochs': 1}
0.794875 (0.010914) with: {'batch_size': 100, 'epochs': 10}
0.838500 (0.012485) with: {'batch_size': 100, 'epochs': 50}
0.850750 (0.012307) with: {'batch_size': 100, 'epochs': 100}
0.851250 (0.007202) with: {'batch_size': 100, 'epochs': 150}
0.445250 (0.095711) with: {'batch_size': 5000, 'epochs': 1}
0.614250 (0.154171) with: {'batch_size': 5000, 'epochs': 10}
0.793500 (0.009267) with: {'batch_size': 5000, 'epochs': 50}
0.794500 (0.010624) with: {'batch_size': 5000, 'epochs': 100}
0.793500 (0.011869) with: {'batch_size': 5000, 'epochs': 150}


## K Cross Validation

In [37]:
bank_df_cv.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,0,0,42,2,0.0,1,1,1,101348.88,1
1,608,2,0,41,1,83807.86,1,0,1,112542.58,0
2,502,0,0,42,8,159660.8,3,1,0,113931.57,1
3,699,0,0,39,1,0.0,2,0,0,93826.63,0
4,850,2,0,43,2,125510.82,1,1,1,79084.1,0


In [0]:
df_1 = bank_df_cv.copy(deep=True)
Y_cv = df_1['Exited']
X_cv = df_1.drop(['Exited'], axis=1)
X_cv = X_cv.values

In [40]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
model_KC_CV = KerasClassifier(build_fn=create_model,epochs=150, batch_size=10, verbose=0)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
results = cross_val_score(model_KC_CV, X_cv, Y_cv, cv=kfold, scoring='accuracy')
print(results.mean())
cross_val_score

0.7962999999999999


<function sklearn.model_selection._validation.cross_val_score>