# Problem Description

* The data used in the implementation is a dummy data from a bank that contains information about it's customers. 

* Different type of information is recorded for each customer (Check the CSV file). 

* The goal is to build a model on the given dataset and make a prediction whether a customer will leave the bank or not.

# Packages

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, accuracy_score
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.models import load_model

Using TensorFlow backend.


# Import and Explore the Data

In [2]:
#Read the csv file
dataset = pd.read_csv('data/Churn_Modelling.csv')

In [3]:
#View the first few rows of the data
dataset.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [4]:
#List of columns in the dataset
dataset.columns

Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography',
       'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary', 'Exited'],
      dtype='object')

In [5]:
#Check the shape of the dataset (Rows, Columns)
dataset.shape

(10000, 14)

In [6]:
#Check the information about the dataset
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber          10000 non-null int64
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             10000 non-null int64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


# Data Preprocessing

## Seperating the Dependent and Independent Variables

In [7]:
#Independent Variables
X = dataset.iloc[:, 3:13].values
#Dependent Variables
Y = dataset.iloc[:, 13].values

## Encoding the Categorical data

In [8]:
#Check which data needs encoding - {Geography, Gender} -> {index1, index2}
X[0]

array([619, 'France', 'Female', 42, 2, 0.0, 1, 1, 1, 101348.88],
      dtype=object)

In [9]:
#Check the unique values of the category
print(dataset['Geography'].unique())
print(dataset['Gender'].unique())

['France' 'Spain' 'Germany']
['Female' 'Male']


In [10]:
#Generate numeric index labels for each unique value in the data
#Eg: Male, Female -> 0, 1  OR England, Germany, France -> 0, 1, 2
#NOTE: Each categorical data has a different encoder
label_encoder_geography = LabelEncoder()
label_encoder_gender = LabelEncoder()

#Encode Geography column data
X[:, 1] = label_encoder_geography.fit_transform(X[:, 1])

#Encode the Gender column data
X[:, 2] = label_encoder_gender.fit_transform(X[:, 2])


#Check the encoded data for Geography Column
print('Original Data: {} got encoded to {}'.format(dataset['Geography'].unique(), list(set(X[:, 1]))))

#Check the encoded data for Gender Column
print('Original Data: {} got encoded to {}'.format(dataset['Gender'].unique(), list(set(X[:, 2]))))

Original Data: ['France' 'Spain' 'Germany'] got encoded to [0, 1, 2]
Original Data: ['Female' 'Male'] got encoded to [0, 1]


In [11]:
#One Hot Encoding - Create Dummy Variables -> Values either 0 or 1 representing the presence or absence of categorical data

onehot_encoder = OneHotEncoder(categorical_features= [1])

X = onehot_encoder.fit_transform(X).toarray()

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [12]:
#Check the one hot encoded data

[x for x in X[0]]

[1.0, 0.0, 0.0, 619.0, 0.0, 42.0, 2.0, 0.0, 1.0, 1.0, 1.0, 101348.88]

In [13]:
#Avoid the dummy variable trap -> A situation in which two or more variables are highly correlated. Can predict one from another
X = X[:, 1:]

In [14]:
#Check the data
[x for x in X[0]]

[0.0, 0.0, 619.0, 0.0, 42.0, 2.0, 0.0, 1.0, 1.0, 1.0, 101348.88]

## Split the data into Train and Test set

In [15]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

## Feature Scaling

In [16]:
standard_scaler = StandardScaler()

In [17]:
X_train = standard_scaler.fit_transform(X_train)

In [18]:
X_test = standard_scaler.transform(X_test)

In [19]:
#Check the scaled values from train and test set
print('Scaled values sample from training set:\n{}'.format([round(x, 2) for x in X_train[0]]))

print('\nScaled values sample from test set:\n{}'.format([round(x, 2) for x in X_test[0]]))

Scaled values sample from training set:
[-0.57, 1.74, 0.17, -1.09, -0.46, 0.01, -1.22, 0.81, 0.64, -1.03, 1.11]

Scaled values sample from test set:
[1.75, -0.57, -0.55, -1.09, -0.37, 1.04, 0.88, -0.92, 0.64, 0.97, 1.61]


# Build the Neural Network model

In [20]:
#Initialise the model and add layers to it
classifier_model = Sequential()

In [21]:
classifier_model.add(layer = Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_shape = (11, )))

In [22]:
classifier_model.add(layer = Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))

In [23]:
#Add the output layer
classifier_model.add(layer = Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

In [24]:
#Check the different layers added to the model
classifier_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 6)                 72        
_________________________________________________________________
dense_2 (Dense)              (None, 6)                 42        
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 7         
Total params: 121
Trainable params: 121
Non-trainable params: 0
_________________________________________________________________


## Compile the Model

In [25]:
classifier_model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Train the Model

In [26]:
#Fit the NN to the Training set
classifier_model.fit(x = X_train, y = Y_train, batch_size = 10, epochs = 100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x7fad8f9bf0f0>

# Predictions on the Test set and Evaluate the Model

In [27]:
#Predict the result from the test set
Y_prediction = classifier_model.predict(X_test)

In [28]:
#The prediction produces a probability value
Y_prediction

array([[0.18823475],
       [0.2873297 ],
       [0.1536086 ],
       ...,
       [0.17667806],
       [0.1510222 ],
       [0.13656744]], dtype=float32)

In [29]:
#Select a threshold to change the probabilities to a specific classification
Y_prediction = (Y_prediction > 0.5) # 50% threshold -> If the probability is above 50%, the customer leaves the bank and if less the customer stays

In [30]:
Y_prediction

array([[False],
       [False],
       [False],
       ...,
       [False],
       [False],
       [False]])

In [31]:
#Check the Confusion Matrix
print(confusion_matrix(Y_test, Y_prediction))

[[1541   54]
 [ 268  137]]


In [32]:
#Check the accuracy score
print('Accuracy: {:.2f}%'.format(accuracy_score(Y_test, Y_prediction) * 100))

Accuracy: 83.90%


# Save the model

In [33]:
classifier_model.save(filepath = 'model/bank_customer_classification_model.h5')

# Generate a new dataset from the previous one.

In [34]:
#Select any five random rows from the original dataset to test the saved model
row_number = np.random.randint(low = 0, high = 10000, size = 5)

In [35]:
row_number

array([4103, 5330, 4108, 7680, 4631])

In [36]:
#Previous Dataset
dataset.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [37]:
test_dataset = dataset.iloc[row_number,:]

In [38]:
test_dataset

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
4103,4104,15693337,Perry,683,Spain,Male,41,0,148863.17,1,1,1,163911.32,0
5330,5331,15626212,Wark,616,France,Male,29,9,0.0,1,1,1,166984.44,0
4108,4109,15769389,Wan,709,Germany,Female,39,9,124723.92,1,1,0,73641.86,0
7680,7681,15665181,Chung,808,Spain,Male,25,7,0.0,2,0,1,23180.37,0
4631,4632,15706116,McKay,659,Germany,Female,30,8,154159.51,1,1,0,40441.1,0


# Preprocess the Test dataset

In [39]:
#Seperate the dependent and independent variables
X_new_test = test_dataset.iloc[:, 3:13].values

In [40]:
#View the data
X_new_test

array([[683, 'Spain', 'Male', 41, 0, 148863.17, 1, 1, 1, 163911.32],
       [616, 'France', 'Male', 29, 9, 0.0, 1, 1, 1, 166984.44],
       [709, 'Germany', 'Female', 39, 9, 124723.92, 1, 1, 0, 73641.86],
       [808, 'Spain', 'Male', 25, 7, 0.0, 2, 0, 1, 23180.37],
       [659, 'Germany', 'Female', 30, 8, 154159.51, 1, 1, 0, 40441.1]],
      dtype=object)

In [41]:
#Encode the categorical Data
geo_label_encoder_test = LabelEncoder()

gender_label_encoder_test = LabelEncoder()

In [42]:
X_new_test[:, 1] = geo_label_encoder_test.fit_transform(X_new_test[:, 1])

In [43]:
X_new_test

array([[683, 2, 'Male', 41, 0, 148863.17, 1, 1, 1, 163911.32],
       [616, 0, 'Male', 29, 9, 0.0, 1, 1, 1, 166984.44],
       [709, 1, 'Female', 39, 9, 124723.92, 1, 1, 0, 73641.86],
       [808, 2, 'Male', 25, 7, 0.0, 2, 0, 1, 23180.37],
       [659, 1, 'Female', 30, 8, 154159.51, 1, 1, 0, 40441.1]],
      dtype=object)

In [44]:
X_new_test[:, 2] = gender_label_encoder_test.fit_transform(X_new_test[:, 2])

In [45]:
X_new_test

array([[683, 2, 1, 41, 0, 148863.17, 1, 1, 1, 163911.32],
       [616, 0, 1, 29, 9, 0.0, 1, 1, 1, 166984.44],
       [709, 1, 0, 39, 9, 124723.92, 1, 1, 0, 73641.86],
       [808, 2, 1, 25, 7, 0.0, 2, 0, 1, 23180.37],
       [659, 1, 0, 30, 8, 154159.51, 1, 1, 0, 40441.1]], dtype=object)

In [46]:
#One hot encoding
onehot_encoder_test = OneHotEncoder(categorical_features = [1])

In [47]:
X_new_test = onehot_encoder_test.fit_transform(X_new_test).toarray()

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [48]:
#Check the sample of the encoded data
[x for x in X_new_test[0]]

[0.0, 0.0, 1.0, 683.0, 1.0, 41.0, 0.0, 148863.17, 1.0, 1.0, 1.0, 163911.32]

In [49]:
#Avoid the dummy variable trap
X_new_test = X_new_test[:, 1:]

In [50]:
[x for x in X_new_test[0]]

[0.0, 1.0, 683.0, 1.0, 41.0, 0.0, 148863.17, 1.0, 1.0, 1.0, 163911.32]

In [51]:
#Scale the data
test_scaler = StandardScaler()

In [52]:
X_new_test = test_scaler.fit_transform(X_new_test)

In [53]:
X_new_test[0]

array([-0.81649658,  1.22474487, -0.1867447 ,  0.81649658,  1.33443634,
       -1.95133091,  0.89740508, -0.5       ,  0.5       ,  0.81649658,
        1.15501108])

# Load the saved model

In [54]:
saved_model = load_model('model/bank_customer_classification_model.h5')

## Predict the probability of each customer leaving the bank

In [55]:
y_test_predict = saved_model.predict(X_new_test)

In [56]:
#Probabilities
y_test_predict

array([[0.36057696],
       [0.03769911],
       [0.7195019 ],
       [0.01057194],
       [0.20874919]], dtype=float32)

In [57]:
test_dataset

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
4103,4104,15693337,Perry,683,Spain,Male,41,0,148863.17,1,1,1,163911.32,0
5330,5331,15626212,Wark,616,France,Male,29,9,0.0,1,1,1,166984.44,0
4108,4109,15769389,Wan,709,Germany,Female,39,9,124723.92,1,1,0,73641.86,0
7680,7681,15665181,Chung,808,Spain,Male,25,7,0.0,2,0,1,23180.37,0
4631,4632,15706116,McKay,659,Germany,Female,30,8,154159.51,1,1,0,40441.1,0


In [58]:
surnames = test_dataset['Surname']

In [59]:
surnames = [x for x in surnames]

In [60]:
probabilities = [x[0] for x in y_test_predict]

In [61]:
probabilities

[0.36057696, 0.037699107, 0.7195019, 0.01057194, 0.20874919]

In [62]:
for index, probability in enumerate(probabilities):
    print('{} has {:.2f}% probability of leaving the bank.\n'.format(surnames[index], probability*100))

Perry has 36.06% probability of leaving the bank.

Wark has 3.77% probability of leaving the bank.

Wan has 71.95% probability of leaving the bank.

Chung has 1.06% probability of leaving the bank.

McKay has 20.87% probability of leaving the bank.

