This is an analysis of a bank churn modelling dataset. Most of customers normally leave partnerships with banks because of reasons such as interest, debt, and other reasons known by them. So, in this analysis we are going to find out whether customers will stop using the bank services in the next six months or not. The data is obtained from Kaggle. The concept we are going to use is exploratory analysis, train and evaluate the model, tuning, and deep learning using keras . We'll use this model to evaluate their performance, and also tune their hyperparameters.

In [6]:
!pip install tensorflow



In [3]:
#we are going to first import the necessary libraries and use pandas to read the data frame.
import pandas as pd
import numpy as np
import tensorflow as tf

In [4]:
df=pd.read_csv("https://raw.githubusercontent.com/ibrahimabdike/Africa-Data-School-Curriculum-February/main/Notebooks/data/Churn_Modelling.csv")
df

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


In [5]:
#we will now explore the dataset by checking its structures and general statistics
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


In [6]:
df.shape

(10000, 14)

So, we are dealing with a large dataset. We have 10,000 rows and 14 columns

In [7]:
#we can also check on the specific geographical location of the customers in this dataset
df.Geography.unique()

array(['France', 'Spain', 'Germany'], dtype=object)

In [8]:
#look at the statistics of the dataset
df.describe()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


From the statistics above we conclude that;
1. The oldest customer is 92 years old whereas the least is 18 years of age
2. The bank customer with the highest creditscore has  a score of 850 whereas the lowest has a credit score of 350
3. The number of products owned by a customer in the bank is 4

In [9]:
#Since we are going to train our model, it's necessary to drop unnecessary columns and that's what we will achieve here. Thereafter we will load our data after assigning the new data set.
X = df.drop(labels= ['RowNumber', 'CustomerId', 'Surname', 'Exited'], axis=1)
y = df['Exited']

In [11]:
X.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,619,France,Female,42,2,0.0,1,1,1,101348.88
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58
2,502,France,Female,42,8,159660.8,3,1,0,113931.57
3,699,France,Female,39,1,0.0,2,0,0,93826.63
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1


So X is our new data set after handling any missing or irrelevant info

In [12]:
#Dummies are used for data manipulation. It basically converts categorical variables to dummy variables making it effective in exploratory analysis.
X = pd.get_dummies(X, dtype=int)
X.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,619,42,2,0.0,1,1,1,101348.88,1,0,0,1,0
1,608,41,1,83807.86,1,0,1,112542.58,0,0,1,1,0
2,502,42,8,159660.8,3,1,0,113931.57,1,0,0,1,0
3,699,39,1,0.0,2,0,0,93826.63,1,0,0,1,0
4,850,43,2,125510.82,1,1,1,79084.1,0,0,1,1,0


We can see that our dummies are assigned boolean instead of integers.We cannot train boolean so we have to convert them to integers. That is why I have used dtype=int. So, on the background, our dummies are converted to integers. Else StandardScaler will not work.

In [13]:
#preprocessing data in readiness to build our deep learning model
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.2, random_state=42)
scaler= StandardScaler()


X_train= scaler.fit_transform(X_train)
X_test= scaler.fit_transform(X_test)


In [16]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Input, Dense, Dropout

Sequential is a container that eases the simple feedforward neural networks. So, basically, we are designing our simple FeedForward Neural Network, adding layers and choosing activation functions. So, in our case, we are going to add two hidden layers with 1000 neurons each. The ReLu function is our activation function for the hidden layers and sigmoid for the output layer. 

In [23]:

model= Sequential()
model.add(Dense(1000, activation='relu', input_dim=X_train.shape[1]))
model.add(Dropout(0.5))
model.add(Dense(1000, activation='relu'))

model.add(Dense(1, activation= 'sigmoid'))

Then, we are going to compile the model above by selecting an optimizer, loss function and the metrics. Optimizers updates the weights and biases while the loss function measures the difference between  the predicted output and actual output. The metrics will evaluate the performance of the model during the training and validation. 

In [24]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


In [25]:
model.summary()

In [26]:
#fit model
history= model.fit(X_train, y_train, epochs=5, batch_size=20, validation_split=0.2)

Epoch 1/5
[1m320/320[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 22ms/step - accuracy: 0.8087 - loss: 0.4513 - val_accuracy: 0.8537 - val_loss: 0.3673
Epoch 2/5
[1m320/320[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 21ms/step - accuracy: 0.8454 - loss: 0.3661 - val_accuracy: 0.8544 - val_loss: 0.3498
Epoch 3/5
[1m320/320[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 21ms/step - accuracy: 0.8582 - loss: 0.3408 - val_accuracy: 0.8544 - val_loss: 0.3581
Epoch 4/5
[1m320/320[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 22ms/step - accuracy: 0.8569 - loss: 0.3619 - val_accuracy: 0.8562 - val_loss: 0.3467
Epoch 5/5
[1m320/320[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 21ms/step - accuracy: 0.8473 - loss: 0.3598 - val_accuracy: 0.8550 - val_loss: 0.3518


In [27]:
y_pred= model.predict(X_test)

[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step


In [28]:
model.evaluate(X_test, y_test )

[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.8594 - loss: 0.3458


[0.3494553864002228, 0.8585000038146973]

So, our model has an accuracy level of 86% and a loss of 35%. This means that from the churn modelling data set above, the probabilty of customers churning away from the bank services after six months is high.

You can also save the trained model for further training

In [30]:
model.save('my_churnmodel.keras')

In [32]:
# if you want to load the model later
from tensorflow.keras.models import load_model
loaded_model= load_model('my_churnmodel.keras')

  trackable.load_own_variables(weights_store.get(inner_path))
