##### Objective 

Given a Bank customer, build a neural network-based classifier that can determine whether they will leave or not in the next 6 months.

In [None]:
import os 

In [None]:
os.getcwd()

Import the required libraries

In [None]:

import pandas as pd
import numpy as np
import tensorflow as tf

# For missing values
import missingno as msno

# Ignore warnings 
import warnings
pd.options.display.max_columns = None
pd.options.display.max_rows = None
warnings.filterwarnings("ignore")

# Visualization libraries
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns

# Train Test Split
from sklearn.model_selection import train_test_split

# Sklearn libraries
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, precision_recall_curve, auc,classification_report


# Tensorflow libraries
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras import optimizers



###### Read the dataset

In [None]:
data=pd.read_csv('bank.csv')

In [None]:
data.head()

In [None]:
data.dtypes

###### Drop the unwanted variables

In [None]:
data.drop(['RowNumber','CustomerId','Surname'],axis=1,inplace=True)

In [None]:
data.columns

##### Basic checks on the data before getting it ready for analysis 

In [None]:
def basic_checks(df):
    
    print('='*50)
    print('Shape of the dataframe is: \n',df.shape)
    print('='*50)
    print('Basic stats for the data: \n',df.describe())
    print('='*50)
    print('Data type and info :')
    print(df.info())
    print('='*50)
    print('Missing value information : \n',df.isnull().any())
    print('='*50)
    print('Sum of missing values if any : \n',df.isnull().sum())

In [None]:
basic_checks(data)

###### Missing values matrix

In [None]:
msno.matrix(data)

No missing values in the dataset,in any of the variables

##### Plotting correlations

In [None]:
data.corr()

In [None]:
plt.figure(figsize=(10,8))

sns.heatmap(data.corr(),
            annot=True,
            linewidths=.5,
            center=0,
            cbar=False,
            cmap="YlGnBu")

plt.show()

                    ####################### EDA PART 1#############################

`EDA PART 1`

1.There float,int and object data types in the dataset

2.Row number,customerID and surname are not required for the analysis and hence we will drop the columns

3.There are 10000 records and 11 columns in the dataset

4.Average age in the dataset is 38,average tenure is 5.01 ,the data in both these columns is less skewed.Average balance is 76485.

5.Estimated salary is at 100090 average value and the data is not skewed.

6.There is no missing values in the dataset.

7.The target variable here is ‘Exited’ since we would like to see and predict the probability of the customer exiting

8.There’s no strong correlation between the target variables and the dependant variables in the dataset.

9.Exited and age show some correlation of 0.2.Number of products and balance seems to be slightly negatively correlated.

In [None]:
def catplot(variable):
    fig=plt.figure(figsize=(10,5))
    plt.subplot(1,2,1)
    sns.countplot(x=variable,data=data)
    plt.subplot(1,2,2)
    sns.countplot(x=variable, hue='Exited', data=data)

In [None]:
catplot(data['Geography'])

In [None]:
data[data['Geography']=='France'].shape[0]/data['Geography'].count()

In [None]:
data[data['Geography']=='Spain'].shape[0]/data['Geography'].count()

In [None]:
data[data['Geography']=='Germany'].shape[0]/data['Geography'].count()

In [None]:
data[(data['Geography']=='France')&(data['Exited']==1)].shape[0]/data['Geography'].count()

In [None]:
data[(data['Geography']=='Spain')&(data['Exited']==1)].shape[0]/data['Geography'].count()

In [None]:
data[(data['Geography']=='Germany')&(data['Exited']==1)].shape[0]/data['Geography'].count()

In [None]:
catplot(data['Gender'])

In [None]:
data[data['Gender']=='Male'].shape[0]/data['Gender'].count()

In [None]:
data[data['Gender']=='Female'].shape[0]/data['Gender'].count()

In [None]:
data[(data['Gender']=='Male')&(data['Exited']==1)].shape[0]/data['Gender'].count()

In [None]:
data[(data['Gender']=='Female')&(data['Exited']==1)].shape[0]/data['Gender'].count()

In [None]:
catplot(data['Tenure'])

In [None]:
def tenure_cat(x):
    if(x>0)&(x<=2):
        return 0
    else:
        if(x>2)&(x<=4):
            return 1
        else:
            if(x>4)&(x<=6):
                return 2
            else:
                  if(x>6)&(x<=8):
                      return 3
                  else:
                      if(x>8)&(x<=10):
                          return 4

In [None]:
data['Tenure_cat']=data['Tenure'].apply(tenure_cat)

In [None]:
catplot(data['Tenure_cat'])

In [None]:
data[data['Tenure_cat']==0].shape[0]/data['Tenure_cat'].count()

In [None]:
data[data['Tenure_cat']==1].shape[0]/data['Tenure_cat'].count()

In [None]:
data[data['Tenure_cat']==2].shape[0]/data['Tenure_cat'].count()

In [None]:
data[data['Tenure_cat']==3].shape[0]/data['Tenure_cat'].count()

In [None]:
data[data['Tenure_cat']==4].shape[0]/data['Tenure_cat'].count()

In [None]:
data[(data['Tenure_cat']==0)&(data['Exited']==1)].shape[0]/data[data['Exited']==1].shape[0]

In [None]:
data[(data['Tenure_cat']==1)&(data['Exited']==1)].shape[0]/data[data['Exited']==1].shape[0]

In [None]:
data[(data['Tenure_cat']==2)&(data['Exited']==1)].shape[0]/data[data['Exited']==1].shape[0]

In [None]:
data[(data['Tenure_cat']==3)&(data['Exited']==1)].shape[0]/data[data['Exited']==1].shape[0]

In [None]:
data[(data['Tenure_cat']==3)&(data['Exited']==1)].shape[0]/data[data['Exited']==1].shape[0]

In [None]:
catplot(data['NumOfProducts'])

In [None]:
data[data['NumOfProducts']==1].shape[0]/data['NumOfProducts'].count()

In [None]:
data[data['NumOfProducts']==2].shape[0]/data['NumOfProducts'].count()

In [None]:
data[data['NumOfProducts']==3].shape[0]/data['NumOfProducts'].count()

In [None]:
data[data['NumOfProducts']==4].shape[0]/data['NumOfProducts'].count()

In [None]:
data[(data['NumOfProducts']==1)&(data['Exited']==1)].shape[0]/data[data['Exited']==1].shape[0]

In [None]:
data[(data['NumOfProducts']==2)&(data['Exited']==1)].shape[0]/data[data['Exited']==1].shape[0]

In [None]:
data[(data['NumOfProducts']==3)&(data['Exited']==1)].shape[0]/data[data['Exited']==1].shape[0]

In [None]:
data[(data['NumOfProducts']==4)&(data['Exited']==1)].shape[0]/data[data['Exited']==1].shape[0]

In [None]:
catplot(data['HasCrCard'])

In [None]:
data[data['HasCrCard']==0].shape[0]/data['HasCrCard'].count()

In [None]:
data[data['HasCrCard']==1].shape[0]/data['HasCrCard'].count()

In [None]:
data[(data['HasCrCard']==0)&(data['Exited']==1)].shape[0]/data[data['Exited']==1].shape[0]

In [None]:
data[(data['HasCrCard']==1)&(data['Exited']==1)].shape[0]/data[data['Exited']==1].shape[0]

In [None]:
catplot(data['IsActiveMember'])

In [None]:
data[data['IsActiveMember']==0].shape[0]/data['IsActiveMember'].count()

In [None]:
data[data['IsActiveMember']==1].shape[0]/data['IsActiveMember'].count()

In [None]:
data[(data['IsActiveMember']==0)&(data['Exited']==1)].shape[0]/data[data['Exited']==1].shape[0]

In [None]:
data[(data['IsActiveMember']==1)&(data['Exited']==1)].shape[0]/data[data['Exited']==1].shape[0]

### Continuous variable

In [None]:
def plots(variable):
    fig=plt.figure(figsize=(10,5))
    plt.subplot(131)
    sns.distplot(data[variable])
    plt.xticks(rotation=90)
    plt.subplot(132)
    sns.boxplot(x=data[variable])
    plt.xticks(rotation=90)
    plt.subplot(133)
    sns.boxplot(x=data['Exited'],y=data[variable])
    plt.xticks(rotation=90)

In [None]:
plots('CreditScore')

In [None]:
plots('Age')

In [None]:
plots('Balance')


In [None]:
plots('EstimatedSalary')

###### Dependent variable distribution 

In [None]:
data['Exited'].value_counts()

In [None]:
data[data['Exited']==1].shape[0]/data['Exited'].count()

In [None]:
data[data['Exited']==0].shape[0]/data['Exited'].count()

In [None]:
sns.countplot(data['Exited'])

In [None]:
data.groupby(['Exited']).mean()

In [None]:
data.groupby(['Exited']).median()

In [None]:
################################ EDA PART 2 #####################################

###### `EDA - Part 2`

`Categorical plots`

1. Georgraphy - Maximum number of exits are from France,Germany (at 0.08%)followed by Spain (0.04%).There are also highest number of people that do not exit in France it could be attributed to high number of overalls for France(50% of the data).
2. Gender - There is 54.5% is Male and 45.4 % are Female ,the % of exits are treated in females than in Males . 11.3% vs 8.9% in males.
3. Tenure - There’s an even distribution of data amongst tenures that are bucketed between 0-2,2-4 ,…8-10 years .The % of exits is also fairly even amongst different tenures.
4. 50.8 % of the customers have 1 bank affiliated bank product ,followed by 45% and a minimal 3% for products 3&4 
5. 70% of the customers that exited have 1 bank affiliated product and over 30% have more than 2 products in their bank account 
6. Has credit card. - 70 % of the customers have credit cards 
7. There’s an even distribution between active members and not active members .
8.  Non active members that have exited the bank are 63% 

`Continuous plots` 

Credit score 
1. The distribution of credit score is close to normal and there are few outliers as we see in the boxplot
2. Median credit score is above 625 for the overall data , for those that exited the bank the credit score distribution is similar to that which haven’t exited the bank ,there are a few outliers in the customers that have exited the bank.

Age 
1. Age is right skewed data ,there are few outliers in the Age variable.
2. There are a few outliers in the customers that have not exited the bank.

Balance 
1. The distribution of Balance variable is sinusoidal and median value for balance is at 100000,for those that exited the bank the median value is slightly higher than 100000

Estimated Salary
1. The distribution of Estimated salary is somewhat normal ,with estimated salary median value at close to 100000 for both exited and non exited parties .

`Dependent variable distribution` 

1. There’s class imbalance in the distribution of the Exited customers and the non exited customers 80% vs 20% ,those that exited.
2. The credit scores of those that have exited the bank is slightly higher than the one’s that did not exit 
3. Those that exited the balance and estimated salary is higher than the others

In [None]:
################################ Preprocessing the data #####################################################

##### Preprocessing the data 

`Preprocessing the data` 

1. Drop the unnecessary columns 
2. Apply One hot encoding on  the categorical variables ,gender and geography
3. Check the datatypes of the variables and see if any change in datatype is required

`Train Test Split` 

1. Identify the target and feature variables . X will have all feature variables and Y will have ‘Exited’ as the target variable  as we want to predict the customer exit from the bank.
2. We will the split the data into Train,Test data from the 10000  records as 80/20.From the Test we will split 50 % to validation afterward

`Scaling the variables` 

1. We will scale the variables to bring them all on one scale using Standard scalar 
2. We repeat this on Training,Testing and Validation datasets


In [None]:
#data['Tenure_cat']=data['Tenure_cat'].astype(str)

In [None]:
data.dtypes

In [None]:
data.drop(['Tenure_cat'],axis=1,inplace=True)

In [None]:
data1=pd.get_dummies(data,drop_first=True)

In [None]:
data1.head()

In [None]:
data1.columns

##### Train Test Split 

In [None]:
x_data = data1.loc[:,data1.columns!='Exited']

In [None]:
y_data = data1.loc[:,data1.columns=='Exited']

In [None]:
x_data.head()

In [None]:
x_data.shape

In [None]:
y_data.head()

In [None]:
y_data.shape

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2, random_state = 7)

In [None]:
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size = 0.5, random_state = 7)

SMOTE - Data Imbalance 

In [None]:
sc=StandardScaler()
sc.fit(x_train)

In [None]:
x_train_std=sc.transform(x_train)

In [None]:
x_val_std=sc.transform(x_val)

In [None]:
x_test_std=sc.transform(x_test)

In [None]:
##x_train=preprocessing.normalize(x_train) # Understand why we do this ?

In [None]:
print(x_train.shape)
print(x_test.shape)
print(x_val.shape)
print(y_train.shape)
print(y_test.shape)
print(y_val.shape)

##### Model Building & Architechture


`Model set up`

1. Construct a function defining the model architecture - a)define the epochsb)batch size c)number of neurons d) activation function and e)optimisers f)learning rate set 
2. define the model summary
3. Compile the model 
4. Get an object of the above function 
5. Train the model using x train,y train  and pass the test data as validation data
6. Print the charts showing the accuracy and loss for both training and validation data 


Modeling using Accuracy for testing and validation datasets 

In [None]:
epoks=200
neurons=[64,32,1]
activation=['tanh','sigmoid']
batch_size=[10,20,30]
learning_rate=0.0001
optimizer=[optimizers.Adam(learning_rate=learning_rate),
           optimizers.SGD(learning_rate=learning_rate),
           optimizers.RMSprop(learning_rate=learning_rate)]

def dnn_model():

    model=Sequential()
    model.add(Dense(neurons[0], input_dim=x_train.shape[1],activation = activation[0]))
    model.add(Dense(neurons[1], activation = activation[0]))
    model.add(Dense(neurons[2], activation = activation[1]))

    model.compile(optimizer=optimizer[0],loss='binary_crossentropy',metrics=['accuracy'])

    model.summary()
    return model


In [None]:
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, CSVLogger, Callback, History, EarlyStopping

In [None]:
model=dnn_model()


model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint('best_model.h5', 
                                     monitor='val_loss', 
                                     verbose=0, 
                                     save_best_only=True, 
                                     save_freq='epoch')

history = model.fit(x_train_std,y_train, 
                    epochs=200, 
                    batch_size=20,
                    validation_data=(x_val_std,y_val),
                    callbacks=[model_checkpoint_callback]
                    )

##### Checking Model Performance 

In [None]:
acc      = history.history[     'accuracy' ]
val_acc  = history.history[ 'val_accuracy' ]
loss     = history.history[    'loss' ]
val_loss = history.history['val_loss' ]

epochs   = range(len(acc)) # Get number of epochs


plt.plot  ( epochs,     acc ,label='training')
plt.plot  ( epochs, val_acc,label='validation' )
plt.title ('Training and validation accuracy')
plt.legend()
plt.figure()

plt.plot  ( epochs,     loss , label='training')
plt.plot  ( epochs, val_loss , label='validation')
plt.title ('Training and validation loss'   )
plt.legend()
plt.show()

Checking the probabilities of a customer exiting the bank >0.5

In [None]:
y_predict=model.predict(x_test_std)

In [None]:
y_predict[:5]

Converting the probabilities into T/F or binary values 

In [None]:
y_predict = (y_predict > 0.5).astype(int)
print(y_predict[:5])

In [None]:
cm=confusion_matrix(y_test,y_predict)
print(cm)

In [None]:
confusionmatrix=confusion_matrix(y_test,y_predict)
print('The Confusion Matrix is displayed below :')
print('')

print(confusionmatrix)
print('')


## Confusion Matrix

cm=confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True,  fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1])
plt.ylabel('Observed')
plt.xlabel('Predicted')
plt.show()

In [None]:
print(classification_report(y_test,y_predict))

##### Model details 

`Tweaking the hyper parameters` 

1. We need to tweak the hyper parameter of the model to get results that are more accurate and low loss value for both training and validation data without over fitting the model .

2. We would need to find the number of epochs where the performance/accuracy is better than what we had tried previously.We started with 500 epochs and a batch size of 100 and then iterated the process to find the best batch size and optimal number of epochs 

`Interpreting the charts` 

1. Accuracy/Recall - We see that there’s  no improvement in validation accuracy/recall after 100 epochs ,in fact the accuracy seems to be slightly dropping after 100 epochs.We see over fitting here 

2. Loss function - We see that clearly here there is no drop in loss after 50-75 epochs 

`Prediction using the test data set` 

1. We predict the target variable on the unseen test data and see the top 5 values to see if the array reflects the probability values .
2. Since the output is using sigmoid ,we get the probabilities of the target variable .
3. We will now set a threshold on top of the probabilities ,so anything >0.5 will give us 1 and less than 0.5 will give 0 .
4. We use this output to print the confusion matrix



###### Modeling using Recall as primary metric - An experiment 

##### Plotting the recall and loss function 

`As an experiment and an alternative to the Accuracy metric ,we plot Recall and see if it gives us results that are more favourable to the business`

N.B - I've not kept the code below inactive after checking the metrics .

####         Conclusions 

` Business Insights `

True Negative (observed=0,predicted=0)

Predicted that the customer would not exit and they actually do not .

False Positive (observed=0,predicted=1)

Predicted that the customer would exit the bank  while the customer did not.

True Negative (observed=0,predicted=0)

Predicted that the customer would not exit the bank and the customer did not.

False Negative(observed=1,predicted=0)

Predicted the customer would not exit the bank when the customer did.

`Metrics of main interest`

From the problem statement given for the project,we look at the predictions in terms of accurate predictions of bank exits.


*** Experiment 1 *** 
`Accuracy at 87% and Recall at 44%`

***True Classifications*** Of the 1000 records that were used in the test data ,we have 777 predicted as True negatives  and 82 were predicted as positives and they are positive ,that is customer exited the bank.This is 86% accuracy .Of all those that were predicted as leaving or staying with the bank 86% of them were correctly done.

We see that this value in our predicted value is not drastically different to the training/test data accuracy.

**** Experiment 2 - Recall as a main metric ***
`An alternate approach by looking at ***Recall*** as a main metric instead of Accuracy` 

As an experiment we also look at recall numbers since this seems more like a relevant metric to the business that just accuracy ,in fact accuracy and recall could be used in conjunction to decide on the best model

`Accuracy at 87 % Recall at 47%`

The ***False negatives*** The lower this number ,the better it is .These are the customers that we predict as not exiting the bank and they actually exit .Our model above ,shows low number of False negatives in the data compared to the one with Accuarcy as main metric.

The ***True positives*** number has also gone up in the current model,which means that the  predictions made that the customer would buy a personal loan matches the actual.

The ***False Positives number*** False positives have gone down ,indicating an overall error going down when we considered Recall as main metric.
Whilst the Accuracy has remained the same ,while Recall,Precision f1 score have all gone up  by a small %

*** Experiment - 3 + SMOTE  ***

Using SMOTE to upsample the data ,since there's class imbalance .

`Accuracy at 81% and Recall at 65 %`

This would be ideal if we want to reduce the False Negatives (We predict the customers would not exit and they DO exit) ,While also trying to strike a fair balance with teh accuracy  at 81 % .F1 score & Recall are much better than the other models and the accuracy is a little compromised in this case . Needs further adjustment on the hyperparameters.

This boils down to concluding that these are the different options provided to teh business and the most important metric needs to be decided and then accordingly we pick the best model from the 3 experiments we did.


