# Binary Classification of Loan data using Random Forest and Neural Network Learning methods and comparing 

## **Importing Libraries to be Used** 

In [0]:
# Linear Algebra
import numpy as np

# Data processing
import pandas as pd

#Algorithms
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix

#Data visualization
import seaborn as sns
import matplotlib.pyplot as plt

#Setting Warnings to Ignore
import warnings
warnings.filterwarnings("ignore")

#Date
import datetime

## Getting the Data

In [0]:
#Uploading the CSV file from the Local drive using files from Colab library
from google.colab import files
uploaded = files.upload()

Saving New Arise.csv to New Arise.csv


In [0]:
#Importing the file and reading into a Pandas Dataframe
import io
df = pd.read_csv(io.BytesIO(uploaded['New Arise.csv']))

## Data Exploration and Analysis:
A few things are notable from the table below. Firstly, we need to convert a lot of features into numeric ones later on, so that the machine learning algorithms can process them. Furthermore, we can see that some of the features have widely different ranges, that we will need to convert into roughly the same scale. We can also spot some more features, that contain Null which are missing values but are not empty cells, that we need to deal with.

In [0]:
#Previewing the Data
df.head()

Unnamed: 0,clientId,clientIncome,incomeVerified,clientAge,clientGender,clientMaritalStatus,clientLoanPurpose,clientResidentialStauts,clientState,clientTimeAtEmployer,...,dueDate,paidAt,loanAmount,interestRate,loanTerm,max_amount_taken,max_tenor_taken,firstPaymentRatio,Firstpayment,loanDefault
0,755398623,52500.0,False,29,FEMALE,Single,business,Rented,KANO,7,...,2018-09-10 06:35:11 UTC,2018-09-04 11:54:00 UTC,16000,20.0,60,1,1,0.0,0,0
1,915689736,52500.0,False,25,MALE,Single,business,Rented,LAGOS,21,...,2018-10-21 07:13:29 UTC,2018-09-06 04:44:04 UTC,14500,15.0,60,0,1,0.0,0,0
2,292629156,35000.0,False,32,MALE,Single,education,Rented,ANAMBRA,29,...,2018-10-23 11:00:00 UTC,Null,19500,15.0,60,0,1,0.0,1,1
3,671710636,35000.0,False,28,FEMALE,Married,business,Own Residence,OSUN,36+,...,2018-08-18 04:21:05 UTC,2018-07-10 11:23:31 UTC,19500,15.0,60,1,1,0.0,0,0
4,367769827,35000.0,False,34,MALE,Married,medical,Rented,ONDO,36+,...,2018-08-01 07:31:40 UTC,2018-08-09 06:05:37 UTC,17500,12.5,60,1,1,0.0,1,0


## Dropping Informationless Columns:
These are columns which bear singularly repetitive or Irrelevant data which should not be added as features

In [0]:
df.drop(['clientId', 'loanId', 'loanType', 'payout_status', 'declinedDate', 'applicationDate', 'approvalDate', 'dueDate'], axis=1, inplace=True)

## Data Exploration:
Combing through the data to view all the data column names, their number of rows and their data types

In [0]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159596 entries, 0 to 159595
Data columns (total 22 columns):
clientIncome                 159596 non-null float64
incomeVerified               159596 non-null object
clientAge                    159596 non-null int64
clientGender                 159596 non-null object
clientMaritalStatus          159596 non-null object
clientLoanPurpose            159596 non-null object
clientResidentialStauts      159596 non-null object
clientState                  159596 non-null object
clientTimeAtEmployer         159596 non-null object
clientNumberPhoneContacts    159596 non-null object
clientAvgCallsPerDay         159596 non-null object
loanNumber                   159596 non-null int64
disbursementDate             159596 non-null object
paidAt                       159596 non-null object
loanAmount                   159596 non-null int64
interestRate                 159596 non-null float64
loanTerm                     159596 non-null int64
max_amo

## Data Statistics:
Here we can see that about 28% of the clients defaulted on their payments and the average loan amount is about N35000. Also, the maximum number of times a loan was collected is 32.

In [0]:
df.describe()

Unnamed: 0,clientIncome,clientAge,loanNumber,loanAmount,interestRate,loanTerm,max_amount_taken,max_tenor_taken,firstPaymentRatio,Firstpayment,loanDefault
count,159596.0,159596.0,159596.0,159596.0,159596.0,159596.0,159596.0,159596.0,159596.0,159596.0,159596.0
mean,90839.06,33.691847,3.556806,35324.18419,13.331235,85.385599,0.705275,0.915374,0.097991,0.294268,0.277526
std,97280.33,7.18087,2.471578,27840.824297,4.467938,39.323756,0.455921,0.278325,0.278058,0.455714,0.447779
min,30.0,18.0,1.0,11000.0,4.5,60.0,0.0,0.0,0.0,0.0,0.0
25%,35000.0,28.0,2.0,20000.0,10.0,60.0,0.0,1.0,0.0,0.0,0.0
50%,55116.21,33.0,3.0,25500.0,12.5,60.0,1.0,1.0,0.0,0.0,0.0
75%,105000.0,38.0,4.0,37500.0,15.0,90.0,1.0,1.0,0.0,1.0,1.0
max,3925000.0,138.0,32.0,500000.0,20.0,180.0,1.0,1.0,1.0,1.0,1.0


## Checking for NaN:
This was a check for features with NaN(Not a Number) values which in essence are empty cells. It was discoveed that weren't any from the summary table which can be seen below. 

In [0]:
total = df.isnull().sum().sort_values(ascending=False)
percent_1 = df.isnull().sum()/df.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Total', '%'])
missing_data.head()

Unnamed: 0,Total,%
loanDefault,0,0.0
Firstpayment,0,0.0
incomeVerified,0,0.0
clientAge,0,0.0
clientGender,0,0.0


## Data Preprocessing:
Here, I used different methods such as one-hot encoding, label encoding, range setting to categorise the data into groups based on inference after probing into each feature.
For binary classification such as client gender, Income verified or classification which had less than 10 classes was label encoded. 
Some of the features had "Null" in them and had to be grouped into the class with the highest occurence.
Lambda functions were also assigned in some cases.
Lastly, some steps in the data analysis were not shown because it was a repetition of checks done before ( i.e checking the features value_count, unique and description were shown for some but not others) so as to make the notebook less clumsy.

In [0]:
#Grouping client Income by range. There was also no "null" here. 
data = [df]
for dataset in data:
    dataset['clientIncome'] = dataset['clientIncome'].astype(int)
    dataset.loc[(dataset['clientIncome'] > 0) & (dataset['clientIncome'] <= 50000), 'clientIncome'] = 0
    dataset.loc[(dataset['clientIncome'] > 50000) & (dataset['clientIncome'] <= 100000), 'clientIncome'] = 1
    dataset.loc[(dataset['clientIncome'] > 100000) & (dataset['clientIncome'] <= 200000), 'clientIncome'] = 2
    dataset.loc[ dataset['clientIncome'] > 200000, 'clientIncome'] = 3

In [0]:
#Import of label encoder and applying it 
from sklearn.preprocessing import LabelEncoder
lb_make = LabelEncoder()
df["incomeVerified"] = lb_make.fit_transform(df["incomeVerified"])

In [0]:
#Ranging client Age and assigning values to each range 
data = [df]
for dataset in data:
    dataset['clientAge'] = dataset['clientAge'].astype(int)
    dataset.loc[(dataset['clientAge'] > 17) & (dataset['clientAge'] <= 30), 'clientAge'] = 0
    dataset.loc[(dataset['clientAge'] > 30) & (dataset['clientAge'] <= 40), 'clientAge'] = 1
    dataset.loc[(dataset['clientAge'] > 40) & (dataset['clientAge'] <= 50), 'clientAge'] = 2
    dataset.loc[(dataset['clientAge'] > 50) & (dataset['clientAge'] <= 60), 'clientAge'] = 3
    dataset.loc[ dataset['clientAge'] > 60, 'clientAge'] = 4

In [0]:
#It can be seen that the largest client age group was between 18 and 30. There was only one client above 60 which was 138 years old from the describe above and was probably an oversight.
df['clientAge'].value_counts()

In [0]:
#label encoding client Gender
df["clientGender"] = lb_make.fit_transform(df["clientGender"])

In [0]:
#label encoding Marital status
df["clientMaritalStatus"] = lb_make.fit_transform(df["clientMaritalStatus"])


In [0]:
#label encoding Loan purpose
df["clientLoanPurpose"] = lb_make.fit_transform(df["clientLoanPurpose"])

In [0]:
#Inquiry into the client residential status feature elements 
df['clientResidentialStauts'].unique()

array(['Rented', 'Own Residence', 'Family Owned', 'Employer Provided',
       'Temp. Residence', 'Null'], dtype=object)

In [0]:
#"Null" was grouped to Rented because it ws the largest class while others were categorised unique
df.loc[df['clientResidentialStauts'] == "Null", "clientResidentialStauts"] = 0
df.loc[df['clientResidentialStauts'] == "Rented", "clientResidentialStauts"] = 0
df.loc[df['clientResidentialStauts'] == "Employer Provided", "clientResidentialStauts"] = 1
df.loc[df['clientResidentialStauts'] == "Family Owned", "clientResidentialStauts"] = 2
df.loc[df['clientResidentialStauts'] == "Own Residence", "clientResidentialStauts"] = 3
df.loc[df['clientResidentialStauts'] == "Temp. Residence", "clientResidentialStauts"] = 4

In [0]:
#The states were factorized and each category differentiated with an increment of 1.
df['clientState'].describe()

df['clientState'] = pd.factorize(df['clientState'], sort=True)[0] + 1 


In [0]:
# A check of the data showed that there was a number string 36+. This had to be adjusted to an integer.
df['clientTimeAtEmployer'].describe()

count     159596
unique        41
top          36+
freq       87294
Name: clientTimeAtEmployer, dtype: object

In [0]:
# Null was set to the first category while 36+ to 36. Some values in this column were also negative and had to be converted to positive values. A lambda function was used to accomplish
#this and another was set to divide the values by 5 and set their categorical value to the integer value.
df.loc[df['clientTimeAtEmployer'] == "Null", "clientTimeAtEmployer"]= 0
df.loc[df['clientTimeAtEmployer'] == "36+", "clientTimeAtEmployer"]= 36
df['clientTimeAtEmployer'] = df['clientTimeAtEmployer'].astype(int).apply(lambda x: -x if x<0 else x)
df['clientTimeAtEmployer'] = df['clientTimeAtEmployer'].astype(int).apply(lambda x: (x-1)//5 if x>0 else 0)

In [0]:
#The loan amount column was categorised by range of values 
data = [df]
for dataset in data:
    dataset['loanAmount'] = dataset['loanAmount'].astype(int)
    dataset.loc[(dataset['loanAmount'] > 0) & (dataset['loanAmount'] <= 50000), 'loanAmount'] = 0
    dataset.loc[(dataset['loanAmount'] > 50000) & (dataset['loanAmount'] <= 100000), 'loanAmount'] = 1
    dataset.loc[(dataset['loanAmount'] > 100000) & (dataset['loanAmount'] <= 150000), 'loanAmount'] = 2
    dataset.loc[(dataset['loanAmount'] > 150000) & (dataset['loanAmount'] <= 200000), 'loanAmount'] = 3
    dataset.loc[(dataset['loanAmount'] > 200000) & (dataset['loanAmount'] <= 250000), 'loanAmount'] = 4
    dataset.loc[(dataset['loanAmount'] > 250000) & (dataset['loanAmount'] <= 300000), 'loanAmount'] = 5 
    dataset.loc[ dataset['loanAmount'] > 300000, 'loanAmount'] = 6

In [0]:
# Loan Term was label encoded since there are only 3 options
df["loanTerm"] = lb_make.fit_transform(df["loanTerm"])


In [0]:
# Interest rates were different and had to be categorised by range 
data = [df]
for dataset in data:
    dataset['interestRate'] = dataset['interestRate'].astype(int)
    dataset.loc[(dataset['interestRate'] > 0) & (dataset['interestRate'] <= 5), 'interestRate'] = 0
    dataset.loc[(dataset['interestRate'] > 5) & (dataset['interestRate'] <= 10), 'interestRate'] = 1
    dataset.loc[(dataset['interestRate'] > 10) & (dataset['interestRate'] <= 15), 'interestRate'] = 2
    dataset.loc[(dataset['interestRate'] > 15) & (dataset['interestRate'] <= 20), 'interestRate'] = 3

In [0]:
# Same with first payment ratio
data = [df]
for dataset in data:
    dataset['firstPaymentRatio'] = dataset['firstPaymentRatio'].astype(int)
    dataset.loc[(dataset['firstPaymentRatio'] > 0) & (dataset['firstPaymentRatio'] <= 0.2), 'firstPaymentRatio'] = 0
    dataset.loc[(dataset['firstPaymentRatio'] > 0.2) & (dataset['firstPaymentRatio'] <= 0.4), 'firstPaymentRatio'] = 1
    dataset.loc[(dataset['firstPaymentRatio'] > 0.4) & (dataset['firstPaymentRatio'] <= 0.6), 'firstPaymentRatio'] = 2
    dataset.loc[(dataset['firstPaymentRatio'] > 0.6) & (dataset['firstPaymentRatio'] <= 0.8), 'firstPaymentRatio'] = 3
    dataset.loc[ dataset['firstPaymentRatio'] > 0.8, 'firstPaymentRatio'] = 4

In [0]:
# Two lambda functions were set. The first was to map null to zero then the second was to categorise the values by dividing by 2000.
data = [df]
zeromap = lambda x: 0 if x == "Null" else x
split = lambda x: 4 if x>8000 else (x-1)//2000 if x>0 else 0
for dataset in data:
  dataset['clientNumberPhoneContacts'] = dataset['clientNumberPhoneContacts'].apply(zeromap).astype(int)
  dataset['clientNumberPhoneContacts'] = dataset['clientNumberPhoneContacts'].apply(split).astype(int)


In [0]:
# The first lambda function was reused then the second was the same as above except now divided by 40 
data = [df]
split = lambda x: 4 if x> 160 else (x-1)//40 if x>=1 else 0
for dataset in data:
  dataset['clientAvgCallsPerDay'] = dataset['clientAvgCallsPerDay'].apply(zeromap).astype(float)
  dataset['clientAvgCallsPerDay'] = dataset['clientAvgCallsPerDay'].apply(split).astype(int)
  

In [0]:
# Parsing the date string to pandas datetime format then extracting the month were disbursement was made 
df['disbursementDate'] = pd.to_datetime(df['disbursementDate'], format = "%Y-%m-%d")
df['disbursementDate'] = df['disbursementDate'].dt.month

In [0]:
# The pandas date time infer datetime function was used due to the format of the date. Null was set to category zero. The day of the week was extracted and weekdays were mapped to zero
# while weekends to 1
df.loc[df['paidAt'] == "Null", "paidAt"]= 0   

df['paidAt'] = pd.to_datetime(df['paidAt'], infer_datetime_format=True).dt.dayofweek
df['paidAt'] = df['paidAt'].apply(lambda x: 0 if x<6 else 1).astype(int)


0    0
1    0
2    0
3    0
4    0
Name: paidAt, dtype: int64

## Building the Machine Learning models:
Pre-processing is completed and now the models are going to be built. First is the Random Forest classifier after which comes the Neural Network and there would be a comparison of the two methods. The training set will also be used to compare the algorithms with each other. I also used cross-validation to check the accuracy of the k-folds of the models to check how realistic they are. 
The SMOTE algorithm was used to balance the classes since the number of non-defaulters were about 3 times larger than the defaulters. The algorithm over-sampled the dafaulters to have an equal number of defaulters and non-defaulters.
I did a feature importance check and dropped the two least important features to get a better acuracy.
I did used several metrics such as f1 test, recall, precision, confusion matrix, accuracy and oob.

In [0]:
#The Y variable was set to the label and X to the input features then did a train test split with test size set to 30%
Y = df.loanDefault
X = df.drop('loanDefault', axis = 1)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,test_size=0.3)

In [0]:
# The SMOTE algorithm was used to balance the classes and can be seen in the new shape of Y_train_res which will be used further on
from imblearn.over_sampling import SMOTE
print("Before OverSampling, counts of label '1': {}".format(sum(Y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(Y_train==0)))

sm = SMOTE(random_state=2)
X_train_res, Y_train_res = sm.fit_sample(X_train, Y_train.ravel())

print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(Y_train_res.shape))

print("After OverSampling, counts of label '1': {}".format(sum(Y_train_res==1)))
print("After OverSampling, counts of label '0': {}".format(sum(Y_train_res==0)))

Before OverSampling, counts of label '1': 31058
Before OverSampling, counts of label '0': 80659 

After OverSampling, the shape of train_X: (161318, 21)
After OverSampling, the shape of train_y: (161318,) 

After OverSampling, counts of label '1': 80659
After OverSampling, counts of label '0': 80659


In [0]:
#The random forest is then trained on the training set and its accuracy is checked
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train_res, Y_train_res)

Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train_res, Y_train_res)
acc_random_forest = round(random_forest.score(X_train_res, Y_train_res) * 100, 4)
print(acc_random_forest ,'%')

**An accuracy of about 99.5% is very high and this could suggest overfitting but this would be confirmed with the accuracy of the testing set which has been fitted on the training set.**

In [0]:
#The model was the cross validated with 10 folds to check for the reliability and accuracy of the model 
from sklearn.model_selection import cross_val_score
rf = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(rf, X_train_res, Y_train_res, cv=10, scoring = "accuracy")
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())

Scores: [0.83387057 0.82891148 0.83132904 0.83529631 0.86362509 0.8652368
 0.86802628 0.86337714 0.8674064  0.86410415]
Mean: 0.8521183251196845
Standard Deviation: 0.016273974456636597


**The model was scored an average of 85.2% which is very realistic and seems very good to use. The estimates also differ by 1.62% which is really strong.**

In [0]:
# The features were compared based on their level of importance using random forest's feature importance function. The most importnt feature was Firstpayment while the least importnat was
# max tenor taken.
importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(random_forest.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False).set_index('feature')
importances

Unnamed: 0_level_0,importance
feature,Unnamed: 1_level_1
Firstpayment,0.288
clientState,0.109
disbursementDate,0.101
clientTimeAtEmployer,0.061
loanNumber,0.059
firstPaymentRatio,0.043
clientLoanPurpose,0.042
clientIncome,0.037
clientResidentialStauts,0.036
clientAvgCallsPerDay,0.033


In [0]:
# The two least important ffeatures were then dropped to boost accuracy
df  = df.drop("max_tenor_taken", axis=1)
df  = df.drop("loanAmount", axis=1)

In [0]:
# The random forest was then scored again on a training level 
random_forest = RandomForestClassifier(n_estimators=100, oob_score = True)
random_forest.fit(X_train_res, Y_train_res)
Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)

acc_random_forest = round(random_forest.score(X_train_res, Y_train_res) * 100, 2)
print(round(acc_random_forest,4,), "%")

99.5 %


**There was a slight increase in acccuracy and could be improved more if other less important features are dropped**

**There is also another way to evaluate a random-forest classifier, which is probably much more accurate than the score that was used before.  The out-of-bag samples to estimate the generalization accuracy. The out-of-bag estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-of-bag error estimate removes the need for a set aside test set.**

In [0]:
print("oob score:", round(random_forest.oob_score_, 4)*100, "%")

oob score: 85.32 %


**This shows that the training on the test and scoring woould result in 85.32%. This is a pretty good classifier**

A Confusion matrix shows how good the classifier did. True positives and True negatives were about 6 times greater than False positives and False Negatives on the training set.

In [0]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
predictions = cross_val_predict(random_forest, X_train_res, Y_train_res, cv=3)
confusion_matrix(Y_train_res, predictions)

array([[69541, 11118],
       [15555, 65104]])

In [0]:
# The precision, f1-score, recall are all shown below.
from sklearn.metrics import classification_report
print(classification_report(Y_train_res,predictions))

              precision    recall  f1-score   support

           0       0.82      0.86      0.84     80659
           1       0.85      0.81      0.83     80659

   micro avg       0.83      0.83      0.83    161318
   macro avg       0.84      0.83      0.83    161318
weighted avg       0.84      0.83      0.83    161318



### The testing set is now classified on the fit of the training set.

In [0]:
random_forest = RandomForestClassifier()
random_forest.fit(X_train_res, Y_train_res)

Y_prediction = random_forest.predict(X_test)

acc_random_forest = round(random_forest.score(X_test, Y_test) * 100, 2)
print((acc_random_forest), "%")

81.02 %


**It did  much less better than the training set did which will suggest an overfit on the training data**

In [0]:
#Classification report on the testing set 
print('Results on the test set:')
print(classification_report(Y_test, Y_prediction))

Results on the test set:
              precision    recall  f1-score   support

           0       0.87      0.86      0.87     34645
           1       0.65      0.67      0.66     13234

   micro avg       0.81      0.81      0.81     47879
   macro avg       0.76      0.77      0.76     47879
weighted avg       0.81      0.81      0.81     47879



In [0]:
#Confusion matrix on the testing set
from sklearn.metrics import confusion_matrix
confusion_matrix(Y_test, Y_prediction)

array([[29897,  4748],
       [ 4341,  8893]])

### Hyperparameter tuning of the Random Forest Classifier using GridSearchCV

In [0]:
#Setting the parameters and fitting on the training data and then displaying the best parameters
# Also the tuples had to be restricted to a size of two to reduce search iteration time 
param_grid = { 
    "min_samples_leaf" : [1, 5], "min_samples_split" : [10, 16], "n_estimators": [300, 700]}
from sklearn.model_selection import GridSearchCV, cross_val_score
rf = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1, n_jobs=-1)
clf = GridSearchCV(estimator=rf, param_grid=param_grid, n_jobs=-1)
clf.fit(X_train_res, Y_train_res)
clf.best_params_

{'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 700}

**These best parameters are then plugged into the classifier and used to score how well the testing data did**

In [0]:
acc_random_forest = round(clf.score(X_test, Y_test) * 100, 2)
print((acc_random_forest), '%')

83.42 %


**The accuracy of the hyper parameter tuned Random forest is relatively better than that of the testing set alone but still less than the oob score.**

In [0]:
#Classification report for the Hyperparameter tuned Random Forest
y_prediction = clf.predict(X_test)
print('Results on the test set with hyperparameter tuning:')
print(classification_report(Y_test, y_prediction))

Results on the test set with hyperparameter tuning:
              precision    recall  f1-score   support

           0       0.90      0.87      0.88     34645
           1       0.68      0.75      0.71     13234

   micro avg       0.83      0.83      0.83     47879
   macro avg       0.79      0.81      0.80     47879
weighted avg       0.84      0.83      0.84     47879



**This is as good as the classification report of the cross validation set**

## Neural Network:
**The data is now passed through a NN classifier which ws a Multi Layer Perceptron (MLP). The neural network has difficulty converging before the maximum number of iterations allowed if the data is not normalized. Multi-layer Perceptron is sensitive to feature scaling, so I scaled the data. I also applied the same scaling to the test set for meaningful results. StandardScaler was used for standardization.**

In [0]:
#Importing standard scaler
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Fit only to the training data

scaler.fit(X_train_res)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [0]:
#Applying the transformations to the data 
x_train = scaler.transform(X_train_res)

x_test = scaler.transform(X_test)

**Training the model using Estimator Objects:**

In [0]:
#Importing MLP
from sklearn.neural_network import MLPClassifier
# Setting the number of hidden layers(3) and the sizes of the neurons at 21 each for the 21 input features
mlp = MLPClassifier(hidden_layer_sizes=(21,21,21))

In [0]:
#ffitting the processed and scaled data with the MLP classifier
mlp.fit(x_train,Y_train_res)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(21, 21, 21), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

### Scoring the training set:

In [0]:
mlp = MLPClassifier()
mlp.fit(x_train, Y_train_res)

acc_mlp = round(mlp.score(x_train, Y_train_res) * 100, 2)
print(round(acc_mlp,2,), "%")

81.04 %


**The training set scored 81.04% which means it's not overfitting but might be slightly underfitting. This will be decided later on.**

In [0]:
# Confusion matrix for the training set
predictions = cross_val_predict(mlp, x_train, Y_train_res, cv=3)
confusion_matrix(Y_train_res, predictions)

array([[68242, 12417],
       [22177, 58482]])

In [56]:
#Classification Report for the training set 
print(classification_report(Y_train_res,predictions))

              precision    recall  f1-score   support

           0       0.76      0.84      0.80     80659
           1       0.82      0.73      0.77     80659

   micro avg       0.79      0.79      0.79    161318
   macro avg       0.79      0.79      0.79    161318
weighted avg       0.79      0.79      0.79    161318



In [0]:
#Cross validation score for the NN
scores = cross_val_score(mlp, x_train, Y_train_res, cv=10, scoring = "accuracy")
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())

Scores: [0.80188445 0.79308207 0.79134639 0.80275229 0.80399207 0.80039673
 0.7983511  0.79264815 0.8000248  0.80588965]
Mean: 0.7990367702658892
Standard Deviation: 0.004818110213421101


**This shows an average of 79.9% and varies slightly by 0.48%.**

In [0]:
# Predictions using the fitted model
predictions = mlp.predict(X_test)

In [58]:
# Confusion matrix for the predictions
print(confusion_matrix(Y_test,predictions))

[[29862  4783]
 [ 3477  9757]]


In [59]:
#Classification Report for the predictions
print(classification_report(Y_test,predictions))

              precision    recall  f1-score   support

           0       0.90      0.86      0.88     34645
           1       0.67      0.74      0.70     13234

   micro avg       0.83      0.83      0.83     47879
   macro avg       0.78      0.80      0.79     47879
weighted avg       0.83      0.83      0.83     47879



In [71]:
#Scoring the MLP on the testing dat 
mlp = MLPClassifier()
mlp.fit(x_train, Y_train_res)

acc_mlp = round(mlp.score(x_test, Y_test) * 100, 2)
print(round(acc_mlp,2,), "%")

81.73 %


**The MLP scored 81.73% which is better than that of the training data. This will suggest that the MLP is underfitting slightly. It also scored a  much better classification report.**

### Hyperparameter tuning of the Multi-Layer perceptron Classifier using GridSearchCV

In [72]:
#Parameters are specified and again less are used to reduce running time.
mlp = MLPClassifier(max_iter=200)
parameter_space = {
    'hidden_layer_sizes': [(25,20,25)],
    'activation': ['relu'],
    'solver': ['adam'],
    'alpha': [0.0001],
    'learning_rate': ['constant'],
}
clf = GridSearchCV(mlp, parameter_space, n_jobs=-1, cv=3)
clf.fit(x_train, Y_train_res)

# Best parameter set
print('Best parameters found:\n', clf.best_params_)
# All results
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))


Best parameters found:
 {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (25, 20, 25), 'learning_rate': 'constant', 'solver': 'adam'}
0.785 (+/-0.034) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (25, 20, 25), 'learning_rate': 'constant', 'solver': 'adam'}


In [73]:
#These parameters are then used to predict x test
y_pred = clf.predict(x_test)
#Classification report for the hyper parameter tuned NN
print('Results on the test set:')
print(classification_report(Y_test, y_pred))

Results on the test set:
              precision    recall  f1-score   support

           0       0.90      0.85      0.88     34645
           1       0.66      0.75      0.70     13234

   micro avg       0.82      0.82      0.82     47879
   macro avg       0.78      0.80      0.79     47879
weighted avg       0.83      0.82      0.83     47879



**These results are very similar to the non-grid searched NN**

In [75]:
#Accuracy of the GridSearched NN on the testing set
acc_mlp = round(clf.score(x_test, Y_test) * 100, 2)
print((acc_mlp), '%')

82.46 %


**The Grid search had better results.**

## Final Verdict:
Both classifiers did well relatively but the title of better classifier would have to be given to the Random Forest.
This is because when hyperparameters were tuned,

1) It had a better accuracy on the testing set after hyper-parameter tuning which was 83.42% compared with NN's 82.46%.

2) RF had a better cross validation mean of 85.2% compared with NN's CV's score of 79.9%

3) It had a slightly better precision, recall and F1 score.

At  training level, the RF overfitted the training set with an accuracy of about 99.5% while the NN underfitted slightly with an accuracy score of 81.04%.

At testing level, the NN had a better accuracy compared with the RF and this was shown on the confusion matrix for each model. NN had less False positives and False Negatives. 