# Credit default prediction - Project overview

The goal of this project is ***at first*** to train a machine learning model to make predictions about whether a borrower will be able to repay the loan or not. In the ***second*** step we are going to try to improve the model with different formal methods like optimizing the hyperparameters of the model with gridsearch, filling in NaNs with another strategy and rebalancing the dataset. Afterwards we are going to analyse the remaining prediction errors to then use the gained knowledge in the ***3rd*** step to improve the models performance further on. The emphasis of this project is **not** to create a model with the best possible accuracy but to demonstrate a general procedure of building a machine learning model and conducting an error analysis as a basis of further improvements. 


The structure of the first part of this project is:

1. Data exploration
2. Data cleaning
3. Data preprocessing 
4. Training the models
5. Interim result

# 1. Data exploration
First we explore to get an overview of the data we are dealing with

In [1]:
import pandas as pd
import numpy as np 

raw_df = pd.read_csv("https://github.com/TomMarq1/Credit_default_prediction_-ML-model-/raw/main/credit_risk_dataset.csv")

In [2]:
# showing a sample of the dataset
raw_df.sample(10)

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
18817,31,75000,RENT,1.0,PERSONAL,A,1200,6.54,0,0.02,N,5
17588,23,41000,RENT,7.0,EDUCATION,A,4700,7.9,0,0.11,N,4
17798,25,77078,MORTGAGE,1.0,DEBTCONSOLIDATION,B,3425,12.18,0,0.04,N,4
13928,25,98000,MORTGAGE,2.0,PERSONAL,A,6400,9.63,0,0.07,N,4
22868,28,59000,MORTGAGE,2.0,VENTURE,B,14000,10.36,0,0.24,N,5
22170,28,53000,MORTGAGE,12.0,HOMEIMPROVEMENT,B,17050,12.42,0,0.32,N,7
30289,43,72000,RENT,5.0,MEDICAL,B,6000,12.42,0,0.08,N,17
2001,22,32000,RENT,0.0,DEBTCONSOLIDATION,C,14000,13.49,1,0.44,N,2
904,25,52000,RENT,5.0,MEDICAL,B,19000,,1,0.37,N,4
16384,22,40000,RENT,2.0,DEBTCONSOLIDATION,B,7000,11.99,0,0.17,N,3


In [64]:
# show shape of the dataset
raw_df.shape

# the dataset contains of 32.581 rows and 12 columns

(32581, 12)

In [65]:
# check the amount of loan default in relation to all loans
# loan_status 1 stands for loan default according to the data description

len(raw_df.query("loan_status == 1"))/ len(raw_df)

# about 21,8% of all loans are defaults. Thus our dataset is imbalanced which is an issue we will check later on.

0.21816396059052823

In [66]:
# check for NaN-values
raw_df.isna().sum()

#There are NaN-values in the columns person_home_ownership and loan_int_rate. 
#Those NaN-values need to be dealt with before the data can be used to train the model. 
#This will be the main topic in the next chapter: data cleaning.


person_age                       0
person_income                    0
person_home_ownership            0
person_emp_length              895
loan_intent                      0
loan_grade                       0
loan_amnt                        0
loan_int_rate                 3116
loan_status                      0
loan_percent_income              0
cb_person_default_on_file        0
cb_person_cred_hist_length       0
dtype: int64

In [67]:
# to see the features with the highest correlation with loan default, check 3.4 below 
# (categorical features are then encoded as numbers and show on the correlation table)

# 2. Data cleaning

# 2.1 Dropping duplicated data

In [68]:
# Checking for duplicate rows
sum(raw_df.duplicated())

165

In [69]:
# drop the duplicate rows
raw_df.drop_duplicates(inplace=True)

# 2.2 Dealing with NaN´s in the column "person_emp_length"

In [70]:
raw_df[raw_df["person_emp_length"].isna()].sample(20)

# A look at a sample below shows that rows with NaN-values in "person_emp_length" are mostly people in their 20s. 
# So maybe they have a NaN value in "person_emp_length because" because they haven´t worked before. 
# If this was the case, we should fill in the NaN´s with 0. 
# Maybe we can find some more hints which would support this assumption:

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
6772,24,50000,OWN,,VENTURE,B,21000,10.62,0,0.42,N,4
30178,38,48000,OWN,,VENTURE,A,7600,7.88,0,0.16,N,11
31745,39,134000,MORTGAGE,,VENTURE,B,25000,10.25,0,0.19,N,17
8385,26,57550,MORTGAGE,,DEBTCONSOLIDATION,A,3500,5.42,0,0.06,N,4
21244,27,46000,OWN,,VENTURE,A,8000,,0,0.17,N,9
5790,26,15436,RENT,,DEBTCONSOLIDATION,A,5500,7.88,1,0.36,N,3
30162,39,48000,MORTGAGE,,DEBTCONSOLIDATION,A,3000,5.42,0,0.06,N,11
21403,28,48000,OWN,,PERSONAL,D,8000,16.77,0,0.17,Y,9
29333,37,92000,RENT,,DEBTCONSOLIDATION,D,20000,14.46,1,0.22,N,15
18850,28,30000,RENT,,HOMEIMPROVEMENT,B,1500,9.63,1,0.05,N,10


In [71]:
raw_df[raw_df["person_emp_length"].isna()].describe()

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_cred_hist_length
count,887.0,887.0,0.0,887.0,820.0,887.0,887.0,887.0
mean,27.312289,44308.392334,,7059.188275,10.039902,0.316798,0.191218,5.636979
std,5.89721,37472.609235,,5208.977655,3.464062,0.46549,0.121127,3.847122
min,21.0,4200.0,,1000.0,5.42,0.0,0.01,2.0
25%,23.0,24000.0,,3200.0,6.99,0.0,0.09,3.0
50%,25.0,36000.0,,6000.0,9.91,0.0,0.17,4.0
75%,30.0,55000.0,,9775.0,12.69,1.0,0.265,8.0
max,70.0,648000.0,,35000.0,21.36,1.0,0.65,27.0


In [72]:
# As we can see above: 50% of the people with NaN in "person_emp_length" are 25 years old or younger 
# and 75% are 30 years or younger. 
# So most of the people with NaN in "person_emp_length" can considered to be young 
# and at the start of their carreer. This supports our assumption that the NaN-values exist 
# because the persons don´t have prior working experience.

# Because of this we will fill the NaN´s with 0

In [73]:
cleaned_df = raw_df.copy()
cleaned_df["person_emp_length"].fillna(0, inplace = True)


In [74]:
# check if it worked
sum(cleaned_df["person_emp_length"].isna())

0

# 2.3 Dealing with NaN´s in the column loan_int_rate 

In [75]:
# Let´s have a look at rows with NaN´s in loan_int_rate first

cleaned_df[cleaned_df["loan_int_rate"].isna()].sample(10) 


Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
9416,23,62000,OWN,2.0,VENTURE,B,5575,,0,0.09,N,3
8567,22,59000,MORTGAGE,5.0,MEDICAL,A,6000,,0,0.1,N,2
11646,21,75000,MORTGAGE,4.0,MEDICAL,B,25000,,0,0.33,N,2
18941,27,29120,OWN,0.0,VENTURE,A,4000,,0,0.14,N,7
13123,24,88494,OWN,8.0,MEDICAL,B,24000,,0,0.27,N,2
4596,22,41604,OWN,1.0,PERSONAL,B,14400,,0,0.35,N,4
17853,34,156000,RENT,4.0,EDUCATION,A,35000,,0,0.22,N,5
32024,38,20004,OWN,2.0,HOMEIMPROVEMENT,B,6250,,0,0.31,N,15
23763,29,49000,RENT,3.0,EDUCATION,A,9500,,1,0.19,N,5
9700,22,63500,MORTGAGE,3.0,EDUCATION,C,7000,,0,0.11,N,2


In [76]:
# Unfortunately from the sample we can´t see or guess why there are NaN´s in loan_int_rate because the rest 
# of the columns have no similar entries for each row. That means the loans are not similar
# so we can not guess one single value for loan_int_rate for all the loans.  

# Instead we will have to estimate the missing value for each row. 
# To keep it simple in the 1st iteration but also to make a substantial estimation 
# and not only a wild guess, we take the loan_grade into account. 
# This is because in general the worse the loan grade gets, the higher the interest rate 
# should be to compensate for the higher risk of default.

In [77]:
cleaned_df.groupby("loan_grade").mean()

Unnamed: 0_level_0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_cred_hist_length
loan_grade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A,27.68336,66605.836121,4.913389,8545.70214,7.328423,0.099598,0.153757,5.754088
B,27.70078,66332.369789,4.634158,9992.228266,10.995756,0.163185,0.175283,5.790796
C,27.808326,64963.895464,4.359273,9219.17521,13.464579,0.207518,0.170098,5.870923
D,27.875414,63703.096961,4.614917,10855.759669,15.360698,0.590608,0.190981,5.896133
E,27.862928,70856.360332,4.30945,12919.911734,17.008409,0.64486,0.206106,5.82866
F,28.352697,77008.73029,4.165975,14717.323651,18.609159,0.705394,0.215643,6.128631
G,28.4375,76773.296875,6.125,17195.703125,20.251525,0.984375,0.243906,6.453125


In [78]:
# As we can see the worse the loan grade gets, the higher is the average interest rate. 
# Because of this we are going to fill in the NaN´s in loan_int_rate 
# with the average interest rate of corresponding loan_grade class of the row. 

In [79]:
# filling in the right interest rate average

##list of all categories in loan_grade
loan_grade_categories = ["A", "B", "C", "D", "E", "F", "G"]

##list of avg interest rates of corresponding loan_grade
corresp_avg_int_rates = cleaned_df.groupby("loan_grade").mean()["loan_int_rate"].to_list()
corresp_avg_int_rates

[7.328422732420467,
 10.99575559601585,
 13.464579101394389,
 15.360698096101542,
 17.00840909090909,
 18.609158878504672,
 20.25152542372881]

In [80]:

for i in range(len(loan_grade_categories)):

    # get all rows with NaN in loan_int_rate and a specific value in loan_grade (these rows are a copy of slice of cleaned_df). Store this copy in x
    x = cleaned_df[cleaned_df["loan_int_rate"].isna()].query("loan_grade == @loan_grade_categories[@i]")

    #use the index of the copy to get the rows with NaN in loan_int_rate and a specific value in loan_grade as a view of the original dataframe
    cleaned_df.loc[x.index, "loan_int_rate"] = corresp_avg_int_rates[i]
    
    #Beware : You cannot use leaned_df.loc[x.index, "loan_int_rate"].fillna(value = corresp_avg_int_rates[i], inplace= True) here because fillna() cannot take a list as an argument
    

In [81]:
#checked if it worked
cleaned_df.isna().sum()

person_age                    0
person_income                 0
person_home_ownership         0
person_emp_length             0
loan_intent                   0
loan_grade                    0
loan_amnt                     0
loan_int_rate                 0
loan_status                   0
loan_percent_income           0
cb_person_default_on_file     0
cb_person_cred_hist_length    0
dtype: int64

# 3. Data preprocessing
Now that we completed cleaning the data by deleting duplicate rows and filling NaNs, it´s time to preprocess the values so our machine learning model can make better use of the data later on.

3.1 Creating train and test  <br>
Before we start preprocessing the data, we need to create a separate test data set


In [82]:
X = cleaned_df.drop(columns=["loan_status"])
y = cleaned_df["loan_status"]

In [83]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=8)

3.2 Preprocessing numerical values <br>
We are going to preprocess the numerical values by scaling them with the standardization-method.

In [84]:
#To speficly select the numeric values, we will save their column names in a list 
num_col = X_train.select_dtypes("number").columns.to_list()
num_col

['person_age',
 'person_income',
 'person_emp_length',
 'loan_amnt',
 'loan_int_rate',
 'loan_percent_income',
 'cb_person_cred_hist_length']

In [85]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train_num_col_scaled = sc.fit_transform(X_train[num_col])
X_test_num_col_scaled = sc.transform(X_test[num_col])



3.3 Preprocessing categorical values <br>
Before we preprocess categorical values, we have to split them in ordinal and non ordinal because both are going to be processed differntly

In [86]:
# categorical columns
cat_col = X_train.select_dtypes("object").columns.to_list()
cat_col

['person_home_ownership',
 'loan_intent',
 'loan_grade',
 'cb_person_default_on_file']

In [87]:
# columns with non ordinal categories
## in our data set the only non ordinal category is loan_intent because it´s hard tell e.g. whether the loan intent "homeimprovement" is better then "education" or "medical" is worse then "debtconsilidation"
non_ord_category_col = [cat_col[1]]
non_ord_category_col

['loan_intent']

In [88]:
# encoding the non ordinal categorical columns
from sklearn.preprocessing import OneHotEncoder

ohenc = OneHotEncoder(sparse=False)
X_train_non_ord_cat_col_enc = ohenc.fit_transform(X_train[non_ord_category_col])
X_test_non_ord_cat_col_enc = ohenc.transform(X_test[non_ord_category_col])

In [89]:
# columns with ordinal categories
## the rest of the categorical columns are oridnal
ord_category_col = list((set(X_train.columns.to_list()) - set(num_col) - set(non_ord_category_col)))
ord_category_col

# for unkown reasons the order of the items in ord_category_col change every time the notebook is run. 
# This can lead to an error 2 cells below. 
# To resolve the error rearrange the item order in cats_ord in cell below according to the item order 
# of ord_category_col in the cell 2 above

['person_home_ownership', 'cb_person_default_on_file', 'loan_grade']

In [90]:
# encoding ordinal categorical columns
from sklearn.preprocessing import OrdinalEncoder

## Determine the order in which the categories should be placed (must be done manually, otherwise would be ordered arbitrarily)
cb_person_default_on_file_cats = ["Y", "N"]
loan_grade_cats =["G", "F", "E", "D", "C", "B", "A"]
person_home_ownership_cats = ["OTHER", "RENT", "MORTGAGE", "OWN"]

cats_ord = [person_home_ownership_cats, cb_person_default_on_file_cats, loan_grade_cats]

In [91]:
##encode the ordinal categorie columns
ord_enc = OrdinalEncoder(categories= cats_ord)
X_train_ord_cat_enc = ord_enc.fit_transform(X_train[ord_category_col])
X_test_ord_cat_enc = ord_enc.transform(X_test[ord_category_col])

## if an error occurs executing this cell, look at the comment in the cell 2 above to resolve the problem

In [92]:
# built one dataframe for the train data out of the processed columns... 
X_train_processed = pd.concat([pd.DataFrame(X_train_num_col_scaled, #dataframe with scaled numerical columns
                               columns = num_col),
                               pd.DataFrame(X_train_non_ord_cat_col_enc, #dataframe with encoded non ordinal categorical columns
                               columns = ohenc.get_feature_names_out()),
                               pd.DataFrame(X_train_ord_cat_enc, # dataframe with encoded ordinal categorical columns
                               columns = ord_category_col) 

                               
                               ], axis=1)

3.4 Overview of main features correlated with loan default

In [93]:
corr = X_train_processed.copy()
corr["loan_status"] = y_train.to_list()

In [94]:
corr.corr()["loan_status"].sort_values(ascending = False)
# The features with the highest correlations with loan_status in absolut terms are:
# loan_percent_income: relation of loan amount and annual income 
# loan_int_rate: interest rate of the loan
# loan_grade: not explained in data description. According to internet research an aggregated value 
# to determine the risk of default
# person_home_ownership: categorical value that describes whether the borrower owns a house, owns a house
# but with mortgage, only rented or other

loan_status                      1.000000
loan_percent_income              0.373258
loan_int_rate                    0.334425
loan_amnt                        0.099082
loan_intent_DEBTCONSOLIDATION    0.069874
loan_intent_MEDICAL              0.050191
loan_intent_HOMEIMPROVEMENT      0.045374
cb_person_cred_hist_length      -0.018377
loan_intent_PERSONAL            -0.023159
person_age                      -0.023393
loan_intent_EDUCATION           -0.049892
loan_intent_VENTURE             -0.081010
person_emp_length               -0.087095
person_income                   -0.170005
cb_person_default_on_file       -0.177667
person_home_ownership           -0.235235
loan_grade                      -0.375385
Name: loan_status, dtype: float64

In [95]:
X_train_processed

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,loan_intent_DEBTCONSOLIDATION,loan_intent_EDUCATION,loan_intent_HOMEIMPROVEMENT,loan_intent_MEDICAL,loan_intent_PERSONAL,loan_intent_VENTURE,person_home_ownership,cb_person_default_on_file,loan_grade
0,2.597058,-0.014781,2.696150,-1.202739,-0.960381,-1.319334,2.029945,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,6.0
1,-0.430591,-0.298444,-0.865631,0.065842,-1.149359,0.276065,-0.688305,0.0,0.0,0.0,0.0,0.0,1.0,3.0,1.0,6.0
2,0.047459,-0.278790,1.746342,-1.186882,0.399640,-1.225487,-0.194077,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,4.0
3,-0.749291,0.268958,-0.865631,0.224415,0.535952,-0.287017,-0.688305,0.0,0.0,0.0,0.0,1.0,0.0,2.0,1.0,4.0
4,-0.908641,1.232155,-1.103083,0.858706,-0.638189,-0.568558,-0.935418,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22686,-0.749291,-0.279603,0.321629,0.620847,0.381052,0.839147,-0.935418,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,4.0
22687,-0.111891,1.082341,-1.103083,0.577239,1.583075,-0.568558,0.300150,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,4.0
22688,-0.111891,-0.790333,-0.865631,-0.568449,0.763524,0.745300,0.300150,0.0,1.0,0.0,0.0,0.0,0.0,3.0,1.0,4.0
22689,0.206809,0.193294,-0.628179,0.541561,-0.232352,-0.005476,0.547263,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,5.0


In [96]:
# ..and the same for the test data
X_test_processed = pd.concat([pd.DataFrame(X_test_num_col_scaled, #dataframe with scaled numerical columns
                               columns = num_col),
                               pd.DataFrame(X_test_non_ord_cat_col_enc, #dataframe with encoded non ordinal categorical columns
                               columns = ohenc.get_feature_names_out()),
                               pd.DataFrame(X_test_ord_cat_enc, # dataframe with encoded ordinal categorical columns
                               columns = ord_category_col) 

                               
                               ], axis=1)

In [97]:
X_test_processed

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,loan_intent_DEBTCONSOLIDATION,loan_intent_EDUCATION,loan_intent_HOMEIMPROVEMENT,loan_intent_MEDICAL,loan_intent_PERSONAL,loan_intent_VENTURE,person_home_ownership,cb_person_default_on_file,loan_grade
0,-0.430591,-0.771417,-0.865631,-0.695307,0.473992,0.369912,-0.688305,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,4.0
1,0.206809,2.349707,0.084177,-0.885594,-1.356925,-1.413181,-0.194077,0.0,0.0,0.0,0.0,1.0,0.0,2.0,1.0,6.0
2,-0.749291,-0.479355,-1.103083,-0.647735,0.535952,-0.287017,-0.441191,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,4.0
3,1.481609,0.079799,-0.390727,-0.742879,0.365562,-0.943946,2.277059,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,5.0
4,-0.749291,-0.128276,0.796534,1.175851,-0.232352,1.120688,-0.441191,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9720,-0.589941,0.344621,0.084177,1.651570,0.257132,0.651453,-0.688305,0.0,0.0,0.0,0.0,1.0,0.0,3.0,1.0,5.0
9721,-0.908641,-0.203940,-0.390727,0.858706,1.118376,0.932994,-0.441191,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,3.0
9722,1.003559,-0.517944,-0.865631,-0.489162,-0.337684,-0.005476,0.300150,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,5.0
9723,-0.908641,-0.881129,0.321629,-1.297883,1.211316,-0.943946,-0.441191,0.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,3.0


# 5. Training and evaluating the models

5.1 Dummy model <br>
To evaluate the performance of the models we are going to train, we first have to establish a simple dummy model. After that we are able to compare the performance of the trained models to the performance of the simple dummy model. By this we can evaluate the performance of the trained models in a sense whether they are an improvement over the simple dummy model or not.   

In [98]:
#to built a dummy model, we first take a look at the correlation of the numeric values with the loan default

cleaned_df.corr()["loan_status"].sort_values(ascending = False)

# as we can see loan_percent_income has the highest correlation (also in absolut terms) with a loan default 

loan_status                   1.000000
loan_percent_income           0.379697
loan_int_rate                 0.334154
loan_amnt                     0.105736
cb_person_cred_hist_length   -0.016498
person_age                   -0.022698
person_emp_length            -0.087331
person_income                -0.145005
Name: loan_status, dtype: float64

In [99]:
cleaned_df.groupby('loan_status')['loan_percent_income'].describe()
# As we can see 75% of non default loans have a loan_percent_income of 0.2 or lower 
# so we will use this value as a treshold for our simple dummy model to make predictions 

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
loan_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,25327.0,0.148794,0.087252,0.0,0.08,0.13,0.2,0.83
1,7089.0,0.246906,0.132103,0.01,0.14,0.24,0.34,0.78


In [100]:
# to establish the model we create a new column "dummy_pred" in a copy of cleaned_df 
# and fill the columns with the value 0
dummy_df = cleaned_df.copy()
dummy_df["dummy_pred"] = 0

In [101]:
# now as the prediction of the dummy model, we change the value in "dummy_pred" from 0 to 1 
# for every row which´s value in "loan_percent_income" is larger as 0.2
dummy_df.loc[dummy_df["loan_percent_income"] > 0.2, "dummy_pred"] = 1

In [102]:
dummy_df.sample(5)

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,dummy_pred
23094,32,51400,RENT,1.0,HOMEIMPROVEMENT,A,8000,7.51,0,0.16,N,10,0
1090,21,37232,RENT,3.0,MEDICAL,B,17500,12.53,1,0.47,N,2,1
27543,33,150000,MORTGAGE,17.0,VENTURE,A,14000,6.62,0,0.09,N,7,0
6379,22,36250,RENT,6.0,EDUCATION,C,6000,14.22,0,0.17,Y,3,0
32285,38,12000,OWN,0.0,EDUCATION,A,4800,7.29,1,0.4,N,12,1


In [103]:
# the performance of the dummy model can now be measured by the percentage its predictions were correct. 
# We can achieve that by calling all rows in which the predictions were correct as a view of the dataframe. 
# Then count the amount of these rows by using the .shape[0] attribute which indicates the number of rows 
# in this view of the dataframe.
# After that divide the number of rows with the right predictions by the total amount of all rows 
# of the dataframe. 

dummy_df.loc[dummy_df["loan_status"] == dummy_df["dummy_pred"]].shape[0] / dummy_df.shape[0]

# the performance of the dummy model is about 72,3 %

0.7229763079960513

5.2 Decision tree

In [104]:
from sklearn.tree import DecisionTreeClassifier

#tree = DecisionTreeClassifier()
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train_processed, y_train)

In [105]:
# check the accuracy
from sklearn.metrics import accuracy_score

preds_tree_train = tree.predict(X_train_processed)
accuracy_score(preds_tree_train, y_train)

# With an accuracy of 100% the decision tree is overfitted to the train data.

1.0

In [106]:
# testing
preds_tree_test = tree.predict(X_test_processed)
accuracy_score(preds_tree_test, y_test)

0.8902827763496144

5.2 KNN

In [107]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=150)

knn.fit(X_train_processed, y_train)

preds_knn = knn.predict(X_train_processed)
accuracy_score(preds_knn, y_train)

0.8795998413467895

In [108]:
preds_knn = knn.predict(X_test_processed)
accuracy_score(preds_knn, y_test)

0.8809254498714653

5.3 Random Forest

In [109]:
from sklearn.ensemble import RandomForestClassifier



In [110]:
random_forest = RandomForestClassifier(random_state= 42)

In [111]:
random_forest.fit(X_train_processed, y_train)

In [112]:
random_forest_pred_train = random_forest.predict(X_train_processed)

In [113]:
accuracy_score(random_forest_pred_train, y_train)
# an accuracy of 100% means the random forest is overfitted to the train data as well.

1.0

In [114]:
random_forest_pred_test = random_forest.predict(X_test_processed)

In [115]:
accuracy_score(random_forest_pred_test, y_test)

0.9346015424164524

# 6. Interim result
After cleaning and preprocessing the data, random forest was the best performing model with an accuracy of 
93.47% of correct predictions which surpasses the dummy model as the "benchmark" (accuracy 72,3%) significantly. 
<br>
For our further iteration we are going to stick with just the random forest for several reasons:
1. A random forest reduces the overfitting problem in decision trees because random forest
creates many trees on the subset of the data and combines the output of all the trees. 
2. A random Forest is usually robust to outliers and can handle them automatically.
3. The disadvantage of a random forest which is computational time due to the many decision trees that are combined, is irrelevant here because we are not doing any realtime calculations.
<br>
In the next step we will first optimise the hyperparameters of the overfitted random forest by running a grid search cross validation. After that we are going try to improve the performance furhter on by filling NaNs with a different strategy and rebalancing the dataset. We will then analyse the incorrect predictions of the optimised random forest specifically and then to attempt to find out why the predictions were false and what are the reasons behind it.<br>
Afterwards we will use this knowledge to attempt to improve performance even more.