# Home Credit Risk Default: 
# Part 3: Training the Model
# Introduction: 
In this part, we will load the balance data we achieved in part 2, and use it to train a decision tree model.

To convert .pkl data into a data frame, we uuse '.read_pickle'. 

In [1]:
import pandas as pd
unpickled_df = pd.read_pickle('balancedData.pkl')
print(unpickled_df)

        SK_ID_CURR  TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR  \
1           100003       0         Cash loans           F            N   
2           100004       0    Revolving loans           M            Y   
3           100006       0         Cash loans           F            N   
4           100007       0         Cash loans           M            N   
5           100008       0         Cash loans           M            N   
6           100009       0         Cash loans           F            Y   
7           100010       0         Cash loans           M            Y   
8           100011       0         Cash loans           F            N   
9           100012       0    Revolving loans           M            N   
10          100014       0         Cash loans           F            N   
11          100015       0         Cash loans           F            N   
12          100016       0         Cash loans           F            N   
13          100017       0         Cas

# Non- numeric data - string in the data frame
Looking at the data above, reveals some of the columns have non-numeric values, e.g., CODE_GENDER contains F or M values.
Before appliying Decision Tree, we need to transform the non-numeric vaalues to numeric ones. We need a dictionary to uniquely assign numeric vaues to non-numeric ones in each column. defaultdict(LabelEncoder) defines the dictionary d.
In the next line, we convert all strings into numeric values using the defined dictionart, d. 

In [2]:
from collections import defaultdict
from sklearn.preprocessing import LabelEncoder
d = defaultdict(LabelEncoder)
numeric_train_df = unpickled_df.apply(lambda x: d[x.name].fit_transform(x.astype(str)))

# Selecting inputs (features) and outputs (targets) for our model
To define a model based on the historical data we have from the balanced training set, we shall determine inputs and outputs. We then select the TARGET values from the numeric_train_df and call it y.
We select all columns except the target and the applicant's ID as training features (Input of our model) and name it X.
We then select the TARGET values from the numeric_train_df and call it y.

In [3]:
columns_of_interest =(numeric_train_df.drop(['TARGET','SK_ID_CURR'], axis=1)).columns
X = numeric_train_df [columns_of_interest]
y = numeric_train_df.TARGET

# Splitting the training data
Now that X, and Y are ready, we can apply any machine learning algorithm we like, and train a model to describe the relation between inputs and outputs in our training set. But wait! We need to keep some of the training data as a validation set to check the performance of our model. Therefore, we use train_test_split to split the data into training and validating data. By default 25% of the training data is kept aside for validation. However, we can change this percentage by changing the value of test_size in the train_test_split. 

In [4]:
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y,test_size = 0.25, random_state = 20)

The train_X, and train_y contains features and targets of the training set, and val_X, and val_y have validation set values.
We have used .shape to have an idea of the size of the training and validation sets.

In [5]:
print('Training Features Shape:', train_X.shape)
print('Training Labels Shape:', train_y.shape)
print('Validation Features Shape:', val_X.shape)
print('Validation Labels Shape:', val_y.shape)

Training Features Shape: (435439, 120)
Training Labels Shape: (435439,)
Validation Features Shape: (145147, 120)
Validation Labels Shape: (145147,)


# Applying Descision Tree Model to the training data:
Luckily, we can easilyimport the desired machine learning library, RandomForestClassifier here, and use it to train our model. In the first step, we initiate the model structureand name it clf. By convention, clf means 'Classifier', and in the next step, we train the Classifier to take the training features and learn how they relate to the training y using '.fit'. This instruction may take a few minutes or more to find the best model to fint the training data. As a result, the specifications of the decision tree would appear.

In [6]:
Method ='RandomForestClassifier' # We will uuse this name later when exporting the results
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
clf = RandomForestClassifier(class_weight="balanced", max_features = 11)
clf.fit(train_X, train_y)

RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features=11,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

# Make Predictions on the Validation Set:
Now that the clf is ready, we can predict the output of a given set of features (inputs) by using'.predict' method. Weapply it on the validation set to predict the output of the validation set.

In [7]:
predictions = clf.predict(val_X)
predictions.shape

(145147,)

# Evaluate Classifier
We have predicted the output values (targets) for the validation data. Now, we can compare it with the real outputs we have and calculate the error. 

In [8]:
import numpy as np
error = predictions - val_y
print('Mean Error:', round(np.mean(error), 4))

Mean Error: 0.0056


Alternatively: 

In [9]:
from sklearn.metrics import mean_absolute_error
print('Mean Absolute Error = %.2f'% (100*mean_absolute_error(val_y, predictions)),'%')

Mean Absolute Error = 0.56 %


The error is arround 0.6%. This is a good indicator of total error we have from the validation set. However, it does not judge the performance of th model on targets and non-targets. There are other indicators such as cv_score, and F1 score which can better judge about the performance of the trained models.

In [10]:
from sklearn.metrics import log_loss
cv_scores = log_loss(val_y, predictions)
print('cv_scores =', cv_scores)

cv_scores = 0.19465351100425762


# Analyzing the model performance: False and true predictions

In [11]:
result=pd.DataFrame({'Validation':val_y,'Predictions':predictions, 'Error': error})
result.describe()
#print(result)
print('Number of false predictions:', result['Error'].abs().sum())
print('Number of total predictions:', result['Error'].count())
print('Number of correct predictions:', result['Error'].abs().count()-result['Error'].sum())
fp = (result['Error']==1).sum()
fn = (result['Error']==-1).sum()
tp = ((result['Predictions']==1)&(result['Validation']==1)).sum()
tn = ((result['Predictions']==0)&(result['Validation']==0)).sum()
print('Number of false positives:',fp )
print('Number of false negatives:',fn)
print('Number of true positives:', tp)
print('Number of true negatives:', tn)

Number of false predictions: 818
Number of total predictions: 145147
Number of correct predictions: 144329
Number of false positives: 818
Number of false negatives: 0
Number of true positives: 74354
Number of true negatives: 69975


# Popular metrics: F1 score, prescision, recall, and accuracy: 

In [12]:
precision = round((tp*100)/(tp+fp),2)
print('Precision =', precision,'%')
recall = round((tp*100)/(tp+fn),2)
print('Recall = %.2f'%recall,'%')
F1 = round(2*(precision*recall)/(precision+recall),2)
print('F1 score = %.2f' %F1, '%')
accuracy = round(100*(tp +tn)/(tp+tn+fp+fn),2)
print('Accuracy = %.2f' %accuracy, '%')

Precision = 98.91 %
Recall = 100.00 %
F1 score = 99.45 %
Accuracy = 99.44 %


The recall increased to almost 100% with a high accuracy and F1 score. There is no false negatives which is very good. It means, all of the applicants whom we predict to return their loan, will do.
There are a few applicant whom we predict not to return their loan, but they will. So, the model is a bit conservative in determining the trustable applicants which is good. We can distinguish all applicants who will not return their loan.

# Predict the TARGET column of the Test data sets
Now that the model is ready, and the performance is acceptable, we predict the targets of the test set.

In [13]:
df_test = pd.read_csv('Data/application_test.csv')
Test_columns_of_interest =(df_test.drop(['SK_ID_CURR'], axis=1)).columns
Test_X = df_test[Test_columns_of_interest]

p = clf.predict(Test_X.apply(lambda x: d[x.name].fit_transform(x.astype(str))))
f_result=pd.DataFrame({'SK_ID_CURR':df_test['SK_ID_CURR'],'TARGET':p})
my_result= f_result.set_index('SK_ID_CURR')

In [14]:
print('We predict', f_result['TARGET'].sum(),'number of applicants out of', f_result['TARGET'].count(),'applicants in total, will not repay their loan.' )
ratio = f_result['TARGET'].sum()/f_result['TARGET'].count()*100
print('In the other words, base on the historical data given, %.2f'%ratio, '% of applicants are predicted not to repay their loans.')

We predict 673 number of applicants out of 48744 applicants in total, will not repay their loan.
In the other words, base on the historical data given, 1.38 % of applicants are predicted not to repay their loans.


# Export the result to SQL

In [15]:
import sqlite3
import re
conn = sqlite3.connect('Summary.sqlite')
my_result.to_sql('MyResult', conn, if_exists='replace')

# What is next? 
We need to do some try and error to find the best model, and tune some parameters, e.g., training with original unbalanced data sets and so on. The best way to compare different models is to extract the performance metrics of each method in a table and compare them together.

In [16]:
cur = conn.cursor()
cur.execute('DROP TABLE IF EXISTS Results')
cur.execute('''
CREATE TABLE Results (Method TEXT, Accuracy FLOAT, Precision FLOAT, F1 FLOAT, Recall FLOAT)''')
cur.execute('''INSERT INTO Results (Method, Accuracy, Precision, F1, Recall)
               VALUES (?, ?, ?, ?, ?)''', (Method, accuracy, precision, F1, recall))
conn.commit()
cur.close()