# Machine Learning

Now that the EDA and the preprocessing steps are done in R, we can move on to building the proper ML model. We will be trying out three different algorithms, decision tree, random forest and XGBoost. We will evaluate their performance and decide which model performs best for the particular dataset.

In [1]:
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

In [2]:
train = pd.read_csv('./archive/train.csv')
train

Unnamed: 0,Age,Annual_Income,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Num_of_Loan,Delay_from_due_date,Num_Credit_Inquiries,Credit_Mix,Outstanding_Debt,...,OccupationEntrepreneur,OccupationJournalist,OccupationLawyer,OccupationManager,OccupationMechanic,OccupationMedia_Manager,OccupationMusician,OccupationScientist,OccupationTeacher,OccupationWriter
0,23,19114.12,3,4,3,4,3,4.0,2,809.98,...,0,0,0,0,0,0,0,1,0,0
1,23,19114.12,3,4,3,4,-1,4.0,2,809.98,...,0,0,0,0,0,0,0,1,0,0
2,23,19114.12,3,4,3,4,3,4.0,2,809.98,...,0,0,0,0,0,0,0,1,0,0
3,23,19114.12,3,4,3,4,5,4.0,2,809.98,...,0,0,0,0,0,0,0,1,0,0
4,23,19114.12,3,4,3,4,6,4.0,2,809.98,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,25,39628.99,4,6,7,2,23,3.0,2,502.38,...,0,0,0,0,1,0,0,0,0,0
99996,25,39628.99,4,6,7,2,18,3.0,2,502.38,...,0,0,0,0,1,0,0,0,0,0
99997,25,39628.99,4,6,5729,2,27,3.0,2,502.38,...,0,0,0,0,1,0,0,0,0,0
99998,25,39628.99,4,6,7,2,20,3.0,2,502.38,...,0,0,0,0,1,0,0,0,0,0


In [3]:
x = train.drop('Credit_Score', axis = 1)
y = train['Credit_Score']
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=0.1, random_state=42)

In [4]:
x_train.shape

(90000, 30)

In [5]:
y_train

51994    2
77540    1
16382    1
83439    1
61618    2
        ..
6265     2
54886    0
76820    1
860      1
15795    0
Name: Credit_Score, Length: 90000, dtype: int64

In [6]:
y_train.value_counts()

Credit_Score
1    47867
0    26053
2    16080
Name: count, dtype: int64

# Model Building

The cells below actually build the Decision Tree, Random Forest and the XGBoost models

In [7]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

In [8]:
dt = DecisionTreeClassifier()
rf = RandomForestClassifier()
xgb = XGBClassifier()

In [9]:
dt.fit(x_train, y_train)
rf.fit(x_train, y_train)
xgb.fit(x_train, y_train)

In [10]:
dtprediction = dt.predict(x_test)
rfprediction = rf.predict(x_test)
xgbprediction = xgb.predict(x_test)

# Evaluation

Here we will judge the performance of these models through 4 metrics, the accuracy, the precision, the recall and the f1 score

In [11]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [17]:
predictions = [dtprediction, rfprediction, xgbprediction]

accuracy, precision, recall, f1 = [], [], [], []

for prediction in predictions:
    accuracy.append(accuracy_score(y_test, prediction) * 100)
    precision.append(precision_score(y_test, prediction, average='macro') * 100)
    recall.append(recall_score(y_test, prediction, average='macro') * 100)
    f1.append(f1_score(y_test, prediction, average='macro') * 100)

finaldf = pd.DataFrame({'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'F1_Score': f1}, index = ['Decision Tree', 'Random Forest', 'XGBoost'])

finaldf.sort_values('Accuracy', ascending=False)

Unnamed: 0,Accuracy,Precision,Recall,F1_Score
Random Forest,80.23,79.334735,78.983727,79.13658
XGBoost,74.56,72.60111,73.7588,73.138585
Decision Tree,72.62,71.238225,70.939036,71.085906


# Conclusion

Thus we can conclude that Random Forest gives us the best result for this particular dataset.