# **Loan Default Prediction**: *Model Training*
In the previous notebook, we trained our baseline model using the `LogisticRegression()` algorithm, despite this giving us a good result, it had its limitations, that being:
- Despite appying hyperparameter tuning using `RandomizedSearchCV`, the model had the same results.

This meant that the model had reached it's limits standing at an equal score of 69% for all the metrics and at 76% ROC-AUC curve.

In this notebook, we are going to pick it up from there and use `DecisionTreeClassifier()` and `RandomTreeClassifier()` to boost our model performance. 

## **Data Preparation**
In this step we are importing all the necesarry packages, loading the cleaned data and splitting the data inti training and test datasets.

In [1]:
# Import important dependencies
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import OrdinalEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report, roc_curve, roc_auc_score

In [2]:
# Load the cleaned data
loans_info = pd.read_csv("../Data/CleanData/Cleaned_Loans_Data.csv")

# Preview
loans_info.head()

Unnamed: 0,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner,Default
0,56,85994,50587,520,80,4,15.23,36,0.44,Bachelor's,Full-time,Divorced,Yes,Yes,Other,Yes,0
1,69,50432,124440,458,15,1,4.81,60,0.68,Master's,Full-time,Married,No,No,Other,Yes,0
2,46,84208,129188,451,26,3,21.17,24,0.31,Master's,Unemployed,Divorced,Yes,Yes,Auto,No,1
3,32,31713,44799,743,0,3,7.07,24,0.23,High School,Full-time,Married,No,No,Business,No,0
4,60,20437,9139,633,8,4,6.51,48,0.73,Bachelor's,Unemployed,Divorced,No,Yes,Auto,No,0


In [3]:
# Assign predictors and target variable to their respective variables.
X = loans_info.drop("Default", axis=1)
y = loans_info["Default"]

# Split the data into train and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

## **`DecisionTreeClassifier()`**
First, we are using the `DecisionTreeClassifier()` algorithm because:
- `DecisionTreeClassifier()` is considered better than `LogisticRegression()` in certain situations due to its ease in interpretation and ability to handle non-linear relationships and outliers. During our EDA, we noticed that the predictors had negative relationship with the target variable.

Before we start modeling, we are preprocessing the data, according to the needs of this specific algorithm, that is:
- No standardization

- Ordinal encoding the xcategorical data.

In [6]:
# Copy train data
X_dt = X_train.copy()
y_dt = y_train.copy()

# Identify categorical columns
cat_cols = X_train.select_dtypes(include='object').columns

# Encode the categorical columns
encoder = OrdinalEncoder()
X_dt[cat_cols] = encoder.fit_transform(X_dt[cat_cols])

X_dt.head()

Unnamed: 0,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner
15826,45,37039,247916,624,19,2,4.62,60,0.85,1.0,3.0,1.0,0.0,1.0,0.0,0.0
147371,48,133963,66275,494,119,3,14.72,48,0.49,3.0,2.0,1.0,1.0,1.0,4.0,1.0
178180,47,100204,6967,718,108,3,5.51,24,0.76,1.0,2.0,2.0,1.0,1.0,1.0,1.0
126915,42,36078,25966,344,2,3,18.29,36,0.76,0.0,2.0,1.0,0.0,1.0,2.0,0.0
163930,20,99464,248557,318,74,3,19.45,60,0.45,0.0,1.0,0.0,0.0,0.0,0.0,1.0


### **Model 1: `DecisionTreeClassifier()` baseline model**