# Model: Decision Tree

## Date: Nov 23, 2023

---

## Introduction

For the next model, a decision tree will be used. This gives a good variety as log reg is probability based, svm distance based. It also handles non linearity well compared to linear SVM, log reg, and is a much quicker algorithm for large datasets compared to svm's. Since the last few model accuracies stagnated, there might be some non linearity in the data they are not capturing well. Therefore a decsison tree is a good next model, however they can be prone to overfitting.  

Unlike with the previous models, categorical features do not need to be numerically encoded and the data does not need to be scaled. As its not linear, the linear assumptions dont need to be checked. 

The hyperparameters to be used:  
1. max_depth: The maximum depth of the tree. This controls overfitting as higher depth allows the model to discern more and more
2. min_samples_split: The minimum number of samples required to split an internal node. Higher values prevent the model from learning too much detail.
3. 
min_samples_leaf: The minimum number of samples required to be at a leaf node. Similar to min_samples_split, it controls overfitting- 
4. 
max_features: The number of features to consider when looking for the best sp- 
5. .
max_leaf_nodes: Limits the number of leaf nodes in the tree, helping to control overfii- 
6. g.
criterion: The function used to measure the quality of a split. Common criteria include 'gini' for Gini Impurity and 'entropy' for Information Gain.

--------

### Table of Contents

1. [Introduction](#Introduction)
   - [Table of Contents](#Table-of-contents)
   - [Import Librarys](#Import-Librarys)
   - [Data Dictionary](#Data-Dictionary)
   - [Define Functions](#Define-Functions)
   - [Load the data](#Load-the-data)
3. [Logistic Regression Model](#Logistic-Regression-Model)
   - [Assumptions](#Assumptions)
   - [PreProcessing](#PreProcessing)
   - [Modelling](#Modelling)
   - [Evaluation](#Evaluation)
8. [Conclusion](#Conclusion)


### Import Librarys

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.utils import resample
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_selector

from statsmodels.stats.outliers_influence import variance_inflation_factor
from pathlib import Path
from sklearnex import patch_sklearn 


from helpers import display_corr_heatmap, data_dict

### Data Dictionary

In [3]:
data_dict()

Unnamed: 0,LoanStatNew,Description
0,acc_now_delinq,The number of accounts on which the borrower is now delinquent.
1,acc_open_past_24mths,Number of trades opened in past 24 months.
2,addr_state,The state provided by the borrower in the loan application
3,all_util,Balance to credit limit on all trades
4,annual_inc,The self-reported annual income provided by the borrower during registration.
5,annual_inc_joint,The combined self-reported annual income provided by the co-borrowers during registration
6,application_type,Indicates whether the loan is an individual application or a joint application with two co-borrowers
7,avg_cur_bal,Average current balance of all accounts
8,bc_open_to_buy,Total open to buy on revolving bankcards.
9,bc_util,Ratio of total current balance to high credit/credit limit for all bankcard accounts.


### Load the Data

In [11]:
# Define the relative path to the file
parquet_file_path = Path('../Data/Lending_club/model_cleaned')

try:
    # Read the parquet file
    loans_df = pd.read_parquet(parquet_file_path)
except FileNotFoundError as e:
    print(e.args[1])
    print('Check file location')

In [12]:
loans_df.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,purpose,dti,delinq_2yrs,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,collections_12_mths_ex_med,application_type,acc_now_delinq,tot_coll_amt,tot_cur_bal,total_rev_hi_lim,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
50867,23000.0,36,12.05,764.47998,3.0,RENT,50000.0,Source Verified,1,debt_consolidation,13.9,0.0,1.0,8.0,0.0,9976.0,45.599998,12.0,0.0,Individual,0.0,0.0,11973.0,21900.0,2.0,1710.0,11924.0,45.599998,0.0,0.0,110.0,119.0,5.0,5.0,0.0,5.0,2.0,4.0,4.0,4.0,6.0,5.0,5.0,7.0,4.0,8.0,0.0,1.0,81.800003,0.0,0.0,0.0,28207.0,11973.0,21900.0,6307.0
373353,35000.0,60,17.57,880.609985,2.0,RENT,110000.0,Verified,1,debt_consolidation,31.629999,0.0,0.0,27.0,0.0,50450.0,75.599998,38.0,0.0,Individual,0.0,541.0,304808.0,66700.0,3.0,11289.0,9169.0,81.599998,0.0,0.0,138.0,193.0,11.0,3.0,1.0,47.0,0.0,7.0,10.0,8.0,13.0,17.0,14.0,19.0,10.0,27.0,0.0,3.0,91.900002,50.0,0.0,0.0,336948.0,215616.0,49900.0,134250.0
185185,33100.0,36,13.99,1131.119995,1.0,MORTGAGE,72000.0,Source Verified,1,debt_consolidation,12.13,0.0,2.0,10.0,0.0,16993.0,37.599998,29.0,0.0,Individual,0.0,89.0,33172.0,45200.0,3.0,3686.0,8802.0,64.800003,0.0,0.0,143.0,247.0,10.0,10.0,0.0,43.0,0.0,3.0,4.0,3.0,9.0,7.0,9.0,22.0,4.0,10.0,0.0,1.0,100.0,33.299999,0.0,0.0,67371.0,33172.0,25000.0,22171.0
33164,1000.0,36,11.22,32.849998,10.0,RENT,40000.0,Verified,1,vacation,18.27,0.0,1.0,9.0,0.0,12175.0,39.799999,13.0,0.0,Individual,0.0,0.0,25333.0,30600.0,5.0,2815.0,17325.0,41.299999,0.0,0.0,19.0,132.0,1.0,1.0,0.0,1.0,0.0,4.0,4.0,6.0,8.0,2.0,7.0,11.0,4.0,9.0,0.0,3.0,100.0,16.700001,0.0,0.0,46125.0,25333.0,29500.0,15525.0
368586,4400.0,36,12.29,146.759995,3.0,RENT,34000.0,Verified,1,debt_consolidation,8.86,0.0,0.0,6.0,0.0,10915.0,78.0,17.0,0.0,Individual,0.0,0.0,10915.0,14000.0,0.0,2183.0,1985.0,84.599998,0.0,0.0,123.0,172.0,29.0,29.0,0.0,33.0,0.0,4.0,4.0,4.0,9.0,3.0,6.0,14.0,4.0,6.0,0.0,0.0,100.0,100.0,0.0,0.0,14000.0,10915.0,12900.0,0.0


### Decision Tree model

For the next model, a decision tree will be used. This gives a good variety as log reg is probability based, svm distance based. It also handles non linearity well compared to linear SVM and log reg. Since the last few model accuracies stagnated, there might be some non linearity in the data they are not capturing well. Therefore a decsison tree is a good next model, however they can be prone to overfitting.  

Unlike with the previous models, categorical features do not need to be numerically encoded and the data does not need to be scaled. As its not linear, the linear assumptions dont need to be checked. 

The hyperparameters to be used:  
1. max_depth: The maximum depth of the tree. Controls overfitting as higher depth will allow the model to learn more about the data.
2. min_samples_split: The minimum number of samples required to split an internal node. Higher values prevent the model from learning too much detail.
3. min_samples_leaf: The minimum number of samples required to be at a leaf node. Similar to min_samples_split, it controls overfitting- .
4. max_features: The number of features to consider when looking for the best spli- t.
5. max_leaf_nodes: Limits the number of leaf nodes in the tree, helping to control overfitti- ng.
6. criterion: The function used to measure the quality of a split. Common criteria include 'gini' for Gini Impurity and 'entropy' for Information Gain.

### Preprocessing

***Train test split***

In [13]:
# Split the data
X = loans_df.drop(columns=['loan_status'], inplace=False)
y = loans_df['loan_status']

# Split into train and test sets. Stratify to ensure any inbalance is preserved as in the original data. 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=11, stratify=y)

***Data Inbalance***

Decision trees are sensitive to data inbalance, so the train data will be balanced again. 

In [14]:
print('Number of class 1 examples before:', X_train[y_train == 1].shape[0])

# Downsample majority class
X_downsampled, y_downsampled  = resample(X_train[y_train == 1],
                                   y_train[y_train == 1],
                                   replace=False,
                                   n_samples=X_train[y_train == 0].shape[0],
                                   random_state=1)

print('\nNumber of class 1 examples after:', X_downsampled.shape[0])

# Combine the downsampled successful loans with the failed loans. Will keep as a df since changing to 
X_train_bal = pd.concat([X_train[y_train == 0], X_downsampled])
y_train_bal = np.hstack((y_train[y_train == 0], y_downsampled))

print("New X_train shape: ", X_train_bal.shape)
print("New y_train shape: ", y_train_bal.shape)
print("X_test shape: ", X_test.shape)
print("y_test shape: ", y_test.shape)

Number of class 1 examples before: 247290

Number of class 1 examples after: 65003
New X_train shape:  (130006, 55)
New y_train shape:  (130006,)
X_test shape:  (133841, 55)
y_test shape:  (133841,)


No features need to be dropped so there are now 25 extra features that were not available in log reg and svm

***Inspect Categorical Features***

Categorical features have to be numerically encoded. The encodings will be the same as the other notebooks.

In [15]:
categorical_columns = X_train_bal.select_dtypes('object').columns.tolist()
display(categorical_columns)
categorical_columns.remove('verification_status')

['home_ownership', 'verification_status', 'purpose', 'application_type']

In [16]:
#instantiate onehot encoder
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

#instantiate ordinal encoder
ordinal_transformer = OrdinalEncoder(categories=[['Not Verified', 'Source Verified', 'Verified']])

#combine into a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, ['home_ownership', 'verification_status', 'purpose', 'application_type']),
        ('ord', ordinal_transformer, ['verification_status'])],
    remainder='passthrough',
    n_jobs=2
)

#fit to the train set
preprocessor.fit(X_train_bal)

#transform the train and test sets
X_train_transformed = preprocessor.transform(X_train_bal)
X_test_transformed = preprocessor.transform(X_test)

print("Shape of train transformed: ", X_train_transformed.shape)
print("Shape of test transformed: ", X_test_transformed.shape)

Shape of train transformed:  (130006, 75)
Shape of test transformed:  (133841, 75)


***1st iteration***

Since there are so many hyperparameters for decision trees, it does not make sense to manually iterate one by one to find acceptable ranges for the final iteration. Instead, a randomized gridsearch, which simply computes random combinations of hyperparameters, is more efficent. 

In [23]:
%%time
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score

# Decision tree classifier
dt_classifier = DecisionTreeClassifier(random_state=1)

# Hyperparameters grid to search
param_dist = {
    "criterion": ["gini", "entropy"],
    "max_depth": [None, 7, 8, 9],
    "min_samples_split": [4,5,],
    "min_samples_leaf": [9, 10, 11],
    "max_features": [None, "sqrt", "log2"]
}

# Define scoring metrics
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score, average='weighted'),
    'recall': make_scorer(recall_score, average='weighted'),
    'f1': make_scorer(f1_score, average='weighted')
}

# Randomized Grid Search with cross-validation
random_search = RandomizedSearchCV(
    dt_classifier, 
    param_distributions=param_dist, 
    n_iter=100, 
    scoring=scoring, 
    refit='f1', 
    cv=5, 
    random_state=1, 
    verbose=10,
    n_jobs=2
)

# Perform the search
random_search.fit(X_train_transformed, y_train_bal)

# Best model
best_dt_model = random_search.best_estimator_

# Best hyperparameters
best_params = random_search.best_params_
print(f"Best parameters: {best_params}")

Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best parameters: {'min_samples_split': 3, 'min_samples_leaf': 10, 'max_features': None, 'max_depth': 8, 'criterion': 'entropy'}
CPU times: total: 9.45 s
Wall time: 11min 8s


In [22]:
# Best hyperparameters
best_params = random_search.best_params_
print(f"Best parameters: {best_params}")

Best parameters: {'min_samples_split': 5, 'min_samples_leaf': 8, 'max_features': None, 'max_depth': 10, 'criterion': 'entropy'}


In [24]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score, roc_curve, auc
import matplotlib.pyplot as plt

# Make predictions on the test data
y_pred = best_dt_model.predict(X_test_transformed)

# Calculate various metrics
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Print the results
print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

Accuracy: 0.6209308059563213
Confusion Matrix:
[[18639  9220]
 [41515 64467]]
Classification Report:
              precision    recall  f1-score   support

           0       0.31      0.67      0.42     27859
           1       0.87      0.61      0.72    105982

    accuracy                           0.62    133841
   macro avg       0.59      0.64      0.57    133841
weighted avg       0.76      0.62      0.66    133841



### Conclusion

Note that since decision trees do not need to be checked for multicollinearity, resulting in exxtra features compared to log reg and svm, inter model comparisons are to be taken with a grain of salt. 