# Loan Approval Prediction Kaggle Competition
## October 28, 2024
DICHOSO, Aaron Gabrielle C.

This Notebook is part of a series of notebooks that will contain documentation and methods used for training a Decision Tree Classifier (DTC) used in the <a href="https://www.kaggle.com/competitions/playground-series-s4e10/"><b>2024 Loan Approval Prediction Kaggle Playground Series</b></a>. 


For this notebook, I will focus on the methods that I utilized for model training and hyperparameter training.

To view the data cleaning itself, feel free to visit the following notebook: 

<ul>
    <li>1. Data Exploration, Cleaning, and Transformations</li>
</ul>



I chose to train DTCs because, from exploring and cleaning the data, I found that it contained a lot of categorical features. Since the DTC does not require an embedding space for the vector representation of its features but rather builds a tree that best splits the samples according to the features, it can inherently support categorical data, hence my decision.

# 1. Import Cleaned Datasets

We first import the cleaned datasets from the previous notebook first. In this repository, the cleaned datasets are saved in the <b>./output</b> directory.

In [1]:
##Python libraries
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier

As observed, there are two kinds of training datasets used, the train set without oversampling, and the dataset that underwent ADASYN oversampling. I wish to test the performance of the model comparing these two methods during the hyperparameter tuning phase.

In [2]:
##Import Training Dataset
loans_train_df = pd.read_csv('./outputs/cleaned_loans_train.csv')
loans_train_df

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,loan_status,PERSON_HOME_OWNERSHIP_MORTGAGE,...,LOAN_GRADE_B,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below
0,1.569797,-1.081318,-1.896898,-0.578305,0.390423,0.117380,0,1.719062,0,0,...,1,0,0,0,0,0,1,0,0,0
1,-0.921741,-0.052550,0.601227,-0.937769,0.896212,-0.973222,0,-1.364513,0,0,...,0,1,0,0,0,0,0,0,0,1
2,0.240977,-1.508084,0.923860,-0.578305,-0.470628,0.553620,0,1.185873,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0.407079,0.435878,1.579649,0.500086,0.277050,0.117380,0,0.087481,0,0,...,1,0,0,0,0,0,0,0,1,0
4,-0.921741,0.098465,-0.486519,-0.578305,-1.318902,-0.646041,0,-0.721995,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58639,1.071489,1.615661,0.403331,2.836600,1.496063,0.553620,1,1.185873,0,1,...,0,0,1,0,0,0,0,0,1,0
58640,0.074874,-1.508084,-1.896898,0.140622,0.735902,2.080462,0,0.832270,1,0,...,0,1,0,0,0,0,0,0,1,0
58641,-0.755638,-0.580418,0.772652,-0.434520,1.506614,-0.100741,0,-1.364513,1,0,...,0,0,1,0,0,0,0,0,0,1
58642,-0.921741,-1.418731,-0.486519,-0.758037,-0.470628,0.117380,0,-0.721995,0,0,...,0,0,0,0,0,0,0,0,0,1


In [18]:
loans_train_ada_df = pd.read_csv('./outputs/cleaned_loans_train_ada.csv')
loans_train_ada_df

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,PERSON_HOME_OWNERSHIP_MORTGAGE,PERSON_HOME_OWNERSHIP_OTHER,...,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below,loan_status
0,1.569797,-1.081318,-1.896898,-0.578305,0.390423,0.117380,0,1.719062,0,0,...,0,0,0,0,0,1,0,0,0,0
1,-0.921741,-0.052550,0.601227,-0.937769,0.896212,-0.973222,0,-1.364513,0,0,...,1,0,0,0,0,0,0,0,1,0
2,0.240977,-1.508084,0.923860,-0.578305,-0.470628,0.553620,0,1.185873,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0.407079,0.435878,1.579649,0.500086,0.277050,0.117380,0,0.087481,0,0,...,0,0,0,0,0,0,0,1,0,0
4,-0.921741,0.098465,-0.486519,-0.578305,-1.318902,-0.646041,0,-0.721995,0,0,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101122,0.163089,-1.281435,-1.896898,0.331528,0.751306,2.022542,0,0.719892,0,0,...,1,0,0,0,0,0,0,1,0,1
101123,0.074874,-1.416344,-1.542915,0.283617,0.780451,2.210615,0,0.832270,0,0,...,1,0,0,0,0,0,0,1,0,1
101124,-0.912367,-0.829553,0.203323,-0.366684,1.789834,0.207975,0,-1.364513,0,0,...,0,1,0,0,0,0,0,0,1,1
101125,-0.591730,-0.347942,0.921863,-0.221690,1.283043,0.006879,0,-1.364513,0,0,...,0,1,0,0,0,0,0,0,1,1


# 2. Hyperparameter Tuning

The decision tree classifier (DTC) has several hyperparameters that should be tuned to maximize its performance.

In this notebook, I will focus on tuning the following <a href="https://scikit-learn.org/dev/modules/generated/sklearn.tree.DecisionTreeClassifier.html">hyperparameters of the DTC</a> for their explainability, namely:

<ol>
    <li><b>criterion:</b> the formula used for determining the quality of the split performed at a node</li>
    <li><b>splitter:</b> the method used to split each node</li>
    <li><b>max_depth:</b> the maximum depth of the decision tree</li>
    <li><b>min_samples_split:</b> the minimum number of samples required for a node to be able to split</li>
    <li><b>max_features:</b> the maximum number of features to consider when looking for the best way to split a node</li>
    <li><b>oversampling_method:</b> The type of oversampling done in the dataset used.</li>
</ol>

The range of follows I chose for these hyperparameters are as follows:
<ol>
    <li><b>criterion:</b> gini impurity or entropy.</li>
    <li><b>splitter:</b> "best" refers to using the feature that splits the node the best according to the criterion, while "random" refers to the best random split.</li>
    <li><b>max_depth:</b> A range from 1 to 200.</li>
    <li><b>min_samples_split:</b> A fraction, referring to the percentage of samples in the dataset for the minimum number of samples required, with a minimum value of 1e-8.</li>
    <li><b>max_features:</b>"sqrt" refers to using sqrt(number of features) to consider for a split, "log2" refers to using log2(number of features) to consider for a split, and "none" refers to using all features to consider for a split.</li>
        <li><b>oversampling_method:</b> Either using the ADASYN oversampled data set or the unbalanced labels dataset.</li>

</ol>

In [4]:
df_hyper_tuning = pd.DataFrame(columns=['criterion', 'splitter', 'max_depth', 'min_samples_split', 'max_features', 'oversampling_method', 'roc_auc'])
df_hyper_tuning

Unnamed: 0,criterion,splitter,max_depth,min_samples_split,max_features,oversampling_method,roc_auc


For the specific method of hyperparameter tuning, I chose to perform bayesian optimization, which is a hyperparameter tuning method that involves observing the past iterations of the tuning process to influence the configurations to test later on. I chose this method over GridSearch because of the numerous possible configurations that I would need to search through not being a feasible method for my system. Additionally, I chose it over RandomSearch because bayesian optimization would be able to utilize my system resources more effectively by searching in areas with higher probabilities of giving me high performances as opposed to randomly testing configurations.

To perform bayesian optimization, I utilized the <a href="https://scikit-optimize.github.io/stable/modules/generated/skopt.gp_minimize.html">gp_minimize()</a> function provided by the <a href="https://scikit-optimize.github.io/stable/modules/generated/skopt.gp_minimize.html">scikit-optimize library</a>.

Following the instructions found in the documentations, I first initialized the search space to be used in the optimization process. This involved creating an array that pertains to the hyperparameters to tune, the data type of the hyperparameters, and the range of possible values to test in the hyperparameter tuning process.

Afterwards, I created the objective function that the gp_minimize() function will execute. The objective function will use the search space defined earlier and test different values for the hyperparameters. It will then return the negative value of the Area Under the ROC (AUC) obtained from 3-fold cross validation, as this value will be minimized by the gp_minimize() function. Invalid configurations during hyperparameter tuning process will be given a positive value, allowing the bayesian optimization process to avoid such configurations.

I also limit the number of calls performed by the function due to my limited resources.

In [5]:
from sklearn.model_selection import cross_val_score
from skopt import gp_minimize
from skopt.space import Real, Integer, Categorical
from skopt.utils import use_named_args
# Define the search space
search_space = [
    Categorical(['gini', 'entropy'], name='criterion'),
    Categorical(['best', 'random'], name='splitter'),
    Integer(1, 200, name='max_depth'),
    Real(1e-8, 1.0, name='min_samples_split'),
    Categorical(['sqrt', 'log2', None], name='max_features'),
    Categorical(['none', 'ada'], name='oversampling_method')
]

# Define your objective function (e.g., maximizing accuracy)
@use_named_args(search_space)
def objective_function(criterion, splitter, max_depth, min_samples_split, max_features, oversampling_method):
    print("================")
    print("Configuration:")
    print("Criterion:", criterion)
    print("Splitter:", splitter)
    print("Max Depth:", max_depth)
    print("Min Samples Split:", min_samples_split)
    print("Max Features:", max_features)
    print("Oversampling Method:", oversampling_method)
    print("----------------")
    try:
        if oversampling_method == 'none':
            X = loans_train_df.loc[:, loans_train_df.columns != "loan_status"]
            y = loans_train_df["loan_status"]
        elif oversampling_method == 'ada':
            X = loans_train_ada_df.loc[:, loans_train_ada_df.columns != "loan_status"]
            y = loans_train_ada_df["loan_status"]
            
        model = DecisionTreeClassifier(class_weight='balanced', criterion=criterion, splitter=splitter, max_depth=max_depth, min_samples_split=min_samples_split, max_features=max_features)
        roc_auc = cross_val_score(model, X, y, cv=3, scoring='roc_auc').mean()

        print("Results:", -roc_auc)
        df_hyper_tuning.loc[len(df_hyper_tuning.index)] = [criterion, splitter, max_depth, min_samples_split, max_features, oversampling_method, roc_auc] 
        print("================")
        return -roc_auc
    except:
        print("Invalid Config")
        return 100000
        

# Perform Bayesian Optimization
res = gp_minimize(objective_function, search_space, n_calls=500)

# Print best parameters
print("Best parameters:", res.x)


Configuration:
Criterion: gini
Splitter: best
Max Depth: 195
Min Samples Split: 0.8144094203590194
Max Features: sqrt
Oversampling Method: ada
----------------
Results: -0.6288689686702629
Configuration:
Criterion: gini
Splitter: random
Max Depth: 37
Min Samples Split: 0.4295083636030497
Max Features: log2
Oversampling Method: ada
----------------
Results: -0.6759119789416145
Configuration:
Criterion: entropy
Splitter: random
Max Depth: 103
Min Samples Split: 0.9592417451472894
Max Features: None
Oversampling Method: ada
----------------
Results: -0.5965319026367978
Configuration:
Criterion: gini
Splitter: random
Max Depth: 188
Min Samples Split: 0.7603885776165377
Max Features: sqrt
Oversampling Method: ada
----------------
Results: -0.6258685073217751
Configuration:
Criterion: gini
Splitter: best
Max Depth: 176
Min Samples Split: 0.39415985063422715
Max Features: sqrt
Oversampling Method: none
----------------
Results: -0.8021733447465865
Configuration:
Criterion: entropy
Splitter: b

These tuning results are saved in a dataframe. The hyperparameter tuning results for all models can be viewed in the <b>./hyper_tuning</b> directory.

In [6]:
df_hyper_tuning.sort_values(by=['roc_auc'], ascending=False)

Unnamed: 0,criterion,splitter,max_depth,min_samples_split,max_features,oversampling_method,roc_auc
27,entropy,random,142,4.703690e-04,,ada,0.936239
26,entropy,best,145,1.713412e-02,,none,0.925171
126,gini,best,90,3.698953e-02,,none,0.921511
206,entropy,best,147,5.111672e-02,,none,0.919239
209,entropy,best,146,5.255929e-02,,none,0.919186
...,...,...,...,...,...,...,...
133,entropy,random,1,9.552860e-01,log2,ada,0.598228
2,entropy,random,103,9.592417e-01,,ada,0.596532
8,gini,random,5,8.175817e-01,sqrt,none,0.585911
101,entropy,random,200,1.000000e+00,log2,ada,0.565203


In [7]:
df_hyper_tuning.to_csv('hyper_tuning/dtc_hyper_tuning.csv', index=False, header=True, encoding='utf-8')

# 3. Results

The results showed that the following configuration produced the best results:

<ol>
    <li><b>criterion: </b>entropy</li>
    <li><b>splitter: </b>random</li>
    <li><b>max_depth: </b>142</li>
    <li><b>min_samples_split: </b>4.703690e-04 = 0.04703690%</li>
    <li><b>max_features: </b>None</li>
    <li><b>oversampling_method: </b>ADASYN Oversampling</li>
</ol>

Diving into the results deeper, we can see that using the entropy criterion performed the best. This may be because entropy is not greatly affected by the distribution of classes in a dataset. This is because of its logarithmic nature, as opposed to the linear nature of the Gini impurity criterion, which obtained the lowest average AUC. Think back to using the logarithmic transformation during the data cleaning process and how it could scale down outlier values and lessen the skewness of the data. The same concept can be applied here. 

Considering how the classes in the training set were heavily skewed (and that the best oversampling method in the configuration involved no oversampling at all), having a criterion that performs well despite the distribution would be favorable. In this case, using the entropy criterion meant that it was less affected by the skewness of the class distributions than the gini impurity criterion.

We can also observe that the best configuration for the DTC involves considering only all the features for every node and randomly choosing from the best features that split the samples according to entropy. This is advantageous compared to other configurations as it does not restrict the selection of possible features that the DTC could have picked from for a node to best split the data.

Additionally, the minimum number of samples needed for a node to split is fairly low, needing only 0.04703690%,  or around 47 samples, to be seen in a node for it to be split. In this case, having a lower minimum number of samples may be beneficial as opposed to a higher minimum, as it allows the DTC to perform more splits overall.

Finally, the maximum depth was set at the maximum tuning range of 142, indicating that the DTC is most effective on the dataset by querying numerous questions in sequence, as opposed to limiting the maximum depth and limiting the number of splits that the DTC can test.

# 4. Exporting Model

The model with the best configuration found during the hyperparameter tuning process is saved in the <b>./outputs</b> directory.

In [9]:
clf = DecisionTreeClassifier(class_weight='balanced',
                             criterion=res.x[0], 
                             splitter=res.x[1], 
                             max_depth=res.x[2], 
                             min_samples_split=res.x[3], 
                             max_features=res.x[4])

In [10]:
if res.x[5] == 'none':
    X = loans_train_df.loc[:, loans_train_df.columns != "loan_status"]
    y = loans_train_df["loan_status"]
elif res.x[5] == 'ada':
    X = loans_train_ada_df.loc[:, loans_train_ada_df.columns != "loan_status"]
    y = loans_train_ada_df["loan_status"]

clf.fit(X,y)

# Calculate the ROC AUC score
roc_auc = cross_val_score(clf, X, y, cv=3, scoring='roc_auc').mean()
print("Validation AUC:", roc_auc)

Validation AUC: 0.9312499466833102


Validation AUC from 3-fold cross validation: 0.93124994668331023

In [11]:
from joblib import dump
clf.fit(X,y)
dump(clf, './outputs/dtc_model.joblib')

['./outputs/dtc_model.joblib']

# 5. Fitting into Test Data

Finally, we can now generate the predictions made by the DTC on the test data. This is done by isolating the features of the test samples, forwarding it to the DTC for prediction, and appending the predicted class labels with the corresponding IDs of the test data. Predictions for the models can be found in the <b>./predictions</b> directory.

In [12]:
##Import Testing Dataset
loans_test_df = pd.read_csv('./outputs/cleaned_loans_test.csv')
loans_test_df

Unnamed: 0,id,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,PERSON_HOME_OWNERSHIP_MORTGAGE,...,LOAN_GRADE_B,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below
0,58645,-0.755638,0.404383,-0.117198,2.836600,1.455666,2.189522,0,-1.364513,0,...,0,0,0,0,1,0,0,0,0,1
1,58646,-0.257331,1.127233,0.601227,0.140622,0.722635,-0.646041,1,-0.266122,1,...,0,1,0,0,0,0,0,0,0,1
2,58647,-0.257331,-1.418731,0.403331,-0.937769,1.748450,-0.318861,1,-1.364513,0,...,0,0,0,1,0,0,0,0,0,1
3,58648,0.905387,-0.300610,0.169270,-0.398573,-0.470628,-0.209801,0,0.620670,0,...,0,0,0,0,0,0,0,0,1,0
4,58649,-0.257331,1.259932,0.923860,1.039281,1.573370,-0.100741,1,-0.266122,1,...,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39093,97738,-0.921741,-1.332883,-0.486519,-1.117500,0.044689,-0.646041,0,-0.266122,1,...,1,0,0,0,0,0,0,0,0,1
39094,97739,-0.921741,-0.389963,0.601227,-0.398573,-1.782989,-0.100741,0,-0.721995,1,...,0,0,0,0,0,0,0,0,0,1
39095,97740,3.895232,0.098465,-1.896898,1.039281,-1.043084,0.989861,0,2.637868,1,...,0,0,0,0,0,0,0,1,0,0
39096,97741,-0.921741,-1.019656,0.169270,0.859550,1.425586,2.516703,1,-0.266122,1,...,0,0,1,0,0,0,0,0,0,1


In [13]:
X_test = loans_test_df.loc[:, loans_test_df.columns != "id"]
X_test

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,PERSON_HOME_OWNERSHIP_MORTGAGE,PERSON_HOME_OWNERSHIP_OTHER,...,LOAN_GRADE_B,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below
0,-0.755638,0.404383,-0.117198,2.836600,1.455666,2.189522,0,-1.364513,0,0,...,0,0,0,0,1,0,0,0,0,1
1,-0.257331,1.127233,0.601227,0.140622,0.722635,-0.646041,1,-0.266122,1,0,...,0,1,0,0,0,0,0,0,0,1
2,-0.257331,-1.418731,0.403331,-0.937769,1.748450,-0.318861,1,-1.364513,0,0,...,0,0,0,1,0,0,0,0,0,1
3,0.905387,-0.300610,0.169270,-0.398573,-0.470628,-0.209801,0,0.620670,0,0,...,0,0,0,0,0,0,0,0,1,0
4,-0.257331,1.259932,0.923860,1.039281,1.573370,-0.100741,1,-0.266122,1,0,...,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39093,-0.921741,-1.332883,-0.486519,-1.117500,0.044689,-0.646041,0,-0.266122,1,0,...,1,0,0,0,0,0,0,0,0,1
39094,-0.921741,-0.389963,0.601227,-0.398573,-1.782989,-0.100741,0,-0.721995,1,0,...,0,0,0,0,0,0,0,0,0,1
39095,3.895232,0.098465,-1.896898,1.039281,-1.043084,0.989861,0,2.637868,1,0,...,0,0,0,0,0,0,0,1,0,0
39096,-0.921741,-1.019656,0.169270,0.859550,1.425586,2.516703,1,-0.266122,1,0,...,0,0,1,0,0,0,0,0,0,1


In [14]:
y_pred = clf.predict(X_test)

In [15]:
loans_predictions_df = loans_test_df["id"].copy(deep=True)
loans_predictions_df = loans_predictions_df.to_frame()
loans_predictions_df.insert(1, 'loan_status', y_pred, True)

In [16]:
loans_predictions_df

Unnamed: 0,id,loan_status
0,58645,1
1,58646,0
2,58647,1
3,58648,0
4,58649,1
...,...,...
39093,97738,0
39094,97739,0
39095,97740,0
39096,97741,1


In [17]:
loans_predictions_df.to_csv('predictions/dtc_predictions.csv', index=False, header=True, encoding='utf-8')