# Loan Approval Prediction Kaggle Competition
## October 28, 2024
DICHOSO, Aaron Gabrielle C.

This Notebook is part of a series of notebooks that will contain documentation and methods used for training a VotingClassifier used in the <a href="https://www.kaggle.com/competitions/playground-series-s4e10/"><b>2024 Loan Approval Prediction Kaggle Playground Series</b></a>. 


For this notebook, I will focus on the methods that I utilized for model training and hyperparameter training.

To view the data cleaning itself, feel free to visit the following notebook: 

<ul>
    <li>1. Data Exploration, Cleaning, and Transformations</li>
</ul>

To explore the base models used for the VotingClassifier, feel free to visit the following notebooks:

To view the model training itself, feel free to visit the following notebooks: 

<ul>
    <li>2.a: Training Phase - MLPClassifier</li>
    <li>2.b: Training Phase - DTCClassifier</li>
    <li>2.c: Training Phase - KNNClassifier</li>
    <li>2.d: Training Phase - LRClassifier</li>
</ul>

I had decided to attempt using a voting classifier because of just how distinct each model that I had trained from one another. The diversity of algorithms I had tested meant that they had different ways of learning how to generalize the data from one another. Hence, I wish to see what would happen if these algorithms had combined their knowledge together.

# 1. Import Cleaned Datasets

We first import the cleaned datasets from the previous notebook first. In this repository, the cleaned datasets are saved in the <b>./output</b> directory.

In [1]:
##Python libraries
import pandas as pd
import numpy as np

As observed, there are two kinds of training datasets used, the train set without oversampling, and the dataset that underwent ADASYN oversampling. I wish to test the performance of the model comparing these two methods during the hyperparameter tuning phase.

In [2]:
##Import Training Dataset
loans_train_df = pd.read_csv('./outputs/cleaned_loans_train.csv')
loans_train_df.head(5)

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,loan_status,PERSON_HOME_OWNERSHIP_MORTGAGE,...,LOAN_GRADE_B,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below
0,1.569797,-1.081318,-1.896898,-0.578305,0.390423,0.11738,0,1.719062,0,0,...,1,0,0,0,0,0,1,0,0,0
1,-0.921741,-0.05255,0.601227,-0.937769,0.896212,-0.973222,0,-1.364513,0,0,...,0,1,0,0,0,0,0,0,0,1
2,0.240977,-1.508084,0.92386,-0.578305,-0.470628,0.55362,0,1.185873,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0.407079,0.435878,1.579649,0.500086,0.27705,0.11738,0,0.087481,0,0,...,1,0,0,0,0,0,0,0,1,0
4,-0.921741,0.098465,-0.486519,-0.578305,-1.318902,-0.646041,0,-0.721995,0,0,...,0,0,0,0,0,0,0,0,0,1


In [3]:
loans_train_ada_df = pd.read_csv('./outputs/cleaned_loans_train_ada.csv')
loans_train_ada_df.head(5)

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,PERSON_HOME_OWNERSHIP_MORTGAGE,PERSON_HOME_OWNERSHIP_OTHER,...,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below,loan_status
0,1.569797,-1.081318,-1.896898,-0.578305,0.390423,0.11738,0,1.719062,0,0,...,0,0,0,0,0,1,0,0,0,0
1,-0.921741,-0.05255,0.601227,-0.937769,0.896212,-0.973222,0,-1.364513,0,0,...,1,0,0,0,0,0,0,0,1,0
2,0.240977,-1.508084,0.92386,-0.578305,-0.470628,0.55362,0,1.185873,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0.407079,0.435878,1.579649,0.500086,0.27705,0.11738,0,0.087481,0,0,...,0,0,0,0,0,0,0,1,0,0
4,-0.921741,0.098465,-0.486519,-0.578305,-1.318902,-0.646041,0,-0.721995,0,0,...,0,0,0,0,0,0,0,0,1,0


# 2. Hyperparameter Tuning

For my Voting Classifier, there are only two hyperparameters that I wish to tune, which are the set of models I have already trained previously and which ones I should include in the Voting Classifier, and the type of dataset I should use for fitting the voting classifier, either the set not oversampled, or the set oversampled using ADASYN.

To do so, I first imported all of the previously trained models by the other notebooks from the <b>./outputs</b> directory.

In [4]:
from joblib import load

In [5]:
from sklearn.neighbors import KNeighborsClassifier
knn_clf = load('./outputs/knn_model.joblib')

In [6]:
from sklearn.neural_network import MLPClassifier
mlp_clf = load('./outputs/mlp_model.joblib')

In [7]:
from sklearn.tree import DecisionTreeClassifier
dtc_clf = load('./outputs/dtc_model.joblib')

In [8]:
from sklearn.linear_model import LogisticRegression
lrc_clf = load('./outputs/lrc_model.joblib')

Afterwards, I set up my hyperparameter tuning method, which will be a standard grid search. I am utilizing grid search instead of bayesian optimization for the Voting classifier because the hyperparameters for the Voting classifier only consist of a small range of categorical data.

Only 31 configurations will be tested, whether to use a model or not and whether to use the ADASYN oversampled dataset or not (2 * 5 - 1 = 31).

In [9]:
df_hyper_tuning = pd.DataFrame(columns=['use_knn', 'use_mlp', 'use_dtc', 'use_lrc', 'oversampling_method', 'roc_auc'])
df_hyper_tuning

Unnamed: 0,use_knn,use_mlp,use_dtc,use_lrc,oversampling_method,roc_auc


To go through the search space, I utilized a binary number that corresponds to the flags of whether to use a model/ADASYN dataset. I increment this binary number and depending on the places with a 1, those flags will be enabled for that iteration of the grid search. Additionally, I utilized a soft voting scheme as opposed to a hard voting scheme so to give weight to the predictions made by each classifier. Finally, I examined the performance of each hyperparameter configuration using the Area obtained under the ROC (AUC) resulting from 3-fold cross validation.

In [10]:
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_val_score

a = "100001"
for i in range(31): 
    estimators = []
    # Calculating binary value using function
    sum = bin(int(a, 2) + 1)
    a = sum[2:]

    config = sum[2:][1:]
    use_knn = False
    use_mlp = False
    use_dtc = False
    use_lrc = False
    use_ada = False
    print("CONFIG", config)
    
    if config[0] == '1':
        estimators.append(('knn', knn_clf))
        use_knn = True
        print("KNN")
    if config[1] == '1':
        estimators.append(('mlp', mlp_clf))
        use_mlp = True
        print("MLP")
    if config[2] == '1':
        estimators.append(('dtc', dtc_clf))
        use_dtc = True
        print("DTC")
    if config[3] == '1':
        estimators.append(('lrc', knn_clf))
        use_lrc = True
        print("LRC")
    if config[4] == '1':
        X = loans_train_ada_df.loc[:, loans_train_ada_df.columns != "loan_status"]
        y = loans_train_ada_df["loan_status"]
        use_ada = True
        print("ADA")
    else:
        X = loans_train_df.loc[:, loans_train_df.columns != "loan_status"]
        y = loans_train_df["loan_status"]
        print("NO OVERSAMPLING")


    voting_clf = VotingClassifier(estimators=estimators, voting='soft', verbose=True)
    roc_auc = cross_val_score(voting_clf, X, y, cv=3, scoring='roc_auc').mean()
    print("Results:", roc_auc)
    
    df_hyper_tuning.loc[len(df_hyper_tuning.index)] = [use_knn, use_mlp, use_dtc, use_lrc, use_ada, roc_auc]
    
    print("================")

CONFIG 00010
LRC
NO OVERSAMPLING
[Voting] ...................... (1 of 1) Processing lrc, total=   0.1s
[Voting] ...................... (1 of 1) Processing lrc, total=   0.1s
[Voting] ...................... (1 of 1) Processing lrc, total=   0.1s
Results: 0.9047054724827651
CONFIG 00011
LRC
ADA
[Voting] ...................... (1 of 1) Processing lrc, total=   0.2s
[Voting] ...................... (1 of 1) Processing lrc, total=   0.2s
[Voting] ...................... (1 of 1) Processing lrc, total=   0.2s
Results: 0.953963287686896
CONFIG 00100
DTC
NO OVERSAMPLING
[Voting] ...................... (1 of 1) Processing dtc, total=   0.1s
[Voting] ...................... (1 of 1) Processing dtc, total=   0.1s
[Voting] ...................... (1 of 1) Processing dtc, total=   0.1s
Results: 0.8756369331732045
CONFIG 00101
DTC
ADA
[Voting] ...................... (1 of 1) Processing dtc, total=   0.1s
[Voting] ...................... (1 of 1) Processing dtc, total=   0.1s
[Voting] ...................

ValueError: 
All the 3 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\Aaron\AppData\Roaming\Python\Python311\site-packages\sklearn\model_selection\_validation.py", line 890, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Aaron\AppData\Roaming\Python\Python311\site-packages\sklearn\base.py", line 1351, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Aaron\AppData\Roaming\Python\Python311\site-packages\sklearn\ensemble\_voting.py", line 351, in fit
    return super().fit(X, transformed_y, sample_weight)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Aaron\AppData\Roaming\Python\Python311\site-packages\sklearn\ensemble\_voting.py", line 77, in fit
    names, clfs = self._validate_estimators()
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Aaron\AppData\Roaming\Python\Python311\site-packages\sklearn\ensemble\_base.py", line 217, in _validate_estimators
    raise ValueError(
ValueError: Invalid 'estimators' attribute, 'estimators' should be a non-empty list of (string, estimator) tuples.


These tuning results are saved in a dataframe. The hyperparameter tuning results for all models can be viewed in the <b>./hyper_tuning</b> directory.

In [11]:
df_hyper_tuning.sort_values(by=['roc_auc'], ascending=False)

Unnamed: 0,use_knn,use_mlp,use_dtc,use_lrc,oversampling_method,roc_auc
29,True,True,True,True,True,0.965153
21,True,False,True,True,True,0.964872
19,True,False,True,False,True,0.963707
27,True,True,True,False,True,0.963371
13,False,True,True,True,True,0.962717
5,False,False,True,True,True,0.962406
25,True,True,False,True,True,0.961027
23,True,True,False,False,True,0.959067
9,False,True,False,True,True,0.958519
17,True,False,False,True,True,0.953963


In [12]:
df_hyper_tuning.to_csv('hyper_tuning/voting_hyper_tuning.csv', index=False, header=True, encoding='utf-8')

# 3. Results

The results showed that the following configuration produced the best results:

<ol>
    <li><b>estimators:</b> all previously trained models</li>
    <li><b>oversampling_method: </b>ADASYN Oversampling</li>
</ol>

Interestingly, the VotingClassifier had the best performance by utilizing all previously trained models. This is interesting because it may be said that by utilizing the different learnings by each model, we are able to improve overall performance.

Additionally, using the ADASYN oversampling method worked the best for the VotingClassifier as opposed to no oversampling. This aligns with the characteristics of the individual models used in the VotingClassifier, as only the LRClassifier had obtained the best performance using no oversampling, while the rest preferred the ADASYN oversampling method. Hence, an all configurations that had used the ADASYN oversampled training set performed better than without oversampling.

# 4. Exporting Model

The model with the best configuration found during the hyperparameter tuning process is saved in the <b>./outputs</b> directory.

In [13]:
result = df_hyper_tuning.sort_values(by=['roc_auc'], ascending=False).iloc[0]
estimators = []

if result.use_knn:
    estimators.append(('knn', knn_clf))
    print("KNN")
if result.use_mlp:
    estimators.append(('mlp', mlp_clf))
    print("MLP")
if result.use_dtc:
    estimators.append(('dtc', dtc_clf))
    print("DTC")
if result.use_lrc:
    estimators.append(('lrc', knn_clf))
    print("LRC")
if result.oversampling_method:
    X = loans_train_ada_df.loc[:, loans_train_ada_df.columns != "loan_status"]
    y = loans_train_ada_df["loan_status"]
    use_ada = True
    print("ADA")
else:
    X = loans_train_df.loc[:, loans_train_df.columns != "loan_status"]
    y = loans_train_df["loan_status"]
    print("NO OVERSAMPLING")


voting_clf = VotingClassifier(estimators=estimators, voting='soft', verbose=True)
roc_auc = cross_val_score(voting_clf, X, y, cv=3, scoring='roc_auc').mean()
print("Validation AUC:", roc_auc)

KNN
MLP
DTC
LRC
ADA
[Voting] ...................... (1 of 4) Processing knn, total=   0.2s
Iteration 1, loss = 0.52334950
Validation score: 0.778107
Iteration 2, loss = 0.43832029
Validation score: 0.808514
Iteration 3, loss = 0.40115556
Validation score: 0.820231
Iteration 4, loss = 0.38018684
Validation score: 0.829872
Iteration 5, loss = 0.36773462
Validation score: 0.837289
Iteration 6, loss = 0.35875984
Validation score: 0.842777
Iteration 7, loss = 0.35162410
Validation score: 0.838920
Iteration 8, loss = 0.34580327
Validation score: 0.843815
Iteration 9, loss = 0.34151981
Validation score: 0.846781
Iteration 10, loss = 0.33721283
Validation score: 0.845001
Iteration 11, loss = 0.33324841
Validation score: 0.844853
Iteration 12, loss = 0.33046036
Validation score: 0.844705
Iteration 13, loss = 0.32671607
Validation score: 0.851973
Iteration 14, loss = 0.32366851
Validation score: 0.850489
Iteration 15, loss = 0.32081549
Validation score: 0.852121
Iteration 16, loss = 0.31846717
V

Validation AUC from 3-fold cross validation: 0.96627494121771

In [14]:
from joblib import dump
voting_clf.fit(X, y)
dump(voting_clf, './outputs/voting_model.joblib')

[Voting] ...................... (1 of 4) Processing knn, total=   0.4s
Iteration 1, loss = 0.48103228
Validation score: 0.788490
Iteration 2, loss = 0.40889811
Validation score: 0.808168
Iteration 3, loss = 0.38091457
Validation score: 0.822011
Iteration 4, loss = 0.36628356
Validation score: 0.828538
Iteration 5, loss = 0.35633914
Validation score: 0.834273
Iteration 6, loss = 0.35018952
Validation score: 0.838327
Iteration 7, loss = 0.34392596
Validation score: 0.838030
Iteration 8, loss = 0.33970494
Validation score: 0.840502
Iteration 9, loss = 0.33681415
Validation score: 0.841986
Iteration 10, loss = 0.33294038
Validation score: 0.842876
Iteration 11, loss = 0.33028590
Validation score: 0.846336
Iteration 12, loss = 0.32728187
Validation score: 0.844557
Iteration 13, loss = 0.32505105
Validation score: 0.845249
Iteration 14, loss = 0.32196602
Validation score: 0.846139
Iteration 15, loss = 0.32001391
Validation score: 0.847127
Iteration 16, loss = 0.31833557
Validation score: 0.8

['./outputs/voting_model.joblib']

# 5. Fitting into Test Data

Finally, we can now generate the predictions made by the DTC on the test data. This is done by isolating the features of the test samples, forwarding it to the DTC for prediction, and appending the predicted class labels with the corresponding IDs of the test data.

In [15]:
##Import Testing Dataset
loans_test_df = pd.read_csv('./outputs/cleaned_loans_test.csv')
loans_test_df

Unnamed: 0,id,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,PERSON_HOME_OWNERSHIP_MORTGAGE,...,LOAN_GRADE_B,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below
0,58645,-0.755638,0.404383,-0.117198,2.836600,1.455666,2.189522,0,-1.364513,0,...,0,0,0,0,1,0,0,0,0,1
1,58646,-0.257331,1.127233,0.601227,0.140622,0.722635,-0.646041,1,-0.266122,1,...,0,1,0,0,0,0,0,0,0,1
2,58647,-0.257331,-1.418731,0.403331,-0.937769,1.748450,-0.318861,1,-1.364513,0,...,0,0,0,1,0,0,0,0,0,1
3,58648,0.905387,-0.300610,0.169270,-0.398573,-0.470628,-0.209801,0,0.620670,0,...,0,0,0,0,0,0,0,0,1,0
4,58649,-0.257331,1.259932,0.923860,1.039281,1.573370,-0.100741,1,-0.266122,1,...,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39093,97738,-0.921741,-1.332883,-0.486519,-1.117500,0.044689,-0.646041,0,-0.266122,1,...,1,0,0,0,0,0,0,0,0,1
39094,97739,-0.921741,-0.389963,0.601227,-0.398573,-1.782989,-0.100741,0,-0.721995,1,...,0,0,0,0,0,0,0,0,0,1
39095,97740,3.895232,0.098465,-1.896898,1.039281,-1.043084,0.989861,0,2.637868,1,...,0,0,0,0,0,0,0,1,0,0
39096,97741,-0.921741,-1.019656,0.169270,0.859550,1.425586,2.516703,1,-0.266122,1,...,0,0,1,0,0,0,0,0,0,1


In [16]:
X_test = loans_test_df.loc[:, loans_test_df.columns != "id"]
X_test

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,PERSON_HOME_OWNERSHIP_MORTGAGE,PERSON_HOME_OWNERSHIP_OTHER,...,LOAN_GRADE_B,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below
0,-0.755638,0.404383,-0.117198,2.836600,1.455666,2.189522,0,-1.364513,0,0,...,0,0,0,0,1,0,0,0,0,1
1,-0.257331,1.127233,0.601227,0.140622,0.722635,-0.646041,1,-0.266122,1,0,...,0,1,0,0,0,0,0,0,0,1
2,-0.257331,-1.418731,0.403331,-0.937769,1.748450,-0.318861,1,-1.364513,0,0,...,0,0,0,1,0,0,0,0,0,1
3,0.905387,-0.300610,0.169270,-0.398573,-0.470628,-0.209801,0,0.620670,0,0,...,0,0,0,0,0,0,0,0,1,0
4,-0.257331,1.259932,0.923860,1.039281,1.573370,-0.100741,1,-0.266122,1,0,...,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39093,-0.921741,-1.332883,-0.486519,-1.117500,0.044689,-0.646041,0,-0.266122,1,0,...,1,0,0,0,0,0,0,0,0,1
39094,-0.921741,-0.389963,0.601227,-0.398573,-1.782989,-0.100741,0,-0.721995,1,0,...,0,0,0,0,0,0,0,0,0,1
39095,3.895232,0.098465,-1.896898,1.039281,-1.043084,0.989861,0,2.637868,1,0,...,0,0,0,0,0,0,0,1,0,0
39096,-0.921741,-1.019656,0.169270,0.859550,1.425586,2.516703,1,-0.266122,1,0,...,0,0,1,0,0,0,0,0,0,1


In [17]:
y_pred = voting_clf.predict(X_test)

In [18]:
loans_predictions_df = loans_test_df["id"].copy(deep=True)
loans_predictions_df = loans_predictions_df.to_frame()
loans_predictions_df.insert(1, 'loan_status', y_pred, True)

In [19]:
loans_predictions_df

Unnamed: 0,id,loan_status
0,58645,1
1,58646,0
2,58647,1
3,58648,0
4,58649,1
...,...,...
39093,97738,0
39094,97739,0
39095,97740,0
39096,97741,1


In [20]:
loans_predictions_df.to_csv('predictions/voting_predictions.csv', index=False, header=True, encoding='utf-8')