# Loan Approval Prediction Kaggle Competition
## October 28, 2024
DICHOSO, Aaron Gabrielle C.

This Notebook is part of a series of notebooks that will contain documentation and methods used for training a SGDClassifier used in the <a href="https://www.kaggle.com/competitions/playground-series-s4e10/"><b>2024 Loan Approval Prediction Kaggle Playground Series</b></a>. 


For this notebook, I will focus on the methods that I utilized for exploratory data analysis, data cleaning, and feature engineering for model training and hyperparameter training later on.

To view the data cleaning itself, feel free to visit the following notebook: 

<ul>
    <li>1. Data Exploration, Cleaning, and Transformations</li>
</ul>



An SGDClassifier is 

## 1. Import Cleaned Datasets

We first import the cleaned datasets from the previous notebook first. In this repository, the cleaned datasets are saved in the <b>./output</b> directory.

In [1]:
##Python libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

As observed, there are two kinds of training datasets used, the train set without oversampling, and the dataset that underwent ADASYN oversampling. I wish to test the performance of the model comparing these two methods during the hyperparameter tuning phase.

In [2]:
##Import Training Dataset
loans_train_df = pd.read_csv('./outputs/cleaned_loans_train.csv')
loans_train_df.head(5)

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,loan_status,PERSON_HOME_OWNERSHIP_MORTGAGE,...,LOAN_GRADE_B,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below
0,1.569797,-1.081318,-1.896898,-0.578305,0.390423,0.11738,0,1.719062,0,0,...,1,0,0,0,0,0,1,0,0,0
1,-0.921741,-0.05255,0.601227,-0.937769,0.896212,-0.973222,0,-1.364513,0,0,...,0,1,0,0,0,0,0,0,0,1
2,0.240977,-1.508084,0.92386,-0.578305,-0.470628,0.55362,0,1.185873,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0.407079,0.435878,1.579649,0.500086,0.27705,0.11738,0,0.087481,0,0,...,1,0,0,0,0,0,0,0,1,0
4,-0.921741,0.098465,-0.486519,-0.578305,-1.318902,-0.646041,0,-0.721995,0,0,...,0,0,0,0,0,0,0,0,0,1


In [3]:
loans_train_ada_df = pd.read_csv('./outputs/cleaned_loans_train_ada.csv')
loans_train_ada_df.head(5)

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,PERSON_HOME_OWNERSHIP_MORTGAGE,PERSON_HOME_OWNERSHIP_OTHER,...,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below,loan_status
0,1.569797,-1.081318,-1.896898,-0.578305,0.390423,0.11738,0,1.719062,0,0,...,0,0,0,0,0,1,0,0,0,0
1,-0.921741,-0.05255,0.601227,-0.937769,0.896212,-0.973222,0,-1.364513,0,0,...,1,0,0,0,0,0,0,0,1,0
2,0.240977,-1.508084,0.92386,-0.578305,-0.470628,0.55362,0,1.185873,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0.407079,0.435878,1.579649,0.500086,0.27705,0.11738,0,0.087481,0,0,...,0,0,0,0,0,0,0,1,0,0
4,-0.921741,0.098465,-0.486519,-0.578305,-1.318902,-0.646041,0,-0.721995,0,0,...,0,0,0,0,0,0,0,0,1,0


## 2. Hyperparameter Tuning

The decision tree classifier (DTC) has several hyperparameters that should be tuned to maximize its performance.

In this notebook, I will focus on tuning the following <a href="https://scikit-learn.org/dev/modules/generated/sklearn.tree.DecisionTreeClassifier.html">hyperparameters of the DTC</a> for their explainability, namely:

<ol>
    <li><b>criterion:</b> the formula used for determining the quality of the split performed at a node</li>
    <li><b>splitter:</b> the method used to split each node</li>
    <li><b>max_depth:</b> the maximum depth of the decision tree</li>
    <li><b>min_samples_split:</b> the minimum number of samples required for a node to be able to split</li>
    <li><b>max_features:</b> the maximum number of features to consider when looking for the best way to split a node</li>
    <li><b>oversampling_method:</b> The type of oversampling done in the dataset used.</li>
</ol>

The range of follows I chose for these hyperparameters are as follows:
<ol>
    <li><b>criterion:</b> gini impurity or entropy.</li>
    <li><b>splitter:</b> "best" refers to using the feature that splits the node the best according to the criterion, while "random" refers to the best random split.</li>
    <li><b>max_depth:</b> A range from 1 to 100.</li>
    <li><b>min_samples_split:</b> A fraction, referring to the percentage of samples in the dataset for the minimum number of samples required, with a minimum value of 1e-8.</li>
    <li><b>max_features:</b>"sqrt" refers to using sqrt(number of features) to consider for a split, "log2" refers to using log2(number of features) to consider for a split, and "none" refers to using all features to consider for a split.</li>
        <li><b>oversampling_method:</b> Either using the ADASYN oversampled data set or the unbalanced labels dataset.</li>

</ol>

In [4]:
df_hyper_tuning = pd.DataFrame(columns=['loss', 'alpha', 'eta0', 'tol', 'learning_rate', 'oversampling_method', 'roc_auc'])
df_hyper_tuning

Unnamed: 0,loss,alpha,eta0,tol,learning_rate,oversampling_method,roc_auc


For the specific method of hyperparameter tuning, I chose to perform bayesian optimization, which is a hyperparameter tuning method that involves observing the past iterations of the tuning process to influence the configurations to test later on. I chose this method over GridSearch because of the numerous possible configurations that I would need to search through not being a feasible method for my system. Additionally, I chose it over RandomSearch because bayesian optimization would be able to utilize my system resources more effectively by searching in areas with higher probabilities of giving me high performances as opposed to randomly testing configurations.

To perform bayesian optimization, I utilized the <a href="https://scikit-optimize.github.io/stable/modules/generated/skopt.gp_minimize.html">gp_minimize()</a> function provided by the scikit-optimize library.

Following the instructions found in the documentations, I first initialized the search space to be used in the optimization process. This involved creating an array that pertains to the hyperparameters to tune, the data type of the hyperparameters, and the range of possible values to test in the hyperparameter tuning process.

Afterwards, I created the objective function that the gp_minimize() function will execute. The objective function will use the search space defined earlier and test different values for the hyperparameters. It will then return the negative value of the Area Under the ROC (AUC), as this value will be minimized by the gp_minimize() function. Invalid configurations during hyperparameter tuning process will be given a positive value, allowing the bayesian optimization process to avoid such configurations.

I also limit the number of calls performed by the function due to my limited resources.

In [5]:
from sklearn.model_selection import cross_val_score
from skopt import gp_minimize
from skopt.space import Real, Integer, Categorical
from skopt.utils import use_named_args
import numpy as np
import sys
# Define the search space
search_space = [
    Categorical(['hinge', 'log_loss', 'modified_huber', 'squared_hinge', 'perceptron'], name='loss'),
    Real(0.0001, 100, name='alpha'),
    Real(0.0, 100.0, name='eta0'),
    Real(0.0001, 0.1, name='tol'),
    Categorical(['constant', 'optimal', 'invscaling', 'adaptive'], name='learning_rate'),
    Categorical(['none', 'ada'], name='oversampling_method')
]

# Define your objective function (e.g., maximizing accuracy)
@use_named_args(search_space)
def objective_function(loss, alpha, eta0, tol, learning_rate, oversampling_method):
    print("================")
    print("Configuration:")
    print("Loss:", loss)
    print("Tolerance:", tol)
    print("Alpha:", alpha)
    print("Eta0:", eta0)
    print("Learning Rate:", learning_rate)
    print("Oversampling Method:", oversampling_method)
    print("----------------")
    try:
        if oversampling_method == 'none':
            X = loans_train_df.loc[:, loans_train_df.columns != "loan_status"]
            y = loans_train_df["loan_status"]
        elif oversampling_method == 'ada':
            X = loans_train_ada_df.loc[:, loans_train_ada_df.columns != "loan_status"]
            y = loans_train_ada_df["loan_status"]
            
        model = SGDClassifier(class_weight='balanced', loss=loss, alpha=alpha, eta0=eta0, max_iter=200, tol=tol, learning_rate=learning_rate)
        roc_auc = cross_val_score(model, X, y, cv=3, scoring='roc_auc').mean()

        print("Results:", -roc_auc)
        print("================")
        df_hyper_tuning.loc[len(df_hyper_tuning.index)] = [loss, alpha, eta0, tol, learning_rate, oversampling_method, roc_auc] 
        return -roc_auc
    except:
        print("Invalid Config")
        return 100000
        

# Perform Bayesian Optimization
res = gp_minimize(objective_function, search_space, n_calls=500)

# Print best parameters
print("Best parameters:", res.x)


Configuration:
Loss: hinge
Tolerance: 0.05412956686808547
Alpha: 84.74747382102345
Eta0: 77.96390028256833
Learning Rate: constant
Oversampling Method: ada
----------------
Results: -0.5217856009318392
Configuration:
Loss: perceptron
Tolerance: 0.039649019886471154
Alpha: 64.41220646667635
Eta0: 39.32627781516281
Learning Rate: constant
Oversampling Method: none
----------------
Results: -0.5503153537364779
Configuration:
Loss: log_loss
Tolerance: 0.023449328206767754
Alpha: 45.96927560370874
Eta0: 75.48336816805381
Learning Rate: constant
Oversampling Method: ada
----------------
Results: -0.595869505490533
Configuration:
Loss: squared_hinge
Tolerance: 0.05056476131422954
Alpha: 52.30544704291901
Eta0: 14.63490172247202
Learning Rate: adaptive
Oversampling Method: ada
----------------
Results: -0.7919216452048573
Configuration:
Loss: modified_huber
Tolerance: 0.021058851618259957
Alpha: 1.8102984788305057
Eta0: 65.79344412146754
Learning Rate: optimal
Oversampling Method: ada
--------



Results: -0.7880911176336429




Configuration:
Loss: squared_hinge
Tolerance: 0.078463123744636
Alpha: 90.50282028158718
Eta0: 6.654634705424524
Learning Rate: optimal
Oversampling Method: none
----------------
Results: -0.8609697341185649
Configuration:
Loss: log_loss
Tolerance: 0.09980305527632294
Alpha: 99.34845790365219
Eta0: 14.615092840557523
Learning Rate: optimal
Oversampling Method: none
----------------
Results: -0.8614667588091574
Configuration:
Loss: hinge
Tolerance: 0.04536313958257223
Alpha: 6.2001535312486284
Eta0: 100.0
Learning Rate: optimal
Oversampling Method: none
----------------
Results: -0.8616211004867323
Configuration:
Loss: log_loss
Tolerance: 0.07865158248633465
Alpha: 88.49190039725987
Eta0: 10.208303992631096
Learning Rate: optimal
Oversampling Method: none
----------------
Results: -0.8621446368388171
Configuration:
Loss: hinge
Tolerance: 0.0001
Alpha: 0.0001
Eta0: 100.0
Learning Rate: optimal
Oversampling Method: ada
----------------
Results: -0.8835768289171858
Configuration:
Loss: log



Configuration:
Loss: log_loss
Tolerance: 0.09440841686684166
Alpha: 44.784077832825105
Eta0: 78.08743796595049
Learning Rate: invscaling
Oversampling Method: ada
----------------
Results: -0.5970939079556626




Configuration:
Loss: modified_huber
Tolerance: 0.09772442178501022
Alpha: 73.76078982695252
Eta0: 55.370495849561905
Learning Rate: invscaling
Oversampling Method: none
----------------
Results: -0.7428637131913467
Configuration:
Loss: log_loss
Tolerance: 0.0001
Alpha: 0.0001
Eta0: 100.0
Learning Rate: optimal
Oversampling Method: ada
----------------
Results: -0.8845799205142163




Configuration:
Loss: hinge
Tolerance: 0.035727394117202674
Alpha: 79.56007281574153
Eta0: 1.749369243229904
Learning Rate: constant
Oversampling Method: ada
----------------
Results: -0.5310817901124255




Configuration:
Loss: log_loss
Tolerance: 0.03861768165063565
Alpha: 79.16575219549003
Eta0: 80.65781474850814
Learning Rate: invscaling
Oversampling Method: none
----------------
Results: -0.5142437585231904




Configuration:
Loss: squared_hinge
Tolerance: 0.022982447347402447
Alpha: 10.943659583631915
Eta0: 53.899532351619115
Learning Rate: invscaling
Oversampling Method: ada
----------------
Results: -0.5476953760941595




Configuration:
Loss: modified_huber
Tolerance: 0.09268951714375327
Alpha: 24.288342845504218
Eta0: 78.39569410277477
Learning Rate: constant
Oversampling Method: ada
----------------
Results: -0.5545838118313217




Configuration:
Loss: perceptron
Tolerance: 0.05944382205975051
Alpha: 86.55149212265135
Eta0: 17.25933242078595
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.802797126465831
Configuration:
Loss: perceptron
Tolerance: 0.0001
Alpha: 0.0001
Eta0: 100.0
Learning Rate: optimal
Oversampling Method: none
----------------
Results: -0.8528628656435302




Configuration:
Loss: perceptron
Tolerance: 0.057573736923642246
Alpha: 66.1784991476255
Eta0: 66.8040032470372
Learning Rate: optimal
Oversampling Method: none
----------------
Results: -0.8565025169495067
Configuration:
Loss: log_loss
Tolerance: 0.0001
Alpha: 13.539877683994483
Eta0: 100.0
Learning Rate: optimal
Oversampling Method: none
----------------
Results: -0.8626591996210183
Configuration:
Loss: log_loss
Tolerance: 0.0005292701899918859
Alpha: 99.50758189658117
Eta0: 100.0
Learning Rate: optimal
Oversampling Method: none
----------------
Results: -0.8614850013408821
Configuration:
Loss: modified_huber
Tolerance: 0.06254958505813381
Alpha: 97.57932429068147
Eta0: 60.6287091818959
Learning Rate: optimal
Oversampling Method: none
----------------
Results: -0.8520665654538218
Configuration:
Loss: modified_huber
Tolerance: 0.0058788919127716315
Alpha: 100.0
Eta0: 80.24026911101888
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.8625743286322995
Config



Configuration:
Loss: squared_hinge
Tolerance: 0.06809264772404269
Alpha: 52.97300312923799
Eta0: 23.558935051992275
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.8630034430385948




Configuration:
Loss: squared_hinge
Tolerance: 0.00543949038019718
Alpha: 44.89309110285409
Eta0: 36.12229164858618
Learning Rate: adaptive
Oversampling Method: ada
----------------
Results: -0.7919123491699444




Configuration:
Loss: modified_huber
Tolerance: 0.08665024359305702
Alpha: 61.691218168337514
Eta0: 37.343892672707454
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.8627453809263387




Configuration:
Loss: hinge
Tolerance: 0.07510351248959285
Alpha: 27.120041713146698
Eta0: 67.46402420901224
Learning Rate: optimal
Oversampling Method: ada
----------------
Results: -0.7910637638522063




Configuration:
Loss: squared_hinge
Tolerance: 0.07995286292358053
Alpha: 90.93319396878985
Eta0: 56.92363230735309
Learning Rate: invscaling
Oversampling Method: none
----------------
Results: -0.5446445152644906
Configuration:
Loss: log_loss
Tolerance: 0.0001
Alpha: 99.44071685949797
Eta0: 0.0
Learning Rate: optimal
Oversampling Method: none
----------------
Results: -0.8614786348944324
Configuration:
Loss: log_loss
Tolerance: 0.0001
Alpha: 100.0
Eta0: 3.557044860097338
Learning Rate: optimal
Oversampling Method: none
----------------
Results: -0.8614704498888849
Configuration:
Loss: log_loss
Tolerance: 0.0001
Alpha: 100.0
Eta0: 100.0
Learning Rate: optimal
Oversampling Method: none
----------------
Results: -0.8614424644647194
Configuration:
Loss: log_loss
Tolerance: 0.0001
Alpha: 100.0
Eta0: 12.306182974973542
Learning Rate: optimal
Oversampling Method: none
----------------
Results: -0.861440879581712




Configuration:
Loss: perceptron
Tolerance: 0.06763601903856743
Alpha: 71.80876909453598
Eta0: 56.431717579884165
Learning Rate: optimal
Oversampling Method: none
----------------
Results: -0.8497241663158901
Configuration:
Loss: log_loss
Tolerance: 0.0001
Alpha: 99.73139895650904
Eta0: 82.68619646712084
Learning Rate: optimal
Oversampling Method: none
----------------
Results: -0.8614364721673436
Configuration:
Loss: modified_huber
Tolerance: 0.0001
Alpha: 88.8091703403351
Eta0: 100.0
Learning Rate: optimal
Oversampling Method: none
----------------
Results: -0.8609542465713839




Configuration:
Loss: modified_huber
Tolerance: 0.07074205635924263
Alpha: 27.10772430149118
Eta0: 62.14635193281342
Learning Rate: optimal
Oversampling Method: none
----------------
Results: -0.8631838207213378
Configuration:
Loss: hinge
Tolerance: 0.0001
Alpha: 0.0001
Eta0: 100.0
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.901946344898238




Configuration:
Loss: modified_huber
Tolerance: 0.002759855059128887
Alpha: 45.48644005530842
Eta0: 29.856531952717074
Learning Rate: constant
Oversampling Method: none
----------------
Results: -0.7299621071626937
Configuration:
Loss: modified_huber
Tolerance: 0.0001
Alpha: 0.0001
Eta0: 94.51852927247108
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9040918789208977
Configuration:
Loss: modified_huber
Tolerance: 0.0001
Alpha: 0.0001
Eta0: 96.52658067819037
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9040822426773699




Configuration:
Loss: modified_huber
Tolerance: 0.05651538590388674
Alpha: 88.04948705147582
Eta0: 83.83257904284446
Learning Rate: constant
Oversampling Method: none
----------------
Results: -0.5




Configuration:
Loss: log_loss
Tolerance: 0.09797349740219594
Alpha: 25.279356784625413
Eta0: 94.35429955152946
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.8620772426153976
Configuration:
Loss: hinge
Tolerance: 0.0001
Alpha: 14.475581268792832
Eta0: 100.0
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.8616347567298922
Configuration:
Loss: log_loss
Tolerance: 0.0001
Alpha: 0.0001
Eta0: 100.0
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.904633909620903




Configuration:
Loss: modified_huber
Tolerance: 0.028409546997040343
Alpha: 43.190277855695655
Eta0: 51.720587173189514
Learning Rate: constant
Oversampling Method: none
----------------
Results: -0.4175326375782571
Configuration:
Loss: modified_huber
Tolerance: 0.1
Alpha: 0.0001
Eta0: 77.03797643415137
Learning Rate: optimal
Oversampling Method: none
----------------
Results: -0.8328147311794791
Configuration:
Loss: log_loss
Tolerance: 0.01330992318840209
Alpha: 0.0001
Eta0: 97.29356463687314
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9046286594999832
Configuration:
Loss: modified_huber
Tolerance: 0.03827487538316947
Alpha: 0.0001
Eta0: 100.0
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9040716072645439
Configuration:
Loss: log_loss
Tolerance: 0.02752451152810618
Alpha: 0.0001
Eta0: 100.0
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9046370825330046
Configuration:
Loss: log_loss
Tole



Results: -0.8173386901753288
Configuration:
Loss: log_loss
Tolerance: 0.0012152344686190385
Alpha: 0.0001
Eta0: 82.77661878539719
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9046348174697317
Configuration:
Loss: modified_huber
Tolerance: 0.0003139335717886879
Alpha: 0.0001
Eta0: 43.65802253291381
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9040808640341927
Configuration:
Loss: log_loss
Tolerance: 0.009237749939532688
Alpha: 0.0001
Eta0: 51.85690752705104
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9046319809578832
Configuration:
Loss: log_loss
Tolerance: 0.0001
Alpha: 0.0001
Eta0: 91.67978914962576
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9046268083567219
Configuration:
Loss: modified_huber
Tolerance: 0.00014588168268514837
Alpha: 0.0001
Eta0: 72.5255016471863
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9040760



Configuration:
Loss: perceptron
Tolerance: 0.01161843652817615
Alpha: 79.92387646095324
Eta0: 23.595318663012694
Learning Rate: invscaling
Oversampling Method: ada
----------------
Results: -0.5446017675365954
Configuration:
Loss: log_loss
Tolerance: 0.0001
Alpha: 0.0001
Eta0: 67.46548372206918
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.904637338659068
Configuration:
Loss: log_loss
Tolerance: 0.0048376819965610055
Alpha: 0.0001
Eta0: 100.0
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.904636052284467
Configuration:
Loss: log_loss
Tolerance: 0.004788453535830111
Alpha: 0.0001
Eta0: 68.11188412121521
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9046393616302697
Configuration:
Loss: log_loss
Tolerance: 0.0001
Alpha: 0.0001
Eta0: 79.35048516367462
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9046284096857216
Configuration:
Loss: log_loss
Tolerance: 0.0051



Configuration:
Loss: log_loss
Tolerance: 0.05786727461447874
Alpha: 95.7155372582
Eta0: 27.522421479067678
Learning Rate: invscaling
Oversampling Method: ada
----------------
Results: -0.6141074685781666
Configuration:
Loss: log_loss
Tolerance: 0.0001
Alpha: 0.0001
Eta0: 59.71147177767316
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9046253088761781
Configuration:
Loss: log_loss
Tolerance: 0.0001
Alpha: 0.0001
Eta0: 97.2662161269348
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9046301014234678
Configuration:
Loss: log_loss
Tolerance: 0.0001
Alpha: 0.0001
Eta0: 76.08207940836463
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9046398311619033
Configuration:
Loss: log_loss
Tolerance: 0.0001
Alpha: 0.0001
Eta0: 44.86518155384958
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9046353024521667
Configuration:
Loss: log_loss
Tolerance: 0.0001
Alpha: 0.0001
Eta0: 5



Configuration:
Loss: log_loss
Tolerance: 0.04666413063325838
Alpha: 89.86077400115252
Eta0: 81.84668612322838
Learning Rate: constant
Oversampling Method: none
----------------
Results: -0.32098762522791285
Configuration:
Loss: modified_huber
Tolerance: 0.022166575015964994
Alpha: 0.0001
Eta0: 78.16382866283001
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9040927073815777
Configuration:
Loss: log_loss
Tolerance: 0.0006832334921429543
Alpha: 0.0001
Eta0: 91.98228196577834
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9046282022259802
Configuration:
Loss: log_loss
Tolerance: 0.0001
Alpha: 0.0001
Eta0: 85.88082121700933
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9046274019317341
Configuration:
Loss: log_loss
Tolerance: 0.022259731878947524
Alpha: 0.0001
Eta0: 60.22139142264595
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.9046307182677805
Configuration:
L



Configuration:
Loss: hinge
Tolerance: 0.050349274188254516
Alpha: 77.33255188269172
Eta0: 87.55481702004381
Learning Rate: constant
Oversampling Method: none
----------------
Results: -0.58200986440154




Configuration:
Loss: squared_hinge
Tolerance: 0.008016924546043322
Alpha: 47.66060914633564
Eta0: 63.42334634734921
Learning Rate: constant
Oversampling Method: ada
----------------
Results: -0.5
Configuration:
Loss: hinge
Tolerance: 0.061085069793886955
Alpha: 2.804993587496848
Eta0: 81.6332310025716
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.862327704354712
Configuration:
Loss: log_loss
Tolerance: 0.055365206013973094
Alpha: 1.7833655545930158
Eta0: 64.88863968787861
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.8664025735051513
Configuration:
Loss: squared_hinge
Tolerance: 0.07933086612972681
Alpha: 57.707035826049996
Eta0: 50.414394215772894
Learning Rate: adaptive
Oversampling Method: none
----------------
Results: -0.8629726024611158
Configuration:
Loss: modified_huber
Tolerance: 0.008481869207872925
Alpha: 0.0001
Eta0: 91.79000526529029
Learning Rate: adaptive
Oversampling Method: none
----------------
Results:

These tuning results are saved in a dataframe. The hyperparameter tuning results for all models can be viewed in the <b>./hyper_tuning</b> directory.

In [6]:
df_hyper_tuning.sort_values(by=['roc_auc'], ascending=False)

Unnamed: 0,loss,alpha,eta0,tol,learning_rate,oversampling_method,roc_auc
117,log_loss,0.000100,36.078762,0.004037,adaptive,none,0.904657
72,log_loss,0.000100,100.000000,0.005083,adaptive,none,0.904645
76,log_loss,0.000100,48.278881,0.007399,adaptive,none,0.904645
80,log_loss,0.000100,50.937640,0.007440,adaptive,none,0.904645
159,log_loss,0.000100,87.314916,0.000100,adaptive,none,0.904644
...,...,...,...,...,...,...,...
67,modified_huber,43.190278,51.720587,0.028410,constant,none,0.417533
315,log_loss,74.409319,96.326592,0.013963,constant,ada,0.416761
245,log_loss,76.353531,98.128656,0.012756,constant,ada,0.411720
448,log_loss,71.996511,94.993027,0.016532,constant,ada,0.405449


In [7]:
df_hyper_tuning.to_csv('hyper_tuning/sgd_hyper_tuning.csv', index=False, header=True, encoding='utf-8')

# 3. Results

The results showed that the following configuration produced the best results:

<ol>
    <li><b>criterion: </b>entropy</li>
    <li><b>splitter: </b>best</li>
    <li><b>max_depth: </b>42</li>
    <li><b>min_samples_split: </b>0.042438 = 4.2438%</li>
    <li><b>max_features: </b>log2</li>
    <li><b>oversampling_method: </b>none</li>
</ol>

# 4. Exporting Model

The model with the best configuration found during the hyperparameter tuning process is saved in the <b>./outputs</b> directory.

In [8]:
clf = SGDClassifier(class_weight='balanced', loss=res.x[0], alpha=res.x[1], eta0=res.x[2], max_iter=200, tol=res.x[3], learning_rate=res.x[4])

In [9]:
if res.x[5] == 'none':
    X = loans_train_df.loc[:, loans_train_df.columns != "loan_status"]
    y = loans_train_df["loan_status"]
elif res.x[5] == 'ada':
    X = loans_train_ada_df.loc[:, loans_train_ada_df.columns != "loan_status"]
    y = loans_train_ada_df["loan_status"]
    
clf.fit(X,y)

# Calculate the ROC AUC score
roc_auc = cross_val_score(clf, X, y, cv=3, scoring='roc_auc').mean()
print("Validation AUC:", roc_auc)

Validation AUC: 0.9046316593278497


In [10]:
from joblib import dump
clf.fit(X,y)
dump(clf, './outputs/sgd_model.joblib')

['./outputs/sgd_model.joblib']

# 5. Fitting into Test Data

Finally, we can now generate the predictions made by the DTC on the test data. This is done by isolating the features of the test samples, forwarding it to the DTC for prediction, and appending the predicted class labels with the corresponding IDs of the test data.

In [11]:
##Import Testing Dataset
loans_test_df = pd.read_csv('./outputs/cleaned_loans_test.csv')
loans_test_df

Unnamed: 0,id,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,PERSON_HOME_OWNERSHIP_MORTGAGE,...,LOAN_GRADE_B,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below
0,58645,-0.755638,0.404383,-0.117198,2.836600,1.455666,2.189522,0,-1.364513,0,...,0,0,0,0,1,0,0,0,0,1
1,58646,-0.257331,1.127233,0.601227,0.140622,0.722635,-0.646041,1,-0.266122,1,...,0,1,0,0,0,0,0,0,0,1
2,58647,-0.257331,-1.418731,0.403331,-0.937769,1.748450,-0.318861,1,-1.364513,0,...,0,0,0,1,0,0,0,0,0,1
3,58648,0.905387,-0.300610,0.169270,-0.398573,-0.470628,-0.209801,0,0.620670,0,...,0,0,0,0,0,0,0,0,1,0
4,58649,-0.257331,1.259932,0.923860,1.039281,1.573370,-0.100741,1,-0.266122,1,...,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39093,97738,-0.921741,-1.332883,-0.486519,-1.117500,0.044689,-0.646041,0,-0.266122,1,...,1,0,0,0,0,0,0,0,0,1
39094,97739,-0.921741,-0.389963,0.601227,-0.398573,-1.782989,-0.100741,0,-0.721995,1,...,0,0,0,0,0,0,0,0,0,1
39095,97740,3.895232,0.098465,-1.896898,1.039281,-1.043084,0.989861,0,2.637868,1,...,0,0,0,0,0,0,0,1,0,0
39096,97741,-0.921741,-1.019656,0.169270,0.859550,1.425586,2.516703,1,-0.266122,1,...,0,0,1,0,0,0,0,0,0,1


In [12]:
X_test = loans_test_df.loc[:, loans_test_df.columns != "id"]
X_test

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,PERSON_HOME_OWNERSHIP_MORTGAGE,PERSON_HOME_OWNERSHIP_OTHER,...,LOAN_GRADE_B,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below
0,-0.755638,0.404383,-0.117198,2.836600,1.455666,2.189522,0,-1.364513,0,0,...,0,0,0,0,1,0,0,0,0,1
1,-0.257331,1.127233,0.601227,0.140622,0.722635,-0.646041,1,-0.266122,1,0,...,0,1,0,0,0,0,0,0,0,1
2,-0.257331,-1.418731,0.403331,-0.937769,1.748450,-0.318861,1,-1.364513,0,0,...,0,0,0,1,0,0,0,0,0,1
3,0.905387,-0.300610,0.169270,-0.398573,-0.470628,-0.209801,0,0.620670,0,0,...,0,0,0,0,0,0,0,0,1,0
4,-0.257331,1.259932,0.923860,1.039281,1.573370,-0.100741,1,-0.266122,1,0,...,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39093,-0.921741,-1.332883,-0.486519,-1.117500,0.044689,-0.646041,0,-0.266122,1,0,...,1,0,0,0,0,0,0,0,0,1
39094,-0.921741,-0.389963,0.601227,-0.398573,-1.782989,-0.100741,0,-0.721995,1,0,...,0,0,0,0,0,0,0,0,0,1
39095,3.895232,0.098465,-1.896898,1.039281,-1.043084,0.989861,0,2.637868,1,0,...,0,0,0,0,0,0,0,1,0,0
39096,-0.921741,-1.019656,0.169270,0.859550,1.425586,2.516703,1,-0.266122,1,0,...,0,0,1,0,0,0,0,0,0,1


In [13]:
y_pred = clf.predict(X_test)

In [14]:
loans_predictions_df = loans_test_df["id"].copy(deep=True)
loans_predictions_df = loans_predictions_df.to_frame()
loans_predictions_df.insert(1, 'loan_status', y_pred, True)

In [15]:
loans_predictions_df

Unnamed: 0,id,loan_status
0,58645,1
1,58646,0
2,58647,1
3,58648,0
4,58649,1
...,...,...
39093,97738,0
39094,97739,0
39095,97740,0
39096,97741,1


In [16]:
loans_predictions_df.to_csv('predictions/sgd_predictions.csv', index=False, header=True, encoding='utf-8')