# Loan Approval Prediction Kaggle Competition
## October 28, 2024
DICHOSO, Aaron Gabrielle C.

This Notebook is part of a series of notebooks that will contain documentation and methods used for training a Multilayer Perceptron (MLP) used in the <a href="https://www.kaggle.com/competitions/playground-series-s4e10/"><b>2024 Loan Approval Prediction Kaggle Playground Series</b></a>. 


For this notebook, I will focus on the methods that I utilized for model training and hyperparameter training.

To view the data cleaning itself, feel free to visit the following notebook: 

<ul>
    <li>1. Data Exploration, Cleaning, and Transformations</li>
</ul>

I chose to use MLPs due to their ability to handle high-dimensional and non-linearly separable data. These characteristics are due to its usage of non-linear activation functions for each node in the MLP. Additionally, how I preprocessed the data made it so that all the features were relatively the same scale. My hope in using an MLP is that the model will be able to learn the importance of each feature through its hidden layers.

# 1. Import Cleaned Datasets

We first import the cleaned datasets from the previous notebook first. In this repository, the cleaned datasets are saved in the <b>./output</b> directory.

In [1]:
##Python libraries
import pandas as pd
import numpy as np
from sklearn.neural_network import MLPClassifier

As observed, there are two kinds of training datasets used, the train set without oversampling, and the dataset that underwent ADASYN oversampling. I wish to test the performance of the model comparing these two methods during the hyperparameter tuning phase.

In [2]:
##Import Training Dataset
loans_train_df = pd.read_csv('./outputs/cleaned_loans_train.csv')
loans_train_df.head(5)

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,loan_status,PERSON_HOME_OWNERSHIP_MORTGAGE,...,LOAN_GRADE_B,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below
0,1.569797,-1.081318,-1.896898,-0.578305,0.390423,0.11738,0,1.719062,0,0,...,1,0,0,0,0,0,1,0,0,0
1,-0.921741,-0.05255,0.601227,-0.937769,0.896212,-0.973222,0,-1.364513,0,0,...,0,1,0,0,0,0,0,0,0,1
2,0.240977,-1.508084,0.92386,-0.578305,-0.470628,0.55362,0,1.185873,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0.407079,0.435878,1.579649,0.500086,0.27705,0.11738,0,0.087481,0,0,...,1,0,0,0,0,0,0,0,1,0
4,-0.921741,0.098465,-0.486519,-0.578305,-1.318902,-0.646041,0,-0.721995,0,0,...,0,0,0,0,0,0,0,0,0,1


In [3]:
loans_train_ada_df = pd.read_csv('./outputs/cleaned_loans_train_ada.csv')
loans_train_ada_df.head(5)

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,PERSON_HOME_OWNERSHIP_MORTGAGE,PERSON_HOME_OWNERSHIP_OTHER,...,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below,loan_status
0,1.569797,-1.081318,-1.896898,-0.578305,0.390423,0.11738,0,1.719062,0,0,...,0,0,0,0,0,1,0,0,0,0
1,-0.921741,-0.05255,0.601227,-0.937769,0.896212,-0.973222,0,-1.364513,0,0,...,1,0,0,0,0,0,0,0,1,0
2,0.240977,-1.508084,0.92386,-0.578305,-0.470628,0.55362,0,1.185873,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0.407079,0.435878,1.579649,0.500086,0.27705,0.11738,0,0.087481,0,0,...,0,0,0,0,0,0,0,1,0,0
4,-0.921741,0.098465,-0.486519,-0.578305,-1.318902,-0.646041,0,-0.721995,0,0,...,0,0,0,0,0,0,0,0,1,0


# 2. Hyperparameter Tuning

The MLP has several hyperparameters that should be tuned to maximize its performance.

In this notebook, I will focus on tuning the following <a href="https://scikit-learn.org/dev/modules/generated/sklearn.neural_network.MLPClassifier.html">hyperparameters of the MLP</a> for their explainability, namely:

<ol>
    <li><b>activation:</b> the activation function used for the hidden layers of the MLP</li>
    <li><b>tol:</b> the learning rate tolerance of the MLP, which is used to determine model convergence</li>
    <li><b>beta_1:</b> The exponential decay rate for the first moment estimate of the adam classifier</li>
    <li><b>beta_2:</b> The exponential decay rate for the second moment estimate of the adam classifier</li>
    <li><b>epsilon:</b> The value for numerical stability for adam</li>
    <li><b>oversampling_method:</b> The type of oversampling done in the dataset used.</li>
</ol>

Additionally, I will not be tuning the shape or size of the hidden layers due to its high resource demand for my system. I will be using the default parameters given by scikit-learn for the MLP's hidden layers for this reason.

The range of follows I chose for these hyperparameters are as follows:
<ol>
    <li><b>activation:</b> 
        <ol>
            <li>the identity function (f(x) = x) </li>
            <li>the rectified linear unit (relu) function (f(x) = max(0,x))</li>
            <li>the logistic sigmoid function (f(x) = 1 / (1 + exp(-x)))</li>
            <li>the hyperbolic tangent function (f(x) = tanh(x))</li>
        </ol>
    <li><b>tol:</b> [0.0001, 0.1]</li>
    <li><b>beta_1:</b> [0, 0.9999]</li>
    <li><b>beta_2:</b> [0, 0.9999]</li>
    <li><b>epsilon:</b> [1e-10, 0.1]</li>
    <li><b>oversampling_method:</b> Either using the ADASYN oversampled data set or the unbalanced labels dataset.</li>
</ol>

In [4]:
df_hyper_tuning = pd.DataFrame(columns=['activation', 'tol', 'beta_1', 'beta_2', 'epsilon', 'oversampling_method', 'roc_auc'])
df_hyper_tuning

Unnamed: 0,activation,tol,beta_1,beta_2,epsilon,oversampling_method,roc_auc


For the specific method of hyperparameter tuning, I chose to perform bayesian optimization, which is a hyperparameter tuning method that involves observing the past iterations of the tuning process to influence the configurations to test later on. I chose this method over GridSearch because of the numerous possible configurations that I would need to search through not being a feasible method for my system. Additionally, I chose it over RandomSearch because bayesian optimization would be able to utilize my system resources more effectively by searching in areas with higher probabilities of giving me high performances as opposed to randomly testing configurations.

To perform bayesian optimization, I utilized the <a href="https://scikit-optimize.github.io/stable/modules/generated/skopt.gp_minimize.html">gp_minimize()</a> function provided by the scikit-optimize library.

Following the instructions found in the documentations, I first initialized the search space to be used in the optimization process. This involved creating an array that pertains to the hyperparameters to tune, the data type of the hyperparameters, and the range of possible values to test in the hyperparameter tuning process.

Afterwards, I created the objective function that the gp_minimize() function will execute. The objective function will use the search space defined earlier and test different values for the hyperparameters. It will then return the negative value of the Area Under the ROC (AUC) obtained from 3-fold cross validation, as this value will be minimized by the gp_minimize() function. Invalid configurations during hyperparameter tuning process will be given a positive value, allowing the bayesian optimization process to avoid such configurations.

I also limit the number of calls performed by the function due to my limited resources.

In [5]:
from sklearn.model_selection import cross_val_score
from skopt import gp_minimize
from skopt.space import Real, Integer, Categorical
# Define the search space
search_space = [
    Categorical(['identity', 'logistic', 'tanh', 'relu'], name='activation'),
    Real(0.0001, 0.1, name='tol'),
    Real(0, 0.9999, name='beta_1'),
    Real(0, 0.9999, name='beta_2'),
    Real(1e-10, 0.1, name='epsilon'),
    Categorical(['none', 'ada'], name='oversampling_method')
]

# Define your objective function (e.g., maximizing accuracy)
@use_named_args(search_space)
def objective_function(activation, tol, beta_1, beta_2, epsilon, oversampling_method):
    print("================")
    print("Configuration:")
    print("Activation:", activation)
    print("Tolerance:", tol)
    print("Beta 1:", beta_1)
    print("Beta 2:", beta_2)
    print("Epsilon:", epsilon)
    print("Oversampling Method:", oversampling_method)
    print("----------------")
    try:
        if oversampling_method == 'none':
            X = loans_train_df.loc[:, loans_train_df.columns != "loan_status"]
            y = loans_train_df["loan_status"]
        elif oversampling_method == 'ada':
            X = loans_train_ada_df.loc[:, loans_train_ada_df.columns != "loan_status"]
            y = loans_train_ada_df["loan_status"]

        model = MLPClassifier(activation=activation, beta_1=beta_1, beta_2=beta_2, epsilon=epsilon, early_stopping=True, tol=tol)
        roc_auc = cross_val_score(model, X, y, cv=3, scoring='roc_auc').mean()

        print("Results:", -roc_auc)
        df_hyper_tuning.loc[len(df_hyper_tuning.index)] = [activation, tol, beta_1, beta_2, epsilon, oversampling_method, roc_auc] 
        print("================")
        return -roc_auc
    except:
        print("Invalid Config")
        return 100000
        

# Perform Bayesian Optimization
res = gp_minimize(objective_function, search_space, n_calls=100)

# Print best parameters
print("Best parameters:", res.x)

Configuration:
Activation: relu
Tolerance: 0.016496541669711085
Beta 1: 0.83686881851584
Beta 2: 0.6243229026863576
Epsilon: 0.0643763560005888
Oversampling Method: none
----------------
Results: -0.9024631800084547
Configuration:
Activation: tanh
Tolerance: 0.07188842519043259
Beta 1: 0.25239852068545804
Beta 2: 0.06331496979783176
Epsilon: 0.0827951596286516
Oversampling Method: ada
----------------
Results: -0.8451398007840879
Configuration:
Activation: tanh
Tolerance: 0.058689947274921564
Beta 1: 0.30429335879779357
Beta 2: 0.050713741188076936
Epsilon: 0.009688964315612373
Oversampling Method: ada
----------------
Results: -0.8975646951036431
Configuration:
Activation: logistic
Tolerance: 0.08486644361963162
Beta 1: 0.5649601635734669
Beta 2: 0.6937460784681164
Epsilon: 0.07780812750903812
Oversampling Method: ada
----------------
Results: -0.8065784961475219
Configuration:
Activation: tanh
Tolerance: 0.09751733893305348
Beta 1: 0.2303872406234658
Beta 2: 0.5838965982413987
Epsilo



Results: -0.9275038236267017
Configuration:
Activation: relu
Tolerance: 0.0001
Beta 1: 0.00012311156366385894
Beta 2: 0.9999
Epsilon: 1e-10
Oversampling Method: ada
----------------
Results: -0.9322555933695185
Configuration:
Activation: relu
Tolerance: 0.0001
Beta 1: 0.30993297732485975
Beta 2: 0.9999
Epsilon: 1e-10
Oversampling Method: ada
----------------
Results: -0.9319013360653886
Configuration:
Activation: relu
Tolerance: 0.0001
Beta 1: 0.8500452746259589
Beta 2: 0.9999
Epsilon: 1e-10
Oversampling Method: ada
----------------
Results: -0.9321899556535405
Configuration:
Activation: relu
Tolerance: 0.0001
Beta 1: 0.05687258514855831
Beta 2: 0.9999
Epsilon: 1e-10
Oversampling Method: ada
----------------
Results: -0.930593651449367
Configuration:
Activation: relu
Tolerance: 0.0001
Beta 1: 0.18711018917457112
Beta 2: 0.9999
Epsilon: 1e-10
Oversampling Method: ada
----------------
Results: -0.9300837320970171
Configuration:
Activation: tanh
Tolerance: 0.0001
Beta 1: 0.030855148470857



Results: -0.9245296002923471
Configuration:
Activation: identity
Tolerance: 0.0001
Beta 1: 0.9999
Beta 2: 0.9999
Epsilon: 1e-10
Oversampling Method: ada
----------------
Results: -0.8828816944092496
Configuration:
Activation: relu
Tolerance: 0.0001
Beta 1: 0.9134738876810988
Beta 2: 0.0
Epsilon: 0.1
Oversampling Method: ada
----------------




Results: -0.917907417380963




Configuration:
Activation: logistic
Tolerance: 0.049497968360230196
Beta 1: 0.285741592739779
Beta 2: 0.49973194124452847
Epsilon: 0.07091366665840661
Oversampling Method: ada
----------------
Results: -0.805971388286101
Configuration:
Activation: relu
Tolerance: 0.06285582725087735
Beta 1: 0.0
Beta 2: 0.0
Epsilon: 1e-10
Oversampling Method: none
----------------
Results: -0.9265416985457074




Configuration:
Activation: logistic
Tolerance: 0.024931787241026116
Beta 1: 0.23835818181291782
Beta 2: 0.9988835719491237
Epsilon: 0.062215712669068585
Oversampling Method: ada
----------------
Results: -0.8097573475838056
Configuration:
Activation: identity
Tolerance: 0.0001
Beta 1: 0.10805971169590194
Beta 2: 0.0
Epsilon: 0.1
Oversampling Method: none
----------------
Results: -0.9012497376757782
Configuration:
Activation: relu
Tolerance: 0.0001
Beta 1: 0.2370082136662886
Beta 2: 0.9999
Epsilon: 0.013195810279745106
Oversampling Method: none
----------------
Results: -0.9276822102917807
Configuration:
Activation: relu
Tolerance: 0.0001
Beta 1: 0.42269310422491996
Beta 2: 0.9999
Epsilon: 0.012820548227350292
Oversampling Method: none
----------------
Results: -0.9258833374881613
Configuration:
Activation: identity
Tolerance: 0.0001
Beta 1: 0.8420323167823088
Beta 2: 0.0
Epsilon: 0.1
Oversampling Method: ada
----------------
Results: -0.8815848797294222
Configuration:
Activation: relu



Configuration:
Activation: tanh
Tolerance: 0.035185055582388595
Beta 1: 0.2986592493566309
Beta 2: 0.6850367320899565
Epsilon: 0.07172452477466389
Oversampling Method: ada
----------------
Results: -0.8483532713601029
Configuration:
Activation: relu
Tolerance: 0.0001
Beta 1: 0.694458518028561
Beta 2: 0.9999
Epsilon: 1e-10
Oversampling Method: ada
----------------
Results: -0.9307152202129779
Configuration:
Activation: relu
Tolerance: 0.0001
Beta 1: 0.6732914682888953
Beta 2: 0.9999
Epsilon: 1e-10
Oversampling Method: ada
----------------
Results: -0.9309694620892559
Configuration:
Activation: relu
Tolerance: 0.0001
Beta 1: 0.6451557870336069
Beta 2: 0.9999
Epsilon: 1e-10
Oversampling Method: ada
----------------
Results: -0.9318542689964855
Configuration:
Activation: relu
Tolerance: 0.0001
Beta 1: 0.9836126240967187
Beta 2: 0.9999
Epsilon: 1e-10
Oversampling Method: ada
----------------
Results: -0.933023823114381
Configuration:
Activation: relu
Tolerance: 0.0001
Beta 1: 0.989918124095



Configuration:
Activation: logistic
Tolerance: 0.04577552951391579
Beta 1: 0.37580226197446515
Beta 2: 0.7589280346682357
Epsilon: 0.019065101988103766
Oversampling Method: ada
----------------
Results: -0.8308208064964265
Configuration:
Activation: identity
Tolerance: 0.1
Beta 1: 0.7978260285734755
Beta 2: 0.0
Epsilon: 1e-10
Oversampling Method: ada
----------------
Results: -0.663329260068367
Configuration:
Activation: logistic
Tolerance: 0.1
Beta 1: 0.0
Beta 2: 0.0
Epsilon: 1e-10
Oversampling Method: none
----------------
Results: -0.9033197458181726
Configuration:
Activation: identity
Tolerance: 0.1
Beta 1: 0.0
Beta 2: 0.0
Epsilon: 0.1
Oversampling Method: none
----------------
Results: -0.8979024312379127
Configuration:
Activation: identity
Tolerance: 0.04459538221928184
Beta 1: 0.46847605110769813
Beta 2: 0.0
Epsilon: 1e-10
Oversampling Method: none
----------------
Results: -0.5559741939718994
Configuration:
Activation: identity
Tolerance: 0.09680195221800815
Beta 1: 0.179641729



Results: -0.9181037407618508
Configuration:
Activation: relu
Tolerance: 0.0001
Beta 1: 0.5280115285024548
Beta 2: 0.6154545723243792
Epsilon: 1e-10
Oversampling Method: none
----------------
Results: -0.927469181148301
Configuration:
Activation: relu
Tolerance: 0.0001
Beta 1: 0.9999
Beta 2: 0.0
Epsilon: 0.1
Oversampling Method: none
----------------
Results: -0.906046560975141
Configuration:
Activation: relu
Tolerance: 0.0001
Beta 1: 0.7499506843073827
Beta 2: 0.0
Epsilon: 0.037478595683078784
Oversampling Method: none
----------------
Results: -0.9261838947449257
Configuration:
Activation: relu
Tolerance: 0.1
Beta 1: 0.03207935994167879
Beta 2: 0.49294782062744513
Epsilon: 0.019164909806257912
Oversampling Method: none
----------------
Results: -0.9157959403333694
Best parameters: ['relu', 0.0001, 0.9650399101730573, 0.9999, 1e-10, 'ada']


These tuning results are saved in a dataframe. The hyperparameter tuning results for all models can be viewed in the <b>./hyper_tuning</b> directory.

In [17]:
df_hyper_tuning.sort_values(by=['roc_auc'], ascending=False)

Unnamed: 0,activation,tol,beta_1,beta_2,epsilon,oversampling_method,roc_auc
55,relu,0.000100,0.965040,0.999900,1.000000e-10,ada,0.933728
31,relu,0.000100,0.595415,0.866017,1.000000e-10,ada,0.933098
62,relu,0.000100,0.983613,0.999900,1.000000e-10,ada,0.933024
57,relu,0.000100,0.551232,0.999900,1.000000e-10,ada,0.932803
63,relu,0.000100,0.989918,0.999900,1.000000e-10,ada,0.932731
...,...,...,...,...,...,...,...
3,logistic,0.084866,0.564960,0.693746,7.780813e-02,ada,0.806578
44,logistic,0.049498,0.285742,0.499732,7.091367e-02,ada,0.805971
89,logistic,0.059797,0.999900,0.999900,1.000000e-01,none,0.690429
65,identity,0.100000,0.797826,0.000000,1.000000e-10,ada,0.663329


In [7]:
df_hyper_tuning.to_csv('hyper_tuning/mlp_hyper_tuning.csv', index=False, header=True, encoding='utf-8')

# 3. Results

The results showed that the following configuration produced the best results:

<ol>
    <li><b>activation:</b> relu</li>
    <li><b>tol:</b> 0.0001</li>
    <li><b>beta_1:</b> 0.965040</li>
    <li><b>beta_2:</b> 0.9999</li>
    <li><b>epsilon:</b> 1e-10</li>
    <li><b>oversampling_method: </b> ADASYN Oversampling</li>
</ol>

The relu activation function being present in the configuration that produced the best results is highly favorable, as it is significantly less expensive in terms of computational costs compared to the tanh and logistic sigmoid functions. 

Additionally, the utilization of the dataset that underwent ADASYN oversampling indicates that the oversampling method was useful for MLPs, allowing the model to better classify the two classes. This may be due to how ADASYN oversamples the dataset, creating samples for the minority class that are harder to differentiate from the majority class. Thus, during training, the MLP may have been able to better create the distinction between the two classes due to the synthetically generated edge cases.

Finally, the tolerance, beta_1, beta_2, and epsilon hyperparameters all show how the MLP underwent training and eventual convergence. 

For epsilon, we can see that it chose the lowest possible value of 1e-10, which may be due to how it is only used for numerical stability by preventing the divison by 0 during parameter updating. Higher values may cause instability and improper adjustment of weights for the MLP.

The tolerance as well was chosen to be at the lowest pssible value of 1e-4. This is expected as higher values of the tolerance may cause the MLP to preemptively stop with its training, thinking that it may have already converged even though it might still be able to learn more.

For beta_1 and beta_2, these are hyperparameters that the adam optimizer uses for weight optimization. These values, when closer to 1, mean that the adam optimizer is less biased towards 0 during the start of the training process and that the correction process is much more smoother. This attribute is important as we do not want to overcorrect the weights of the MLP. Rather, it is more favorable to slowly update the weights to eventually reach a minimum for the loss function.

# 4. Exporting Model

The model with the best configuration found during the hyperparameter tuning process is saved in the <b>./outputs</b> directory.

In [8]:
clf = MLPClassifier(activation=res.x[0],early_stopping=True, tol=res.x[1], beta_1=res.x[2], beta_2=res.x[3], epsilon=res.x[4],verbose=True)

In [9]:
if res.x[5] == 'none':
    X = loans_train_df.loc[:, loans_train_df.columns != "loan_status"]
    y = loans_train_df["loan_status"]
elif res.x[5] == 'ada':
    X = loans_train_ada_df.loc[:, loans_train_ada_df.columns != "loan_status"]
    y = loans_train_ada_df["loan_status"]
    
clf.fit(X,y)

# Calculate the ROC AUC score
roc_auc = cross_val_score(clf, X, y, cv=3, scoring='roc_auc').mean()
print("Validation AUC:", roc_auc)

Iteration 1, loss = 0.48024405
Validation score: 0.799268
Iteration 2, loss = 0.40816724
Validation score: 0.822308
Iteration 3, loss = 0.38038027
Validation score: 0.830021
Iteration 4, loss = 0.36580037
Validation score: 0.839612
Iteration 5, loss = 0.35647445
Validation score: 0.840206
Iteration 6, loss = 0.34961774
Validation score: 0.839415
Iteration 7, loss = 0.34373210
Validation score: 0.847325
Iteration 8, loss = 0.33976193
Validation score: 0.845743
Iteration 9, loss = 0.33565637
Validation score: 0.844062
Iteration 10, loss = 0.33252108
Validation score: 0.850193
Iteration 11, loss = 0.32893781
Validation score: 0.853159
Iteration 12, loss = 0.32651435
Validation score: 0.853555
Iteration 13, loss = 0.32436091
Validation score: 0.851775
Iteration 14, loss = 0.32148650
Validation score: 0.856818
Iteration 15, loss = 0.32017166
Validation score: 0.853950
Iteration 16, loss = 0.31754698
Validation score: 0.857510
Iteration 17, loss = 0.31612302
Validation score: 0.857807
Iterat

Validation AUC from 3-fold cross validation: 0.9325054374797473

In [10]:
from joblib import dump
dump(clf, './outputs/mlp_model.joblib')

['./outputs/mlp_model.joblib']

# 5. Fitting into Test Data

Finally, we can now generate the predictions made by the DTC on the test data. This is done by isolating the features of the test samples, forwarding it to the DTC for prediction, and appending the predicted class labels with the corresponding IDs of the test data. Predictions for the models can be found in the <b>./predictions</b> directory.

In [11]:
##Import Testing Dataset
loans_test_df = pd.read_csv('./outputs/cleaned_loans_test.csv')
loans_test_df

Unnamed: 0,id,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,PERSON_HOME_OWNERSHIP_MORTGAGE,...,LOAN_GRADE_B,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below
0,58645,-0.755638,0.404383,-0.117198,2.836600,1.455666,2.189522,0,-1.364513,0,...,0,0,0,0,1,0,0,0,0,1
1,58646,-0.257331,1.127233,0.601227,0.140622,0.722635,-0.646041,1,-0.266122,1,...,0,1,0,0,0,0,0,0,0,1
2,58647,-0.257331,-1.418731,0.403331,-0.937769,1.748450,-0.318861,1,-1.364513,0,...,0,0,0,1,0,0,0,0,0,1
3,58648,0.905387,-0.300610,0.169270,-0.398573,-0.470628,-0.209801,0,0.620670,0,...,0,0,0,0,0,0,0,0,1,0
4,58649,-0.257331,1.259932,0.923860,1.039281,1.573370,-0.100741,1,-0.266122,1,...,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39093,97738,-0.921741,-1.332883,-0.486519,-1.117500,0.044689,-0.646041,0,-0.266122,1,...,1,0,0,0,0,0,0,0,0,1
39094,97739,-0.921741,-0.389963,0.601227,-0.398573,-1.782989,-0.100741,0,-0.721995,1,...,0,0,0,0,0,0,0,0,0,1
39095,97740,3.895232,0.098465,-1.896898,1.039281,-1.043084,0.989861,0,2.637868,1,...,0,0,0,0,0,0,0,1,0,0
39096,97741,-0.921741,-1.019656,0.169270,0.859550,1.425586,2.516703,1,-0.266122,1,...,0,0,1,0,0,0,0,0,0,1


In [12]:
X_test = loans_test_df.loc[:, loans_test_df.columns != "id"]
X_test

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,PERSON_HOME_OWNERSHIP_MORTGAGE,PERSON_HOME_OWNERSHIP_OTHER,...,LOAN_GRADE_B,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below
0,-0.755638,0.404383,-0.117198,2.836600,1.455666,2.189522,0,-1.364513,0,0,...,0,0,0,0,1,0,0,0,0,1
1,-0.257331,1.127233,0.601227,0.140622,0.722635,-0.646041,1,-0.266122,1,0,...,0,1,0,0,0,0,0,0,0,1
2,-0.257331,-1.418731,0.403331,-0.937769,1.748450,-0.318861,1,-1.364513,0,0,...,0,0,0,1,0,0,0,0,0,1
3,0.905387,-0.300610,0.169270,-0.398573,-0.470628,-0.209801,0,0.620670,0,0,...,0,0,0,0,0,0,0,0,1,0
4,-0.257331,1.259932,0.923860,1.039281,1.573370,-0.100741,1,-0.266122,1,0,...,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39093,-0.921741,-1.332883,-0.486519,-1.117500,0.044689,-0.646041,0,-0.266122,1,0,...,1,0,0,0,0,0,0,0,0,1
39094,-0.921741,-0.389963,0.601227,-0.398573,-1.782989,-0.100741,0,-0.721995,1,0,...,0,0,0,0,0,0,0,0,0,1
39095,3.895232,0.098465,-1.896898,1.039281,-1.043084,0.989861,0,2.637868,1,0,...,0,0,0,0,0,0,0,1,0,0
39096,-0.921741,-1.019656,0.169270,0.859550,1.425586,2.516703,1,-0.266122,1,0,...,0,0,1,0,0,0,0,0,0,1


In [13]:
y_pred = clf.predict(X_test)

In [14]:
loans_predictions_df = loans_test_df["id"].copy(deep=True)
loans_predictions_df = loans_predictions_df.to_frame()
loans_predictions_df.insert(1, 'loan_status', y_pred, True)

In [15]:
loans_predictions_df

Unnamed: 0,id,loan_status
0,58645,1
1,58646,0
2,58647,1
3,58648,0
4,58649,0
...,...,...
39093,97738,0
39094,97739,0
39095,97740,0
39096,97741,1


In [16]:
loans_predictions_df.to_csv('predictions/mlp_predictions.csv', index=False, header=True, encoding='utf-8')