# Loan Approval Prediction Kaggle Competition
## October 28, 2024
DICHOSO, Aaron Gabrielle C.

This Notebook is part of a series of notebooks that will contain documentation and methods used for training a KNN Classifier used in the <a href="https://www.kaggle.com/competitions/playground-series-s4e10/"><b>2024 Loan Approval Prediction Kaggle Playground Series</b></a>. 


For this notebook, I will focus on the methods that I utilized for model training and hyperparameter training.

To view the data cleaning itself, feel free to visit the following notebook: 

<ul>
    <li>1. Data Exploration, Cleaning, and Transformations</li>
</ul>

I chose to test a KNN model because of how I cleaned the data. At the end of the data cleaning process, All the features of the dataset were around the same scale. Because of this, models which uses the distance of data points as a criteria for classification would be able to benefit from the transformation. Additionally, the model is non-parametric and does not assume any distribution for the data. This will be useful as many of the dataset features, even after cleaning, are not normally distributed. Finally, due to the use of standardization for data cleaning, outliers would have minimal effect during the classification process, which KNN needs as it can be highly affected by noisy data.

# 1. Import Cleaned Datasets

We first import the cleaned datasets from the previous notebook first. In this repository, the cleaned datasets are saved in the <b>./output</b> directory.

In [1]:
##Python libraries
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

As observed, there are two kinds of training datasets used, the train set without oversampling, and the dataset that underwent ADASYN oversampling. I wish to test the performance of the model comparing these two methods during the hyperparameter tuning phase.

In [2]:
##Import Training Dataset
loans_train_df = pd.read_csv('./outputs/cleaned_loans_train.csv')
loans_train_df.head(5)

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,loan_status,PERSON_HOME_OWNERSHIP_MORTGAGE,...,LOAN_GRADE_B,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below
0,1.569797,-1.081318,-1.896898,-0.578305,0.390423,0.11738,0,1.719062,0,0,...,1,0,0,0,0,0,1,0,0,0
1,-0.921741,-0.05255,0.601227,-0.937769,0.896212,-0.973222,0,-1.364513,0,0,...,0,1,0,0,0,0,0,0,0,1
2,0.240977,-1.508084,0.92386,-0.578305,-0.470628,0.55362,0,1.185873,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0.407079,0.435878,1.579649,0.500086,0.27705,0.11738,0,0.087481,0,0,...,1,0,0,0,0,0,0,0,1,0
4,-0.921741,0.098465,-0.486519,-0.578305,-1.318902,-0.646041,0,-0.721995,0,0,...,0,0,0,0,0,0,0,0,0,1


In [3]:
loans_train_ada_df = pd.read_csv('./outputs/cleaned_loans_train_ada.csv')
loans_train_ada_df.head(5)

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,PERSON_HOME_OWNERSHIP_MORTGAGE,PERSON_HOME_OWNERSHIP_OTHER,...,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below,loan_status
0,1.569797,-1.081318,-1.896898,-0.578305,0.390423,0.11738,0,1.719062,0,0,...,0,0,0,0,0,1,0,0,0,0
1,-0.921741,-0.05255,0.601227,-0.937769,0.896212,-0.973222,0,-1.364513,0,0,...,1,0,0,0,0,0,0,0,1,0
2,0.240977,-1.508084,0.92386,-0.578305,-0.470628,0.55362,0,1.185873,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0.407079,0.435878,1.579649,0.500086,0.27705,0.11738,0,0.087481,0,0,...,0,0,0,0,0,0,0,1,0,0
4,-0.921741,0.098465,-0.486519,-0.578305,-1.318902,-0.646041,0,-0.721995,0,0,...,0,0,0,0,0,0,0,0,1,0


# 2. Hyperparameter Tuning

The decision tree classifier (DTC) has several hyperparameters that should be tuned to maximize its performance.

In this notebook, I will focus on tuning the following <a href="https://scikit-learn.org/dev/modules/generated/sklearn.tree.DecisionTreeClassifier.html">hyperparameters of the DTC</a> for their explainability, namely:

<ol>
    <li><b>n_neighbors:</b> the number of neighbors to consider when searching for the class label</li>
    <li><b>weights:</b> the class weight for the k nearest neighbors</li>
    <li><b>algorithm:</b> the algorithm used for finding the k nearest neighbors</li>
    <li><b>metric:</b> the distance metric used</li>
    <li><b>oversampling_method:</b> The type of oversampling done in the dataset used.</li>
</ol>

The range of follows I chose for these hyperparameters are as follows:
<ol>
    <li><b>n_neighbors:</b> [1, 320]</li>
    <li><b>weights:</b> "uniform" causes all the neighbors to have the same weight in finding the target label (target label becomes the mode of the neighbor's labels). "distance" scales the weight according to the distance of the neighbors to the target point. </li>
    <li><b>algorithm:</b> "auto" causes the function to determine the algorithm to use by itself, "ball_tree" uses the <a href="https://scikit-learn.org/dev/modules/generated/sklearn.neighbors.BallTree.html#sklearn.neighbors.BallTree">BallTree algorithm</a>, and "kd_tree" uses the <a href="https://scikit-learn.org/dev/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree">KDTree algorithm</a>.</li>
    <li><b>metric:</b> all available distance metrics that are compatible with the dataset, according to the <a href="https://scikit-learn.org/dev/modules/generated/sklearn.tree.DecisionTreeClassifier.html">scikit-learn documentation</a>.</li>
    <li><b>oversampling_method:</b> Either using the ADASYN oversampled data set or the unbalanced labels dataset.</li>
</ol>

In [4]:
df_hyper_tuning = pd.DataFrame(columns=['n_neighbors', 'weights', 'algorithm', 'metric', 'oversampling_method', 'roc_auc'])

For the specific method of hyperparameter tuning, I chose to perform bayesian optimization, which is a hyperparameter tuning method that involves observing the past iterations of the tuning process to influence the configurations to test later on. I chose this method over GridSearch because of the numerous possible configurations that I would need to search through not being a feasible method for my system. Additionally, I chose it over RandomSearch because bayesian optimization would be able to utilize my system resources more effectively by searching in areas with higher probabilities of giving me high performances as opposed to randomly testing configurations.

To perform bayesian optimization, I utilized the <a href="https://scikit-optimize.github.io/stable/modules/generated/skopt.gp_minimize.html">gp_minimize()</a> function provided by the scikit-optimize library.

Following the instructions found in the documentations, I first initialized the search space to be used in the optimization process. This involved creating an array that pertains to the hyperparameters to tune, the data type of the hyperparameters, and the range of possible values to test in the hyperparameter tuning process.

Afterwards, I created the objective function that the gp_minimize() function will execute. The objective function will use the search space defined earlier and test different values for the hyperparameters. It will then return the negative value of the Area Under the ROC (AUC) obtained from 3-fold cross validation, as this value will be minimized by the gp_minimize() function. Invalid configurations during hyperparameter tuning process will be given a positive value, allowing the bayesian optimization process to avoid such configurations.

I also limit the number of calls performed by the function due to my limited resources.

In [5]:
from sklearn.model_selection import cross_val_score
from skopt import gp_minimize
from skopt.space import Real, Integer, Categorical
from skopt.utils import use_named_args
# Define the search space
search_space = [
    Integer(1, 320, name='n_neighbors'),
    Categorical(['uniform', 'distance'], name='weights'),
    Categorical(['auto', 'ball_tree', 'kd_tree'], name='algorithm'),
    Categorical(['cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan', 'nan_euclidean'], name='metric'),
    Categorical(['none', 'ada'], name='oversampling_method')
]

# Define your objective function (e.g., maximizing accuracy)
@use_named_args(search_space)
def objective_function(n_neighbors, weights, algorithm, metric, oversampling_method):
    print("================")
    print("Configuration:")
    print("K:", n_neighbors)
    print("Weights:", weights)
    print("Algorithm:", algorithm)
    print("Distance Metric:", metric)
    print("Oversampling Method:", oversampling_method)
    print("----------------")
    try:
        if oversampling_method == 'none':
            X = loans_train_df.loc[:, loans_train_df.columns != "loan_status"]
            y = loans_train_df["loan_status"]
        elif oversampling_method == 'ada':
            X = loans_train_ada_df.loc[:, loans_train_ada_df.columns != "loan_status"]
            y = loans_train_ada_df["loan_status"]
        
        model = KNeighborsClassifier(n_neighbors=n_neighbors, weights=weights, algorithm=algorithm, metric=metric)
        roc_auc = cross_val_score(model, X, y, cv=3, scoring='roc_auc').mean()

        print("Results:", -roc_auc)
        print("================")
        df_hyper_tuning.loc[len(df_hyper_tuning.index)] = [n_neighbors, weights, algorithm, metric, oversampling_method, roc_auc] 
        return -roc_auc
    except:
        print("Invalid Config")
        return 100000
        

# Perform Bayesian Optimization
res = gp_minimize(objective_function, search_space, n_calls=100)

# Print best parameters
print("Best parameters:", res.x)


Configuration:
K: 231
Weights: distance
Algorithm: ball_tree
Distance Metric: manhattan
Oversampling Method: none
----------------
Results: -0.9179495626731932
Configuration:
K: 22
Weights: distance
Algorithm: ball_tree
Distance Metric: euclidean
Oversampling Method: ada
----------------
Results: -0.9390929956793529
Configuration:
K: 79
Weights: uniform
Algorithm: kd_tree
Distance Metric: l2
Oversampling Method: ada
----------------
Results: -0.8755127006969278
Configuration:
K: 301
Weights: distance
Algorithm: kd_tree
Distance Metric: cosine
Oversampling Method: ada
----------------
Invalid Config
Configuration:
K: 44
Weights: uniform
Algorithm: auto
Distance Metric: l2
Oversampling Method: ada
----------------
Results: -0.8919358951743788
Configuration:
K: 275
Weights: uniform
Algorithm: ball_tree
Distance Metric: cityblock
Oversampling Method: none
----------------
Results: -0.9152299066862611
Configuration:
K: 282
Weights: uniform
Algorithm: ball_tree
Distance Metric: cityblock
Ove



Configuration:
K: 199
Weights: uniform
Algorithm: kd_tree
Distance Metric: cityblock
Oversampling Method: ada
----------------
Results: -0.8682792006998836
Configuration:
K: 162
Weights: uniform
Algorithm: ball_tree
Distance Metric: cityblock
Oversampling Method: ada
----------------
Results: -0.8722296145923103
Configuration:
K: 183
Weights: uniform
Algorithm: auto
Distance Metric: euclidean
Oversampling Method: ada
----------------
Results: -0.8570039270635439
Configuration:
K: 156
Weights: uniform
Algorithm: ball_tree
Distance Metric: cityblock
Oversampling Method: ada
----------------
Results: -0.8729531345870883
Configuration:
K: 320
Weights: distance
Algorithm: auto
Distance Metric: cityblock
Oversampling Method: none
----------------
Results: -0.9166357432267693




Configuration:
K: 39
Weights: uniform
Algorithm: auto
Distance Metric: l2
Oversampling Method: none
----------------
Results: -0.9121337408642917
Configuration:
K: 183
Weights: uniform
Algorithm: auto
Distance Metric: l2
Oversampling Method: ada
----------------
Results: -0.8570038883401457
Configuration:
K: 161
Weights: uniform
Algorithm: ball_tree
Distance Metric: cityblock
Oversampling Method: ada
----------------
Results: -0.8723081309433699




Configuration:
K: 199
Weights: uniform
Algorithm: ball_tree
Distance Metric: cosine
Oversampling Method: ada
----------------
Invalid Config
Configuration:
K: 164
Weights: uniform
Algorithm: ball_tree
Distance Metric: cityblock
Oversampling Method: ada
----------------
Results: -0.8719907934033455




Configuration:
K: 59
Weights: distance
Algorithm: ball_tree
Distance Metric: manhattan
Oversampling Method: none
----------------
Results: -0.9175382892672506
Configuration:
K: 160
Weights: uniform
Algorithm: ball_tree
Distance Metric: cityblock
Oversampling Method: ada
----------------
Results: -0.8724412748373029




Configuration:
K: 16
Weights: distance
Algorithm: kd_tree
Distance Metric: cityblock
Oversampling Method: ada
----------------
Results: -0.953963287686896
Configuration:
K: 150
Weights: uniform
Algorithm: ball_tree
Distance Metric: cityblock
Oversampling Method: ada
----------------
Results: -0.873733004716777
Configuration:
K: 146
Weights: uniform
Algorithm: auto
Distance Metric: cityblock
Oversampling Method: ada
----------------
Results: -0.8742655838181932




Configuration:
K: 73
Weights: uniform
Algorithm: kd_tree
Distance Metric: cosine
Oversampling Method: none
----------------
Invalid Config
Configuration:
K: 320
Weights: distance
Algorithm: auto
Distance Metric: l1
Oversampling Method: ada
----------------
Results: -0.8924211861962984




Configuration:
K: 87
Weights: uniform
Algorithm: ball_tree
Distance Metric: euclidean
Oversampling Method: ada
----------------
Results: -0.8731232967171286




Configuration:
K: 265
Weights: distance
Algorithm: auto
Distance Metric: cityblock
Oversampling Method: none
----------------
Results: -0.9173794370978601




Configuration:
K: 320
Weights: distance
Algorithm: ball_tree
Distance Metric: manhattan
Oversampling Method: none
----------------
Results: -0.9166357432267693




Configuration:
K: 268
Weights: distance
Algorithm: kd_tree
Distance Metric: nan_euclidean
Oversampling Method: none
----------------
Invalid Config
Configuration:
K: 168
Weights: uniform
Algorithm: ball_tree
Distance Metric: l1
Oversampling Method: ada
----------------
Results: -0.8715162471496339
Configuration:
K: 164
Weights: uniform
Algorithm: auto
Distance Metric: l1
Oversampling Method: ada
----------------
Results: -0.8719909852600595
Configuration:
K: 155
Weights: uniform
Algorithm: auto
Distance Metric: l1
Oversampling Method: ada
----------------
Results: -0.873091481494695
Configuration:
K: 165
Weights: uniform
Algorithm: ball_tree
Distance Metric: l1
Oversampling Method: ada
----------------
Results: -0.871881941975678
Configuration:
K: 258
Weights: distance
Algorithm: auto
Distance Metric: cityblock
Oversampling Method: none
----------------
Results: -0.9174116318835434
Configuration:
K: 1
Weights: distance
Algorithm: kd_tree
Distance Metric: euclidean
Oversampling Method: 



Configuration:
K: 129
Weights: distance
Algorithm: auto
Distance Metric: l1
Oversampling Method: none
----------------
Results: -0.9189732179421085
Configuration:
K: 307
Weights: distance
Algorithm: auto
Distance Metric: cityblock
Oversampling Method: none
----------------
Results: -0.9167917712068819
Configuration:
K: 167
Weights: uniform
Algorithm: ball_tree
Distance Metric: cityblock
Oversampling Method: ada
----------------
Results: -0.8716464134547349
Configuration:
K: 260
Weights: distance
Algorithm: auto
Distance Metric: cityblock
Oversampling Method: none
----------------
Results: -0.9174179316722744
Configuration:
K: 168
Weights: uniform
Algorithm: ball_tree
Distance Metric: cityblock
Oversampling Method: ada
----------------
Results: -0.8715162471496339
Configuration:
K: 163
Weights: uniform
Algorithm: ball_tree
Distance Metric: cityblock
Oversampling Method: ada
----------------
Results: -0.8721050960324241
Configuration:
K: 165
Weights: uniform
Algorithm: auto
Distance Metr



Configuration:
K: 191
Weights: uniform
Algorithm: auto
Distance Metric: manhattan
Oversampling Method: ada
----------------
Results: -0.8690616858391765
Configuration:
K: 320
Weights: distance
Algorithm: kd_tree
Distance Metric: l2
Oversampling Method: none
----------------
Results: -0.9131882037022758
Best parameters: [16, 'distance', 'kd_tree', 'cityblock', 'ada']


These tuning results are saved in a dataframe. The hyperparameter tuning results for all models can be viewed in the <b>./hyper_tuning</b> directory.

In [6]:
df_hyper_tuning.sort_values(by=['roc_auc'], ascending=False)

Unnamed: 0,n_neighbors,weights,algorithm,metric,oversampling_method,roc_auc
72,16,distance,kd_tree,cityblock,ada,0.953963
1,22,distance,ball_tree,euclidean,ada,0.939093
16,1,uniform,ball_tree,cityblock,ada,0.926088
11,1,distance,auto,cityblock,ada,0.926088
23,1,uniform,kd_tree,manhattan,ada,0.926088
...,...,...,...,...,...,...
28,1,distance,kd_tree,cityblock,none,0.791903
18,1,distance,ball_tree,cityblock,none,0.791903
56,1,uniform,auto,manhattan,none,0.791903
33,1,distance,auto,l1,none,0.791903


In [7]:
df_hyper_tuning.to_csv('hyper_tuning/knn_hyper_tuning.csv', index=False, header=True, encoding='utf-8')

# 3. Results

The results showed that the following configuration produced the best results:

<ol>
    <li><b>n_neighbors:</b> 16</li>
    <li><b>weights:</b> distance</li>
    <li><b>algorithm:</b> kd_tree</li>
    <li><b>metric:</b> cityblock</li>
    <li><b>oversampling_method: </b>ADASYN Oversampling</li>
</ol>

From the results, it can be seen using the ADASYN oversampled dataset performed the best, indicating that the oversampling method was useful for KNNs. This may be due to how ADASYN oversamples the dataset, creating samples for the minority class that are harder to differentiate from the majority class. Thus, the KNN may have been able to have a more comprehensive view of the embedding space for the dataset, causing the boundaries between the two classes in the embedding space to become clearer to see due to the oversampling.

The weighted distance mode for the neighbors' classes is also observed. By utilizing weighted classes for KNNs, data points closer to the target point have their labels weighted more greatly compared to the neighbors further away. In concept, this makes sense as data points that are more similar to one another will tend to be found near each other in the embedding space of the KNN.

The number of neighbors chosen from hyperparameter tuning is fairly low at 16. This means that to find the label of a target point, the KNN model will take into account the labels of the 16 closest data points in its embedding space. Having a low number of neighbors helps as it decreases the variance of the results by only surveying the local area in the embedding space. Having too high of a <i>k</i> value would lead to data points that are not close at all to the target point to exert influence in the classification. Additionally, having a low <i>k</i> value would help in incremental learning applications, where new data may be introduced to the embedding space. While it may not apply to my situation, only having a low number of neighbors would speed up dynamic applications.

The KDTree algorithm was also chosen by the tuning process. KDTree divides the data by splitting it across axes, while BallTree divides it according to hyperspheres. Therefore, <a href="https://www.geeksforgeeks.org/ball-tree-and-kd-tree-algorithms/">KDTree is generally faster and uses less memory than BallTree</a> due to its simpler construction, which lends itself to faster and more lightweight computations for KNNs. 

Finally, the cityblock distance was used, which involves the summation of the difference of each corresponding dimension of a vector. When using cityblock distance, one can expect to see faster computation times compared to other distance formulas with higher complexities, like cosine, euclidean, and l2 distances.

# 4. Exporting Model

The model with the best configuration found during the hyperparameter tuning process is saved in the <b>./outputs</b> directory.

In [8]:
clf = KNeighborsClassifier(n_neighbors=res.x[0], weights=res.x[1], algorithm=res.x[2], metric=res.x[3])

In [9]:
if res.x[4] == 'none':
    X = loans_train_df.loc[:, loans_train_df.columns != "loan_status"]
    y = loans_train_df["loan_status"]
elif res.x[4] == 'ada':
    X = loans_train_ada_df.loc[:, loans_train_ada_df.columns != "loan_status"]
    y = loans_train_ada_df["loan_status"]

clf.fit(X,y)

# Calculate the ROC AUC score
roc_auc = cross_val_score(clf, X, y, cv=3, scoring='roc_auc').mean()
print("Validation AUC:", roc_auc)

Validation AUC: 0.953963287686896


Validation AUC from 3-fold cross validation: 0.953963287686896



In [10]:
from joblib import dump
clf.fit(X,y)
dump(clf, './outputs/knn_model.joblib')

['./outputs/knn_model.joblib']

# 5. Fitting into Test Data

Finally, we can now generate the predictions made by the DTC on the test data. This is done by isolating the features of the test samples, forwarding it to the DTC for prediction, and appending the predicted class labels with the corresponding IDs of the test data. Predictions for the models can be found in the <b>./predictions</b> directory.

In [11]:
##Import Testing Dataset
loans_test_df = pd.read_csv('./outputs/cleaned_loans_test.csv')
loans_test_df

Unnamed: 0,id,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,PERSON_HOME_OWNERSHIP_MORTGAGE,...,LOAN_GRADE_B,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below
0,58645,-0.755638,0.404383,-0.117198,2.836600,1.455666,2.189522,0,-1.364513,0,...,0,0,0,0,1,0,0,0,0,1
1,58646,-0.257331,1.127233,0.601227,0.140622,0.722635,-0.646041,1,-0.266122,1,...,0,1,0,0,0,0,0,0,0,1
2,58647,-0.257331,-1.418731,0.403331,-0.937769,1.748450,-0.318861,1,-1.364513,0,...,0,0,0,1,0,0,0,0,0,1
3,58648,0.905387,-0.300610,0.169270,-0.398573,-0.470628,-0.209801,0,0.620670,0,...,0,0,0,0,0,0,0,0,1,0
4,58649,-0.257331,1.259932,0.923860,1.039281,1.573370,-0.100741,1,-0.266122,1,...,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39093,97738,-0.921741,-1.332883,-0.486519,-1.117500,0.044689,-0.646041,0,-0.266122,1,...,1,0,0,0,0,0,0,0,0,1
39094,97739,-0.921741,-0.389963,0.601227,-0.398573,-1.782989,-0.100741,0,-0.721995,1,...,0,0,0,0,0,0,0,0,0,1
39095,97740,3.895232,0.098465,-1.896898,1.039281,-1.043084,0.989861,0,2.637868,1,...,0,0,0,0,0,0,0,1,0,0
39096,97741,-0.921741,-1.019656,0.169270,0.859550,1.425586,2.516703,1,-0.266122,1,...,0,0,1,0,0,0,0,0,0,1


In [12]:
X_test = loans_test_df.loc[:, loans_test_df.columns != "id"]
X_test

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,PERSON_HOME_OWNERSHIP_MORTGAGE,PERSON_HOME_OWNERSHIP_OTHER,...,LOAN_GRADE_B,LOAN_GRADE_C,LOAN_GRADE_D,LOAN_GRADE_E,LOAN_GRADE_F,LOAN_GRADE_G,CB_PERSON_CRED_HIST_LENGTH_11_17,CB_PERSON_CRED_HIST_LENGTH_18_above,CB_PERSON_CRED_HIST_LENGTH_5_10,CB_PERSON_CRED_HIST_LENGTH_5_below
0,-0.755638,0.404383,-0.117198,2.836600,1.455666,2.189522,0,-1.364513,0,0,...,0,0,0,0,1,0,0,0,0,1
1,-0.257331,1.127233,0.601227,0.140622,0.722635,-0.646041,1,-0.266122,1,0,...,0,1,0,0,0,0,0,0,0,1
2,-0.257331,-1.418731,0.403331,-0.937769,1.748450,-0.318861,1,-1.364513,0,0,...,0,0,0,1,0,0,0,0,0,1
3,0.905387,-0.300610,0.169270,-0.398573,-0.470628,-0.209801,0,0.620670,0,0,...,0,0,0,0,0,0,0,0,1,0
4,-0.257331,1.259932,0.923860,1.039281,1.573370,-0.100741,1,-0.266122,1,0,...,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39093,-0.921741,-1.332883,-0.486519,-1.117500,0.044689,-0.646041,0,-0.266122,1,0,...,1,0,0,0,0,0,0,0,0,1
39094,-0.921741,-0.389963,0.601227,-0.398573,-1.782989,-0.100741,0,-0.721995,1,0,...,0,0,0,0,0,0,0,0,0,1
39095,3.895232,0.098465,-1.896898,1.039281,-1.043084,0.989861,0,2.637868,1,0,...,0,0,0,0,0,0,0,1,0,0
39096,-0.921741,-1.019656,0.169270,0.859550,1.425586,2.516703,1,-0.266122,1,0,...,0,0,1,0,0,0,0,0,0,1


In [13]:
y_pred = clf.predict(X_test)

In [14]:
loans_predictions_df = loans_test_df["id"].copy(deep=True)
loans_predictions_df = loans_predictions_df.to_frame()
loans_predictions_df.insert(1, 'loan_status', y_pred, True)

In [15]:
loans_predictions_df

Unnamed: 0,id,loan_status
0,58645,1
1,58646,0
2,58647,1
3,58648,1
4,58649,1
...,...,...
39093,97738,0
39094,97739,0
39095,97740,0
39096,97741,1


In [16]:
loans_predictions_df.to_csv('predictions/knn_predictions.csv', index=False, header=True, encoding='utf-8')