# ML Final Project

In this final project, I will analyze a dataset containing loan eligibility data. The goal is to accurately classify if a loan will be approved based on its features. There are 614 observations. The dataset has 13 features including one unique ID (Loan_ID). The descriptions of the features are below.
 
| Columns	| Description |
| :-----------: | :-----------: |
| Loan_ID	| Unique Job ID |
| Gender	| Male/Female |
| Married	| Applicant married (Y/N) |
| Dependents	| Number of dependents |
| Education	| Applicant Education (Graduate/Not Graduate)|
| Self_Employed	| Self employed (Y/N) |
| ApplicantIncome	| Applicant income |
| CoapplicantIncome	| Coapplicant income |
| LoanAmount	| Loan amount in thousands |
| LoanAmountTerm	| Term of loan in months |
| Credit_History	| Credit history meets guidelines |
| Property_Area	| Urban/Semi-Urban/Rural |
| Loan_Status	| (Target) Loan approved (Y/N) |

In [1]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
# import tensorflow as tf
# import keras

In [2]:
# read in the file
df = pd.read_csv("Loan_Data.csv")

In [3]:
# take a look at the dataset
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [4]:
df.columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [5]:
# There are a lot of NaN values across the columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [9]:
# gauge the breakout between yes and no loan statuses
df["Loan_Status"].value_counts()

Y    422
N    192
Name: Loan_Status, dtype: int64

We saw a lot of columns with NaN values. We can fill those missing values based on the proportion of the exising values since most of them are non-numeric. For the numeric columns, we will use the mean.

In [12]:
# generate a list of columns with NaN values
nan_columns = df.columns[df.isna().any()].tolist()
nan_columns

['Gender',
 'Married',
 'Dependents',
 'Self_Employed',
 'LoanAmount',
 'Loan_Amount_Term',
 'Credit_History']

In [36]:
# check to make sure these columns have NaNs
df[nan_columns].isna().sum()

Gender              13
Married              3
Dependents          15
Self_Employed       32
LoanAmount          22
Loan_Amount_Term    14
Credit_History      50
dtype: int64

In [78]:
df2 = df.copy()

In [79]:
# fill the nan columns with the mean of the other non-null values in those columns
# excluse Credit_History because it is a boolean

df2.loc[:,df2.columns!="Credit_History"] = df2.loc[:,df2.columns!="Credit_History"].fillna(df2.mean(numeric_only=True)) 

In [80]:
# let us make sure we removed the NaN values from those numeric columns
df2[nan_columns].isna().sum()

Gender              13
Married              3
Dependents          15
Self_Employed       32
LoanAmount           0
Loan_Amount_Term     0
Credit_History      50
dtype: int64

In [81]:
# let us start with imputing the Gender column
gender = df2["Gender"].value_counts(normalize=True)
gender

Male      0.813644
Female    0.186356
Name: Gender, dtype: float64

In [82]:
df2.loc[df2["Gender"].isna(), 'Gender'] = np.random.choice(gender.index, p=gender.values, size=df2["Gender"].isna().sum())
df2

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,146.412162,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.000000,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.000000,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.000000,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.000000,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.000000,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.000000,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.000000,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.000000,360.0,1.0,Urban,Y


In [83]:
# sanity check
df2[nan_columns].isna().sum()

Gender               0
Married              3
Dependents          15
Self_Employed       32
LoanAmount           0
Loan_Amount_Term     0
Credit_History      50
dtype: int64

In [84]:
# imputing the Married column
married = df2["Married"].value_counts(normalize=True)
df2.loc[df2["Married"].isna(), 'Married'] = np.random.choice(married.index, p=married.values, size=df2["Married"].isna().sum())

df2[nan_columns].isna().sum()

Gender               0
Married              0
Dependents          15
Self_Employed       32
LoanAmount           0
Loan_Amount_Term     0
Credit_History      50
dtype: int64

In [85]:
# imputing the Dependents column
dependents = df2["Dependents"].value_counts(normalize=True)
df2.loc[df2["Dependents"].isna(), 'Dependents'] = np.random.choice(dependents.index, p=dependents.values, 
                                                                   size=df2["Dependents"].isna().sum())

df2[nan_columns].isna().sum()

Gender               0
Married              0
Dependents           0
Self_Employed       32
LoanAmount           0
Loan_Amount_Term     0
Credit_History      50
dtype: int64

In [86]:
# imputing the Self_Employed column
self_employed = df2["Self_Employed"].value_counts(normalize=True)
df2.loc[df2["Self_Employed"].isna(), 'Self_Employed'] = np.random.choice(self_employed.index, p=self_employed.values, 
                                                                   size=df2["Self_Employed"].isna().sum())

df2[nan_columns].isna().sum()

Gender               0
Married              0
Dependents           0
Self_Employed        0
LoanAmount           0
Loan_Amount_Term     0
Credit_History      50
dtype: int64

In [88]:
# imputing the SCredit_History column
credit_history = df2["Credit_History"].value_counts(normalize=True)
df2.loc[df2["Credit_History"].isna(), 'Credit_History'] = np.random.choice(credit_history.index, p=credit_history.values, 
                                                                   size=df2["Credit_History"].isna().sum())

df2[nan_columns].isna().sum()

Gender              0
Married             0
Dependents          0
Self_Employed       0
LoanAmount          0
Loan_Amount_Term    0
Credit_History      0
dtype: int64

Great! We were able to impute all the missing values.

In [89]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             614 non-null    object 
 2   Married            614 non-null    object 
 3   Dependents         614 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      614 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         614 non-null    float64
 9   Loan_Amount_Term   614 non-null    float64
 10  Credit_History     614 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [106]:
# split the data into X and Y

X = df2.iloc[:,0:-1]
y = df2.iloc[:, -1].values
y = np.where((y == "Y"), 1, 0)

In [107]:
print(X.shape)
print(y.shape)

(614, 12)
(614,)


In [109]:
# one-hot-encode the categorical columns
X = pd.get_dummies(X)

In [118]:
# save the column names for later
feature_names = X.columns
X.shape

(614, 634)

In [120]:
from sklearn.model_selection import train_test_split

In [121]:
# split the data into training, testing, and validation sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42) # 0.25 x 0.8 = 0.2

In [122]:
# check that the splitting worked correctly
print("X Training Shape:", X_train.shape)
print("Y Training Shape:", y_train.shape)
print("X Validation Shape:", X_val.shape)
print("Y Validation Shape:", y_val.shape)
print("X Test Shape:", X_test.shape)
print("Y Test Shape:", y_test.shape)

X Training Shape: (368, 634)
Y Training Shape: (368,)
X Validation Shape: (123, 634)
Y Validation Shape: (123,)
X Test Shape: (123, 634)
Y Test Shape: (123,)


Now that we split our data into training, validation, and testing sets, we can begin building our models. We will start with a logistic regression.

## Logistic Regression

In [123]:
from sklearn.linear_model import LogisticRegression

In [124]:
# list for the regularization strength

strength = [0.001, 0.01, 0.1, 1, 10]

To more efficiently find the right fit, solver, and C value for our logistic regression, we will leverage the below function.

In [125]:
# create a function to compute the different solvers
def run_reg(solver, penalty, C, random_state, max_iter, x_train, y_train, x_val, y_val):
    # create a logistic regression object and fit it
    model = LogisticRegression()
    model.solver = solver
    model.penalty = penalty
    model.C = C
    model.random_state = random_state
    model.max_iter = max_iter
    model.fit(x_train, y_train)
    
    # print the accuracy scores
    print(f"\033[1m{solver.title()} with C Value of {C}\033[0m")
    training_score = f"Training Accuracy Score:"
    validation_score = f"Validation Accuracy Score:"

    print(training_score, model.score(x_train, y_train)*100)
    print(validation_score, model.score(x_val, y_val)*100)
    print("----------------------------------------------")

In [127]:
for c in strength:
    run_reg("sag","l2", c, 42, 3000, X_train, y_train, X_val, y_val)

[1mSag with C Value of 0.001[0m
Training Accuracy Score: 70.65217391304348
Validation Accuracy Score: 66.66666666666666
----------------------------------------------
[1mSag with C Value of 0.01[0m
Training Accuracy Score: 70.65217391304348
Validation Accuracy Score: 66.66666666666666
----------------------------------------------
[1mSag with C Value of 0.1[0m
Training Accuracy Score: 70.65217391304348
Validation Accuracy Score: 66.66666666666666
----------------------------------------------
[1mSag with C Value of 1[0m
Training Accuracy Score: 70.65217391304348
Validation Accuracy Score: 66.66666666666666
----------------------------------------------
[1mSag with C Value of 10[0m
Training Accuracy Score: 70.65217391304348
Validation Accuracy Score: 66.66666666666666
----------------------------------------------


In [128]:
for c in strength:
    run_reg("liblinear","l2", c, 42, 3000, X_train, y_train, X_val, y_val)

[1mLiblinear with C Value of 0.001[0m
Training Accuracy Score: 70.65217391304348
Validation Accuracy Score: 66.66666666666666
----------------------------------------------
[1mLiblinear with C Value of 0.01[0m
Training Accuracy Score: 71.19565217391305
Validation Accuracy Score: 66.66666666666666
----------------------------------------------
[1mLiblinear with C Value of 0.1[0m
Training Accuracy Score: 81.25
Validation Accuracy Score: 73.98373983739837
----------------------------------------------
[1mLiblinear with C Value of 1[0m
Training Accuracy Score: 83.96739130434783
Validation Accuracy Score: 74.79674796747967
----------------------------------------------
[1mLiblinear with C Value of 10[0m
Training Accuracy Score: 100.0
Validation Accuracy Score: 75.60975609756098
----------------------------------------------


In [129]:
for c in strength:
    run_reg("saga","l2", c, 42, 3000, X_train, y_train, X_val, y_val)

[1mSaga with C Value of 0.001[0m
Training Accuracy Score: 70.65217391304348
Validation Accuracy Score: 66.66666666666666
----------------------------------------------
[1mSaga with C Value of 0.01[0m
Training Accuracy Score: 70.65217391304348
Validation Accuracy Score: 66.66666666666666
----------------------------------------------
[1mSaga with C Value of 0.1[0m
Training Accuracy Score: 70.65217391304348
Validation Accuracy Score: 66.66666666666666
----------------------------------------------
[1mSaga with C Value of 1[0m
Training Accuracy Score: 70.65217391304348
Validation Accuracy Score: 66.66666666666666
----------------------------------------------
[1mSaga with C Value of 10[0m
Training Accuracy Score: 70.65217391304348
Validation Accuracy Score: 66.66666666666666
----------------------------------------------


In [130]:
for c in strength:
    run_reg("lbfgs","l2", c, 42, 3000, X_train, y_train, X_val, y_val)

[1mLbfgs with C Value of 0.001[0m
Training Accuracy Score: 70.65217391304348
Validation Accuracy Score: 66.66666666666666
----------------------------------------------
[1mLbfgs with C Value of 0.01[0m
Training Accuracy Score: 71.19565217391305
Validation Accuracy Score: 66.66666666666666
----------------------------------------------
[1mLbfgs with C Value of 0.1[0m
Training Accuracy Score: 80.97826086956522
Validation Accuracy Score: 73.17073170731707
----------------------------------------------
[1mLbfgs with C Value of 1[0m
Training Accuracy Score: 83.69565217391305
Validation Accuracy Score: 76.42276422764228
----------------------------------------------
[1mLbfgs with C Value of 10[0m
Training Accuracy Score: 98.91304347826086
Validation Accuracy Score: 75.60975609756098
----------------------------------------------


In [133]:
from sklearn.model_selection import GridSearchCV

In [134]:
X2 = df2.iloc[:,0:-1]

In [136]:
X2.pop("Loan_ID")

0      LP001002
1      LP001003
2      LP001005
3      LP001006
4      LP001008
         ...   
609    LP002978
610    LP002979
611    LP002983
612    LP002984
613    LP002990
Name: Loan_ID, Length: 614, dtype: object

In [138]:
X2 = pd.get_dummies(X2)

In [149]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y, test_size=0.3, random_state=42)

In [153]:
parameters = {"C":strength, "solver":["lbfgs", "liblinear","saga", "sag"],
              "random_state":[42], "max_iter":[3000], "penalty":["l2"]}
lr = LogisticRegression()
clf = GridSearchCV(lr, parameters)
clf.fit(X_train2, y_train2)

In [166]:
parameters2 = {"C":strength, "solver":["liblinear","saga"],
              "random_state":[42], "max_iter":[3000], "penalty":["l1"]}
lr2 = LogisticRegression()
clf2 = GridSearchCV(lr2, parameters2)
clf2.fit(X_train2, y_train2)