# Mortgage Loan Approval Capstone Project 
# Machine Learning Modelling

<img src="https://images.unsplash.com/photo-1560518883-ce09059eeffa?q=80&w=1973&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" width="700">

### Importing the Libraries and Dataset

In [4]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import ParameterGrid

import lightgbm as lgb
import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from sklearn.metrics import roc_curve, auc
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

In [5]:
# Set the path 
path = 'ML_data.csv'

# Read the dataframe
df = pd.read_csv(path)

# Set opition to display all the columns
pd.set_option('display.max_columns', None)

# Display the first five rows of the dataframe
df.head()

Unnamed: 0,ethnicity,race,sex,outcome,preapproval,loan_type,reverse_mortgage,business_or_commercial_purpose,loan_amount,loan_to_value_ratio,interest_rate,loan_term,interest_only_payment,property_value,occupancy_type,total_units,income,debt_to_income_ratio,applicant_credit_score_type,co-applicant_credit_score_type,age,co_applicant,intermediary,denial_reason
0,Not Hispanic or Latino,White,Joint,1,2,1,2,1,165000.0,75.0,5.75,300.0,2,225000.0,3,4.0,371.0,50%-<60%,3,2,55-64,0,1,10
1,Not Hispanic or Latino,White,Male,1,2,1,2,2,325000.0,79.75,5.75,12.0,1,405000.0,1,1.0,62.0,40%-<50%,1,10,55-64,1,1,10
2,Not Hispanic or Latino,White,Male,1,2,1,2,2,75000.0,62.93,4.75,240.0,2,115000.0,1,1.0,43.0,30%-<40%,3,10,45-54,1,1,10
3,Not Hispanic or Latino,White,Joint,1,2,1,2,2,725000.0,79.98,5.5,360.0,2,905000.0,2,1.0,662.0,<20%,1,1,55-64,0,1,10
4,Not Hispanic or Latino,White,Male,1,2,1,2,1,145000.0,76.92,6.11,300.0,2,185000.0,3,1.0,176.0,<20%,1,10,45-54,1,1,10


In [6]:
# Check the datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4839365 entries, 0 to 4839364
Data columns (total 24 columns):
 #   Column                          Dtype  
---  ------                          -----  
 0   ethnicity                       object 
 1   race                            object 
 2   sex                             object 
 3   outcome                         int64  
 4   preapproval                     int64  
 5   loan_type                       int64  
 6   reverse_mortgage                int64  
 7   business_or_commercial_purpose  int64  
 8   loan_amount                     float64
 9   loan_to_value_ratio             float64
 10  interest_rate                   float64
 11  loan_term                       float64
 12  interest_only_payment           int64  
 13  property_value                  float64
 14  occupancy_type                  int64  
 15  total_units                     float64
 16  income                          float64
 17  debt_to_income_ratio       

In [7]:
# Check the dimensions of the dataset
df.shape

(4839365, 24)

In [8]:
df['outcome'].value_counts()

outcome
1    4147252
0     692113
Name: count, dtype: int64

Owing to the massive amount of rows in the dataset and the class imbalance, I will take a sample of the dataset so that the class is balanced at 50-50

### Train-Test-Split

In [11]:
# Take a balanced sample for training
balanced = df.groupby('outcome', group_keys=False).apply(lambda x: x.sample(2000, random_state=42))

In [12]:
# Make a copy of the dataframe
data = balanced.copy()

In [13]:
#Drop unnecessary columns from the DataFrame
data = data.drop(columns = [
                             'ethnicity', 
                             'race',
                             'sex',  
                             'preapproval',  # is an indicator if the mortagage will be approved
                             'loan_type',
                             'reverse_mortgage', 
                             'business_or_commercial_purpose', 
                             #'loan_amount',
                             'loan_to_value_ratio',
                             'interest_rate', # contains too many null values
                             'loan_term',
                             'interest_only_payment', 
                            # 'property_value', 
                             'occupancy_type',
                             'total_units', 
                            # 'income', 
                             #'debt_to_income_ratio',
                             'applicant_credit_score_type', 
                             'co-applicant_credit_score_type', 
                             'age',
                             #'co_applicant', 
                             'intermediary', 
                             'denial_reason'# denial reason is inputed after the application so must be dropped
                           ])

In [14]:
# Convert categorical variables into dummy/indicator variables
data = pd.get_dummies(data, 
                      columns=[#'ethnicity', 
                               #'race', 
                               #'sex',
                               'debt_to_income_ratio', 
                               # 'age'
                      ],
                      drop_first=True, 
                      dtype = int)


In [15]:
# Get the list of column names from the dataset
features = list(data.columns)

# Remove the target variable 'outcome' from the list of features
features.remove('outcome')

# Assign the target variable 'outcome' to y
y = data['outcome']

# Assign the features to X
X = data[features]

In [16]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [17]:
# Impute missing values using SimpleImputer
imputer = SimpleImputer(strategy='median')  # Example imputation strategy (median)

# Fit and transform on training set
X_train_imputed = imputer.fit_transform(X_train)
# Transform the test set using parameters learned from training set
X_test_imputed = imputer.transform(X_test)

# Convert imputed arrays back to DataFrames (optional for convenience)
X_train_imputed = pd.DataFrame(X_train_imputed, columns=X.columns)
X_test_imputed = pd.DataFrame(X_test_imputed, columns=X.columns)

In [18]:
# Scale the data using MinMaxScaler
scaler = MinMaxScaler()

# Fit and transform on training set
X_train_scaled = scaler.fit_transform(X_train_imputed)
# Transform the test set using parameters learned from training set
X_test_scaled = scaler.transform(X_test_imputed)

# Convert scaled arrays back to DataFrames (optional for convenience)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns)

In [19]:
# Check if the indices of the training data X_train and y_train are the same
all(X_train.index == y_train.index)

# Check if the indices of the testing data X_test and y_test are the same
all(X_test.index == y_test.index)

True

### Random Forest

In [21]:
# Initialize and train a Random Forest classifier
rf_clf = RandomForestClassifier(random_state=42, n_estimators=1)

# Fit the model on the imputed and scaled training data
rf_clf.fit(X_train_scaled, y_train)

In [22]:
# Step 6: Evaluate on training set
y_train_pred = rf_clf.predict(X_train_scaled)

accuracy_train = accuracy_score(y_train, y_train_pred)
print(f'Accuracy on Training Set: {accuracy_train}')

print('Classification Report on Training Set:')
print(classification_report(y_train, y_train_pred))

print('Confusion Matrix on Training Set:')
print(confusion_matrix(y_train, y_train_pred))

Accuracy on Training Set: 0.8721875
Classification Report on Training Set:
              precision    recall  f1-score   support

           0       0.87      0.87      0.87      1578
           1       0.87      0.88      0.87      1622

    accuracy                           0.87      3200
   macro avg       0.87      0.87      0.87      3200
weighted avg       0.87      0.87      0.87      3200

Confusion Matrix on Training Set:
[[1368  210]
 [ 199 1423]]


In [23]:
# Evaluate on testing set
y_test_pred = rf_clf.predict(X_test_scaled)

accuracy_test = accuracy_score(y_test, y_test_pred)
print(f'Accuracy on Testing Set: {accuracy_test}')

print('Classification Report on Testing Set:')
print(classification_report(y_test, y_test_pred))

print('Confusion Matrix on Testing Set:')
print(confusion_matrix(y_test, y_test_pred))

Accuracy on Testing Set: 0.64125
Classification Report on Testing Set:
              precision    recall  f1-score   support

           0       0.66      0.66      0.66       422
           1       0.62      0.62      0.62       378

    accuracy                           0.64       800
   macro avg       0.64      0.64      0.64       800
weighted avg       0.64      0.64      0.64       800

Confusion Matrix on Testing Set:
[[279 143]
 [144 234]]


In [24]:
# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(estimator=rf_clf, param_grid=param_grid, cv=5, scoring='precision', verbose=1, n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [25]:
# Best parameters found by GridSearchCV
print("\nBest Parameters:")
print(grid_search.best_params_)


Best Parameters:
{'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 200}


In [26]:
# Evaluate the best model
best_rf_model = grid_search.best_estimator_

In [27]:
# Evaluate on training set
y_train_pred_best = best_rf_model.predict(X_train_scaled)

accuracy_train_best = accuracy_score(y_train, y_train_pred_best)
print(f'Best Model Accuracy on Training Set: {accuracy_train_best}')

print('Best Model Classification Report on Training Set:')
print(classification_report(y_train, y_train_pred_best))

print('Best Model Confusion Matrix on Training Set:')
print(confusion_matrix(y_train, y_train_pred_best))

Best Model Accuracy on Training Set: 0.8475
Best Model Classification Report on Training Set:
              precision    recall  f1-score   support

           0       0.88      0.80      0.84      1578
           1       0.82      0.90      0.86      1622

    accuracy                           0.85      3200
   macro avg       0.85      0.85      0.85      3200
weighted avg       0.85      0.85      0.85      3200

Best Model Confusion Matrix on Training Set:
[[1258  320]
 [ 168 1454]]


In [28]:
# Evaluate on testing set
y_test_pred_best = best_rf_model.predict(X_test_scaled)

accuracy_test_best = accuracy_score(y_test, y_test_pred_best)
print(f'Best Model Accuracy on Testing Set: {accuracy_test_best}')

print('Best Model Classification Report on Testing Set:')
print(classification_report(y_test, y_test_pred_best))

print('Best Model Confusion Matrix on Testing Set:')
print(confusion_matrix(y_test, y_test_pred_best))

Best Model Accuracy on Testing Set: 0.705
Best Model Classification Report on Testing Set:
              precision    recall  f1-score   support

           0       0.76      0.65      0.70       422
           1       0.66      0.77      0.71       378

    accuracy                           0.70       800
   macro avg       0.71      0.71      0.70       800
weighted avg       0.71      0.70      0.70       800

Best Model Confusion Matrix on Testing Set:
[[273 149]
 [ 87 291]]


### Logistic Regression

In [30]:
# Initialize and train a Logistic Regression classifier
log_reg = LogisticRegression(random_state=42)

# Fit the model on the training data
log_reg.fit(X_train_scaled, y_train)

In [31]:
# Evaluate on training set
y_train_pred = log_reg.predict(X_train_scaled)

accuracy_train = accuracy_score(y_train, y_train_pred)
print(f'Accuracy on Training Set: {accuracy_train}')

print('Classification Report on Training Set:')
print(classification_report(y_train, y_train_pred))

print('Confusion Matrix on Training Set:')
print(confusion_matrix(y_train, y_train_pred))

Accuracy on Training Set: 0.68
Classification Report on Training Set:
              precision    recall  f1-score   support

           0       0.77      0.50      0.61      1578
           1       0.64      0.85      0.73      1622

    accuracy                           0.68      3200
   macro avg       0.70      0.68      0.67      3200
weighted avg       0.70      0.68      0.67      3200

Confusion Matrix on Training Set:
[[ 792  786]
 [ 238 1384]]


In [32]:
# Evaluate on testing set
y_test_pred = log_reg.predict(X_test_scaled)

accuracy_test = accuracy_score(y_test, y_test_pred)
print(f'Accuracy on Testing Set: {accuracy_test}')

print('Classification Report on Testing Set:')
print(classification_report(y_test, y_test_pred))

print('Confusion Matrix on Testing Set:')
print(confusion_matrix(y_test, y_test_pred))

Accuracy on Testing Set: 0.65875
Classification Report on Testing Set:
              precision    recall  f1-score   support

           0       0.76      0.52      0.62       422
           1       0.60      0.81      0.69       378

    accuracy                           0.66       800
   macro avg       0.68      0.67      0.65       800
weighted avg       0.68      0.66      0.65       800

Confusion Matrix on Testing Set:
[[219 203]
 [ 70 308]]


In [33]:
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10],  # Regularization strength
    'penalty': ['l2']                # Only L2 penalty, as 'lbfgs' does not support 'l1'
}


grid_search = GridSearchCV(estimator=log_reg, param_grid=param_grid, cv=5, verbose=1, n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


In [34]:
# Best parameters found by GridSearchCV
print("\nBest Parameters:")
print(grid_search.best_params_)


Best Parameters:
{'C': 10, 'penalty': 'l2'}


In [35]:
# Evaluate the best model
best_log_reg_model = grid_search.best_estimator_

In [36]:
# Evaluate on training set
y_train_pred_best = best_log_reg_model.predict(X_train_scaled)

accuracy_train_best = accuracy_score(y_train, y_train_pred_best)
print(f'Best Model Accuracy on Training Set: {accuracy_train_best}')

print('Best Model Classification Report on Training Set:')
print(classification_report(y_train, y_train_pred_best))

print('Best Model Confusion Matrix on Training Set:')
print(confusion_matrix(y_train, y_train_pred_best))

Best Model Accuracy on Training Set: 0.678125
Best Model Classification Report on Training Set:
              precision    recall  f1-score   support

           0       0.77      0.49      0.60      1578
           1       0.64      0.86      0.73      1622

    accuracy                           0.68      3200
   macro avg       0.70      0.68      0.67      3200
weighted avg       0.70      0.68      0.67      3200

Best Model Confusion Matrix on Training Set:
[[ 779  799]
 [ 231 1391]]


In [37]:
 # Evaluate on testing set
y_test_pred_best = best_log_reg_model.predict(X_test_scaled)

accuracy_test_best = accuracy_score(y_test, y_test_pred_best)
print(f'Best Model Accuracy on Testing Set: {accuracy_test_best}')

print('Best Model Classification Report on Testing Set:')
print(classification_report(y_test, y_test_pred_best))

print('Best Model Confusion Matrix on Testing Set:')
print(confusion_matrix(y_test, y_test_pred_best))

Best Model Accuracy on Testing Set: 0.65125
Best Model Classification Report on Testing Set:
              precision    recall  f1-score   support

           0       0.75      0.51      0.61       422
           1       0.60      0.81      0.69       378

    accuracy                           0.65       800
   macro avg       0.67      0.66      0.65       800
weighted avg       0.68      0.65      0.64       800

Best Model Confusion Matrix on Testing Set:
[[216 206]
 [ 73 305]]


### Support Vector Machine

In [39]:
# Initialize and train an SVM classifier
svm = SVC(random_state=42, probability=True)

# Fit the model on the imputed and scaled training data
svm.fit(X_train_scaled, y_train)


In [40]:
# Evaluate on training set
y_train_pred = svm.predict(X_train_scaled)

accuracy_train = accuracy_score(y_train, y_train_pred)
print(f'Accuracy on Training Set: {accuracy_train}')

print('Classification Report on Training Set:')
print(classification_report(y_train, y_train_pred))

print('Confusion Matrix on Training Set:')
print(confusion_matrix(y_train, y_train_pred))

Accuracy on Training Set: 0.666875
Classification Report on Training Set:
              precision    recall  f1-score   support

           0       0.81      0.43      0.56      1578
           1       0.62      0.90      0.73      1622

    accuracy                           0.67      3200
   macro avg       0.71      0.66      0.65      3200
weighted avg       0.71      0.67      0.65      3200

Confusion Matrix on Training Set:
[[ 673  905]
 [ 161 1461]]


In [41]:
# Evaluate on testing set
y_test_pred = svm.predict(X_test_scaled)

accuracy_test = accuracy_score(y_test, y_test_pred)
print(f'Accuracy on Testing Set: {accuracy_test}')

print('Classification Report on Testing Set:')
print(classification_report(y_test, y_test_pred))

print('Confusion Matrix on Testing Set:')
print(confusion_matrix(y_test, y_test_pred))

Accuracy on Testing Set: 0.64
Classification Report on Testing Set:
              precision    recall  f1-score   support

           0       0.78      0.44      0.56       422
           1       0.58      0.86      0.69       378

    accuracy                           0.64       800
   macro avg       0.68      0.65      0.63       800
weighted avg       0.69      0.64      0.63       800

Confusion Matrix on Testing Set:
[[186 236]
 [ 52 326]]


In [42]:
# Hyperparameter tuning with GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['linear', 'rbf']
}

grid_search = GridSearchCV(estimator=svm, param_grid=param_grid, cv=5, verbose=1, n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

Fitting 5 folds for each of 32 candidates, totalling 160 fits


In [43]:
# Best parameters found by GridSearchCV
print("\nBest Parameters:")
print(grid_search.best_params_)


Best Parameters:
{'C': 100, 'gamma': 1, 'kernel': 'rbf'}


In [44]:
# Evaluate the best model
best_svm_model = grid_search.best_estimator_

In [45]:
# Evaluate on training set
y_train_pred_best = best_svm_model.predict(X_train_scaled)

accuracy_train_best = accuracy_score(y_train, y_train_pred_best)
print(f'Best Model Accuracy on Training Set: {accuracy_train_best}')

print('Best Model Classification Report on Training Set:')
print(classification_report(y_train, y_train_pred_best))

print('Best Model Confusion Matrix on Training Set:')
print(confusion_matrix(y_train, y_train_pred_best))

Best Model Accuracy on Training Set: 0.6909375
Best Model Classification Report on Training Set:
              precision    recall  f1-score   support

           0       0.76      0.54      0.63      1578
           1       0.65      0.84      0.73      1622

    accuracy                           0.69      3200
   macro avg       0.71      0.69      0.68      3200
weighted avg       0.71      0.69      0.68      3200

Best Model Confusion Matrix on Training Set:
[[ 856  722]
 [ 267 1355]]


In [46]:
# Evaluate on testing set
y_test_pred_best = best_svm_model.predict(X_test_scaled)

accuracy_test_best = accuracy_score(y_test, y_test_pred_best)
print(f'Best Model Accuracy on Testing Set: {accuracy_test_best}')

print('Best Model Classification Report on Testing Set:')
print(classification_report(y_test, y_test_pred_best))

print('Best Model Confusion Matrix on Testing Set:')
print(confusion_matrix(y_test, y_test_pred_best))

Best Model Accuracy on Testing Set: 0.65125
Best Model Classification Report on Testing Set:
              precision    recall  f1-score   support

           0       0.73      0.54      0.62       422
           1       0.60      0.78      0.68       378

    accuracy                           0.65       800
   macro avg       0.67      0.66      0.65       800
weighted avg       0.67      0.65      0.65       800

Best Model Confusion Matrix on Testing Set:
[[228 194]
 [ 85 293]]


### XGBoost

In [48]:
# Ensure that the column names are valid strings without prohibited characters

# Replace any prohibited characters with valid characters
X_train_scaled.columns = X_train_scaled.columns.str.replace('[', '_').str.replace(']', '_').str.replace('<', '_')
X_test_scaled.columns = X_test_scaled.columns.str.replace('[', '_').str.replace(']', '_').str.replace('<', '_')

In [49]:
# Initialize and train an XGBoost classifier
xgb_clf = xgb.XGBClassifier(random_state=42)

# Fit the model on the training data
xgb_clf.fit(X_train_scaled, y_train)

In [50]:
# Evaluate on training set
y_train_pred = xgb_clf.predict(X_train_scaled)

accuracy_train = accuracy_score(y_train, y_train_pred)
print(f'Accuracy on Training Set: {accuracy_train}')

print('Classification Report on Training Set:')
print(classification_report(y_train, y_train_pred))

print('Confusion Matrix on Training Set:')
print(confusion_matrix(y_train, y_train_pred))

Accuracy on Training Set: 0.90625
Classification Report on Training Set:
              precision    recall  f1-score   support

           0       0.93      0.87      0.90      1578
           1       0.88      0.94      0.91      1622

    accuracy                           0.91      3200
   macro avg       0.91      0.91      0.91      3200
weighted avg       0.91      0.91      0.91      3200

Confusion Matrix on Training Set:
[[1380  198]
 [ 102 1520]]


In [51]:
# Step 7: Evaluate on testing set
y_test_pred = xgb_clf.predict(X_test_scaled)

accuracy_test = accuracy_score(y_test, y_test_pred)
print(f'Accuracy on Testing Set: {accuracy_test}')

print('Classification Report on Testing Set:')
print(classification_report(y_test, y_test_pred))

print('Confusion Matrix on Testing Set:')
print(confusion_matrix(y_test, y_test_pred))

Accuracy on Testing Set: 0.69625
Classification Report on Testing Set:
              precision    recall  f1-score   support

           0       0.73      0.67      0.70       422
           1       0.66      0.73      0.69       378

    accuracy                           0.70       800
   macro avg       0.70      0.70      0.70       800
weighted avg       0.70      0.70      0.70       800

Confusion Matrix on Testing Set:
[[281 141]
 [102 276]]


In [52]:
param_grid = {
    'n_estimators': [50, 100, 150],        # Number of boosting rounds
    'learning_rate': [0.01, 0.1, 0.3],     # Step size shrinkage to prevent overfitting
    'max_depth': [3, 4, 5],                # Maximum depth of a tree
    'min_child_weight': [1, 3, 5],         # Minimum sum of instance weight needed in a child
    'subsample': [0.8, 0.9, 1.0],          # Subsample ratio of the training instances
    'colsample_bytree': [0.8, 0.9, 1.0]    # Subsample ratio of columns when constructing each tree
}


grid_search = GridSearchCV(estimator=xgb_clf, param_grid=param_grid, cv=5, scoring='precision', verbose=1, n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

Fitting 5 folds for each of 729 candidates, totalling 3645 fits


In [53]:
# Best parameters found by GridSearchCV
print("\nBest Parameters:")
print(grid_search.best_params_)



Best Parameters:
{'colsample_bytree': 0.9, 'learning_rate': 0.1, 'max_depth': 3, 'min_child_weight': 3, 'n_estimators': 100, 'subsample': 0.8}


In [54]:
# Evaluate the best model
best_xgb_model = grid_search.best_estimator_

In [55]:
# Evaluate on training set
y_train_pred_best = best_xgb_model.predict(X_train_scaled)

accuracy_train_best = accuracy_score(y_train, y_train_pred_best)
print(f'Best Model Accuracy on Training Set: {accuracy_train_best}')

print('Best Model Classification Report on Training Set:')
print(classification_report(y_train, y_train_pred_best))

print('Best Model Confusion Matrix on Training Set:')
print(confusion_matrix(y_train, y_train_pred_best))

Best Model Accuracy on Training Set: 0.7540625
Best Model Classification Report on Training Set:
              precision    recall  f1-score   support

           0       0.79      0.69      0.73      1578
           1       0.73      0.82      0.77      1622

    accuracy                           0.75      3200
   macro avg       0.76      0.75      0.75      3200
weighted avg       0.76      0.75      0.75      3200

Best Model Confusion Matrix on Training Set:
[[1083  495]
 [ 292 1330]]


In [56]:
# Evaluate on testing set
y_test_pred_best = best_xgb_model.predict(X_test_scaled)

accuracy_test_best = accuracy_score(y_test, y_test_pred_best)
print(f'Best Model Accuracy on Testing Set: {accuracy_test_best}')

print('Best Model Classification Report on Testing Set:')
print(classification_report(y_test, y_test_pred_best))

print('Best Model Confusion Matrix on Testing Set:')
print(confusion_matrix(y_test, y_test_pred_best))

Best Model Accuracy on Testing Set: 0.70625
Best Model Classification Report on Testing Set:
              precision    recall  f1-score   support

           0       0.76      0.65      0.70       422
           1       0.66      0.77      0.71       378

    accuracy                           0.71       800
   macro avg       0.71      0.71      0.71       800
weighted avg       0.71      0.71      0.71       800

Best Model Confusion Matrix on Testing Set:
[[273 149]
 [ 86 292]]


### LightGBM

In [58]:
# Initialize and train a LightGBM classifier
lgb_clf = lgb.LGBMClassifier(random_state=42)

# Fit the model on the imputed and scaled training data
lgb_clf.fit(X_train_scaled, y_train)


[LightGBM] [Info] Number of positive: 1622, number of negative: 1578
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000246 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 537
[LightGBM] [Info] Number of data points in the train set: 3200, number of used features: 9
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.506875 -> initscore=0.027502
[LightGBM] [Info] Start training from score 0.027502


In [59]:
# Evaluate on training set
y_train_pred = lgb_clf.predict(X_train_scaled)

accuracy_train = accuracy_score(y_train, y_train_pred)
print(f'Accuracy on Training Set: {accuracy_train}')

print('Classification Report on Training Set:')
print(classification_report(y_train, y_train_pred))

print('Confusion Matrix on Training Set:')
print(confusion_matrix(y_train, y_train_pred))

Accuracy on Training Set: 0.85
Classification Report on Training Set:
              precision    recall  f1-score   support

           0       0.88      0.80      0.84      1578
           1       0.82      0.90      0.86      1622

    accuracy                           0.85      3200
   macro avg       0.85      0.85      0.85      3200
weighted avg       0.85      0.85      0.85      3200

Confusion Matrix on Training Set:
[[1265  313]
 [ 167 1455]]


In [60]:
# Evaluate on testing set
y_test_pred = lgb_clf.predict(X_test_scaled)

accuracy_test = accuracy_score(y_test, y_test_pred)
print(f'Accuracy on Testing Set: {accuracy_test}')

print('Classification Report on Testing Set:')
print(classification_report(y_test, y_test_pred))

print('Confusion Matrix on Testing Set:')
print(confusion_matrix(y_test, y_test_pred))

Accuracy on Testing Set: 0.7025
Classification Report on Testing Set:
              precision    recall  f1-score   support

           0       0.76      0.64      0.69       422
           1       0.66      0.77      0.71       378

    accuracy                           0.70       800
   macro avg       0.71      0.71      0.70       800
weighted avg       0.71      0.70      0.70       800

Confusion Matrix on Testing Set:
[[271 151]
 [ 87 291]]


In [61]:
param_grid = {
    'n_estimators': [50, 100],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}


grid_search = GridSearchCV(estimator=lgb_clf, param_grid=param_grid, cv=5, scoring='precision', verbose=1, n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

Fitting 5 folds for each of 32 candidates, totalling 160 fits
[LightGBM] [Info] Number of positive: 1297, number of negative: 1263
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000855 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 514
[LightGBM] [Info] Number of data points in the train set: 2560, number of used features: 9
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.506641 -> initscore=0.026564
[LightGBM] [Info] Start training from score 0.026564
[LightGBM] [Info] Number of positive: 1298, number of negative: 1262
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001300 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 521
[LightGBM] [Info] Number of data points in the train set: 2560, number of used features: 9
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.507031 -> initscore=0.028127
[LightGBM] [Info]

In [62]:
# Best parameters found by GridSearchCV
print("\nBest Parameters:")
print(grid_search.best_params_)


Best Parameters:
{'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.8}


In [63]:
# Evaluate the best model
best_lgb_model = grid_search.best_estimator_

In [64]:
# Evaluate on training set
y_train_pred_best = best_lgb_model.predict(X_train_scaled)

accuracy_train_best = accuracy_score(y_train, y_train_pred_best)
print(f'Best Model Accuracy on Training Set: {accuracy_train_best}')

print('Best Model Classification Report on Training Set:')
print(classification_report(y_train, y_train_pred_best))

print('Best Model Confusion Matrix on Training Set:')
print(confusion_matrix(y_train, y_train_pred_best))

Best Model Accuracy on Training Set: 0.7578125
Best Model Classification Report on Training Set:
              precision    recall  f1-score   support

           0       0.80      0.68      0.73      1578
           1       0.73      0.84      0.78      1622

    accuracy                           0.76      3200
   macro avg       0.76      0.76      0.76      3200
weighted avg       0.76      0.76      0.76      3200

Best Model Confusion Matrix on Training Set:
[[1069  509]
 [ 266 1356]]


In [65]:
# Evaluate on testing set
y_test_pred_best = best_lgb_model.predict(X_test_scaled)

accuracy_test_best = accuracy_score(y_test, y_test_pred_best)
print(f'Best Model Accuracy on Testing Set: {accuracy_test_best}')

print('Best Model Classification Report on Testing Set:')
print(classification_report(y_test, y_test_pred_best))

print('Best Model Confusion Matrix on Testing Set:')
print(confusion_matrix(y_test, y_test_pred_best))

Best Model Accuracy on Testing Set: 0.70625
Best Model Classification Report on Testing Set:
              precision    recall  f1-score   support

           0       0.77      0.63      0.69       422
           1       0.66      0.79      0.72       378

    accuracy                           0.71       800
   macro avg       0.71      0.71      0.71       800
weighted avg       0.72      0.71      0.71       800

Best Model Confusion Matrix on Testing Set:
[[267 155]
 [ 80 298]]


In [66]:
feature_importances = best_lgb_model.feature_importances_
feature_importances

array([129, 208, 139,  27,  13,  10,  33,   1,  41], dtype=int32)

In [67]:
# Print the most importance predictive features of our model
print(sorted(list(zip(X_train_scaled.columns, best_lgb_model.feature_importances_)), key=lambda x: x[1], reverse=True))

[('property_value', 208), ('income', 139), ('loan_amount', 129), ('debt_to_income_ratio_>60%', 41), ('debt_to_income_ratio_50%-_60%', 33), ('co_applicant', 27), ('debt_to_income_ratio_30%-_40%', 13), ('debt_to_income_ratio_40%-_50%', 10), ('debt_to_income_ratio__20%', 1)]


### Model Accuracy Analysis

Here are the accuracy scores summarized:

- **Random Forest**: Train = 85%, Test = 71%
- **Logistic Regression**: Train = 68%, Test = 65%
- **SVM**: Train = 69%, Test = 65%
- **XGBoost**: Train = 75%, Test = 70%
- **LightGBM**: Train = 76%, Test = 71%

#### Analysis

1. **Random Forest**:
   - High training accuracy (85%) but relatively lower test accuracy (71%), indicating possible overfitting.
2. **Logistic Regression**:
   - Similar training and test accuracy (68% and 65%), indicating good generalization but lower overall accuracy.
3. **SVM**:
   - Similar training and test accuracy (69% and 65%), like logistic regression, it generalizes well but with lower accuracy.
4. **XGBoost**:
   - Balanced performance (75% training, 70% test), suggesting good generalization with higher accuracy than logistic regression and SVM.
5. **LightGBM**:
   - Balanced performance (76% training, 71% test), indicating good generalization and slightly better accuracy compared to XGBoost.

#### Conclusion

**LightGBM** appears to be the best model based on the provided accuracy scores, as it has the highest test accuracy (71%) with a reasonably close training accuracy (76%), suggesting it generalizes well without overfitting.
