## **Loan Default Prediction**

### **Goal:**
#### -Identify if a borrower is at risk of failing to repay their loan

### **Process:**
#### -Use applicant data to develop a model that predicts if a borrower will default on their loan


In [1]:
#Metrics needed to determine if a borrower is at risk of default 

#Quantitative metrics
#1. PD probability of default -> found using Machine learning 
#2. credit score -> located in data set
#3. Loss Given Default (LGD) and Exposure at Default (EAD): (N/A for this project)

#Qualitative metrics
#1. income stability, employment history, DTI ratio (debt to income ratio = total monthly debt/total monthly income *100)
#2. Loan properties (size of loan, term, interest rate, collateral assets) 
#3. Management quality (for businesses), history of management team   (N/A for this project)

#### **1. Data Loading and Cleaning**

In [2]:
#imports
import pandas as pd
import numpy as np
import torch
import torchinfo 
import seaborn as sns
import matplotlib as plt

In [3]:
#Load data into pandas dataframe
df = pd.read_csv("Loan_default.csv")

#check and drop any rows that contain empty values      *(for this exercise the data is already clean)
print(df.isnull().sum())
df = df.dropna(axis=0)

#drop loanID as it will not be used in the analysis
df = df.drop(['LoanID'],axis=1)

LoanID            0
Age               0
Income            0
LoanAmount        0
CreditScore       0
MonthsEmployed    0
NumCreditLines    0
InterestRate      0
LoanTerm          0
DTIRatio          0
Education         0
EmploymentType    0
MaritalStatus     0
HasMortgage       0
HasDependents     0
LoanPurpose       0
HasCoSigner       0
Default           0
dtype: int64


#### **2. Data Manipulation**

#### **2.1 Baseline (majority class) accuracy**

In [4]:
#calculating the most frequent class   (0 or 1)
#final model performance must be above the baseline to indicate accurate predictions
default_count = pd.DataFrame(df['Default']).value_counts()
print(default_count)

a = 225694+29653
b = 225694/a

print(b)

Default
0          225694
1           29653
Name: count, dtype: int64
0.8838717509898295


### The dataset contains about 88% of the non-default class, therefore we will add downsampling and upweighting to handle this imbalanced dataset. This makes sure the model does not just choose the most common class when making predictions.

In [5]:
#get the indices for each class, remove samples at random indices from the majority class,combine into new dataframe
majority_index = df[df['Default'] == 0].index
minority_index = df[df['Default'] == 1].index

# Downsample majority to minority size
downsample_size = len(minority_index)
downsample_index = np.random.choice(majority_index, size=downsample_size, replace=False)

#Combined new indices
balanced_index = np.concatenate([downsample_index, minority_index])
df_rebalanced = df.loc[balanced_index].reset_index(drop=True)

print(f"Re-Balanced dataset: {len(df_rebalanced)} total sample")
print(pd.DataFrame(df_rebalanced['Default']).value_counts())

Re-Balanced dataset: 59306 total sample
Default
0          29653
1          29653
Name: count, dtype: int64


### **2.2 Dataset Contents**

In [6]:
'''
18 total columns (original)

independent variables (features):
9 numeric variables                 (excluding binary label 0,1 for default result)
4 relevant non-numeric variables    (excluding loanID)
3 boolean variables                 (yes, no  values, convert to 1,0)

dependent variable (target):
1 boolean column values to label as default or not
'''

'\n18 total columns (original)\n\nindependent variables (features):\n9 numeric variables                 (excluding binary label 0,1 for default result)\n4 relevant non-numeric variables    (excluding loanID)\n3 boolean variables                 (yes, no  values, convert to 1,0)\n\ndependent variable (target):\n1 boolean column values to label as default or not\n'

In [7]:
#get unique values (for non-numeric variables)
unique_education = df['Education'].unique()

unique_employment = df['EmploymentType'].unique()

unique_marital = df['MaritalStatus'].unique()

unique_purpose = df['LoanPurpose'].unique()

print(f"education: {unique_education}\nemployment: {unique_employment}\nmarital status: {unique_marital}\nloan purpose: {unique_purpose}")

#get min and max values to estimate a range for each column in the training/testing data
minCreditscore = df['CreditScore'].max()
print(f"min credit score: {minCreditscore}")

monthsEmployed_min = df['MonthsEmployed'].min()
monthsEmployed_max = df['MonthsEmployed'].max()
print(f"months employed min and max: {monthsEmployed_min}, {monthsEmployed_max}")

numCreditLines_min = df['NumCreditLines'].min()
numCreditLines_max = df['NumCreditLines'].max()
print(f"num credit lines min and max: {numCreditLines_min}, {numCreditLines_max}")

interestRate_min = df['InterestRate'].min()
interestRate_max = df['InterestRate'].max()
print(f"InterestRate min and max: {interestRate_min}, {interestRate_max}")

loanTerm_min = df['LoanTerm'].min()
loanTerm_max = df['LoanTerm'].max()
print(f"LoanTerm min and max: {loanTerm_min}, {loanTerm_max}")

DTIRatio_min = df['DTIRatio'].min()
DTIRatio_max = df['DTIRatio'].max()
print(f"DTI Ratio min and max: {DTIRatio_min}, {DTIRatio_max}")

education: ["Bachelor's" "Master's" 'High School' 'PhD']
employment: ['Full-time' 'Unemployed' 'Self-employed' 'Part-time']
marital status: ['Divorced' 'Married' 'Single']
loan purpose: ['Other' 'Auto' 'Business' 'Home' 'Education']
min credit score: 849
months employed min and max: 0, 119
num credit lines min and max: 1, 4
InterestRate min and max: 2.0, 25.0
LoanTerm min and max: 12, 60
DTI Ratio min and max: 0.1, 0.9


#### **2.2 Training and Testing Data**

In [8]:
from torch.utils.data import TensorDataset, DataLoader,random_split
import pytorch_lightning as L
import ModelsFinal as models

In [9]:
#Training and Testing data

df_train_test = pd.DataFrame(df_rebalanced)
#display(df_train_test)

#use one hot encoding to convert the non-numeric categorical values and boolean values, education,employment,marital status,loan purpose
#set drop_first=True to drop the first dummy column to avoid multicollinearity, i.e avoid the indepenedent variables being affected by each other
df_encoded = pd.get_dummies(df_train_test, columns=['Education','EmploymentType','MaritalStatus','LoanPurpose','HasMortgage','HasDependents','HasCoSigner'],drop_first=True, dtype=float)
#display(df_train_test_encoded)

#define independent and dependent variables, (x,y) 
#drop LoanID, put into new dataframe
#instead of typing each individual category just assign the encoded dataframe to x but drop the target category (depenedent variable Default)
x_df = df_encoded.drop(['Default'],axis=1)
y_df = df_encoded['Default']

#convert to torch tensor format
x_full = torch.tensor(x_df.to_numpy(),dtype=torch.float32)
y_full = torch.tensor(y_df.to_numpy(),dtype=torch.float32)

#train test split (80% train, 20% test)
dataset_full = TensorDataset(x_full, y_full)

dataset_size = len(dataset_full)
train_split: int = int(0.8*dataset_size)
test_split: int = dataset_size-train_split

#set seed to 25 for pytorch rng consistency
torch.manual_seed(25)
train_set, test_set = random_split(dataset_full, [train_split, test_split])

#Final DataLoaders (batch size of 32 is ideal)
train_dataloader = DataLoader(train_set, batch_size=32, shuffle=True)
test_dataloader = DataLoader(test_set, batch_size=32, shuffle=False)


## 3. Logistic Regression (Linear Classifier)

### The basic Logistic regression model modeled using a neural network with 1 layer and a sigmoid activation function

In [10]:
#Linear model Train (total feautures = 24 after encoding)
model_LR = models.LinearModel(x_df.shape[1])
torchinfo.summary(model_LR)

Layer (type:depth-idx)                   Param #
LinearModel                              --
â”œâ”€BinaryAccuracy: 1-1                    --
â”œâ”€Sequential: 1-2                        --
â”‚    â””â”€Linear: 2-1                       25
Total params: 25
Trainable params: 25
Non-trainable params: 0

In [11]:

print(model_LR)
trainer = L.Trainer(max_epochs=3)
trainer.fit(model_LR,train_dataloaders=train_dataloader)
trainer.test(model_LR,dataloaders=test_dataloader)


ðŸ’¡ Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.


GPU available: False, used: False
TPU available: False, using: 0 TPU cores
c:\Users\Master\AppData\Local\Programs\Python\Python313\Lib\site-packages\pytorch_lightning\trainer\connectors\logger_connector\logger_connector.py:76: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
c:\Users\Master\AppData\Local\Programs\Python\Python313\Lib\site-packages\pytorch_lightning\trainer\configuration_validator.py:70: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.

  | Name     | Type           | Params | Mode  | FLOPs
------------------------------------------------------------
0 | accuracy | BinaryAccuracy | 0     

LinearModel(
  (accuracy): BinaryAccuracy()
  (model): Sequential(
    (0): Linear(in_features=24, out_features=1, bias=True)
  )
)
Epoch 2: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1483/1483 [00:17<00:00, 84.55it/s, v_num=35, train_step_acc=0.350]

`Trainer.fit` stopped: `max_epochs=3` reached.


Epoch 2: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1483/1483 [00:17<00:00, 84.39it/s, v_num=35, train_step_acc=0.350]


c:\Users\Master\AppData\Local\Programs\Python\Python313\Lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:434: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Testing DataLoader 0: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 371/371 [00:03<00:00, 122.75it/s]
â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
       Test metric             DataLoader 0
â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
      test_step_acc         0.5092732906341553
     test_step_loss          8.999670028686523
â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”

[{'test_step_acc': 0.5092732906341553, 'test_step_loss': 8.999670028686523}]

In [12]:
print(f"Final Metrics: {trainer.logged_metrics}")

Final Metrics: {'test_step_acc': tensor(0.5093), 'test_step_loss': tensor(8.9997)}


### An accuracy score around 50% shows a basic logistic regression model is not sufficient enough to capture the details of the dataset. 

## **4. Multi Layer Perceptron (MLP)**

An MLP model

In [13]:
trainer_mlp = L.Trainer(max_epochs=34)

ðŸ’¡ Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores


In [15]:
model_MLP = models.MLP(x_df.shape[1])
torchinfo.summary(model_MLP)

Layer (type:depth-idx)                   Param #
MLP                                      --
â”œâ”€BinaryAccuracy: 1-1                    --
â”œâ”€Sequential: 1-2                        --
â”‚    â””â”€Linear: 2-1                       2,500
â”‚    â””â”€ReLU: 2-2                         --
â”‚    â””â”€Linear: 2-3                       101
Total params: 2,601
Trainable params: 2,601
Non-trainable params: 0

In [16]:
trainer_mlp.fit(model_MLP,train_dataloaders=train_dataloader)
trainer_mlp.test(model_MLP,dataloaders=test_dataloader)


  | Name     | Type           | Params | Mode  | FLOPs
------------------------------------------------------------
0 | accuracy | BinaryAccuracy | 0      | train | 0    
1 | model    | Sequential     | 2.6 K  | train | 0    
------------------------------------------------------------
2.6 K     Trainable params
0         Non-trainable params
2.6 K     Total params
0.010     Total estimated model params size (MB)
5         Modules in train mode
0         Modules in eval mode
0         Total Flops


Epoch 33: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1483/1483 [00:19<00:00, 77.20it/s, v_num=36, train_step_acc=0.550] 

`Trainer.fit` stopped: `max_epochs=34` reached.


Epoch 33: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1483/1483 [00:19<00:00, 76.88it/s, v_num=36, train_step_acc=0.550]
Testing DataLoader 0: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 371/371 [00:03<00:00, 121.90it/s]
â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
       Test metric             DataLoader 0
â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
      test_step_acc        

[{'test_step_acc': 0.5228460431098938, 'test_step_loss': 5.409454345703125}]

In [None]:
print(f"Final Metrics: {trainer_mlp.logged_metrics}")

Final Metrics: {'test_step_acc': tensor(0.5219), 'test_step_loss': tensor(4.9107)}


## **5. Conclusion**

### Both models Showed less than ideal performance during this experiment <80% accuracy, however the MLP did slightly outperform the logistic regression model showing it was able to capture more detailed relationships between the input features, and adjust the weights and biases for a overall better prediction. 
### Potential improvments include: L1 regularization in both models for better feature selection, experimenting with different optimizers such as stochastic gradient descent, altered learning rate configuration, more layers for the MLP model.