# Loan Default Prediction 3 - Modeling 1


We’ll use two models for baseline evaluation:

Logistic Regression (interpretable and works well for binary classification).
Gradient Boosting (e.g., XGBoost or LightGBM) for a stronger, non-linear model.

At the end we will compare the two. 


#### Imports

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, roc_auc_score, roc_curve, accuracy_score
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler


In [2]:
df = pd.read_csv("../data/data_for_model.csv", index_col=0)

df.head()

Unnamed: 0_level_0,person_emp_length,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_cred_hist_length,loan_intent_EDUCATION,loan_intent_HOMEIMPROVEMENT,loan_intent_MEDICAL,loan_intent_PERSONAL,...,loan_grade_B,loan_grade_C,loan_grade_D,loan_grade_E,loan_grade_F,loan_grade_G,person_home_ownership_OTHER,person_home_ownership_OWN,person_home_ownership_RENT,cb_person_default_on_file_Y
person_income,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
9600,5.0,1000,11.14,0,0.1,2,True,False,False,False,...,True,False,False,False,False,False,False,True,False,False
9600,1.0,5500,12.87,1,0.57,3,False,False,True,False,...,False,True,False,False,False,False,False,False,False,False
65500,4.0,35000,15.23,1,0.53,2,False,False,True,False,...,False,True,False,False,False,False,False,False,True,False
54400,8.0,35000,14.27,1,0.55,4,False,False,True,False,...,False,True,False,False,False,False,False,False,True,True
9900,2.0,2500,7.14,1,0.25,2,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False


### Train-Test Split

In [3]:
# we are doing an 80/20 split

X = df.drop(columns=['loan_status'])
y = df['loan_status'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=99)

## Scale Numerical Data


In [4]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Logistic Regression

In [5]:
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)
y_pred_lr = log_reg.predict(X_test_scaled)
y_pred_lr_proba = log_reg.predict_proba(X_test_scaled)[:, 1]

#### Evaluation


In [6]:
print("Logistic Regression Performance:")
print(classification_report(y_test, y_pred_lr))
print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_lr_proba):.2f}")

Logistic Regression Performance:
              precision    recall  f1-score   support

           0       0.89      0.95      0.92      4601
           1       0.76      0.59      0.66      1291

    accuracy                           0.87      5892
   macro avg       0.83      0.77      0.79      5892
weighted avg       0.86      0.87      0.86      5892

AUC-ROC: 0.87


Precision:
Class 0 (Non-default): 0.89 - Out of all the loans predicted as non-default, 89% were actually non-default.
Class 1 (Default): 0.76 - Out of all loans predicted as default, 76% were actually default.

Recall:
Class 0 (Non-default): 0.95 - The model identified 95% of the actual non-default loans correctly.
Class 1 (Default): 0.59 - The model only identified 59% of the actual default loans correctly, which indicates some room for improvement in detecting defaults.

F1-Score:
Class 0 (Non-default): 0.92 - Strong balance of precision and recall for non-defaults.
Class 1 (Default): 0.66 - Lower F1-score for defaults, reflecting lower recall for this class.

Accuracy: 0.87 (87%) - The model correctly classified 87% of loans overall.

AUC-ROC: 0.87 - Indicates good separability between the two classes, suggesting the model is performing well at distinguishing between defaults and non-defaults.

The recall for defaults (Class 1) is low (0.59), meaning the model misses a significant number of actual defaults.
This is expected since defaults (Class 1) are likely the minority class in the dataset, and Logistic Regression tends to favor the majority class (Non-default, Class 0).
Good Performance for Non-defaults. Precision, recall, and F1-score for non-defaults (Class 0) are strong, which is helpful for identifying reliable borrowers.


# Gradient Boosting

In [7]:
gb_clf = GradientBoostingClassifier(random_state=42)
gb_clf.fit(X_train, y_train)
y_pred_gb = gb_clf.predict(X_test)
y_pred_gb_proba = gb_clf.predict_proba(X_test)[:, 1]

### Evaluation

In [8]:
print("\nGradient Boosting Performance:")
print(classification_report(y_test, y_pred_gb))
print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_gb_proba):.2f}")


Gradient Boosting Performance:
              precision    recall  f1-score   support

           0       0.92      0.98      0.95      4601
           1       0.91      0.69      0.79      1291

    accuracy                           0.92      5892
   macro avg       0.92      0.84      0.87      5892
weighted avg       0.92      0.92      0.91      5892

AUC-ROC: 0.92


Precision:
Class 0 (Non-default): 0.92 - Out of all loans predicted as non-default, 92% were actually non-default.
Class 1 (Default): 0.91 - Out of all loans predicted as default, 91% were actually default. This is a significant improvement over Logistic Regression (0.76 for Class 1 precision).

Recall:
Class 0 (Non-default): 0.98 - The model identified 98% of the actual non-default loans correctly, which is very strong.
Class 1 (Default): 0.69 - The model identified 69% of the actual default loans correctly. This is an improvement over Logistic Regression (0.59 recall for Class 1).

F1-Score:
Class 0 (Non-default): 0.95 - Excellent balance of precision and recall for non-defaults.
Class 1 (Default): 0.79 - Improved F1-score for defaults compared to Logistic Regression (0.66).

Accuracy: 0.92 (92%) - The model correctly classified 92% of loans overall, which is higher than Logistic Regression (87%).

AUC-ROC: 0.92 - Indicates excellent separability between the two classes, showing that the model is effectively distinguishing defaults from non-defaults.



Precision (0.91) and recall (0.69) for defaults are significantly better than Logistic Regression, making this model more effective at identifying high-risk borrowers.
Overall Performance:

Gradient Boosting has better overall accuracy, AUC-ROC, and F1-scores for both classes compared to Logistic Regression.
It's capturing the non-linear relationships in the data, which Logistic Regression couldn't.
Class Imbalance Impact:

While Gradient Boosting performs better on the minority class (defaults), recall (0.69) for Class 1 could still be improved further.

### Next Step - Ways to improve the models

**Logistic Agression**
The model only identified 59% of the actual default loans correctly, this is expected since defaults (Class 1) are likely the minority class in the dataset, and Logistic Regression tends to favor the majority class (Non-default, Class 0). 

To address this we will use techniques to handle class imbalance and improve recall for defaults. We will consider: 

- Class Weights: By assigning a higher weight to the minority class.
- Oversampling - We can consider using SMOTE (Synthetic Minority Oversampling Technique) to generate more synthetic examples for the default class.
- Undersampling: Reduce the majority class (non-default) samples.

 Addition to class imbalance, we can also perform hyperparameter tuning using GridSearchCV or RandomizedSearchCV. We will first focus on the class imbalance before further tuning.

**Gradient Boosting** This model has better overall accuracy, but we can still see how well it performs after handling class imbalance. 

# Approach
We will be comparing the 3 different methods and seeing which one is the best method. 
##### 1. Class Weights
Adjust the model to assign a higher weight to the minority class (loan_status=1) to penalize the model more for misclassifying defaults.

**Implementation:**
Use class_weight='balanced'

##### 2. Oversampling with SMOTE
We will use SMOTE (Synthetic Minority Oversampling Technique) to create synthetic examples for the minority class.

**Implementation:**
Use the SMOTE class from imblearn to oversample the training data.
We will need to be careful to only apply SMOTE to the training set to avoid data leakage.

##### 3. Undersampling
Randomly remove samples from the majority class to balance the dataset.

**Implementation:**
Use the RandomUnderSampler class from imblearn.
Like SMOTE, we need to make sure we are applying undersampling only to the training set.

## Class Imbalance in Logistic Regression

In [12]:
# Helper function to evaluate Logistic Regression
def evaluate_model(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    print(classification_report(y_test, y_pred))
    print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_proba):.3f}")



# 1. Logistic Regression with Class Weights
print("Logistic Regression with Class Weights")
weighted_model = LogisticRegression(class_weight='balanced', random_state=42, solver='liblinear')
evaluate_model(weighted_model, X_train, X_test, y_train, y_test)

# 2. Logistic Regression with SMOTE
print("\nLogistic Regression with SMOTE")
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
oversampled_model = LogisticRegression(random_state=42, solver='liblinear')
evaluate_model(oversampled_model, X_train_smote, X_test, y_train_smote, y_test)

# 3. Logistic Regression with Undersampling
print("\nLogistic Regression with Undersampling")
undersampler = RandomUnderSampler(random_state=42)
X_train_under, y_train_under = undersampler.fit_resample(X_train, y_train)
undersampled_model = LogisticRegression(random_state=42, solver='liblinear')
evaluate_model(undersampled_model, X_train_under, X_test, y_train_under, y_test)


Logistic Regression with Class Weights
              precision    recall  f1-score   support

           0       0.93      0.80      0.86      4601
           1       0.53      0.78      0.63      1291

    accuracy                           0.80      5892
   macro avg       0.73      0.79      0.74      5892
weighted avg       0.84      0.80      0.81      5892

AUC-ROC: 0.872

Logistic Regression with SMOTE
              precision    recall  f1-score   support

           0       0.91      0.86      0.88      4601
           1       0.58      0.69      0.63      1291

    accuracy                           0.82      5892
   macro avg       0.74      0.77      0.75      5892
weighted avg       0.83      0.82      0.83      5892

AUC-ROC: 0.848

Logistic Regression with Undersampling
              precision    recall  f1-score   support

           0       0.92      0.79      0.85      4601
           1       0.51      0.76      0.61      1291

    accuracy                           0.

## Observations
### Class Weights

**Strengths:**
Best recall for defaults (78%), meaning the model identified most default cases.
Balanced precision and recall resulted in the highest AUC-ROC (0.872), indicating strong overall performance.

**Weaknesses:**
Lower precision (53%) means more false positives, but this is acceptable when the goal is to minimize missed defaults.

### SMOTE (Oversampling)
**Strengths:**
Better precision (58%) compared to class weights, reducing false positives.
Maintains good recall (69%) and accuracy (82%).

**Weaknesses:**
Slightly lower AUC-ROC (0.848), indicating the model struggles more to separate classes.

### Undersampling
**Strengths:**
Comparable recall (76%) to class weights.
Simplifies the dataset by balancing classes.

**Weaknesses:**
Lowest precision (51%) and F1-score (61%) due to potential overfitting on the reduced dataset.
Loses information about non-default patterns by removing majority-class examples.

## Which Method is Best?
The choice depends on the problem you're solving:

##### Use Class Weights:

If identifying as many defaults as possible (high recall) is the priority, class weights are the best option.
This method balances performance without modifying the dataset.

#### Use SMOTE:

If reducing false positives (higher precision) is important, SMOTE is a solid choice.
It slightly sacrifices recall for better precision and overall accuracy.

#### Avoid Undersampling:

While it maintains decent recall, undersampling is the least effective method due to its loss of information and lower precision.

### Conclusion
 In this case, Class Weights emerged as the best approach for Logistic Regression, offering a great balance of recall and overall model performance.

**In the next step, I’ll apply these techniques to Gradient Boosting to see how well it handles imbalanced data compared to Logistic Regression.**

## Gradient Boosting with Class Imbalance

In [13]:
# 1. Gradient Boosting with Class Weights
print("Gradient Boosting with Class Weights")
weighted_model = GradientBoostingClassifier(random_state=42)
evaluate_model(weighted_model, X_train, X_test, y_train, y_test)

# 2. Gradient Boosting with SMOTE
print("\nGradient Boosting with SMOTE")
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
oversampled_model = GradientBoostingClassifier(random_state=42)
evaluate_model(oversampled_model, X_train_smote, X_test, y_train_smote, y_test)

# 3. Gradient Boosting with Undersampling
print("\nGradient Boosting with Undersampling")
undersampler = RandomUnderSampler(random_state=42)
X_train_under, y_train_under = undersampler.fit_resample(X_train, y_train)
undersampled_model = GradientBoostingClassifier(random_state=42)
evaluate_model(undersampled_model, X_train_under, X_test, y_train_under, y_test)


Gradient Boosting with Class Weights
              precision    recall  f1-score   support

           0       0.92      0.98      0.95      4601
           1       0.91      0.69      0.79      1291

    accuracy                           0.92      5892
   macro avg       0.92      0.84      0.87      5892
weighted avg       0.92      0.92      0.91      5892

AUC-ROC: 0.921

Gradient Boosting with SMOTE
              precision    recall  f1-score   support

           0       0.93      0.93      0.93      4601
           1       0.76      0.74      0.75      1291

    accuracy                           0.89      5892
   macro avg       0.84      0.84      0.84      5892
weighted avg       0.89      0.89      0.89      5892

AUC-ROC: 0.906

Gradient Boosting with Undersampling
              precision    recall  f1-score   support

           0       0.94      0.89      0.91      4601
           1       0.66      0.79      0.72      1291

    accuracy                           0.87    

## Analysis of Gradient Boosting Results with Class Imbalance Techniques
### 1. Gradient Boosting with Class Weights
#### Performance Metrics:

**Precision (Class 1)**: 0.91 - Excellent precision, meaning 91% of loans predicted as defaults were actually defaults.

**Recall (Class 1):** 0.69 - Identified 69% of actual defaults, a strong performance.

**F1-Score (Class 1):** 0.79 - Balanced precision and recall.

**AUC-ROC:** 0.921 - Outstanding ability to distinguish between defaults and non-defaults.

#### Observations:

Class weights achieved high precision and recall, striking a good balance.
High AUC-ROC indicates the model effectively separates the two classes.

### 2. Gradient Boosting with SMOTE (Oversampling)
#### Performance Metrics:

**Precision (Class 1)**: 0.76 - Lower than class weights, meaning more false positives.
**Recall (Class 1):** 0.74 - Improved recall compared to class weights, catching more defaults.
**F1-Score (Class 1):** 0.75 - Balanced but slightly lower than class weights.
**AUC-ROC:** 0.906 - Slightly lower separability compared to class weights.

#### Observations:

SMOTE sacrifices precision for better recall, making it a good option when capturing defaults is critical.
Slight drop in AUC-ROC reflects the potential noise introduced by synthetic samples.

### 3. Gradient Boosting with Undersampling
#### Performance Metrics:

**Precision (Class 1)**: 0.66 - Lower precision, indicating more false positives.
**Recall (Class 1):** 0.79 - The best recall among all techniques, capturing the highest proportion of defaults.
**F1-Score (Class 1):** 0.72 - Balanced but slightly lower due to lower precision.
**AUC-ROC:**  0.921 - Matches class weights, indicating good class separability.

#### Observations:

Undersampling is effective at maximizing recall for defaults but struggles with precision.
Slightly lower accuracy and F1-score overall due to reduced training data.


### Class Weights:
#### Best Overall Option:
- High precision (0.91) and balanced recall (0.69).
- Best AUC-ROC (0.921), indicating excellent separability.
- Suitable for scenarios where false positives are more acceptable than missed defaults.

### SMOTE (Oversampling):
#### Best for Balanced Recall and Precision:
- Improves recall (0.74) while maintaining decent precision (0.76).
- Slightly lower AUC-ROC (0.906) due to noise from synthetic samples.
- Ideal if you prioritize recall without sacrificing too much precision.

### Undersampling:
#### Best for Maximizing Recall:
- Achieved the highest recall (0.79), making it effective for identifying defaults.
- Lower precision (0.66) and accuracy (87%) due to reduced training data.
- Suitable when default identification is critical, but false positives are less costly.

### Comparison with Logistic Regression
Gradient Boosting significantly outperforms Logistic Regression for all three techniques:

- Higher Precision and Recall: Gradient Boosting better balances both metrics.
- Stronger AUC-ROC: Gradient Boosting achieves better class separability (0.921 vs. 0.872 for class weights).

## Conclusion
For Gradient Boosting:

- Class Weights: Best overall, with the highest AUC-ROC and precision.
- SMOTE: A solid choice for slightly better recall without sacrificing much precision.
- Undersampling: Use only if maximizing recall is your primary objective.

### We will be diving into feature engineering before revisting Modeling again. 