## Advanced Modeling

## 1. Introduction

In this notebook, we evaluate advanced models to improve upon the logistic regression baseline. We test Random Forest, Gradient Boosting, CatBoost, and LightGBM classifiers, using AUC-ROC as our evaluation metric.


## 2. Train and Evaluate Multiple Models

We train each model on the same train-test split and evaluate them using AUC-ROC. This allows for direct comparison of their performance.


In [14]:
# Import Libraries and Load Dataset
import pandas as pd
from src.utils import evaluate_model_on_dataset, evaluate_catboost_on_dataset, evaluate_lightgbm_on_dataset
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

train_preprocessed_dataset = pd.read_csv("data/train_preprocessed.csv")

In [15]:
# Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
result_random_forest = evaluate_model_on_dataset(train_preprocessed_dataset, model)

In [16]:
# Gradient Boosting Classifier
model = GradientBoostingClassifier(n_estimators=100, random_state=42)
result_gradient_boosting = evaluate_model_on_dataset(train_preprocessed_dataset, model)

In [17]:
# CatBoost Classifier
categorical_cols = ['cb_person_default_on_file', 'person_home_ownership', 'loan_intent', 'loan_grade']
result_catboost = evaluate_catboost_on_dataset(train_preprocessed_dataset, target_col='loan_status', categorical_cols=categorical_cols)

In [18]:
# LightGBM Classifier
result_lightgbm = evaluate_lightgbm_on_dataset(train_preprocessed_dataset, target_col="loan_status")

## 3. AUC Score Comparison

We compare AUC scores across all models tested.


In [19]:
# Create DataFrame For AUC Results
data = {
	'Model': ['CatBoost', 'Gradient Boosting', 'Random Forest', 'LightGBM'],
	'AUC': [result_catboost, result_gradient_boosting, result_random_forest, result_lightgbm]
}
df = pd.DataFrame(data)
df = df.sort_values(by='AUC', ascending=False)
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,Model,AUC
0,LightGBM,0.955654
1,CatBoost,0.943104
2,Gradient Boosting,0.941029
3,Random Forest,0.93325


## 4. Interpretation and Insights

Among the models tested, **LightGBM** achieved the highest AUC of **~0.9557**, followed by **CatBoost (~0.9431)**, **Gradient Boosting (~0.9410)**, and **Random Forest (~0.9333)**.

Key insights:
- All advanced models outperformed the baseline logistic regression (AUC: 0.7814), indicating the value of tree-based ensembles for this classification problem.
- LightGBM offered the best performance overall, making it the top candidate for further tuning and interpretation.
- While Random Forest and Gradient Boosting also performed strongly, the efficiency and flexibility of LightGBM makes it preferable for final optimization.

