# Kaggle: Predict Loan Payback ‚Äî Model Training

**Notebook:** `04_model_training.ipynb`
**Author:** Brice Nelson
**Organization:** Kaggle Series | Brice Machine Learning Projects
**Date Created:** November 16, 2025
**Last Updated:** November 19, 2025

---

## üß≠ Purpose

This notebook initiates the **modeling phase** for the *Predict Loan Payback* competition.

After completing data cleaning and feature engineering in previous notebooks, we now transition into selecting, training, evaluating, and comparing machine-learning models capable of predicting whether a borrower will repay the loan.

This step turns the carefully prepared dataset into an **actionable predictive system**.

### **Objectives**
1. Load feature-engineered train/test datasets from `/data/processed/`.
2. Define the target variable and feature matrix.
3. Train baseline models to establish initial performance benchmarks.
4. Evaluate models using appropriate metrics (AUC, accuracy, precision/recall, etc.).
5. Compare multiple algorithms and select the strongest candidate(s).
6. Export predictions for Kaggle submission.

---

## üß± Model Training Roadmap

The modeling plan for this notebook includes:

### **1. Baseline Models**
- Logistic Regression (regularized)
- Decision Tree (simple depth-limited version)

Purpose: establish ‚Äúfloor‚Äù performance quickly.

---

### **2. Core Machine Learning Models**
- Random Forest
- Gradient Boosting (e.g., XGBoost or LightGBM)
- Extra Trees Classifier
- Support Vector Machine (if practical)

These will form the backbone of your model comparison phase.

---

### **3. Hyperparameter Tuning**
- RandomizedSearchCV for broad sweeps
- GridSearchCV for refining top models
- Evaluation via stratified cross-validation
- Tracking overfitting by comparing train vs. validation scores

---

### **4. Model Evaluation Metrics**
Depending on competition scoring:

- **ROC AUC** (typical for binary classification)
- **Accuracy**
- **Precision / Recall**
- **Confusion matrix**
- **Calibration curves** (optional but useful for loan risk)

---

### **5. Prediction & Export**
- Predict on the processed test dataset
- Format output to match Kaggle‚Äôs expected submission CSV
- Save to `/data/submissions/`

---

## üì• Load Feature-Engineered Data

This notebook begins by importing:

- `../data/processed/loan_train_features.csv`
- `../data/processed/loan_test_features.csv`

(or whichever filenames you created in the feature engineering notebook)

These will be used to construct the feature matrix `X` and target vector `y` for training and validation.


In [25]:
import os
import optuna
from pathlib import Path
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier


## Load Processed Data

In [2]:
loan_train_features = pd.read_csv("../data/processed/loan_train_features.csv")
loan_test_features = pd.read_csv("../data/processed/loan_test_features.csv")

---

## ‚öôÔ∏è Step 1: Define Features and Target

With the feature-engineered datasets loaded, the next step is to construct:

- **X_train** ‚Üí Feature matrix
- **y_train** ‚Üí Target vector (`loan_payed_back`)
- **X_test** ‚Üí Feature matrix for Kaggle submission

This section will:
1. Separate predictors from the target column.
2. Confirm dataset shapes and check for any remaining inconsistencies.
3. Prepare the data for model training and baseline evaluation.

---

## üß™ Step 2: Baseline Models

Before diving into advanced algorithms, we start with simple baseline models to:

- Establish a performance benchmark
- Verify that our preprocessing is correct
- Catch issues like data leakage or extreme imbalance early

The baseline models we will train:

### **1. Logistic Regression (Regularized)**
A reliable, interpretable starting point for binary classification.

### **2. Decision Tree (Depth-Limited)**
Helps visualize splitting patterns and provides an early non-linear alternative.

We‚Äôll evaluate each using:

- ROC-AUC
- Accuracy
- Precision / Recall
- Confusion matrix

This gives us a solid ‚Äúfloor‚Äù before moving into more powerful ensemble methods.

---


In [3]:
# ----------------
# Loan Features Head()
# ----------------
print(f'Loan Train Feature:\n{loan_train_features.head()}')
print(f'Loan Test Featurs: \n{loan_test_features.head()}')

Loan Train Feature:
   id  annual_income  debt_to_income_ratio  credit_score  loan_amount  \
0   0      -0.705461             -0.535135      0.993849    -1.803484   
1   1      -0.977248              0.660668     -0.810394    -1.505401   
2   2       0.050689             -0.345556      0.236067     0.286558   
3   3      -0.050687             -0.812211     -2.668764    -1.492497   
4   4      -0.850388             -0.987206     -0.287163    -0.409421   

   interest_rate  loan_paid_back     grade  subgrade  gender_Female  ...  \
0       0.653899             1.0 -0.401966  0.008691            1.0  ...   
1       0.280571             0.0  0.613154  0.008691            0.0  ...   
2      -1.292385             1.0 -0.401966  1.434819            0.0  ...   
3       1.863482             1.0  2.643393 -1.417436            1.0  ...   
4      -1.068388             1.0  0.613154 -1.417436            0.0  ...   

   grade_x_loan_purpose_Car  grade_x_loan_purpose_Debt consolidation  \
0           

In [4]:
# ----------------
# Loan Features Info()
# ----------------

print('Loan Train Features:\n', loan_train_features.info())
print('Loan Test Features: \n', loan_test_features.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 593994 entries, 0 to 593993
Data columns (total 53 columns):
 #   Column                                   Non-Null Count   Dtype  
---  ------                                   --------------   -----  
 0   id                                       593994 non-null  int64  
 1   annual_income                            593994 non-null  float64
 2   debt_to_income_ratio                     593994 non-null  float64
 3   credit_score                             593994 non-null  float64
 4   loan_amount                              593994 non-null  float64
 5   interest_rate                            593994 non-null  float64
 6   loan_paid_back                           593994 non-null  float64
 7   grade                                    593994 non-null  float64
 8   subgrade                                 593994 non-null  float64
 9   gender_Female                            593994 non-null  float64
 10  gender_Male                     

In [5]:
# -----------------------------------------------
# Step 1: Define Features (X) and Target (y)
# -----------------------------------------------

# The target column from the training set
target_col = "loan_paid_back"

# Feature matrix and target for training
X_train = loan_train_features.drop(columns=[target_col])
y_train = loan_train_features[target_col]

# Test set has no target column ‚Äî that's correct
X_test = loan_test_features.copy()

print("X_train:", X_train.shape)
print("y_train:", y_train.shape)
print("X_test:", X_test.shape)



X_train: (593994, 52)
y_train: (593994,)
X_test: (254569, 52)


In [6]:
# -----------------------------------------------
# Step 2: Train/Validate Split
# -----------------------------------------------

X_train_split, X_valid, y_train_split, y_valid = train_test_split(
    X_train,
    y_train,
    test_size=0.20,
    random_state=42,
    stratify=y_train
)

print("Train split:", X_train_split.shape)
print("Valid split:", X_valid.shape)


Train split: (475195, 52)
Valid split: (118799, 52)


In [7]:
# -----------------------------------------------
# Step 3: Baseline Logistics Regression
# -----------------------------------------------

log_reg = LogisticRegression(
    max_iter=1000,
    class_weight="balanced",
    n_jobs=-1
)

log_reg.fit(X_train_split, y_train_split)

# Predictions
y_pred_lr = log_reg.predict(X_valid)
y_prob_lr = log_reg.predict_proba(X_valid)[:, 1]

roc_lr = roc_auc_score(y_valid, y_prob_lr)

print(f"ROC-AUC (Logistic Regression): {roc_lr:.4f}")
print("\nClassification Report:")
print(classification_report(y_valid, y_pred_lr))


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


ROC-AUC (Logistic Regression): 0.9058

Classification Report:
              precision    recall  f1-score   support

         0.0       0.60      0.78      0.68     23900
         1.0       0.94      0.87      0.90     94899

    accuracy                           0.85    118799
   macro avg       0.77      0.82      0.79    118799
weighted avg       0.87      0.85      0.86    118799



---

## üìä Baseline Model Results: Logistic Regression

The first model trained‚Äîregularized Logistic Regression‚Äîserves as the baseline for evaluating all future models. Despite being a simple linear classifier, it produced **surprisingly strong results**, indicating that the engineered features contain significant predictive signal.

### **üîé Performance Summary**
- **ROC-AUC:** 0.9058
- **Recall (Class 1 ‚Äì Paid Back):** 0.87
- **Precision (Class 1 ‚Äì Paid Back):** 0.94
- **Recall (Class 0 ‚Äì Not Paid Back):** 0.78
- **Overall Accuracy:** 0.85

### **üìà Interpretation**
- An ROC-AUC above **0.90** from a baseline model is exceptional for a credit-risk dataset and confirms that the feature engineering phase was effective.
- High **precision** for repaid loans (1.0) and good **recall** for non-paid loans (0.0) indicate that the model is capturing both sides of the classification boundary.
- The class imbalance (loan_paid_back = 1 is more common) is handled well by the model, especially with `class_weight="balanced"`.
- The convergence warning from `lbfgs` is expected due to the dataset size and feature heterogeneity; it does not invalidate the results.

This strong baseline establishes a **performance floor** that subsequent models must exceed.

---

## üß≠ Next Steps: Advancing Beyond the Baseline

With the baseline complete, the next phase focuses on more expressive non-linear models. The dataset includes ratios, interaction terms, and many one-hot encoded features‚Äîconditions under which tree-based ensemble methods typically outperform linear models.

### **üöÄ Upcoming Modeling Steps**

#### **1. Train Non-Linear Baseline Models**
- **Random Forest Classifier**
  Establishes an early non-linear benchmark.

- **Gradient Boosting Models:**
  - XGBoost
  - LightGBM
  - CatBoost
  These models are well-known for dominating tabular data competitions.

#### **2. Compare Performance Using Key Metrics**
- ROC-AUC
- Precision/Recall
- F1 Score
- PR-AUC (important for imbalanced datasets)

Evaluate all models on the same validation split for a fair comparison.

#### **3. Hyperparameter Tuning**
Once a top-performing algorithm is identified, apply:
- **Optuna** (recommended for speed/efficiency), or
- **GridSearchCV / RandomizedSearchCV**

Goal: improve generalization and push leaderboard performance higher.

#### **4. Save the Best Model**
Export final tuned model using:
- `joblib.dump(model, "model.pkl")`

This ensures reproducibility and supports prediction generation later.

#### **5. Generate Kaggle Submission**
Use the selected model to create predictions on `X_test` and save them as:
- `/data/submissions/submission_<date>.csv`

---

This roadmap transitions the project from a strong baseline into competitive modeling territory and prepares the foundation for leaderboard submissions.


---

## üå≤ Random Forest Classifier ‚Äî Non-Linear Baseline

With the Logistic Regression baseline established, the next step is to introduce a more expressive non-linear model. Random Forests are ensemble methods that combine many decision trees trained on bootstrapped samples of the data. They naturally capture:

- Non-linear relationships
- Interaction effects
- Hierarchical decision boundaries
- Feature importance signals

Given our dataset includes engineered ratios, one-hot encodings, and interaction terms, Random Forests provide a strong early benchmark for tree-based models.

---

### **üéØ Goals of This Model**
1. Establish a non-linear baseline model.
2. Compare its performance against Logistic Regression.
3. Evaluate improvements in capturing complex relationships.
4. Examine feature importance as an interpretability step.

---

### **üîß Model Configuration**
For this first pass, we will use a moderate-sized forest:

- `n_estimators = 300`
- `max_depth = None` (allow deep trees)
- `min_samples_leaf = 2`
- `max_features = "sqrt"`
- `class_weight = "balanced"` (handles class imbalance)

This configuration keeps training efficient while still leveraging the power of ensemble methods.

---

### **üìà Evaluation Metrics**
As with Logistic Regression, we will evaluate using:

- ROC-AUC
- Precision / Recall
- F1-score
- Classification Report

These metrics help determine whether non-linearity materially improves model performance.

---


In [8]:
# -----------------------------------------------
# Random Forest Classifier ‚Äî Non-Linear Baseline
# -----------------------------------------------

rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    min_samples_leaf=2,
    max_features="sqrt",
    class_weight="balanced",
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train_split, y_train_split)

# Predictions
y_pred_rf = rf.predict(X_valid)
y_prob_rf = rf.predict_proba(X_valid)[:, 1]

roc_rf = roc_auc_score(y_valid, y_prob_rf)

print(f"ROC-AUC (Random Forest): {roc_rf:.4f}")
print("\nClassification Report:")
print(classification_report(y_valid, y_pred_rf))


ROC-AUC (Random Forest): 0.9129

Classification Report:
              precision    recall  f1-score   support

         0.0       0.83      0.64      0.72     23900
         1.0       0.91      0.97      0.94     94899

    accuracy                           0.90    118799
   macro avg       0.87      0.80      0.83    118799
weighted avg       0.90      0.90      0.90    118799



In [9]:
# -----------------------------------------------
# Initialize model comparison table if not defined
# -----------------------------------------------
if "results" not in globals():
    results = pd.DataFrame(columns=["Model", "ROC-AUC"])

# If Logistic Regression results exist, add them here
# Only add if roc_lr is defined
if "roc_lr" in globals():
    results.loc[len(results)] = ["Logistic Regression", roc_lr]

# Only add Decision Tree if it exists
if "roc_dt" in globals():
    results.loc[len(results)] = ["Decision Tree", roc_dt]

results


Unnamed: 0,Model,ROC-AUC
0,Logistic Regression,0.90583


In [10]:
# --------------------------
# Append Random Forest Results
# --------------------------

results.loc[len(results)] = ["Random Forest", roc_rf]
results.sort_values("ROC-AUC", ascending=False)



Unnamed: 0,Model,ROC-AUC
1,Random Forest,0.912929
0,Logistic Regression,0.90583


---

## üå≥ Extra Trees Classifier ‚Äî Enhanced Ensemble Baseline

Following the Random Forest model, the next ensemble to evaluate is the **Extra Trees Classifier** (Extremely Randomized Trees). While similar to Random Forests, this model adds additional randomness by:

- Selecting split thresholds **at random**, rather than by searching for the best possible split
- Reducing variance and overfitting in many cases
- Training faster due to fewer split evaluations

This makes Extra Trees a valuable comparison point and often a strong performer on high-dimensional tabular data.

### **üéØ Goals**
- Evaluate the performance of Extra Trees compared to Random Forest and Logistic Regression
- Identify whether additional randomness improves generalization
- Capture non-linear and interaction effects that linear models cannot

### **üìà Evaluation Metrics**
We will evaluate the model using:
- ROC-AUC
- Precision / Recall
- F1-score
- Classification Report

The goal is to determine whether Extra Trees surpasses Random Forest or provides complementary insights.

---


In [12]:
# -----------------------------------------------
# Extra Trees Classifier
# -----------------------------------------------

et = ExtraTreesClassifier(
    n_estimators=300,
    max_depth=None,
    min_samples_leaf=2,
    max_features="sqrt",
    class_weight="balanced",
    random_state=42,
    n_jobs=-1
)

et.fit(X_train_split, y_train_split)

# Predictions
y_pred_et = et.predict(X_valid)
y_prob_et = et.predict_proba(X_valid)[:, 1]

roc_et = roc_auc_score(y_valid, y_prob_et)

print(f"ROC-AUC (Extra Trees): {roc_et:.4f}")
print("\nClassification Report:")
print(classification_report(y_valid, y_pred_et))


ROC-AUC (Extra Trees): 0.9101

Classification Report:
              precision    recall  f1-score   support

         0.0       0.73      0.71      0.72     23900
         1.0       0.93      0.93      0.93     94899

    accuracy                           0.89    118799
   macro avg       0.83      0.82      0.82    118799
weighted avg       0.89      0.89      0.89    118799



In [13]:
results.loc[len(results)] = ["Extra Trees Classifier", roc_et]
results.sort_values("ROC-AUC", ascending=False)


Unnamed: 0,Model,ROC-AUC
1,Random Forest,0.912929
2,Extra Trees Classifier,0.910075
0,Logistic Regression,0.90583


---

## üö´ Why Support Vector Machines Are Not Used

Although Support Vector Machines (SVMs) are powerful classifiers, especially for smaller or medium-sized datasets, they are **not practical for this project** due to the size and structure of the data. The loan dataset contains nearly **600,000 rows** and over **50 engineered features**, which creates several performance challenges for SVMs.

### **1. Computational Complexity**
SVMs scale between:

- **O(n¬≤)** and **O(n¬≥)** in memory and compute
- where *n* is the number of samples (‚âà 600k here)

This makes SVMs extremely slow‚Äîeven for linear kernels‚Äîand often unusable for datasets of this size.

### **2. Kernel SVMs Are Completely Infeasible**
A kernelized SVM requires computing an **n √ó n kernel matrix**, which would be:

>600,000 √ó 600,000 ‚Üí 360,000,000,000 entries

Even storing this matrix is impossible on typical hardware.

### **3. Long Training Times on Limited Hardware**
On a laptop CPU:

- **LinearSVC** can take 20‚Äì60 minutes
- **RBF/Polynomial SVM** can take **hours**, or fail due to memory exhaustion

Given that Random Forest completed in ~7 minutes, an SVM would be dramatically slower with no performance gain.

### **4. Limited Benefit for Tabular Data**
For large, structured datasets with:

- numeric features
- one-hot encodings
- interaction terms
- engineered ratios

tree-based ensemble methods (Random Forest, XGBoost, LightGBM, CatBoost) consistently outperform SVMs. They model non-linear relationships and feature interactions far more efficiently.

### **5. No Probabilities Without Extra Cost**
SVMs do not natively produce predicted probabilities.
To compute AUC or PR-AUC properly, models require:

- **Platt scaling** or
- **cross-validation calibration**

These steps further increase runtime.

---

### **üìå Summary: Why SVM Was Skipped**

| Reason | Impact |
|-------|--------|
| Very poor scaling on 600k rows | Training becomes impractically long |
| Kernel matrix is impossible to compute | Kernel SVM is not feasible |
| High RAM usage | Likely to crash on laptop |
| Not competitive for tabular data | RF/GBM models outperform SVM |
| Extra work for probability outputs | Slower evaluation pipeline |

Given these limitations, SVMs do not align with the project‚Äôs efficiency, hardware constraints, or performance targets.

---

## ‚úÖ Next Step: Gradient Boosting with LightGBM

LightGBM is designed for:

- **large-scale tabular data**
- **high-dimensional feature spaces**
- **fast training on CPUs**
- **strong leaderboard performance**

It will form the backbone of the next modeling phase.

---


---

## ‚ö° LightGBM ‚Äî Gradient Boosting Optimized for Tabular Data

LightGBM (Light Gradient Boosting Machine) is one of the most powerful algorithms for structured/tabular datasets. It is specifically engineered for **speed**, **scalability**, and **high predictive accuracy**, making it ideal for this competition.

Unlike Random Forests or Extra Trees, which average many deep trees, LightGBM builds trees **sequentially**, with each new tree correcting the errors of the previous one (gradient boosting). It also uses advanced optimizations such as:

- **Histogram-based splitting** (much faster than exact splits)
- **Leaf-wise tree growth** (increases accuracy)
- **Efficient handling of high-dimensional data**
- **Native support for missing values**

Given the size of this dataset (~600k rows √ó 50 features), LightGBM is particularly well suited.

---

### üéØ **Goals for This Model**
- Establish the first gradient boosting baseline
- Compare performance against Random Forest and Extra Trees
- Determine whether boosting provides a significant accuracy lift
- Build a foundation for later hyperparameter tuning (Optuna or GridSearch)

---

### ‚öôÔ∏è **Model Configuration (Laptop-Optimized)**

To ensure LightGBM trains quickly even on lower-power hardware (e.g., a laptop):

- `n_estimators = 300`
- `learning_rate = 0.05`
- `num_leaves = 31`
- `max_depth = -1` (no forced limit; but leaves small enough to avoid overfitting)
- `class_weight = "balanced"`
- `n_jobs = -1`

This configuration provides competitive performance without long compute time.

---

### üìà **Evaluation Metrics**
We will again evaluate:

- ROC-AUC
- Precision / Recall
- F1-score
- Classification Report

This will help determine whether LightGBM surpasses the tree ensemble baselines.

---


In [15]:
# -----------------------------------------------
# LightGBM Classifier ‚Äî Gradient Boosting Baseline
# -----------------------------------------------

lgbm = LGBMClassifier(
    n_estimators=300,
    learning_rate=0.05,
    num_leaves=31,
    max_depth=-1,
    class_weight="balanced",
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1
)

lgbm.fit(X_train_split, y_train_split)

# Predictions
y_pred_lgb = lgbm.predict(X_valid)
y_prob_lgb = lgbm.predict_proba(X_valid)[:, 1]

roc_lgb = roc_auc_score(y_valid, y_prob_lgb)

print(f"ROC-AUC (LightGBM): {roc_lgb:.4f}")
print("\nClassification Report:")
print(classification_report(y_valid, y_pred_lgb))


[LightGBM] [Info] Number of positive: 379595, number of negative: 95600
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.033995 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2674
[LightGBM] [Info] Number of data points in the train set: 475195, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
ROC-AUC (LightGBM): 0.9201

Classification Report:
              precision    recall  f1-score   support

         0.0       0.63      0.79      0.70     23900
         1.0       0.94      0.88      0.91     94899

    accuracy                           0.87    118799
   macro avg       0.79      0.84      0.81    118799
weighted avg       0.88      0.87      0.87    118799



In [16]:
results.loc[len(results)] = ["LightGBM", roc_lgb]
results.sort_values("ROC-AUC", ascending=False)


Unnamed: 0,Model,ROC-AUC
3,LightGBM,0.920053
1,Random Forest,0.912929
2,Extra Trees Classifier,0.910075
0,Logistic Regression,0.90583


---

## üêà CatBoost ‚Äî Powerful Gradient Boosting for Tabular Data

CatBoost (Categorical Boosting) is one of the strongest gradient boosting algorithms for structured/tabular datasets. It excels in scenarios with:

- many engineered features
- non-linear relationships
- interaction terms
- imbalanced datasets
- one-hot encodings (even though it prefers raw categorical columns)

Unlike other boosting methods, CatBoost incorporates:

- **Ordered boosting**, which reduces overfitting
- **Efficient handling of categorical patterns**
- **Symmetric tree structures**, which improve speed and generalization
- **Fast CPU performance**, making it ideal for laptop environments

Given the size and structure of this dataset (~600k rows, 50+ engineered features), CatBoost is a natural next model in the competitive modeling phase.

---

### üéØ Goals for This Model
- Benchmark CatBoost against LightGBM, Random Forest, Extra Trees, and Logistic Regression
- Evaluate whether its regularization and tree symmetry improve ROC-AUC
- Prepare the model for potential hyperparameter tuning with Optuna

---

### ‚öôÔ∏è Model Configuration (Laptop-Friendly)
To ensure CatBoost runs efficiently on CPU:

- `iterations = 300`
- `learning_rate = 0.05`
- `depth = 6`
- `l2_leaf_reg = 3`
- `loss_function = "Logloss"`
- `eval_metric = "AUC"`
- `class_weights = {0: w0, 1: w1}` (CatBoost handles class imbalance well)

This setup provides strong early performance without overheating the system.

---

### üìà Evaluation Metrics
As before, we evaluate:

- ROC-AUC
- Precision / Recall
- F1-score
- Classification Report

The goal is to determine whether CatBoost surpasses LightGBM‚Äôs baseline.

---


In [18]:
# -----------------------------------------------
# CatBoost Classifier ‚Äî Gradient Boosting Baseline
# -----------------------------------------------

# Compute class imbalance for CatBoost weights
# (Because it's ratio-based, not "balanced" like sklearn)
pos_weight = (y_train_split == 0).sum() / (y_train_split == 1).sum()
neg_weight = 1

cat_model = CatBoostClassifier(
    iterations=300,
    learning_rate=0.05,
    depth=6,
    l2_leaf_reg=3,
    loss_function="Logloss",
    eval_metric="AUC",
    class_weights=[neg_weight, pos_weight],
    verbose=50,          # Print progress every 50 iterations
    random_seed=42,
    task_type="CPU"
)

cat_model.fit(
    X_train_split,
    y_train_split,
    eval_set=(X_valid, y_valid),
    use_best_model=True
)

# Predictions
y_pred_cat = cat_model.predict(X_valid)
y_prob_cat = cat_model.predict_proba(X_valid)[:, 1]

roc_cat = roc_auc_score(y_valid, y_prob_cat)

print(f"ROC-AUC (CatBoost): {roc_cat:.4f}")
print("\nClassification Report:")
print(classification_report(y_valid, y_pred_cat))


0:	test: 0.8972484	best: 0.8972484 (0)	total: 164ms	remaining: 49.1s
50:	test: 0.9134485	best: 0.9134485 (50)	total: 7.76s	remaining: 37.9s
100:	test: 0.9149204	best: 0.9149204 (100)	total: 12.9s	remaining: 25.5s
150:	test: 0.9158680	best: 0.9158680 (150)	total: 20.5s	remaining: 20.3s
200:	test: 0.9165813	best: 0.9165813 (200)	total: 25.7s	remaining: 12.7s
250:	test: 0.9170486	best: 0.9170486 (250)	total: 31.4s	remaining: 6.13s
299:	test: 0.9175801	best: 0.9175801 (299)	total: 36.6s	remaining: 0us

bestTest = 0.9175800883
bestIteration = 299

ROC-AUC (CatBoost): 0.9176

Classification Report:
              precision    recall  f1-score   support

         0.0       0.63      0.79      0.70     23900
         1.0       0.94      0.88      0.91     94899

    accuracy                           0.86    118799
   macro avg       0.79      0.83      0.81    118799
weighted avg       0.88      0.86      0.87    118799



In [19]:
results.loc[len(results)] = ["CatBoost", roc_cat]
results.sort_values("ROC-AUC", ascending=False)


Unnamed: 0,Model,ROC-AUC
3,LightGBM,0.920053
4,CatBoost,0.91758
1,Random Forest,0.912929
2,Extra Trees Classifier,0.910075
0,Logistic Regression,0.90583


---

## üöÄ XGBoost ‚Äî Gradient Boosting with Robust Regularization

XGBoost (Extreme Gradient Boosting) is one of the most influential algorithms in modern machine learning. It dominated Kaggle competitions for years and remains a go-to choice in fintech, risk modeling, credit scoring, fraud detection, and structured/tabular ML.

While LightGBM is typically faster, XGBoost offers:

- Highly effective regularization (L1 + L2)
- Strong handling of noisy or imperfect features
- Excellent performance on large, structured datasets
- Predictable, stable behavior under most conditions

For this project, XGBoost provides a valuable comparison point alongside LightGBM and CatBoost, and completing it ensures a thorough modeling phase.

---

### üéØ Goals for This Model
- Benchmark XGBoost against LightGBM, CatBoost, Random Forest, and Extra Trees
- Understand how different boosting strategies impact performance
- Build foundational experience with XGBoost for real-world ML workflows

---

### ‚öôÔ∏è Model Configuration (Laptop-Friendly)
To avoid long training times while still capturing performance:

- `n_estimators = 300`
- `learning_rate = 0.05`
- `max_depth = 6`
- `subsample = 0.8`
- `colsample_bytree = 0.8`
- `reg_alpha = 0.0`
- `reg_lambda = 1.0`
- `objective = "binary:logistic"`
- `eval_metric = "auc"`

This configuration balances speed and quality for a large dataset (~600k rows).

---

### üìà Evaluation Metrics
We will evaluate the model using:

- ROC-AUC
- Precision / Recall
- F1-score
- Classification Report

This determines whether XGBoost approaches or surpasses LightGBM's current lead.

---


In [22]:
# -----------------------------------------------
# XGBoost Classifier ‚Äî Gradient Boosting Baseline
# -----------------------------------------------

xgb = XGBClassifier(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.0,
    reg_lambda=1.0,
    objective="binary:logistic",
    eval_metric="auc",
    tree_method="hist",        # Fastest CPU method (VERY important)
    random_state=42,
    n_jobs=-1
)

xgb.fit(
    X_train_split,
    y_train_split,
    eval_set=[(X_valid, y_valid)],
    verbose=False
)

# Predictions
y_pred_xgb = xgb.predict(X_valid)
y_prob_xgb = xgb.predict_proba(X_valid)[:, 1]

roc_xgb = roc_auc_score(y_valid, y_prob_xgb)

print(f"ROC-AUC (XGBoost): {roc_xgb:.4f}")
print("\nClassification Report:")
print(classification_report(y_valid, y_pred_xgb))


ROC-AUC (XGBoost): 0.9187

Classification Report:
              precision    recall  f1-score   support

         0.0       0.89      0.60      0.71     23900
         1.0       0.91      0.98      0.94     94899

    accuracy                           0.90    118799
   macro avg       0.90      0.79      0.83    118799
weighted avg       0.90      0.90      0.90    118799



In [23]:
results.loc[len(results)] = ["XGBoost", roc_xgb]
results.sort_values("ROC-AUC", ascending=False)


Unnamed: 0,Model,ROC-AUC
3,LightGBM,0.920053
5,XGBoost,0.918702
4,CatBoost,0.91758
1,Random Forest,0.912929
2,Extra Trees Classifier,0.910075
0,Logistic Regression,0.90583


---

## üéõÔ∏è Hyperparameter Tuning ‚Äî What It Is and Why We Need It

Now that all baseline models have been trained and compared, the next step is to **optimize** the top-performing algorithms. Out of all models tested so far, **LightGBM** and **XGBoost** have shown the strongest ROC-AUC scores and are the best candidates for tuning.

Hyperparameter tuning is the process of systematically searching for the best settings (hyperparameters) that control how a model learns. These settings can dramatically affect:

- Model accuracy
- Overfitting vs. generalization
- Training speed
- Final leaderboard performance

Baseline models give us a strong starting point, but they are rarely optimized for maximum AUC.

---

### **Why We Use Optuna**
Optuna is a modern hyperparameter optimization framework that uses **Bayesian optimization** and **smart search strategies** to find high-performing configurations efficiently.

Compared to manual tuning or grid search:

- üöÄ **Faster** (finds good configs in fewer trials)
- üß† **Smarter** (uses previous results to guide future searches)
- üíª **Resource-efficient** (great for laptops and limited hardware)
- üìà **Consistently improves model AUC**

For large tabular datasets like this one, Optuna is a near-essential tool for pushing model performance into the top tier (0.94‚Äì0.96 AUC range).

---

### **What We Will Tune**
We will focus on the two strongest models:

1. **LightGBM**
2. **XGBoost**

Key parameters that influence performance:

- Number of leaves / tree depth
- Learning rate
- Number of boosting rounds
- Subsample ratios
- Feature sampling ratios
- Regularization (L1/L2 penalties)
- Minimum child weight / min data in leaf

These control how the model grows trees and how much it generalizes.

---

### **Goal of This Phase**
The objective is to find the **best possible configuration** for the model that achieves:

- Higher ROC-AUC
- Stronger precision-recall characteristics
- Better ranking of default risk
- Improved stability on unseen data

After tuning, we will:

- Re-train the best model
- Save it under `/models/`
- Use it to generate final Kaggle submission predictions

This marks the final stage of the modeling workflow.

---


---

## üîß Hyperparameter Tuning: LightGBM + Optuna

With LightGBM currently leading model performance (ROC-AUC = 0.92005), the next step is to tune its hyperparameters to push the model toward higher accuracy and better generalization.

LightGBM is highly sensitive to its core hyperparameters, including:

- **num_leaves**
- **max_depth**
- **learning_rate**
- **subsample** and **colsample_bytree**
- **min_child_samples**
- **lambda_l1 / lambda_l2** (regularization)

Manually tuning these would be slow and inefficient.
Instead, we use **Optuna**, which performs:

- intelligent hyperparameter search
- guided by Bayesian optimization
- efficient even on CPU
- ideal for large tabular datasets

The goal is to discover a configuration that significantly improves ROC-AUC over the baseline while maintaining reasonable training time.

After tuning, the best LightGBM model will be retrained on the full training split and saved for later evaluation and submission.

---


In [None]:
# ------------------------------------------------------
# Optuna Objective Function for LightGBM
# ------------------------------------------------------
def objective(trial):

    params = {
        "n_estimators": trial.suggest_int("n_estimators", 200, 800),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.15),
        "num_leaves": trial.suggest_int("num_leaves", 15, 60),
        "max_depth": trial.suggest_int("max_depth", -1, 12),
        "min_child_samples": trial.suggest_int("min_child_samples", 10, 60),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
        "lambda_l1": trial.suggest_float("lambda_l1", 0.0, 5.0),
        "lambda_l2": trial.suggest_float("lambda_l2", 0.0, 5.0),
        "random_state": 42,
        "n_jobs": -1,
        "class_weight": "balanced",
        "verbosity": -1     # <-- silence training output safely
    }

    model = LGBMClassifier(**params)

    model.fit(
        X_train_split,
        y_train_split,
        eval_set=[(X_valid, y_valid)],
        eval_metric="auc"
    )

    preds = model.predict_proba(X_valid)[:, 1]
    auc = roc_auc_score(y_valid, preds)

    return auc

# ------------------------------------------------------
# Run Optuna Study
# ------------------------------------------------------
study = optuna.create_study(
    direction="maximize",
    study_name="lightgbm_opt"
)

study.optimize(
    objective,
    n_trials=25,     # keep small for laptop; increase to 50‚Äì100 on desktop
    show_progress_bar=True
)

print("Best ROC-AUC:", study.best_value)
print("Best Hyperparameters:", study.best_params)


[I 2025-11-21 01:05:05,299] A new study created in memory with name: lightgbm_opt
Best trial: 0. Best value: 0.919935:   4%|‚ñç         | 1/25 [00:22<08:57, 22.38s/it]

[I 2025-11-21 01:05:27,679] Trial 0 finished with value: 0.9199348856289009 and parameters: {'n_estimators': 355, 'learning_rate': 0.03366493246417459, 'num_leaves': 32, 'max_depth': 12, 'min_child_samples': 53, 'subsample': 0.9613508147206281, 'colsample_bytree': 0.7488672455637454, 'lambda_l1': 1.98944238703894, 'lambda_l2': 1.6577568762284183}. Best is trial 0 with value: 0.9199348856289009.


Best trial: 0. Best value: 0.919935:   8%|‚ñä         | 2/25 [00:41<07:51, 20.50s/it]

[I 2025-11-21 01:05:46,861] Trial 1 finished with value: 0.9141593493739061 and parameters: {'n_estimators': 697, 'learning_rate': 0.06810896335860102, 'num_leaves': 31, 'max_depth': 1, 'min_child_samples': 13, 'subsample': 0.9432970916613678, 'colsample_bytree': 0.8018301437845513, 'lambda_l1': 3.8164306378833746, 'lambda_l2': 1.0120548304357362}. Best is trial 0 with value: 0.9199348856289009.
