# Kaggle: Predict Loan Payback ‚Äî Model Training

**Notebook:** `04_model_training.ipynb`
**Author:** Brice Nelson
**Organization:** Kaggle Series | Brice Machine Learning Projects
**Date Created:** November 16, 2025
**Last Updated:** November 19, 2025

---

## üß≠ Purpose

This notebook initiates the **modeling phase** for the *Predict Loan Payback* competition.

After completing data cleaning and feature engineering in previous notebooks, we now transition into selecting, training, evaluating, and comparing machine-learning models capable of predicting whether a borrower will repay the loan.

This step turns the carefully prepared dataset into an **actionable predictive system**.

### **Objectives**
1. Load feature-engineered train/test datasets from `/data/processed/`.
2. Define the target variable and feature matrix.
3. Train baseline models to establish initial performance benchmarks.
4. Evaluate models using appropriate metrics (AUC, accuracy, precision/recall, etc.).
5. Compare multiple algorithms and select the strongest candidate(s).
6. Export predictions for Kaggle submission.

---

## üß± Model Training Roadmap

The modeling plan for this notebook includes:

### **1. Baseline Models**
- Logistic Regression (regularized)
- Decision Tree (simple depth-limited version)

Purpose: establish ‚Äúfloor‚Äù performance quickly.

---

### **2. Core Machine Learning Models**
- Random Forest
- Gradient Boosting (e.g., XGBoost or LightGBM)
- Extra Trees Classifier
- Support Vector Machine (if practical)

These will form the backbone of your model comparison phase.

---

### **3. Hyperparameter Tuning**
- RandomizedSearchCV for broad sweeps
- GridSearchCV for refining top models
- Evaluation via stratified cross-validation
- Tracking overfitting by comparing train vs. validation scores

---

### **4. Model Evaluation Metrics**
Depending on competition scoring:

- **ROC AUC** (typical for binary classification)
- **Accuracy**
- **Precision / Recall**
- **Confusion matrix**
- **Calibration curves** (optional but useful for loan risk)

---

### **5. Prediction & Export**
- Predict on the processed test dataset
- Format output to match Kaggle‚Äôs expected submission CSV
- Save to `/data/submissions/`

---

## üì• Load Feature-Engineered Data

This notebook begins by importing:

- `../data/processed/loan_train_features.csv`
- `../data/processed/loan_test_features.csv`

(or whichever filenames you created in the feature engineering notebook)

These will be used to construct the feature matrix `X` and target vector `y` for training and validation.


In [2]:
import os
import optuna
from pathlib import Path
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split





  from .autonotebook import tqdm as notebook_tqdm


## Load Processed Data

In [3]:
loan_train_features = pd.read_csv("../data/processed/loan_train_features.csv")
loan_test_features = pd.read_csv("../data/processed/loan_test_features.csv")

---

## ‚öôÔ∏è Step 1: Define Features and Target

With the feature-engineered datasets loaded, the next step is to construct:

- **X_train** ‚Üí Feature matrix
- **y_train** ‚Üí Target vector (`loan_payed_back`)
- **X_test** ‚Üí Feature matrix for Kaggle submission

This section will:
1. Separate predictors from the target column.
2. Confirm dataset shapes and check for any remaining inconsistencies.
3. Prepare the data for model training and baseline evaluation.

---

## üß™ Step 2: Baseline Models

Before diving into advanced algorithms, we start with simple baseline models to:

- Establish a performance benchmark
- Verify that our preprocessing is correct
- Catch issues like data leakage or extreme imbalance early

The baseline models we will train:

### **1. Logistic Regression (Regularized)**
A reliable, interpretable starting point for binary classification.

### **2. Decision Tree (Depth-Limited)**
Helps visualize splitting patterns and provides an early non-linear alternative.

We‚Äôll evaluate each using:

- ROC-AUC
- Accuracy
- Precision / Recall
- Confusion matrix

This gives us a solid ‚Äúfloor‚Äù before moving into more powerful ensemble methods.

---


In [4]:
# ----------------
# Loan Features Head()
# ----------------
print(f'Loan Train Feature:\n{loan_train_features.head()}')
print(f'Loan Test Featurs: \n{loan_test_features.head()}')

Loan Train Feature:
   id  annual_income  debt_to_income_ratio  credit_score  loan_amount  \
0   0      -0.705461             -0.535135      0.993849    -1.803484   
1   1      -0.977248              0.660668     -0.810394    -1.505401   
2   2       0.050689             -0.345556      0.236067     0.286558   
3   3      -0.050687             -0.812211     -2.668764    -1.492497   
4   4      -0.850388             -0.987206     -0.287163    -0.409421   

   interest_rate  loan_paid_back     grade  subgrade  gender_Female  ...  \
0       0.653899             1.0 -0.401966  0.008691            1.0  ...   
1       0.280571             0.0  0.613154  0.008691            0.0  ...   
2      -1.292385             1.0 -0.401966  1.434819            0.0  ...   
3       1.863482             1.0  2.643393 -1.417436            1.0  ...   
4      -1.068388             1.0  0.613154 -1.417436            0.0  ...   

   grade_x_loan_purpose_Car  grade_x_loan_purpose_Debt consolidation  \
0           

In [5]:
# ----------------
# Loan Features Info()
# ----------------

print('Loan Train Features:\n', loan_train_features.info())
print('Loan Test Features: \n', loan_test_features.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 593994 entries, 0 to 593993
Data columns (total 53 columns):
 #   Column                                   Non-Null Count   Dtype  
---  ------                                   --------------   -----  
 0   id                                       593994 non-null  int64  
 1   annual_income                            593994 non-null  float64
 2   debt_to_income_ratio                     593994 non-null  float64
 3   credit_score                             593994 non-null  float64
 4   loan_amount                              593994 non-null  float64
 5   interest_rate                            593994 non-null  float64
 6   loan_paid_back                           593994 non-null  float64
 7   grade                                    593994 non-null  float64
 8   subgrade                                 593994 non-null  float64
 9   gender_Female                            593994 non-null  float64
 10  gender_Male                     

In [6]:
# -----------------------------------------------
# Step 1: Define Features (X) and Target (y)
# -----------------------------------------------

# The target column from the training set
target_col = "loan_paid_back"

# Feature matrix and target for training
X_train = loan_train_features.drop(columns=[target_col])
y_train = loan_train_features[target_col]

# Test set has no target column ‚Äî that's correct
X_test = loan_test_features.copy()

print("X_train:", X_train.shape)
print("y_train:", y_train.shape)
print("X_test:", X_test.shape)



X_train: (593994, 52)
y_train: (593994,)
X_test: (254569, 52)


In [7]:
# -----------------------------------------------
# Step 2: Train/Validate Split
# -----------------------------------------------

X_train_split, X_valid, y_train_split, y_valid = train_test_split(
    X_train,
    y_train,
    test_size=0.20,
    random_state=42,
    stratify=y_train
)

print("Train split:", X_train_split.shape)
print("Valid split:", X_valid.shape)


Train split: (475195, 52)
Valid split: (118799, 52)


In [8]:
# -----------------------------------------------
# Step 3: Baseline Logistics Regression
# -----------------------------------------------

log_reg = LogisticRegression(
    max_iter=1000,
    class_weight="balanced",
    n_jobs=-1
)

log_reg.fit(X_train_split, y_train_split)

# Predictions
y_pred_lr = log_reg.predict(X_valid)
y_prob_lr = log_reg.predict_proba(X_valid)[:, 1]

roc_lr = roc_auc_score(y_valid, y_prob_lr)

print(f"ROC-AUC (Logistic Regression): {roc_lr:.4f}")
print("\nClassification Report:")
print(classification_report(y_valid, y_pred_lr))


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


ROC-AUC (Logistic Regression): 0.9038

Classification Report:
              precision    recall  f1-score   support

         0.0       0.59      0.78      0.67     23900
         1.0       0.94      0.86      0.90     94899

    accuracy                           0.85    118799
   macro avg       0.76      0.82      0.79    118799
weighted avg       0.87      0.85      0.85    118799



---

## üìä Baseline Model Results: Logistic Regression

The first model trained‚Äîregularized Logistic Regression‚Äîserves as the baseline for evaluating all future models. Despite being a simple linear classifier, it produced **surprisingly strong results**, indicating that the engineered features contain significant predictive signal.

### **üîé Performance Summary**
- **ROC-AUC:** 0.9058
- **Recall (Class 1 ‚Äì Paid Back):** 0.87
- **Precision (Class 1 ‚Äì Paid Back):** 0.94
- **Recall (Class 0 ‚Äì Not Paid Back):** 0.78
- **Overall Accuracy:** 0.85

### **üìà Interpretation**
- An ROC-AUC above **0.90** from a baseline model is exceptional for a credit-risk dataset and confirms that the feature engineering phase was effective.
- High **precision** for repaid loans (1.0) and good **recall** for non-paid loans (0.0) indicate that the model is capturing both sides of the classification boundary.
- The class imbalance (loan_paid_back = 1 is more common) is handled well by the model, especially with `class_weight="balanced"`.
- The convergence warning from `lbfgs` is expected due to the dataset size and feature heterogeneity; it does not invalidate the results.

This strong baseline establishes a **performance floor** that subsequent models must exceed.

---

## üß≠ Next Steps: Advancing Beyond the Baseline

With the baseline complete, the next phase focuses on more expressive non-linear models. The dataset includes ratios, interaction terms, and many one-hot encoded features‚Äîconditions under which tree-based ensemble methods typically outperform linear models.

### **üöÄ Upcoming Modeling Steps**

#### **1. Train Non-Linear Baseline Models**
- **Random Forest Classifier**
  Establishes an early non-linear benchmark.

- **Gradient Boosting Models:**
  - XGBoost
  - LightGBM
  - CatBoost
  These models are well-known for dominating tabular data competitions.

#### **2. Compare Performance Using Key Metrics**
- ROC-AUC
- Precision/Recall
- F1 Score
- PR-AUC (important for imbalanced datasets)

Evaluate all models on the same validation split for a fair comparison.

#### **3. Hyperparameter Tuning**
Once a top-performing algorithm is identified, apply:
- **Optuna** (recommended for speed/efficiency), or
- **GridSearchCV / RandomizedSearchCV**

Goal: improve generalization and push leaderboard performance higher.

#### **4. Save the Best Model**
Export final tuned model using:
- `joblib.dump(model, "model.pkl")`

This ensures reproducibility and supports prediction generation later.

#### **5. Generate Kaggle Submission**
Use the selected model to create predictions on `X_test` and save them as:
- `/data/submissions/submission_<date>.csv`

---

This roadmap transitions the project from a strong baseline into competitive modeling territory and prepares the foundation for leaderboard submissions.


---

## üå≤ Random Forest Classifier ‚Äî Non-Linear Baseline

With the Logistic Regression baseline established, the next step is to introduce a more expressive non-linear model. Random Forests are ensemble methods that combine many decision trees trained on bootstrapped samples of the data. They naturally capture:

- Non-linear relationships
- Interaction effects
- Hierarchical decision boundaries
- Feature importance signals

Given our dataset includes engineered ratios, one-hot encodings, and interaction terms, Random Forests provide a strong early benchmark for tree-based models.

---

### **üéØ Goals of This Model**
1. Establish a non-linear baseline model.
2. Compare its performance against Logistic Regression.
3. Evaluate improvements in capturing complex relationships.
4. Examine feature importance as an interpretability step.

---

### **üîß Model Configuration**
For this first pass, we will use a moderate-sized forest:

- `n_estimators = 300`
- `max_depth = None` (allow deep trees)
- `min_samples_leaf = 2`
- `max_features = "sqrt"`
- `class_weight = "balanced"` (handles class imbalance)

This configuration keeps training efficient while still leveraging the power of ensemble methods.

---

### **üìà Evaluation Metrics**
As with Logistic Regression, we will evaluate using:

- ROC-AUC
- Precision / Recall
- F1-score
- Classification Report

These metrics help determine whether non-linearity materially improves model performance.

---


In [9]:
# -----------------------------------------------
# Random Forest Classifier ‚Äî Non-Linear Baseline
# -----------------------------------------------

rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    min_samples_leaf=2,
    max_features="sqrt",
    class_weight="balanced",
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train_split, y_train_split)

# Predictions
y_pred_rf = rf.predict(X_valid)
y_prob_rf = rf.predict_proba(X_valid)[:, 1]

roc_rf = roc_auc_score(y_valid, y_prob_rf)

print(f"ROC-AUC (Random Forest): {roc_rf:.4f}")
print("\nClassification Report:")
print(classification_report(y_valid, y_pred_rf))


ROC-AUC (Random Forest): 0.9129

Classification Report:
              precision    recall  f1-score   support

         0.0       0.83      0.64      0.72     23900
         1.0       0.91      0.97      0.94     94899

    accuracy                           0.90    118799
   macro avg       0.87      0.80      0.83    118799
weighted avg       0.90      0.90      0.90    118799



In [10]:
# -----------------------------------------------
# Initialize model comparison table if not defined
# -----------------------------------------------
if "results" not in globals():
    results = pd.DataFrame(columns=["Model", "ROC-AUC"])

# If Logistic Regression results exist, add them here
# Only add if roc_lr is defined
if "roc_lr" in globals():
    results.loc[len(results)] = ["Logistic Regression", roc_lr]

# Only add Decision Tree if it exists
if "roc_dt" in globals():
    results.loc[len(results)] = ["Decision Tree", roc_dt]

results


Unnamed: 0,Model,ROC-AUC
0,Logistic Regression,0.903793


In [11]:
# --------------------------
# Append Random Forest Results
# --------------------------

results.loc[len(results)] = ["Random Forest", roc_rf]
results.sort_values("ROC-AUC", ascending=False)



Unnamed: 0,Model,ROC-AUC
1,Random Forest,0.912929
0,Logistic Regression,0.903793


---

## üå≥ Extra Trees Classifier ‚Äî Enhanced Ensemble Baseline

Following the Random Forest model, the next ensemble to evaluate is the **Extra Trees Classifier** (Extremely Randomized Trees). While similar to Random Forests, this model adds additional randomness by:

- Selecting split thresholds **at random**, rather than by searching for the best possible split
- Reducing variance and overfitting in many cases
- Training faster due to fewer split evaluations

This makes Extra Trees a valuable comparison point and often a strong performer on high-dimensional tabular data.

### **üéØ Goals**
- Evaluate the performance of Extra Trees compared to Random Forest and Logistic Regression
- Identify whether additional randomness improves generalization
- Capture non-linear and interaction effects that linear models cannot

### **üìà Evaluation Metrics**
We will evaluate the model using:
- ROC-AUC
- Precision / Recall
- F1-score
- Classification Report

The goal is to determine whether Extra Trees surpasses Random Forest or provides complementary insights.

---


In [12]:
# -----------------------------------------------
# Extra Trees Classifier
# -----------------------------------------------

et = ExtraTreesClassifier(
    n_estimators=300,
    max_depth=None,
    min_samples_leaf=2,
    max_features="sqrt",
    class_weight="balanced",
    random_state=42,
    n_jobs=-1
)

et.fit(X_train_split, y_train_split)

# Predictions
y_pred_et = et.predict(X_valid)
y_prob_et = et.predict_proba(X_valid)[:, 1]

roc_et = roc_auc_score(y_valid, y_prob_et)

print(f"ROC-AUC (Extra Trees): {roc_et:.4f}")
print("\nClassification Report:")
print(classification_report(y_valid, y_pred_et))


ROC-AUC (Extra Trees): 0.9101

Classification Report:
              precision    recall  f1-score   support

         0.0       0.73      0.71      0.72     23900
         1.0       0.93      0.93      0.93     94899

    accuracy                           0.89    118799
   macro avg       0.83      0.82      0.82    118799
weighted avg       0.89      0.89      0.89    118799



In [13]:
results.loc[len(results)] = ["Extra Trees Classifier", roc_et]
results.sort_values("ROC-AUC", ascending=False)


Unnamed: 0,Model,ROC-AUC
1,Random Forest,0.912929
2,Extra Trees Classifier,0.910075
0,Logistic Regression,0.903793


---

## üö´ Why Support Vector Machines Are Not Used

Although Support Vector Machines (SVMs) are powerful classifiers, especially for smaller or medium-sized datasets, they are **not practical for this project** due to the size and structure of the data. The loan dataset contains nearly **600,000 rows** and over **50 engineered features**, which creates several performance challenges for SVMs.

### **1. Computational Complexity**
SVMs scale between:

- **O(n¬≤)** and **O(n¬≥)** in memory and compute
- where *n* is the number of samples (‚âà 600k here)

This makes SVMs extremely slow‚Äîeven for linear kernels‚Äîand often unusable for datasets of this size.

### **2. Kernel SVMs Are Completely Infeasible**
A kernelized SVM requires computing an **n √ó n kernel matrix**, which would be:

>600,000 √ó 600,000 ‚Üí 360,000,000,000 entries

Even storing this matrix is impossible on typical hardware.

### **3. Long Training Times on Limited Hardware**
On a laptop CPU:

- **LinearSVC** can take 20‚Äì60 minutes
- **RBF/Polynomial SVM** can take **hours**, or fail due to memory exhaustion

Given that Random Forest completed in ~7 minutes, an SVM would be dramatically slower with no performance gain.

### **4. Limited Benefit for Tabular Data**
For large, structured datasets with:

- numeric features
- one-hot encodings
- interaction terms
- engineered ratios

tree-based ensemble methods (Random Forest, XGBoost, LightGBM, CatBoost) consistently outperform SVMs. They model non-linear relationships and feature interactions far more efficiently.

### **5. No Probabilities Without Extra Cost**
SVMs do not natively produce predicted probabilities.
To compute AUC or PR-AUC properly, models require:

- **Platt scaling** or
- **cross-validation calibration**

These steps further increase runtime.

---

### **üìå Summary: Why SVM Was Skipped**

| Reason | Impact |
|-------|--------|
| Very poor scaling on 600k rows | Training becomes impractically long |
| Kernel matrix is impossible to compute | Kernel SVM is not feasible |
| High RAM usage | Likely to crash on laptop |
| Not competitive for tabular data | RF/GBM models outperform SVM |
| Extra work for probability outputs | Slower evaluation pipeline |

Given these limitations, SVMs do not align with the project‚Äôs efficiency, hardware constraints, or performance targets.

---

## ‚úÖ Next Step: Gradient Boosting with LightGBM

LightGBM is designed for:

- **large-scale tabular data**
- **high-dimensional feature spaces**
- **fast training on CPUs**
- **strong leaderboard performance**

It will form the backbone of the next modeling phase.

---


---

## ‚ö° LightGBM ‚Äî Gradient Boosting Optimized for Tabular Data

LightGBM (Light Gradient Boosting Machine) is one of the most powerful algorithms for structured/tabular datasets. It is specifically engineered for **speed**, **scalability**, and **high predictive accuracy**, making it ideal for this competition.

Unlike Random Forests or Extra Trees, which average many deep trees, LightGBM builds trees **sequentially**, with each new tree correcting the errors of the previous one (gradient boosting). It also uses advanced optimizations such as:

- **Histogram-based splitting** (much faster than exact splits)
- **Leaf-wise tree growth** (increases accuracy)
- **Efficient handling of high-dimensional data**
- **Native support for missing values**

Given the size of this dataset (~600k rows √ó 50 features), LightGBM is particularly well suited.

---

### üéØ **Goals for This Model**
- Establish the first gradient boosting baseline
- Compare performance against Random Forest and Extra Trees
- Determine whether boosting provides a significant accuracy lift
- Build a foundation for later hyperparameter tuning (Optuna or GridSearch)

---

### ‚öôÔ∏è **Model Configuration (Laptop-Optimized)**

To ensure LightGBM trains quickly even on lower-power hardware (e.g., a laptop):

- `n_estimators = 300`
- `learning_rate = 0.05`
- `num_leaves = 31`
- `max_depth = -1` (no forced limit; but leaves small enough to avoid overfitting)
- `class_weight = "balanced"`
- `n_jobs = -1`

This configuration provides competitive performance without long compute time.

---

### üìà **Evaluation Metrics**
We will again evaluate:

- ROC-AUC
- Precision / Recall
- F1-score
- Classification Report

This will help determine whether LightGBM surpasses the tree ensemble baselines.

---


In [14]:
# -----------------------------------------------
# LightGBM Classifier ‚Äî Gradient Boosting Baseline
# -----------------------------------------------

lgbm = LGBMClassifier(
    n_estimators=300,
    learning_rate=0.05,
    num_leaves=31,
    max_depth=-1,
    class_weight="balanced",
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1
)

lgbm.fit(X_train_split, y_train_split)

# Predictions
y_pred_lgb = lgbm.predict(X_valid)
y_prob_lgb = lgbm.predict_proba(X_valid)[:, 1]

roc_lgb = roc_auc_score(y_valid, y_prob_lgb)

print(f"ROC-AUC (LightGBM): {roc_lgb:.4f}")
print("\nClassification Report:")
print(classification_report(y_valid, y_pred_lgb))


[LightGBM] [Info] Number of positive: 379595, number of negative: 95600
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.030544 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2674
[LightGBM] [Info] Number of data points in the train set: 475195, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
ROC-AUC (LightGBM): 0.9201

Classification Report:
              precision    recall  f1-score   support

         0.0       0.63      0.79      0.70     23900
         1.0       0.94      0.88      0.91     94899

    accuracy                           0.87    118799
   macro avg       0.79      0.84      0.81    118799
weighted avg       0.88      0.87      0.87    118799



In [15]:
results.loc[len(results)] = ["LightGBM", roc_lgb]
results.sort_values("ROC-AUC", ascending=False)


Unnamed: 0,Model,ROC-AUC
3,LightGBM,0.920053
1,Random Forest,0.912929
2,Extra Trees Classifier,0.910075
0,Logistic Regression,0.903793


---

## üêà CatBoost ‚Äî Powerful Gradient Boosting for Tabular Data

CatBoost (Categorical Boosting) is one of the strongest gradient boosting algorithms for structured/tabular datasets. It excels in scenarios with:

- many engineered features
- non-linear relationships
- interaction terms
- imbalanced datasets
- one-hot encodings (even though it prefers raw categorical columns)

Unlike other boosting methods, CatBoost incorporates:

- **Ordered boosting**, which reduces overfitting
- **Efficient handling of categorical patterns**
- **Symmetric tree structures**, which improve speed and generalization
- **Fast CPU performance**, making it ideal for laptop environments

Given the size and structure of this dataset (~600k rows, 50+ engineered features), CatBoost is a natural next model in the competitive modeling phase.

---

### üéØ Goals for This Model
- Benchmark CatBoost against LightGBM, Random Forest, Extra Trees, and Logistic Regression
- Evaluate whether its regularization and tree symmetry improve ROC-AUC
- Prepare the model for potential hyperparameter tuning with Optuna

---

### ‚öôÔ∏è Model Configuration (Laptop-Friendly)
To ensure CatBoost runs efficiently on CPU:

- `iterations = 300`
- `learning_rate = 0.05`
- `depth = 6`
- `l2_leaf_reg = 3`
- `loss_function = "Logloss"`
- `eval_metric = "AUC"`
- `class_weights = {0: w0, 1: w1}` (CatBoost handles class imbalance well)

This setup provides strong early performance without overheating the system.

---

### üìà Evaluation Metrics
As before, we evaluate:

- ROC-AUC
- Precision / Recall
- F1-score
- Classification Report

The goal is to determine whether CatBoost surpasses LightGBM‚Äôs baseline.

---


In [16]:
# -----------------------------------------------
# CatBoost Classifier ‚Äî Gradient Boosting Baseline
# -----------------------------------------------

# Compute class imbalance for CatBoost weights
# (Because it's ratio-based, not "balanced" like sklearn)
pos_weight = (y_train_split == 0).sum() / (y_train_split == 1).sum()
neg_weight = 1

cat_model = CatBoostClassifier(
    iterations=300,
    learning_rate=0.05,
    depth=6,
    l2_leaf_reg=3,
    loss_function="Logloss",
    eval_metric="AUC",
    class_weights=[neg_weight, pos_weight],
    verbose=50,          # Print progress every 50 iterations
    random_seed=42,
    task_type="CPU"
)

cat_model.fit(
    X_train_split,
    y_train_split,
    eval_set=(X_valid, y_valid),
    use_best_model=True
)

# Predictions
y_pred_cat = cat_model.predict(X_valid)
y_prob_cat = cat_model.predict_proba(X_valid)[:, 1]

roc_cat = roc_auc_score(y_valid, y_prob_cat)

print(f"ROC-AUC (CatBoost): {roc_cat:.4f}")
print("\nClassification Report:")
print(classification_report(y_valid, y_pred_cat))


0:	test: 0.8972484	best: 0.8972484 (0)	total: 116ms	remaining: 34.6s
50:	test: 0.9134369	best: 0.9134369 (50)	total: 2.8s	remaining: 13.7s
100:	test: 0.9152246	best: 0.9152246 (100)	total: 4.63s	remaining: 9.13s
150:	test: 0.9160386	best: 0.9160386 (150)	total: 6.43s	remaining: 6.34s
200:	test: 0.9167290	best: 0.9167290 (200)	total: 8.81s	remaining: 4.34s
250:	test: 0.9173444	best: 0.9173444 (250)	total: 11.2s	remaining: 2.19s
299:	test: 0.9178097	best: 0.9178097 (299)	total: 13.8s	remaining: 0us

bestTest = 0.9178097176
bestIteration = 299

ROC-AUC (CatBoost): 0.9178

Classification Report:
              precision    recall  f1-score   support

         0.0       0.63      0.79      0.70     23900
         1.0       0.94      0.88      0.91     94899

    accuracy                           0.86    118799
   macro avg       0.79      0.83      0.81    118799
weighted avg       0.88      0.86      0.87    118799



In [17]:
results.loc[len(results)] = ["CatBoost", roc_cat]
results.sort_values("ROC-AUC", ascending=False)


Unnamed: 0,Model,ROC-AUC
3,LightGBM,0.920053
4,CatBoost,0.91781
1,Random Forest,0.912929
2,Extra Trees Classifier,0.910075
0,Logistic Regression,0.903793


---

## üöÄ XGBoost ‚Äî Gradient Boosting with Robust Regularization

XGBoost (Extreme Gradient Boosting) is one of the most influential algorithms in modern machine learning. It dominated Kaggle competitions for years and remains a go-to choice in fintech, risk modeling, credit scoring, fraud detection, and structured/tabular ML.

While LightGBM is typically faster, XGBoost offers:

- Highly effective regularization (L1 + L2)
- Strong handling of noisy or imperfect features
- Excellent performance on large, structured datasets
- Predictable, stable behavior under most conditions

For this project, XGBoost provides a valuable comparison point alongside LightGBM and CatBoost, and completing it ensures a thorough modeling phase.

---

### üéØ Goals for This Model
- Benchmark XGBoost against LightGBM, CatBoost, Random Forest, and Extra Trees
- Understand how different boosting strategies impact performance
- Build foundational experience with XGBoost for real-world ML workflows

---

### ‚öôÔ∏è Model Configuration (Laptop-Friendly)
To avoid long training times while still capturing performance:

- `n_estimators = 300`
- `learning_rate = 0.05`
- `max_depth = 6`
- `subsample = 0.8`
- `colsample_bytree = 0.8`
- `reg_alpha = 0.0`
- `reg_lambda = 1.0`
- `objective = "binary:logistic"`
- `eval_metric = "auc"`

This configuration balances speed and quality for a large dataset (~600k rows).

---

### üìà Evaluation Metrics
We will evaluate the model using:

- ROC-AUC
- Precision / Recall
- F1-score
- Classification Report

This determines whether XGBoost approaches or surpasses LightGBM's current lead.

---


In [18]:
# -----------------------------------------------
# XGBoost Classifier ‚Äî Gradient Boosting Baseline
# -----------------------------------------------

xgb = XGBClassifier(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.0,
    reg_lambda=1.0,
    objective="binary:logistic",
    eval_metric="auc",
    tree_method="hist",        # Fastest CPU method (VERY important)
    random_state=42,
    n_jobs=-1
)

xgb.fit(
    X_train_split,
    y_train_split,
    eval_set=[(X_valid, y_valid)],
    verbose=False
)

# Predictions
y_pred_xgb = xgb.predict(X_valid)
y_prob_xgb = xgb.predict_proba(X_valid)[:, 1]

roc_xgb = roc_auc_score(y_valid, y_prob_xgb)

print(f"ROC-AUC (XGBoost): {roc_xgb:.4f}")
print("\nClassification Report:")
print(classification_report(y_valid, y_pred_xgb))


ROC-AUC (XGBoost): 0.9187

Classification Report:
              precision    recall  f1-score   support

         0.0       0.89      0.60      0.71     23900
         1.0       0.91      0.98      0.94     94899

    accuracy                           0.90    118799
   macro avg       0.90      0.79      0.83    118799
weighted avg       0.90      0.90      0.90    118799



In [19]:
results.loc[len(results)] = ["XGBoost", roc_xgb]
results.sort_values("ROC-AUC", ascending=False)


Unnamed: 0,Model,ROC-AUC
3,LightGBM,0.920053
5,XGBoost,0.918702
4,CatBoost,0.91781
1,Random Forest,0.912929
2,Extra Trees Classifier,0.910075
0,Logistic Regression,0.903793


---

## üéõÔ∏è Hyperparameter Tuning ‚Äî What It Is and Why We Need It

Now that all baseline models have been trained and compared, the next step is to **optimize** the top-performing algorithms. Out of all models tested so far, **LightGBM** and **XGBoost** have shown the strongest ROC-AUC scores and are the best candidates for tuning.

Hyperparameter tuning is the process of systematically searching for the best settings (hyperparameters) that control how a model learns. These settings can dramatically affect:

- Model accuracy
- Overfitting vs. generalization
- Training speed
- Final leaderboard performance

Baseline models give us a strong starting point, but they are rarely optimized for maximum AUC.

---

### **Why We Use Optuna**
Optuna is a modern hyperparameter optimization framework that uses **Bayesian optimization** and **smart search strategies** to find high-performing configurations efficiently.

Compared to manual tuning or grid search:

- üöÄ **Faster** (finds good configs in fewer trials)
- üß† **Smarter** (uses previous results to guide future searches)
- üíª **Resource-efficient** (great for laptops and limited hardware)
- üìà **Consistently improves model AUC**

For large tabular datasets like this one, Optuna is a near-essential tool for pushing model performance into the top tier (0.94‚Äì0.96 AUC range).

---

### **What We Will Tune**
We will focus on the two strongest models:

1. **LightGBM**
2. **XGBoost**

Key parameters that influence performance:

- Number of leaves / tree depth
- Learning rate
- Number of boosting rounds
- Subsample ratios
- Feature sampling ratios
- Regularization (L1/L2 penalties)
- Minimum child weight / min data in leaf

These control how the model grows trees and how much it generalizes.

---

### **Goal of This Phase**
The objective is to find the **best possible configuration** for the model that achieves:

- Higher ROC-AUC
- Stronger precision-recall characteristics
- Better ranking of default risk
- Improved stability on unseen data

After tuning, we will:

- Re-train the best model
- Save it under `/models/`
- Use it to generate final Kaggle submission predictions

This marks the final stage of the modeling workflow.

---


---

## üîß Hyperparameter Tuning: LightGBM + Optuna

With LightGBM currently leading model performance (ROC-AUC = 0.92005), the next step is to tune its hyperparameters to push the model toward higher accuracy and better generalization.

LightGBM is highly sensitive to its core hyperparameters, including:

- **num_leaves**
- **max_depth**
- **learning_rate**
- **subsample** and **colsample_bytree**
- **min_child_samples**
- **lambda_l1 / lambda_l2** (regularization)

Manually tuning these would be slow and inefficient.
Instead, we use **Optuna**, which performs:

- intelligent hyperparameter search
- guided by Bayesian optimization
- efficient even on CPU
- ideal for large tabular datasets

The goal is to discover a configuration that significantly improves ROC-AUC over the baseline while maintaining reasonable training time.

After tuning, the best LightGBM model will be retrained on the full training split and saved for later evaluation and submission.

---


In [20]:
# ------------------------------------------------------
# Optuna Objective Function for LightGBM
# ------------------------------------------------------
def objective(trial):

    params = {
        "n_estimators": trial.suggest_int("n_estimators", 200, 800),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.15),
        "num_leaves": trial.suggest_int("num_leaves", 15, 60),
        "max_depth": trial.suggest_int("max_depth", -1, 12),
        "min_child_samples": trial.suggest_int("min_child_samples", 10, 60),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
        "lambda_l1": trial.suggest_float("lambda_l1", 0.0, 5.0),
        "lambda_l2": trial.suggest_float("lambda_l2", 0.0, 5.0),
        "random_state": 42,
        "n_jobs": -1,
        "class_weight": "balanced",
        "verbosity": -1     # <-- silence training output safely
    }

    model = LGBMClassifier(**params)

    model.fit(
        X_train_split,
        y_train_split,
        eval_set=[(X_valid, y_valid)],
        eval_metric="auc"
    )

    preds = model.predict_proba(X_valid)[:, 1]
    auc = roc_auc_score(y_valid, preds)

    return auc

# ------------------------------------------------------
# Run Optuna Study
# ------------------------------------------------------
study = optuna.create_study(
    direction="maximize",
    study_name="lightgbm_opt"
)

study.optimize(
    objective,
    n_trials=25,     # keep small for laptop; increase to 50‚Äì100 on desktop
    show_progress_bar=True
)

print("Best ROC-AUC:", study.best_value)
print("Best Hyperparameters:", study.best_params)


[I 2025-11-21 10:07:11,230] A new study created in memory with name: lightgbm_opt
Best trial: 0. Best value: 0.920593:   4%|‚ñç         | 1/25 [00:16<06:31, 16.32s/it]

[I 2025-11-21 10:07:27,554] Trial 0 finished with value: 0.9205927588022342 and parameters: {'n_estimators': 445, 'learning_rate': 0.02777117973040999, 'num_leaves': 44, 'max_depth': 0, 'min_child_samples': 51, 'subsample': 0.7964167566220961, 'colsample_bytree': 0.6692720947270885, 'lambda_l1': 3.780160448561026, 'lambda_l2': 4.220245305005463}. Best is trial 0 with value: 0.9205927588022342.


Best trial: 0. Best value: 0.920593:   8%|‚ñä         | 2/25 [00:26<04:47, 12.48s/it]

[I 2025-11-21 10:07:37,348] Trial 1 finished with value: 0.9180638323210041 and parameters: {'n_estimators': 657, 'learning_rate': 0.07475150851993297, 'num_leaves': 35, 'max_depth': 2, 'min_child_samples': 36, 'subsample': 0.6531778511448821, 'colsample_bytree': 0.8976368222279854, 'lambda_l1': 4.328769886063157, 'lambda_l2': 2.563098335896151}. Best is trial 0 with value: 0.9205927588022342.


Best trial: 0. Best value: 0.920593:  12%|‚ñà‚ñè        | 3/25 [00:37<04:26, 12.13s/it]

[I 2025-11-21 10:07:49,068] Trial 2 finished with value: 0.9186679160901343 and parameters: {'n_estimators': 292, 'learning_rate': 0.025402304888714138, 'num_leaves': 40, 'max_depth': 10, 'min_child_samples': 12, 'subsample': 0.8793516724892867, 'colsample_bytree': 0.6784829750735965, 'lambda_l1': 2.360550597209539, 'lambda_l2': 1.9788308856634118}. Best is trial 0 with value: 0.9205927588022342.


Best trial: 0. Best value: 0.920593:  16%|‚ñà‚ñå        | 4/25 [00:41<03:08,  8.97s/it]

[I 2025-11-21 10:07:53,176] Trial 3 finished with value: 0.9136251128208933 and parameters: {'n_estimators': 244, 'learning_rate': 0.12139033412660803, 'num_leaves': 47, 'max_depth': 1, 'min_child_samples': 60, 'subsample': 0.7564302432150071, 'colsample_bytree': 0.6452474807798109, 'lambda_l1': 2.0061084162388783, 'lambda_l2': 2.466660634508226}. Best is trial 0 with value: 0.9205927588022342.


Best trial: 0. Best value: 0.920593:  20%|‚ñà‚ñà        | 5/25 [00:48<02:39,  7.97s/it]

[I 2025-11-21 10:07:59,387] Trial 4 finished with value: 0.9203640659849731 and parameters: {'n_estimators': 305, 'learning_rate': 0.06900598309230256, 'num_leaves': 21, 'max_depth': 12, 'min_child_samples': 60, 'subsample': 0.7610484395724241, 'colsample_bytree': 0.9550510833073976, 'lambda_l1': 1.3765588692516513, 'lambda_l2': 1.6942977410195448}. Best is trial 0 with value: 0.9205927588022342.


Best trial: 5. Best value: 0.921186:  24%|‚ñà‚ñà‚ñç       | 6/25 [01:01<03:04,  9.71s/it]

[I 2025-11-21 10:08:12,464] Trial 5 finished with value: 0.9211858174167198 and parameters: {'n_estimators': 797, 'learning_rate': 0.09327894922885692, 'num_leaves': 24, 'max_depth': 3, 'min_child_samples': 12, 'subsample': 0.8482042792985489, 'colsample_bytree': 0.9178589247329457, 'lambda_l1': 0.5943650512390863, 'lambda_l2': 4.799882263653815}. Best is trial 5 with value: 0.9211858174167198.


Best trial: 6. Best value: 0.921603:  28%|‚ñà‚ñà‚ñä       | 7/25 [01:09<02:43,  9.08s/it]

[I 2025-11-21 10:08:20,257] Trial 6 finished with value: 0.9216030771494963 and parameters: {'n_estimators': 415, 'learning_rate': 0.13981042545072897, 'num_leaves': 29, 'max_depth': 10, 'min_child_samples': 22, 'subsample': 0.6567505703681153, 'colsample_bytree': 0.7560293763044622, 'lambda_l1': 4.561441548948368, 'lambda_l2': 3.6390882945230625}. Best is trial 6 with value: 0.9216030771494963.


Best trial: 6. Best value: 0.921603:  32%|‚ñà‚ñà‚ñà‚ñè      | 8/25 [01:20<02:45,  9.76s/it]

[I 2025-11-21 10:08:31,481] Trial 7 finished with value: 0.9201994822859679 and parameters: {'n_estimators': 609, 'learning_rate': 0.03860269845853867, 'num_leaves': 46, 'max_depth': 5, 'min_child_samples': 35, 'subsample': 0.7077541907080236, 'colsample_bytree': 0.7127464307549571, 'lambda_l1': 3.990704371278404, 'lambda_l2': 1.9110100835016248}. Best is trial 6 with value: 0.9216030771494963.


Best trial: 6. Best value: 0.921603:  36%|‚ñà‚ñà‚ñà‚ñå      | 9/25 [01:30<02:37,  9.86s/it]

[I 2025-11-21 10:08:41,558] Trial 8 finished with value: 0.9208067061475311 and parameters: {'n_estimators': 516, 'learning_rate': 0.044851694674216507, 'num_leaves': 43, 'max_depth': 7, 'min_child_samples': 17, 'subsample': 0.6712901648263607, 'colsample_bytree': 0.984196462388612, 'lambda_l1': 1.8762184960902646, 'lambda_l2': 2.9475964783351785}. Best is trial 6 with value: 0.9216030771494963.


Best trial: 6. Best value: 0.921603:  40%|‚ñà‚ñà‚ñà‚ñà      | 10/25 [01:35<02:06,  8.46s/it]

[I 2025-11-21 10:08:46,883] Trial 9 finished with value: 0.9144666040235422 and parameters: {'n_estimators': 435, 'learning_rate': 0.13102387747952435, 'num_leaves': 42, 'max_depth': 1, 'min_child_samples': 10, 'subsample': 0.8204664553293886, 'colsample_bytree': 0.9605514420178709, 'lambda_l1': 4.3107312844754295, 'lambda_l2': 4.466476155269864}. Best is trial 6 with value: 0.9216030771494963.


Best trial: 6. Best value: 0.921603:  44%|‚ñà‚ñà‚ñà‚ñà‚ñç     | 11/25 [01:43<01:55,  8.27s/it]

[I 2025-11-21 10:08:54,709] Trial 10 finished with value: 0.9211466927997133 and parameters: {'n_estimators': 390, 'learning_rate': 0.14038839501439476, 'num_leaves': 60, 'max_depth': 8, 'min_child_samples': 25, 'subsample': 0.9815844782096658, 'colsample_bytree': 0.8334819192218762, 'lambda_l1': 3.122846378007818, 'lambda_l2': 0.16612369479151745}. Best is trial 6 with value: 0.9216030771494963.


Best trial: 11. Best value: 0.9222:  48%|‚ñà‚ñà‚ñà‚ñà‚ñä     | 12/25 [01:53<01:55,  8.87s/it] 

[I 2025-11-21 10:09:04,966] Trial 11 finished with value: 0.9222003441580107 and parameters: {'n_estimators': 732, 'learning_rate': 0.11229002060723453, 'num_leaves': 22, 'max_depth': 4, 'min_child_samples': 24, 'subsample': 0.8795712892926907, 'colsample_bytree': 0.7756669726085643, 'lambda_l1': 0.38768355024669177, 'lambda_l2': 4.9671871817965}. Best is trial 11 with value: 0.9222003441580107.


Best trial: 11. Best value: 0.9222:  52%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè    | 13/25 [02:05<01:58,  9.88s/it]

[I 2025-11-21 10:09:17,176] Trial 12 finished with value: 0.9221148650397355 and parameters: {'n_estimators': 776, 'learning_rate': 0.10855237592090938, 'num_leaves': 15, 'max_depth': 5, 'min_child_samples': 25, 'subsample': 0.6102993932281877, 'colsample_bytree': 0.764554562217135, 'lambda_l1': 0.29061607027575265, 'lambda_l2': 3.6142072004626815}. Best is trial 11 with value: 0.9222003441580107.


Best trial: 11. Best value: 0.9222:  56%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå    | 14/25 [02:17<01:53, 10.31s/it]

[I 2025-11-21 10:09:28,469] Trial 13 finished with value: 0.9220233120779674 and parameters: {'n_estimators': 797, 'learning_rate': 0.10202162441649992, 'num_leaves': 17, 'max_depth': 5, 'min_child_samples': 28, 'subsample': 0.9126227027205643, 'colsample_bytree': 0.8235368606207177, 'lambda_l1': 0.051741121464883744, 'lambda_l2': 3.6918411573630046}. Best is trial 11 with value: 0.9222003441580107.


Best trial: 11. Best value: 0.9222:  60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 15/25 [02:27<01:42, 10.22s/it]

[I 2025-11-21 10:09:38,475] Trial 14 finished with value: 0.9220580757494172 and parameters: {'n_estimators': 699, 'learning_rate': 0.11282028619958198, 'num_leaves': 15, 'max_depth': 4, 'min_child_samples': 34, 'subsample': 0.9345015622966164, 'colsample_bytree': 0.7689783535249737, 'lambda_l1': 0.7732064197480846, 'lambda_l2': 4.938988164835865}. Best is trial 11 with value: 0.9222003441580107.


Best trial: 11. Best value: 0.9222:  64%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç   | 16/25 [02:38<01:33, 10.41s/it]

[I 2025-11-21 10:09:49,329] Trial 15 finished with value: 0.9214877217403696 and parameters: {'n_estimators': 708, 'learning_rate': 0.08775202648677595, 'num_leaves': 28, 'max_depth': -1, 'min_child_samples': 45, 'subsample': 0.6010032918738436, 'colsample_bytree': 0.8558189080308486, 'lambda_l1': 0.1288490458099242, 'lambda_l2': 3.57685488393612}. Best is trial 11 with value: 0.9222003441580107.


Best trial: 11. Best value: 0.9222:  68%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä   | 17/25 [02:46<01:19,  9.91s/it]

[I 2025-11-21 10:09:58,065] Trial 16 finished with value: 0.9217141536205351 and parameters: {'n_estimators': 571, 'learning_rate': 0.10982835715490029, 'num_leaves': 21, 'max_depth': 7, 'min_child_samples': 20, 'subsample': 0.9988098598591848, 'colsample_bytree': 0.6091047115231359, 'lambda_l1': 1.1519013216319685, 'lambda_l2': 0.9486282737473548}. Best is trial 11 with value: 0.9222003441580107.


Best trial: 11. Best value: 0.9222:  72%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè  | 18/25 [02:56<01:08,  9.78s/it]

[I 2025-11-21 10:10:07,554] Trial 17 finished with value: 0.9199428956863676 and parameters: {'n_estimators': 735, 'learning_rate': 0.06591388029081524, 'num_leaves': 34, 'max_depth': 3, 'min_child_samples': 30, 'subsample': 0.6062388428814585, 'colsample_bytree': 0.7721600494734877, 'lambda_l1': 3.1805796047940778, 'lambda_l2': 4.301014843126243}. Best is trial 11 with value: 0.9222003441580107.


Best trial: 11. Best value: 0.9222:  76%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 19/25 [03:05<00:58,  9.70s/it]

[I 2025-11-21 10:10:17,074] Trial 18 finished with value: 0.9217523514649641 and parameters: {'n_estimators': 603, 'learning_rate': 0.1491551745561493, 'num_leaves': 28, 'max_depth': 6, 'min_child_samples': 42, 'subsample': 0.7339550639649226, 'colsample_bytree': 0.7283298059793121, 'lambda_l1': 1.4073794481621027, 'lambda_l2': 3.1977620171224403}. Best is trial 11 with value: 0.9222003441580107.


Best trial: 11. Best value: 0.9222:  80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 20/25 [03:16<00:49,  9.97s/it]

[I 2025-11-21 10:10:27,664] Trial 19 finished with value: 0.9218820996257593 and parameters: {'n_estimators': 764, 'learning_rate': 0.123901029610605, 'num_leaves': 15, 'max_depth': 9, 'min_child_samples': 17, 'subsample': 0.8777395738335008, 'colsample_bytree': 0.7952297419044134, 'lambda_l1': 0.5213977996398076, 'lambda_l2': 4.075233646303732}. Best is trial 11 with value: 0.9222003441580107.


Best trial: 11. Best value: 0.9222:  84%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç | 21/25 [03:25<00:38,  9.66s/it]

[I 2025-11-21 10:10:36,600] Trial 20 finished with value: 0.9190306686329061 and parameters: {'n_estimators': 660, 'learning_rate': 0.057617991573365904, 'num_leaves': 53, 'max_depth': 3, 'min_child_samples': 29, 'subsample': 0.9495653110532898, 'colsample_bytree': 0.8636740371985527, 'lambda_l1': 1.0030709177977322, 'lambda_l2': 4.708630784246206}. Best is trial 11 with value: 0.9222003441580107.


Best trial: 11. Best value: 0.9222:  88%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä | 22/25 [03:35<00:29,  9.77s/it]

[I 2025-11-21 10:10:46,629] Trial 21 finished with value: 0.9220629670981185 and parameters: {'n_estimators': 713, 'learning_rate': 0.11088020086672988, 'num_leaves': 15, 'max_depth': 4, 'min_child_samples': 34, 'subsample': 0.9202282136640969, 'colsample_bytree': 0.7355355681834763, 'lambda_l1': 0.6572400897481405, 'lambda_l2': 4.878975281604759}. Best is trial 11 with value: 0.9222003441580107.


Best trial: 11. Best value: 0.9222:  92%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè| 23/25 [03:44<00:19,  9.71s/it]

[I 2025-11-21 10:10:56,203] Trial 22 finished with value: 0.9219565302216701 and parameters: {'n_estimators': 679, 'learning_rate': 0.09823868691093746, 'num_leaves': 19, 'max_depth': 4, 'min_child_samples': 39, 'subsample': 0.8910302465737919, 'colsample_bytree': 0.7174596267370453, 'lambda_l1': 0.2987321214936222, 'lambda_l2': 4.9749943450100265}. Best is trial 11 with value: 0.9222003441580107.


Best trial: 11. Best value: 0.9222:  96%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå| 24/25 [03:56<00:10, 10.22s/it]

[I 2025-11-21 10:11:07,626] Trial 23 finished with value: 0.9218935996301022 and parameters: {'n_estimators': 748, 'learning_rate': 0.08462804398282397, 'num_leaves': 23, 'max_depth': 6, 'min_child_samples': 23, 'subsample': 0.8463924157502148, 'colsample_bytree': 0.7996588036027376, 'lambda_l1': 1.6219164162230713, 'lambda_l2': 4.046185445242018}. Best is trial 11 with value: 0.9222003441580107.


Best trial: 11. Best value: 0.9222: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 25/25 [04:05<00:00,  9.82s/it]

[I 2025-11-21 10:11:16,628] Trial 24 finished with value: 0.9219915280112162 and parameters: {'n_estimators': 626, 'learning_rate': 0.10866808050382025, 'num_leaves': 25, 'max_depth': 4, 'min_child_samples': 31, 'subsample': 0.948234296519552, 'colsample_bytree': 0.7441053615977136, 'lambda_l1': 0.8863153563080481, 'lambda_l2': 3.2028812291175304}. Best is trial 11 with value: 0.9222003441580107.
Best ROC-AUC: 0.9222003441580107
Best Hyperparameters: {'n_estimators': 732, 'learning_rate': 0.11229002060723453, 'num_leaves': 22, 'max_depth': 4, 'min_child_samples': 24, 'subsample': 0.8795712892926907, 'colsample_bytree': 0.7756669726085643, 'lambda_l1': 0.38768355024669177, 'lambda_l2': 4.9671871817965}





In [21]:
best_lgbm_params = study.best_trial.params
best_lgbm = LGBMClassifier(
    **best_lgbm_params,
    objective='binary',
    random_state=42,
    n_jobs=-1
)

best_lgbm.fit(X_train, y_train)


0,1,2
,boosting_type,'gbdt'
,num_leaves,22
,max_depth,4
,learning_rate,0.11229002060723453
,n_estimators,732
,subsample_for_bin,200000
,objective,'binary'
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


---

## üîß Hyperparameter Tuning: XGBoost + Optuna

With LightGBM tuned and performing at the top of the leaderboard, the next step is to tune **XGBoost**, our second-best baseline model.

While LightGBM is often faster, XGBoost offers:

- strong regularization (L1/L2)
- robust performance on noisy features
- stable tree growth patterns
- excellent generalization on credit-risk style datasets

Tuning XGBoost is valuable both for practical performance and for deepening understanding of how boosted trees behave under different hyperparameters.

Optuna is used again because it provides:

- efficient Bayesian optimization
- smarter parameter search
- fewer wasted trials
- ideal performance on CPU for large datasets

The goal: determine whether a tuned XGBoost model can approach ‚Äî or even challenge ‚Äî the tuned LightGBM score.

---


In [22]:
# ------------------------------------------------------
# Optuna Objective Function for XGBoost
# ------------------------------------------------------
def xgb_objective(trial):

    params = {
        "n_estimators": trial.suggest_int("n_estimators", 200, 800),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.20),

        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "min_child_weight": trial.suggest_float("min_child_weight", 1.0, 20.0),

        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),

        "gamma": trial.suggest_float("gamma", 0.0, 10.0),
        "reg_alpha": trial.suggest_float("reg_alpha", 0.0, 5.0),
        "reg_lambda": trial.suggest_float("reg_lambda", 0.0, 5.0),

        "objective": "binary:logistic",
        "eval_metric": "auc",
        "tree_method": "hist",         # fastest CPU method
        "random_state": 42,
        "n_jobs": -1
    }

    model = XGBClassifier(**params)

    model.fit(
        X_train_split,
        y_train_split,
        eval_set=[(X_valid, y_valid)],
        verbose=False
    )

    preds = model.predict_proba(X_valid)[:, 1]
    auc = roc_auc_score(y_valid, preds)

    return auc

# ------------------------------------------------------
# Run Optuna Study
# ------------------------------------------------------
xgb_study = optuna.create_study(
    direction="maximize",
    study_name="xgboost_opt"
)

xgb_study.optimize(
    xgb_objective,
    n_trials=25,                    # increase on desktop for more power
    show_progress_bar=True
)

print("Best ROC-AUC:", xgb_study.best_value)
print("Best Hyperparameters:", xgb_study.best_params)


[I 2025-11-21 10:11:25,347] A new study created in memory with name: xgboost_opt
Best trial: 0. Best value: 0.919989:   4%|‚ñç         | 1/25 [00:16<06:34, 16.44s/it]

[I 2025-11-21 10:11:41,785] Trial 0 finished with value: 0.9199892063180495 and parameters: {'n_estimators': 626, 'learning_rate': 0.11196235992498715, 'max_depth': 10, 'min_child_weight': 13.039667952189012, 'subsample': 0.668142934117381, 'colsample_bytree': 0.9984763776810714, 'gamma': 3.3985390472134505, 'reg_alpha': 4.505844616827078, 'reg_lambda': 3.1420008341028076}. Best is trial 0 with value: 0.9199892063180495.


Best trial: 0. Best value: 0.919989:   8%|‚ñä         | 2/25 [00:28<05:17, 13.82s/it]

[I 2025-11-21 10:11:53,774] Trial 1 finished with value: 0.9195037798609145 and parameters: {'n_estimators': 340, 'learning_rate': 0.04096182196651328, 'max_depth': 9, 'min_child_weight': 9.622184698949026, 'subsample': 0.8028456981934992, 'colsample_bytree': 0.9961888476516878, 'gamma': 0.2560289385701553, 'reg_alpha': 1.232409934985168, 'reg_lambda': 0.09526697578262744}. Best is trial 0 with value: 0.9199892063180495.


Best trial: 0. Best value: 0.919989:  12%|‚ñà‚ñè        | 3/25 [00:41<04:59, 13.63s/it]

[I 2025-11-21 10:12:07,168] Trial 2 finished with value: 0.9190309395661832 and parameters: {'n_estimators': 715, 'learning_rate': 0.09249907146589244, 'max_depth': 8, 'min_child_weight': 4.867901525676685, 'subsample': 0.9496920800574504, 'colsample_bytree': 0.8370140329579077, 'gamma': 4.40552844732384, 'reg_alpha': 2.3498851961791094, 'reg_lambda': 4.990915543665048}. Best is trial 0 with value: 0.9199892063180495.


Best trial: 0. Best value: 0.919989:  16%|‚ñà‚ñå        | 4/25 [01:03<05:55, 16.91s/it]

[I 2025-11-21 10:12:29,112] Trial 3 finished with value: 0.919524333754349 and parameters: {'n_estimators': 649, 'learning_rate': 0.036642577858659105, 'max_depth': 7, 'min_child_weight': 11.822166731222275, 'subsample': 0.7321678677057772, 'colsample_bytree': 0.9130951661055644, 'gamma': 4.600228229932184, 'reg_alpha': 3.8428845377901197, 'reg_lambda': 4.555845152985486}. Best is trial 0 with value: 0.9199892063180495.


Best trial: 0. Best value: 0.919989:  20%|‚ñà‚ñà        | 5/25 [01:19<05:31, 16.56s/it]

[I 2025-11-21 10:12:45,059] Trial 4 finished with value: 0.9188991524616285 and parameters: {'n_estimators': 475, 'learning_rate': 0.02546863817961256, 'max_depth': 8, 'min_child_weight': 4.456424641374461, 'subsample': 0.8343095808338101, 'colsample_bytree': 0.7417864022895927, 'gamma': 4.4050203600808455, 'reg_alpha': 4.286724326137349, 'reg_lambda': 4.005115412075392}. Best is trial 0 with value: 0.9199892063180495.


Best trial: 0. Best value: 0.919989:  24%|‚ñà‚ñà‚ñç       | 6/25 [01:35<05:10, 16.35s/it]

[I 2025-11-21 10:13:00,992] Trial 5 finished with value: 0.9178379471572969 and parameters: {'n_estimators': 413, 'learning_rate': 0.01404005830248626, 'max_depth': 10, 'min_child_weight': 1.9082848907250651, 'subsample': 0.743225536538475, 'colsample_bytree': 0.7162450718925164, 'gamma': 5.887348103369941, 'reg_alpha': 4.8525428403849515, 'reg_lambda': 2.8245581563768605}. Best is trial 0 with value: 0.9199892063180495.


Best trial: 0. Best value: 0.919989:  28%|‚ñà‚ñà‚ñä       | 7/25 [02:13<07:03, 23.52s/it]

[I 2025-11-21 10:13:39,260] Trial 6 finished with value: 0.918689187989821 and parameters: {'n_estimators': 554, 'learning_rate': 0.17521687708793285, 'max_depth': 5, 'min_child_weight': 19.471834336254297, 'subsample': 0.7031150680993186, 'colsample_bytree': 0.6992849760452404, 'gamma': 8.543146106117177, 'reg_alpha': 2.3505396278073567, 'reg_lambda': 2.436343377310659}. Best is trial 0 with value: 0.9199892063180495.


Best trial: 0. Best value: 0.919989:  32%|‚ñà‚ñà‚ñà‚ñè      | 8/25 [04:10<15:04, 53.18s/it]

[I 2025-11-21 10:15:35,959] Trial 7 finished with value: 0.9199445025477648 and parameters: {'n_estimators': 689, 'learning_rate': 0.0829185979016174, 'max_depth': 9, 'min_child_weight': 14.298753904405576, 'subsample': 0.9960839546269578, 'colsample_bytree': 0.8202764811811916, 'gamma': 0.7895089903206898, 'reg_alpha': 1.8070096726444924, 'reg_lambda': 2.409815882222681}. Best is trial 0 with value: 0.9199892063180495.


Best trial: 0. Best value: 0.919989:  36%|‚ñà‚ñà‚ñà‚ñå      | 9/25 [04:28<11:14, 42.13s/it]

[I 2025-11-21 10:15:53,799] Trial 8 finished with value: 0.9147600249831787 and parameters: {'n_estimators': 301, 'learning_rate': 0.024681092113528615, 'max_depth': 4, 'min_child_weight': 19.89627967141153, 'subsample': 0.7889065377013785, 'colsample_bytree': 0.6610661037923257, 'gamma': 5.372500392100932, 'reg_alpha': 4.427776596200916, 'reg_lambda': 3.3231717847748388}. Best is trial 0 with value: 0.9199892063180495.


Best trial: 0. Best value: 0.919989:  40%|‚ñà‚ñà‚ñà‚ñà      | 10/25 [04:46<08:39, 34.60s/it]

[I 2025-11-21 10:16:11,536] Trial 9 finished with value: 0.9197899416164139 and parameters: {'n_estimators': 451, 'learning_rate': 0.12607739319532593, 'max_depth': 10, 'min_child_weight': 8.38090225161018, 'subsample': 0.8415032639962352, 'colsample_bytree': 0.729578458261793, 'gamma': 3.6765244596823266, 'reg_alpha': 2.9018027794312387, 'reg_lambda': 4.1083396366726745}. Best is trial 0 with value: 0.9199892063180495.


Best trial: 0. Best value: 0.919989:  44%|‚ñà‚ñà‚ñà‚ñà‚ñç     | 11/25 [04:57<06:25, 27.55s/it]

[I 2025-11-21 10:16:23,080] Trial 10 finished with value: 0.9198987273895818 and parameters: {'n_estimators': 222, 'learning_rate': 0.14893259328780248, 'max_depth': 6, 'min_child_weight': 15.813406002677729, 'subsample': 0.6045533654103092, 'colsample_bytree': 0.6025732180562658, 'gamma': 2.25162542374814, 'reg_alpha': 0.040795891734422884, 'reg_lambda': 1.1250790540794875}. Best is trial 0 with value: 0.9199892063180495.


Best trial: 0. Best value: 0.919989:  48%|‚ñà‚ñà‚ñà‚ñà‚ñä     | 12/25 [05:42<07:06, 32.78s/it]

[I 2025-11-21 10:17:07,847] Trial 11 finished with value: 0.9174526130202907 and parameters: {'n_estimators': 787, 'learning_rate': 0.08323175354874454, 'max_depth': 10, 'min_child_weight': 14.455130372279829, 'subsample': 0.9992268406890001, 'colsample_bytree': 0.8513158586481347, 'gamma': 0.0010245538593156578, 'reg_alpha': 1.2455885261810997, 'reg_lambda': 1.9803009114936019}. Best is trial 0 with value: 0.9199892063180495.


Best trial: 0. Best value: 0.919989:  52%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè    | 13/25 [06:05<05:57, 29.81s/it]

[I 2025-11-21 10:17:30,810] Trial 12 finished with value: 0.9195436498199958 and parameters: {'n_estimators': 590, 'learning_rate': 0.06913038225776394, 'max_depth': 8, 'min_child_weight': 13.855992221521888, 'subsample': 0.6019192849307644, 'colsample_bytree': 0.9980441563410227, 'gamma': 2.0288134893239826, 'reg_alpha': 3.354715521903998, 'reg_lambda': 1.6886850573571393}. Best is trial 0 with value: 0.9199892063180495.


Best trial: 0. Best value: 0.919989:  56%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå    | 14/25 [06:20<04:38, 25.29s/it]

[I 2025-11-21 10:17:45,662] Trial 13 finished with value: 0.9199705756320273 and parameters: {'n_estimators': 684, 'learning_rate': 0.11619616422697153, 'max_depth': 3, 'min_child_weight': 16.415524492358237, 'subsample': 0.9000882324185637, 'colsample_bytree': 0.916853967998394, 'gamma': 2.0342228721453264, 'reg_alpha': 1.2913558365999163, 'reg_lambda': 3.0850966576653294}. Best is trial 0 with value: 0.9199892063180495.


Best trial: 0. Best value: 0.919989:  60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 15/25 [06:36<03:45, 22.53s/it]

[I 2025-11-21 10:18:01,798] Trial 14 finished with value: 0.9174238973996622 and parameters: {'n_estimators': 779, 'learning_rate': 0.12091164730103414, 'max_depth': 3, 'min_child_weight': 17.44965698684365, 'subsample': 0.8941271751308666, 'colsample_bytree': 0.9279196988857042, 'gamma': 7.180196797879397, 'reg_alpha': 0.20173551814481638, 'reg_lambda': 3.4263394696660443}. Best is trial 0 with value: 0.9199892063180495.


Best trial: 15. Best value: 0.920414:  64%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç   | 16/25 [06:51<03:01, 20.15s/it]

[I 2025-11-21 10:18:16,424] Trial 15 finished with value: 0.9204140396169262 and parameters: {'n_estimators': 606, 'learning_rate': 0.1506905532062914, 'max_depth': 3, 'min_child_weight': 11.413058243377595, 'subsample': 0.6562368784847591, 'colsample_bytree': 0.9299779558517239, 'gamma': 2.665311301551456, 'reg_alpha': 1.1942778916743546, 'reg_lambda': 3.235757099385635}. Best is trial 15 with value: 0.9204140396169262.


Best trial: 15. Best value: 0.920414:  68%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä   | 17/25 [07:08<02:35, 19.40s/it]

[I 2025-11-21 10:18:34,084] Trial 16 finished with value: 0.9194880311201591 and parameters: {'n_estimators': 551, 'learning_rate': 0.19943985984037185, 'max_depth': 5, 'min_child_weight': 11.692213428984658, 'subsample': 0.6537547909869371, 'colsample_bytree': 0.9490652510080779, 'gamma': 2.6156347263800503, 'reg_alpha': 0.5972962670549318, 'reg_lambda': 3.8341164863884996}. Best is trial 15 with value: 0.9204140396169262.


Best trial: 15. Best value: 0.920414:  72%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè  | 18/25 [07:21<02:01, 17.33s/it]

[I 2025-11-21 10:18:46,581] Trial 17 finished with value: 0.9192977953967444 and parameters: {'n_estimators': 606, 'learning_rate': 0.15060931484887527, 'max_depth': 6, 'min_child_weight': 7.316829090609721, 'subsample': 0.6688109864558928, 'colsample_bytree': 0.8814967031372991, 'gamma': 6.826218022109943, 'reg_alpha': 3.2128591178460555, 'reg_lambda': 1.611472209611733}. Best is trial 15 with value: 0.9204140396169262.


Best trial: 15. Best value: 0.920414:  76%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 19/25 [07:32<01:33, 15.60s/it]

[I 2025-11-21 10:18:58,173] Trial 18 finished with value: 0.9203337922224383 and parameters: {'n_estimators': 523, 'learning_rate': 0.14752958989871384, 'max_depth': 4, 'min_child_weight': 11.16175947466123, 'subsample': 0.6538332566830977, 'colsample_bytree': 0.9620342477734285, 'gamma': 3.4271927650045337, 'reg_alpha': 1.7116389255615991, 'reg_lambda': 0.867124467158682}. Best is trial 15 with value: 0.9204140396169262.


Best trial: 15. Best value: 0.920414:  80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 20/25 [07:43<01:09, 13.98s/it]

[I 2025-11-21 10:19:08,362] Trial 19 finished with value: 0.9177162328625884 and parameters: {'n_estimators': 531, 'learning_rate': 0.14485346660670334, 'max_depth': 4, 'min_child_weight': 10.097534228540452, 'subsample': 0.6464548130602101, 'colsample_bytree': 0.7854751014196992, 'gamma': 9.762249953688972, 'reg_alpha': 1.8761893353369314, 'reg_lambda': 0.2670046189742594}. Best is trial 15 with value: 0.9204140396169262.


Best trial: 20. Best value: 0.920508:  84%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç | 21/25 [07:53<00:51, 12.92s/it]

[I 2025-11-21 10:19:18,823] Trial 20 finished with value: 0.9205079203121963 and parameters: {'n_estimators': 393, 'learning_rate': 0.1774198577233517, 'max_depth': 4, 'min_child_weight': 5.934510979845253, 'subsample': 0.7166234678306819, 'colsample_bytree': 0.9624917866152527, 'gamma': 0.9810574551378264, 'reg_alpha': 0.845967560392323, 'reg_lambda': 0.8447888751678477}. Best is trial 20 with value: 0.9205079203121963.


Best trial: 20. Best value: 0.920508:  88%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä | 22/25 [08:02<00:35, 11.87s/it]

[I 2025-11-21 10:19:28,251] Trial 21 finished with value: 0.9204626942954238 and parameters: {'n_estimators': 392, 'learning_rate': 0.1770559186562281, 'max_depth': 4, 'min_child_weight': 7.264164803336599, 'subsample': 0.7128858941876535, 'colsample_bytree': 0.9546257523553244, 'gamma': 1.0097379117054552, 'reg_alpha': 0.7337362650925596, 'reg_lambda': 0.6880598356274032}. Best is trial 20 with value: 0.9205079203121963.


Best trial: 20. Best value: 0.920508:  92%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè| 23/25 [08:11<00:21, 10.88s/it]

[I 2025-11-21 10:19:36,799] Trial 22 finished with value: 0.9199908824008048 and parameters: {'n_estimators': 387, 'learning_rate': 0.17943789262109835, 'max_depth': 3, 'min_child_weight': 6.225637371732132, 'subsample': 0.7257301774182532, 'colsample_bytree': 0.8849128933670227, 'gamma': 1.1227258656810886, 'reg_alpha': 0.6728524533111231, 'reg_lambda': 0.7062760263899501}. Best is trial 20 with value: 0.9205079203121963.


Best trial: 20. Best value: 0.920508:  96%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå| 24/25 [08:19<00:09,  9.97s/it]

[I 2025-11-21 10:19:44,653] Trial 23 finished with value: 0.9200722395415236 and parameters: {'n_estimators': 296, 'learning_rate': 0.17340482009159533, 'max_depth': 5, 'min_child_weight': 2.182307344402872, 'subsample': 0.7695503956238315, 'colsample_bytree': 0.9528204307989638, 'gamma': 1.3330600630603175, 'reg_alpha': 0.6852001004569843, 'reg_lambda': 1.3095444508670013}. Best is trial 20 with value: 0.9205079203121963.


Best trial: 24. Best value: 0.920509: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 25/25 [08:29<00:00, 20.38s/it]

[I 2025-11-21 10:19:54,785] Trial 24 finished with value: 0.9205090635668548 and parameters: {'n_estimators': 410, 'learning_rate': 0.1984083815625296, 'max_depth': 4, 'min_child_weight': 8.228745664724588, 'subsample': 0.693540964450998, 'colsample_bytree': 0.8713661566694098, 'gamma': 1.392901044000212, 'reg_alpha': 0.9683744086910002, 'reg_lambda': 0.5396653156362508}. Best is trial 24 with value: 0.9205090635668548.
Best ROC-AUC: 0.9205090635668548
Best Hyperparameters: {'n_estimators': 410, 'learning_rate': 0.1984083815625296, 'max_depth': 4, 'min_child_weight': 8.228745664724588, 'subsample': 0.693540964450998, 'colsample_bytree': 0.8713661566694098, 'gamma': 1.392901044000212, 'reg_alpha': 0.9683744086910002, 'reg_lambda': 0.5396653156362508}





In [23]:
# Make an explicit alias so the name is clear
study_xgb = study

best_xgb_params = study_xgb.best_trial.params

best_xgb = XGBClassifier(
    **best_xgb_params,
    objective='binary:logistic',
    eval_metric='logloss',
    random_state=42,
    n_jobs=-1,
    tree_method='hist'
)

best_xgb.fit(X_train, y_train)


Parameters: { "lambda_l1", "lambda_l2", "min_child_samples", "num_leaves" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,0.7756669726085643
,device,
,early_stopping_rounds,
,enable_categorical,False


# üß™ Model Blending Overview

Model blending is a simple but powerful ensembling strategy where predictions from multiple strong models are **combined** to produce a single, more stable output. Because different algorithms capture different patterns in the data, a blend often performs better than any single model on its own.

For this project, we blend the two best-performing tuned models:

- **LightGBM (tuned)** ‚Äî Best ROC-AUC: **0.92249**
- **XGBoost (tuned)** ‚Äî Best ROC-AUC: **0.92034**

Both models are high-signal gradient-boosted tree algorithms, but they differ in their split strategy, regularization behavior, and sensitivity to feature interactions. Because of this, each model captures slightly different structure in the data.

By averaging their predicted probabilities, we:

- Reduce variance and stabilize predictions
- Smooth out model-specific biases
- Capture patterns that only one model may detect
- Often gain a small but meaningful bump in ROC-AUC

This approach is widely used in Kaggle competitions because it‚Äôs inexpensive, easy to maintain, and consistently effective.

## üßÆ Blending Strategy

We implement a **soft-voting ensemble** using a simple weighted average:

- **LightGBM weight:** 0.60
- **XGBoost weight:** 0.40

These weights reflect the relative performance of each tuned model while still allowing both to contribute meaningfully to the final output.

## üõ†Ô∏è Code Implementation

```python
# Predict probabilities with both tuned models
lgbm_probs = best_lgbm.predict_proba(X_val)[:, 1]
xgb_probs = best_xgb.predict_proba(X_val)[:, 1]

# Weighted blend (soft voting)
blend_probs = 0.60 * lgbm_probs + 0.40 * xgb_probs

# Evaluate blended model
from sklearn.metrics import roc_auc_score
blend_auc = roc_auc_score(y_val, blend_probs)
blend_auc


## üîß Step 1 ‚Äî Generate Validation Predictions

In [24]:
# Global hold-out validation set for blending
X_train_blend, X_val, y_train_blend, y_val = train_test_split(
    X_train,
    y_train,
    test_size=0.2,
    random_state=42,
    stratify=y_train
)

In [25]:
# Predict probabilities from tuned models
lgbm_val_probs = best_lgbm.predict_proba(X_val)[:, 1]
xgb_val_probs = best_xgb.predict_proba(X_val)[:, 1]


In [26]:
# ----------------------------------------------
# Retrain the Best Models Using the New Split
# ----------------------------------------------

best_xgb.fit(X_train_blend, y_train_blend)


Parameters: { "lambda_l1", "lambda_l2", "min_child_samples", "num_leaves" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,0.7756669726085643
,device,
,early_stopping_rounds,
,enable_categorical,False


In [27]:
lgbm_val_probs = best_lgbm.predict_proba(X_val)[:, 1]
xgb_val_probs = best_xgb.predict_proba(X_val)[:, 1]


In [28]:
# ----------------------------------
# Try 50/50 Blend
# ----------------------------------

blend_50_50 = (lgbm_val_probs + xgb_val_probs) / 2


In [29]:
# Evaluate blend
auc_blend_50_50 = roc_auc_score(y_val, blend_50_50)
print("Blend AUC (50/50):", auc_blend_50_50)


Blend AUC (50/50): 0.9247894773483247


In [30]:
best_auc = 0
best_w   = None

for w in np.linspace(0, 1, 21):  # 0.00, 0.05, ..., 1.00
    blended = w * lgbm_val_probs + (1 - w) * xgb_val_probs
    auc = roc_auc_score(y_val, blended)

    if auc > best_auc:
        best_auc = auc
        best_w = w

print(f"Best Blend Weight: {best_w:.2f}")
print(f"Best Blend AUC:   {best_auc:.6f}")


Best Blend Weight: 1.00
Best Blend AUC:   0.927604


In [31]:
# Final blend with optimal weight
best_w = 1.00  # From your weight search

final_val_probs = best_w * lgbm_val_probs + (1 - best_w) * xgb_val_probs


In [32]:

auc_final = roc_auc_score(y_val, final_val_probs)
print("Final Blended ROC-AUC:", auc_final)


Final Blended ROC-AUC: 0.9276036200742114


In [33]:
# Convert Probabilities to Class Predictions (0/1)
# Use 0.5 threshold for now (fine tune threshold later)

final_preds = (final_val_probs >= 0.5).astype(int)
print(classification_report(y_val, final_preds))


              precision    recall  f1-score   support

         0.0       0.89      0.62      0.73     23900
         1.0       0.91      0.98      0.95     94899

    accuracy                           0.91    118799
   macro avg       0.90      0.80      0.84    118799
weighted avg       0.91      0.91      0.90    118799

