# Bankruptcy Prediction: Model Evolution from V23 to V28

**Course:** MGMT 57100 - Data Mining  
**Project:** Fall 2025 Final Project  
**Objective:** Binary classification to predict company bankruptcy  
**Evaluation Metric:** AUC-ROC Score      
**Author:** DEBADRI SANYAL (SANYALD@PURDUE.EDU) & SARA TARIQ (TARIQ15@PURDUE.EDU)

---

## Executive Summary

This notebook documents the iterative development of our bankruptcy prediction models, specifically:
- **Model V23**: Initial dual-model ensemble (LGBM + XGBoost) with 3-seed averaging
- **Model V28**: Enhanced version with 5-seed averaging for improved stability and generalization

Both models leverage an ensemble approach combining LightGBM and XGBoost with different feature engineering strategies to capture complementary patterns in the data.

---

# Part 1: Model V23 - Foundation Model

## 1.1 Methodology Overview

### Key Design Decisions:

1. **Dual-Model Architecture**
   - LightGBM: Handles heavily engineered features with complex interactions
   - XGBoost: Processes cleaner, scaled features with emphasis on raw patterns
   
2. **Feature Engineering Strategy**
   - **For LGBM**: Aggressive feature engineering including:
     - Row-wise statistics (mean, std, max, min)
     - Log transformations for skewness handling
     - Squared features for non-linear relationships
     - Ratio features (value/row_mean) for relative importance
   - **For XGBoost**: Conservative approach with:
     - Log transformations only
     - RobustScaler for outlier-resistant normalization

3. **Robustness Techniques**
   - Quantile clipping (1st-99th percentile) to handle extreme outliers
   - 10-fold stratified cross-validation
   - 3-seed averaging to reduce variance

4. **Model Blending**
   - Grid search over 41 weight combinations
   - Optimal weights selected based on OOF (Out-of-Fold) AUC performance

## 1.2 Implementation: Model V23

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import RobustScaler

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# =========================================================
# Paths (EDIT IF NEEDED)
# =========================================================
TRAIN_PATH = r"C:/Users/DOUBLEDO_GAMING/OneDrive/Desktop/PURDUE FALL'25/MOD2 - Fall'25/Data_mining_57100/fall-2025-mgmt-571-final-project/bankruptcy_Train.csv"
TEST_PATH  = r"C:/Users/DOUBLEDO_GAMING/OneDrive/Desktop/PURDUE FALL'25/MOD2 - Fall'25/Data_mining_57100/fall-2025-mgmt-571-final-project/bankruptcy_Test_X.csv"

OUT_PATH   = r"C:/Users/DOUBLEDO_GAMING/OneDrive/Desktop/PURDUE FALL'25/MOD2 - Fall'25/Data_mining_57100/fall-2025-mgmt-571-final-project/submission_V23_LGBM_XGB_enhanced_rawview.csv"

# =========================================================
# 1. Load data
# =========================================================
train = pd.read_csv(TRAIN_PATH)
test  = pd.read_csv(TEST_PATH)

y = train["class"].values
X = train.drop(columns=["class"])
X_test = test.drop(columns=["ID"])
test_ids = test["ID"].values

print("Train shape:", X.shape)
print("Test shape :", X_test.shape)

pos_rate = y.mean()
scale_pos = (1 - pos_rate) / pos_rate
print("Positive class rate:", pos_rate)
print("scale_pos_weight   :", scale_pos)

### Data Preprocessing: Quantile Clipping

**Rationale:** Financial ratios often contain extreme outliers that can destabilize tree-based models. We clip values at the 1st and 99th percentiles to maintain data distribution while removing extreme values.

In [None]:
# =========================================================
# 2. Quantile clipping (for stability)
# =========================================================
def quantile_clip(train_df: pd.DataFrame, test_df: pd.DataFrame,
                  q_low=0.01, q_high=0.99) -> (pd.DataFrame, pd.DataFrame):
    train_df = train_df.copy()
    test_df = test_df.copy()
    for col in train_df.columns:
        lo = train_df[col].quantile(q_low)
        hi = train_df[col].quantile(q_high)
        train_df[col] = train_df[col].clip(lo, hi)
        test_df[col]  = test_df[col].clip(lo, hi)
    return train_df, test_df

X_clip, X_test_clip = quantile_clip(X, X_test, q_low=0.01, q_high=0.99)
print("After clipping - train:", X_clip.shape)
print("After clipping - test :", X_test_clip.shape)

### Feature Engineering Path 1: Heavy FE for LightGBM

**Strategy:** Create rich feature set that captures:
- **Row-level patterns**: How each company compares to itself across features
- **Non-linear relationships**: Squared terms and log transforms
- **Relative importance**: Ratio of each feature to row mean

Total features created: ~4x original count

In [None]:
# =========================================================
# 3A. Feature engineering for LGBM (heavy FE on clipped)
# =========================================================
def fe_light(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    cols = df.columns.tolist()
    eps = 1e-6

    row_mean = df[cols].mean(axis=1)
    row_std  = df[cols].std(axis=1)
    row_max  = df[cols].max(axis=1)
    row_min  = df[cols].min(axis=1)

    df["row_mean"] = row_mean
    df["row_std"]  = row_std
    df["row_max"]  = row_max
    df["row_min"]  = row_min

    for c in cols:
        df[f"log1p_{c}"] = np.log1p(np.abs(df[c]))
        df[f"{c}_sq"]    = df[c] ** 2
        df[f"{c}_div_rowmean"] = df[c] / (row_mean + eps)

    df.replace([np.inf, -np.inf], 0, inplace=True)
    df.fillna(0, inplace=True)
    return df

X_all_clip = pd.concat([X_clip, X_test_clip], axis=0).reset_index(drop=True)
X_all_fe = fe_light(X_all_clip)

X_lgb_train = X_all_fe.iloc[:len(X)].reset_index(drop=True).astype("float32").values
X_lgb_test  = X_all_fe.iloc[len(X):].reset_index(drop=True).astype("float32").values

print("LGBM FE train shape:", X_lgb_train.shape)
print("LGBM FE test  shape:", X_lgb_test.shape)

### Feature Engineering Path 2: Enhanced Raw View for XGBoost

**Strategy:** Keep features closer to their original form:
- Add log transformations only (for skewness)
- Apply RobustScaler (median and IQR-based, outlier-resistant)
- Let XGBoost find patterns in cleaner feature space

Total features: ~2x original count

In [None]:
# =========================================================
# 3B. Enhanced raw view for XGB: (clip + log1p + RobustScaler)
# =========================================================
X_raw_enh = X_clip.copy()
X_test_raw_enh = X_test_clip.copy()

for c in X_raw_enh.columns:
    X_raw_enh[f"log1p_{c}"] = np.log1p(np.abs(X_raw_enh[c]))
    X_test_raw_enh[f"log1p_{c}"] = np.log1p(np.abs(X_test_raw_enh[c]))

scaler = RobustScaler()
X_raw_train = scaler.fit_transform(X_raw_enh.astype("float32"))
X_raw_test  = scaler.transform(X_test_raw_enh.astype("float32"))

print("XGB raw-enh train shape:", X_raw_train.shape)
print("XGB raw-enh test  shape:", X_raw_test.shape)

y_arr = y

### Multi-Seed Cross-Validation Training

**Key Components:**
1. **3 Random Seeds** (42, 2025, 777): Reduce variance from random initialization
2. **10-Fold Stratified CV**: Maintain class distribution in each fold
3. **Out-of-Fold Predictions**: Used for unbiased ensemble weight optimization

**Model Hyperparameters:**
- **LightGBM**: 900 trees, unlimited depth, conservative learning rate (0.03)
- **XGBoost**: 800 trees, max_depth=5, scale_pos_weight for class imbalance

In [None]:
# =========================================================
# 4. Multi-seed dual-model CV (LGBM + XGB)
# =========================================================
SEEDS = [42, 2025, 777]
N_FOLDS = 10

oof_lgb_all = np.zeros(len(y_arr))
oof_xgb_all = np.zeros(len(y_arr))
test_lgb_all = np.zeros(len(X_lgb_test))
test_xgb_all = np.zeros(len(X_raw_test))

for seed in SEEDS:
    print(f"\n================= SEED {seed} =================")
    skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=seed)

    oof_lgb_seed = np.zeros(len(y_arr))
    oof_xgb_seed = np.zeros(len(y_arr))
    test_lgb_seed = np.zeros(len(X_lgb_test))
    test_xgb_seed = np.zeros(len(X_raw_test))

    fold_id = 1
    for tr_idx, val_idx in skf.split(X_lgb_train, y_arr):
        X_tr_lgb, X_val_lgb = X_lgb_train[tr_idx], X_lgb_train[val_idx]
        X_tr_raw, X_val_raw = X_raw_train[tr_idx], X_raw_train[val_idx]
        y_tr, y_val = y_arr[tr_idx], y_arr[val_idx]

        # LGBM
        lgb_model = LGBMClassifier(
            n_estimators=900,
            max_depth=-1,
            learning_rate=0.03,
            subsample=0.9,
            colsample_bytree=0.8,
            objective="binary",
            reg_lambda=1.0,
            random_state=seed + fold_id,
            n_jobs=-1
        )
        lgb_model.fit(X_tr_lgb, y_tr)
        val_lgb = lgb_model.predict_proba(X_val_lgb)[:, 1]
        oof_lgb_seed[val_idx] = val_lgb
        test_lgb_seed += lgb_model.predict_proba(X_lgb_test)[:, 1] / N_FOLDS

        # XGB
        xgb_model = XGBClassifier(
            n_estimators=800,
            max_depth=5,
            learning_rate=0.03,
            subsample=0.9,
            colsample_bytree=0.8,
            objective="binary:logistic",
            eval_metric="auc",
            reg_lambda=1.0,
            reg_alpha=0.0,
            scale_pos_weight=scale_pos,
            tree_method="hist",
            random_state=seed + fold_id,
            n_jobs=-1
        )
        xgb_model.fit(X_tr_raw, y_tr)
        val_xgb = xgb_model.predict_proba(X_val_raw)[:, 1]
        oof_xgb_seed[val_idx] = val_xgb
        test_xgb_seed += xgb_model.predict_proba(X_raw_test)[:, 1] / N_FOLDS

        print(f"  Seed {seed} | Fold {fold_id} AUCs -> "
              f"LGB: {roc_auc_score(y_val, val_lgb):.5f} | "
              f"XGB: {roc_auc_score(y_val, val_xgb):.5f}")
        fold_id += 1

    print(f"Seed {seed} full OOF AUCs: LGBM={roc_auc_score(y_arr, oof_lgb_seed):.5f} | "
          f"XGB={roc_auc_score(y_arr, oof_xgb_seed):.5f}")

    oof_lgb_all += oof_lgb_seed / len(SEEDS)
    oof_xgb_all += oof_xgb_seed / len(SEEDS)
    test_lgb_all += test_lgb_seed / len(SEEDS)
    test_xgb_all += test_xgb_seed / len(SEEDS)

### Ensemble Weight Optimization

**Approach:** Grid search over 41 weight combinations (0.0 to 1.0 in steps of 0.025)
- Evaluates each blend on out-of-fold predictions
- Selects weights that maximize OOF AUC
- Applies optimal weights to test predictions

In [None]:
# =========================================================
# 5. 2-model weight search
# =========================================================
auc_lgb = roc_auc_score(y_arr, oof_lgb_all)
auc_xgb = roc_auc_score(y_arr, oof_xgb_all)
print("\n==== Multi-seed Base Model OOF AUCs (V23) ====")
print(f"LGBM OOF AUC: {auc_lgb:.5f}")
print(f"XGB  OOF AUC: {auc_xgb:.5f}")

weights = np.linspace(0, 1, 41)
best_auc = 0.0
best_w = None

for w_xgb in weights:
    w_lgb = 1.0 - w_xgb
    blend_oof = w_xgb * oof_xgb_all + w_lgb * oof_lgb_all
    auc = roc_auc_score(y_arr, blend_oof)
    if auc > best_auc:
        best_auc = auc
        best_w = (w_xgb, w_lgb)

print("\nBest 2-model blend weights (XGB, LGBM):", best_w)
print("Best blended OOF AUC:", round(best_auc, 5))

w_xgb, w_lgb = best_w
final_test_pred = w_xgb * test_xgb_all + w_lgb * test_lgb_all

submission = pd.DataFrame({
    "ID": test_ids,
    "class": final_test_pred
})
submission.to_csv(OUT_PATH, index=False)

print(f"\nSaved submission file: {OUT_PATH}")
print(submission.head())

---

# Part 2: Model V28 - Enhanced Stability

## 2.1 Key Improvements Over V23

### 1. **Increased Seed Diversity (3 → 5 seeds)**
**Rationale:**
- Tree-based models are sensitive to random initialization
- More seeds = better approximation of expected performance
- Reduces risk of overfitting to particular data splits

**New seeds:** 42, 777, 30251, 123, 2024
- Deliberately chosen to span different random number sequences
- Each seed creates different CV folds and tree structures

### 2. **LGBM_PLUS Strategy**
**Observation:** In V23, we noticed LightGBM often had slight edge in generalization

**Action:** After finding optimal weights, shift 5% more weight to LGBM
- If optimal is 40% XGB / 60% LGBM → use 35% XGB / 65% LGBM
- Capitalizes on LGBM's strength with heavy feature engineering
- Conservative adjustment to avoid over-reliance

### 3. **Enhanced Reporting**
- Added prediction statistics (mean, std, min, max)
- Clear version labeling for tracking experiments
- Warning suppression for cleaner output

## 2.2 Expected Performance Gain
- **Stability:** 5-seed averaging reduces variance by ~30% vs 3-seed
- **Generalization:** Better approximation of true test performance
- **Target AUC:** 0.90879+ (based on LGBM_PLUS calibration)

## 2.3 Implementation: Model V28

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import RobustScaler
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import warnings
warnings.filterwarnings('ignore')

# =========================================================
# Paths (EDIT IF NEEDED)
# =========================================================
TRAIN_PATH = r"C:/Users/DOUBLEDO_GAMING/OneDrive/Desktop/PURDUE FALL'25/MOD2 - Fall'25/Data_mining_57100/fall-2025-mgmt-571-final-project/bankruptcy_Train.csv"
TEST_PATH  = r"C:/Users/DOUBLEDO_GAMING/OneDrive/Desktop/PURDUE FALL'25/MOD2 - Fall'25/Data_mining_57100/fall-2025-mgmt-571-final-project/bankruptcy_Test_X.csv"

OUT_PATH   = r"C:/Users/DOUBLEDO_GAMING/OneDrive/Desktop/PURDUE FALL'25/MOD2 - Fall'25/Data_mining_57100/fall-2025-mgmt-571-final-project/submission_V28_MINIMAL_lgbm_plus_FINAL.csv"

# =========================================================
# 1. Load data
# =========================================================
train = pd.read_csv(TRAIN_PATH)
test  = pd.read_csv(TEST_PATH)

y = train["class"].values
X = train.drop(columns=["class"])
X_test = test.drop(columns=["ID"])
test_ids = test["ID"].values

print("Train shape:", X.shape)
print("Test shape :", X_test.shape)

pos_rate = y.mean()
scale_pos = (1 - pos_rate) / pos_rate
print("Positive class rate:", pos_rate)
print("scale_pos_weight   :", scale_pos)

# =========================================================
# 2. Quantile clipping (for stability)
# =========================================================
def quantile_clip(train_df: pd.DataFrame, test_df: pd.DataFrame,
                  q_low=0.01, q_high=0.99) -> (pd.DataFrame, pd.DataFrame):
    train_df = train_df.copy()
    test_df = test_df.copy()
    for col in train_df.columns:
        lo = train_df[col].quantile(q_low)
        hi = train_df[col].quantile(q_high)
        train_df[col] = train_df[col].clip(lo, hi)
        test_df[col]  = test_df[col].clip(lo, hi)
    return train_df, test_df

X_clip, X_test_clip = quantile_clip(X, X_test, q_low=0.01, q_high=0.99)
print("After clipping - train:", X_clip.shape)
print("After clipping - test :", X_test_clip.shape)

# =========================================================
# 3A. Feature engineering for LGBM (heavy FE on clipped)
# =========================================================
def fe_light(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    cols = df.columns.tolist()
    eps = 1e-6

    row_mean = df[cols].mean(axis=1)
    row_std  = df[cols].std(axis=1)
    row_max  = df[cols].max(axis=1)
    row_min  = df[cols].min(axis=1)

    df["row_mean"] = row_mean
    df["row_std"]  = row_std
    df["row_max"]  = row_max
    df["row_min"]  = row_min

    for c in cols:
        df[f"log1p_{c}"] = np.log1p(np.abs(df[c]))
        df[f"{c}_sq"]    = df[c] ** 2
        df[f"{c}_div_rowmean"] = df[c] / (row_mean + eps)

    df.replace([np.inf, -np.inf], 0, inplace=True)
    df.fillna(0, inplace=True)
    return df

X_all_clip = pd.concat([X_clip, X_test_clip], axis=0).reset_index(drop=True)
X_all_fe = fe_light(X_all_clip)

X_lgb_train = X_all_fe.iloc[:len(X)].reset_index(drop=True).astype("float32").values
X_lgb_test  = X_all_fe.iloc[len(X):].reset_index(drop=True).astype("float32").values

print("LGBM FE train shape:", X_lgb_train.shape)
print("LGBM FE test  shape:", X_lgb_test.shape)

# =========================================================
# 3B. Enhanced raw view for XGB: (clip + log1p + RobustScaler)
# =========================================================
X_raw_enh = X_clip.copy()
X_test_raw_enh = X_test_clip.copy()

for c in X_raw_enh.columns:
    X_raw_enh[f"log1p_{c}"] = np.log1p(np.abs(X_raw_enh[c]))
    X_test_raw_enh[f"log1p_{c}"] = np.log1p(np.abs(X_test_raw_enh[c]))

scaler = RobustScaler()
X_raw_train = scaler.fit_transform(X_raw_enh.astype("float32"))
X_raw_test  = scaler.transform(X_test_raw_enh.astype("float32"))

print("XGB raw-enh train shape:", X_raw_train.shape)
print("XGB raw-enh test  shape:", X_raw_test.shape)

y_arr = y

### Enhanced Multi-Seed Training (5 Seeds)

**Key Change:** Expanded from 3 to 5 random seeds
- Original: [42, 2025, 777]
- Enhanced: [42, 777, 30251, 123, 2024]

**Impact:**
- Total models trained: 100 (5 seeds × 10 folds × 2 algorithms)
- More robust averaging of predictions
- Better estimation of model uncertainty

In [None]:
# =========================================================
# 4. Multi-seed dual-model CV (LGBM + XGB)
# =========================================================
SEEDS = [42, 777, 30251, 123, 2024]  # 5 seeds
N_FOLDS = 10

oof_lgb_all = np.zeros(len(y_arr))
oof_xgb_all = np.zeros(len(y_arr))
test_lgb_all = np.zeros(len(X_lgb_test))
test_xgb_all = np.zeros(len(X_raw_test))

for seed in SEEDS:
    print(f"\n================= SEED {seed} =================")
    skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=seed)

    oof_lgb_seed = np.zeros(len(y_arr))
    oof_xgb_seed = np.zeros(len(y_arr))
    test_lgb_seed = np.zeros(len(X_lgb_test))
    test_xgb_seed = np.zeros(len(X_raw_test))

    fold_id = 1
    for tr_idx, val_idx in skf.split(X_lgb_train, y_arr):
        X_tr_lgb, X_val_lgb = X_lgb_train[tr_idx], X_lgb_train[val_idx]
        X_tr_raw, X_val_raw = X_raw_train[tr_idx], X_raw_train[val_idx]
        y_tr, y_val = y_arr[tr_idx], y_arr[val_idx]

        # LGBM
        lgb_model = LGBMClassifier(
            n_estimators=900,
            max_depth=-1,
            learning_rate=0.03,
            subsample=0.9,
            colsample_bytree=0.8,
            objective="binary",
            reg_lambda=1.0,
            random_state=seed + fold_id,
            n_jobs=-1
        )
        lgb_model.fit(X_tr_lgb, y_tr)
        val_lgb = lgb_model.predict_proba(X_val_lgb)[:, 1]
        oof_lgb_seed[val_idx] = val_lgb
        test_lgb_seed += lgb_model.predict_proba(X_lgb_test)[:, 1] / N_FOLDS

        # XGB
        xgb_model = XGBClassifier(
            n_estimators=800,
            max_depth=5,
            learning_rate=0.03,
            subsample=0.9,
            colsample_bytree=0.8,
            objective="binary:logistic",
            eval_metric="auc",
            reg_lambda=1.0,
            reg_alpha=0.0,
            scale_pos_weight=scale_pos,
            tree_method="hist",
            random_state=seed + fold_id,
            n_jobs=-1
        )
        xgb_model.fit(X_tr_raw, y_tr)
        val_xgb = xgb_model.predict_proba(X_val_raw)[:, 1]
        oof_xgb_seed[val_idx] = val_xgb
        test_xgb_seed += xgb_model.predict_proba(X_raw_test)[:, 1] / N_FOLDS

        print(f"  Seed {seed} | Fold {fold_id} AUCs -> "
              f"LGB: {roc_auc_score(y_val, val_lgb):.5f} | "
              f"XGB: {roc_auc_score(y_val, val_xgb):.5f}")
        fold_id += 1

    print(f"Seed {seed} full OOF AUCs: LGBM={roc_auc_score(y_arr, oof_lgb_seed):.5f} | "
          f"XGB={roc_auc_score(y_arr, oof_xgb_seed):.5f}")

    oof_lgb_all += oof_lgb_seed / len(SEEDS)
    oof_xgb_all += oof_xgb_seed / len(SEEDS)
    test_lgb_all += test_lgb_seed / len(SEEDS)
    test_xgb_all += test_xgb_seed / len(SEEDS)

### LGBM_PLUS: Strategic Weight Adjustment

**Innovation in V28:**
1. Find optimal weights via grid search (same as V23)
2. **Add 5% additional weight to LightGBM** (capped at 100%)

**Justification:**
- LightGBM with heavy FE showed consistent strength across folds
- Conservative 5% shift balances exploration vs exploitation
- Empirically improved test set performance in experiments

**Example:**
- Optimal OOF weights: XGB=0.40, LGBM=0.60
- LGBM_PLUS weights: XGB=0.35, LGBM=0.65

In [None]:
# =========================================================
# 5. 2-model weight search
# =========================================================
auc_lgb = roc_auc_score(y_arr, oof_lgb_all)
auc_xgb = roc_auc_score(y_arr, oof_xgb_all)
print("\n==== Multi-seed Base Model OOF AUCs (V28) ====")
print(f"LGBM OOF AUC: {auc_lgb:.5f}")
print(f"XGB  OOF AUC: {auc_xgb:.5f}")

weights = np.linspace(0, 1, 41)
best_auc = 0.0
best_w = None

for w_xgb in weights:
    w_lgb = 1.0 - w_xgb
    blend_oof = w_xgb * oof_xgb_all + w_lgb * oof_lgb_all
    auc = roc_auc_score(y_arr, blend_oof)
    if auc > best_auc:
        best_auc = auc
        best_w = (w_xgb, w_lgb)

print("\nBest 2-model blend weights (XGB, LGBM):", best_w)
print("Best blended OOF AUC:", round(best_auc, 5))

w_xgb, w_lgb = best_w

# =========================================================
# 6. LGBM_PLUS version (add 5% more LGBM weight)
# =========================================================
w_lgb_plus = min(w_lgb + 0.05, 1.0)
w_xgb_minus = 1.0 - w_lgb_plus

final_test_pred = w_xgb_minus * test_xgb_all + w_lgb_plus * test_lgb_all

submission = pd.DataFrame({
    "ID": test_ids,
    "class": final_test_pred
})
submission.to_csv(OUT_PATH, index=False)

print(f"\n{'='*60}")
print(f"V28 MINIMAL LGBM_PLUS (0.90879 version)")
print(f"{'='*60}")
print(f"Optimal OOF weights: XGB={w_xgb:.3f}, LGBM={w_lgb:.3f}")
print(f"LGBM_PLUS weights:   XGB={w_xgb_minus:.3f}, LGBM={w_lgb_plus:.3f}")
print(f"\nSaved: {OUT_PATH}")
print(f"{'='*60}")
print(submission.head(10))
print(f"\nPrediction stats:")
print(f"  Mean: {final_test_pred.mean():.5f}")
print(f"  Std:  {final_test_pred.std():.5f}")
print(f"  Min:  {final_test_pred.min():.5f}")
print(f"  Max:  {final_test_pred.max():.5f}")

---

# Part 3: Comparative Analysis

## 3.1 Summary of Changes: V23 → V28

| Component | V23 | V28 | Impact |
|-----------|-----|-----|--------|
| **Random Seeds** | 3 seeds (42, 2025, 777) | 5 seeds (42, 777, 30251, 123, 2024) | ↑ Stability |
| **Total Models** | 60 (3×10×2) | 100 (5×10×2) | ↑ Ensemble diversity |
| **Weight Strategy** | Optimal from grid search | Optimal + 5% to LGBM | ↑ Generalization |
| **Computational Cost** | Baseline | +67% training time | Acceptable trade-off |

## 3.2 Why These Changes Matter

### Variance Reduction Theory
Given N independent models with variance σ², ensemble variance is σ²/N:
- 3 seeds: variance = σ²/3 ≈ 0.33σ²
- 5 seeds: variance = σ²/5 = 0.20σ² → **40% reduction**

### LGBM_PLUS Rationale
- Heavy feature engineering creates richer information space
- LightGBM's leaf-wise growth exploits this better than XGBoost's level-wise approach
- 5% shift is conservative enough to avoid overfitting to validation set

## 3.3 Expected Performance

**V23 Characteristics:**
- Strong baseline with proven dual-model architecture
- Good balance between LGBM and XGB strengths
- May show higher variance across runs

**V28 Improvements:**
- Lower prediction variance → more reliable estimates
- Better approximation of expected test performance
- LGBM_PLUS capitalizes on feature engineering investment
- Target: **0.90879+ AUC** on test set

## 3.4 Lessons Learned

1. **Multi-seed averaging is crucial** for tree-based models in competitions
2. **Different feature views** (heavy FE vs clean) create complementary models
3. **Empirical weight adjustment** can improve over pure optimization
4. **Computational investment** (5 vs 3 seeds) pays off in stability

---

## Conclusion

The evolution from V23 to V28 demonstrates the value of:
- **Systematic experimentation**: Small, justified changes
- **Ensemble diversity**: Multiple seeds, multiple algorithms, multiple feature views
- **Domain knowledge**: Understanding bankruptcy data characteristics (outliers, class imbalance)
- **Empirical tuning**: LGBM_PLUS adjustment based on observed patterns

Both models represent solid approaches to bankruptcy prediction, with V28 offering enhanced stability for production deployment.