# Amex Pipeline - Step 4: CatBoost & XGBoost Training + Model Ensemble (v6 - FIXED)

This notebook trains CatBoost and XGBoost models and combines them with the LightGBM models from previous notebooks.

**Pipeline:**
1. Loads `train_processed.parquet` and `column_lists.json`.
2. **FIXES categorical features** for CatBoost (converts floats to strings).
3. Trains 3-seed, 5-fold CatBoost models (15 models total).
4. Trains 3-seed, 5-fold XGBoost models (15 models total) with proper GPU/CPU fallback.
5. Loads pre-trained LightGBM models.
6. Creates a weighted ensemble of all three model types.
7. Generates OOF predictions and final submission.

**Key Fixes:**
- Categorical features are properly converted to strings for CatBoost
- XGBoost has CPU fallback if GPU is unavailable
- Proper initialization of ensemble weights
- Robust error handling for library compatibility

In [1]:
import pandas as pd
import numpy as np
import lightgbm as lgb
import catboost as cb
import xgboost as xgb
import gc
import os
import time
import matplotlib.pyplot as plt
import seaborn as sns
import random
import joblib
import json
import warnings
from sklearn.model_selection import StratifiedKFold
from tqdm.auto import tqdm

warnings.filterwarnings('ignore')

# Import the official metric from amex_metric.py
from amex_metric import amex_metric

print(f"LightGBM version: {lgb.__version__}")
print(f"CatBoost version: {cb.__version__}")
print(f"XGBoost version: {xgb.__version__}")
sns.set_style('whitegrid')

# --- Define Paths ---
FE_DATA_DIR = '../data_fe/'
CSV_DATA_DIR = '../data/'
MODEL_DIR = './models/'
PREPROCESSOR_DIR = './preprocessors/'

TRAIN_PATH = os.path.join(FE_DATA_DIR, 'train_processed.parquet')
TEST_PATH = os.path.join(FE_DATA_DIR, 'test_processed.parquet') 
SUB_PATH = os.path.join(CSV_DATA_DIR, 'sample_submission.csv')

# Create model subdirectories
CB_MODEL_DIR = os.path.join(MODEL_DIR, 'catboost')
XGB_MODEL_DIR = os.path.join(MODEL_DIR, 'xgboost')
for d in [MODEL_DIR, CB_MODEL_DIR, XGB_MODEL_DIR]:
    if not os.path.exists(d):
        os.makedirs(d)

print("Directories created successfully.")

LightGBM version: 4.6.0
CatBoost version: 1.2.8
XGBoost version: 3.1.1
Directories created successfully.


  from .autonotebook import tqdm as notebook_tqdm


## Step 1: Utility Functions

In [2]:
def seed_everything(seed):
    """Set seeds for reproducibility"""
    random.seed(seed)
    np.random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

def amex_metric_mod(y_true, y_pred):
    """Wrapper for amex_metric that works with arrays"""
    dummy_index = range(len(y_true))
    y_true_df = pd.DataFrame({'target': y_true}, index=dummy_index)
    y_pred_df = pd.DataFrame({'prediction': y_pred}, index=dummy_index)
    y_true_df.index.name = 'customer_ID'
    y_pred_df.index.name = 'customer_ID'
    return amex_metric(y_true_df, y_pred_df)

def check_gpu_availability():
    """Check if GPU is available for XGBoost"""
    try:
        import subprocess
        result = subprocess.run(['nvidia-smi'], capture_output=True, text=True, timeout=5)
        return result.returncode == 0
    except:
        return False

print("Utility functions defined.")
print(f"GPU Available: {check_gpu_availability()}")

Utility functions defined.
GPU Available: True


## Step 2: Load Processed Data

In [3]:
print(f"Loading PROCESSED train features from {TRAIN_PATH}...")
start_time = time.time()
train_df = pd.read_parquet(TRAIN_PATH)
print(f"Processed train data loaded in {time.time() - start_time:.2f}s. Shape: {train_df.shape}")

# --- Separate X and y ---
y_train = train_df['target']
X_train = train_df.drop(columns=['target'])
del train_df; gc.collect()
print("Separated y_train and X_train.")

# --- Load column lists ---
print("Loading column lists...")
with open(os.path.join(PREPROCESSOR_DIR, 'column_lists.json'), 'r') as f:
    column_lists = json.load(f)

features = column_lists['all_features']
categorical_cols = [col for col in column_lists['categorical_cols_for_lgb'] if col in features]

print(f"Features ({len(features)}) and categorical features ({len(categorical_cols)}) loaded.")
print(f"Sample categorical columns: {categorical_cols[:5]}")
gc.collect()

Loading PROCESSED train features from ../data_fe/train_processed.parquet...
Processed train data loaded in 13.14s. Shape: (458913, 7007)
Separated y_train and X_train.
Loading column lists...
Features (7005) and categorical features (99) loaded.
Sample categorical columns: ['last6_D_120_last', 'D_64_nunique', 'last3_D_68_nunique', 'last6_D_64_nunique', 'last3_B_38_nunique']


0

## Step 3: Define Training Configuration

In [4]:
# Training configuration
SEEDS = [42, 52, 62]
N_SPLITS = 5
EARLY_STOPPING_ROUNDS = 300

# Check GPU availability
USE_GPU = check_gpu_availability()
print(f"\nGPU Training: {USE_GPU}")

# CatBoost parameters (GPU optimized)
CB_PARAMS = {
    'iterations': 4500,
    'learning_rate': 0.03,
    'depth': 8,
    'l2_leaf_reg': 5,
    'min_data_in_leaf': 2000,
    'max_bin': 255,
    'random_strength': 0.5,
    'bagging_temperature': 0.2,
    'od_type': 'Iter',
    'od_wait': EARLY_STOPPING_ROUNDS,
    'task_type': 'GPU' if USE_GPU else 'CPU',
    'devices': '0' if USE_GPU else None,
    'eval_metric': 'Logloss',
    'verbose': 100,
    'random_state': 42
}

# Remove None values
CB_PARAMS = {k: v for k, v in CB_PARAMS.items() if v is not None}

# XGBoost parameters (GPU/CPU adaptive)
XGB_PARAMS = {
    'n_estimators': 4500,
    'learning_rate': 0.03,
    'max_depth': 8,
    'min_child_weight': 50,
    'subsample': 0.8,
    'colsample_bytree': 0.6,
    'gamma': 0,
    'reg_alpha': 0.1,
    'reg_lambda': 5,
    'tree_method': 'hist',  # Use 'hist' which works on both GPU and CPU
    'device': 'cuda' if USE_GPU else 'cpu',  # XGBoost 2.0+ uses 'device' parameter
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'verbosity': 1,
    'random_state': 42
}

print("Training configuration defined.")
print(f"Seeds: {SEEDS}")
print(f"Folds: {N_SPLITS}")
print(f"Total CatBoost models: {len(SEEDS) * N_SPLITS}")
print(f"Total XGBoost models: {len(SEEDS) * N_SPLITS}")
print(f"\nCatBoost task_type: {CB_PARAMS['task_type']}")
print(f"XGBoost device: {XGB_PARAMS['device']}")


GPU Training: True
Training configuration defined.
Seeds: [42, 52, 62]
Folds: 5
Total CatBoost models: 15
Total XGBoost models: 15

CatBoost task_type: GPU
XGBoost device: cuda


## Step 4: Train CatBoost Models

In [5]:
print("\n" + "="*50)
print("STARTING CATBOOST TRAINING")
print("="*50 + "\n")

X_train_cb = X_train[features].copy()

# **FIX: Convert categorical columns to string type for CatBoost**
print("Converting categorical features to string type for CatBoost...")
for col in categorical_cols:
    if col in X_train_cb.columns:
        # Fill NaN with a placeholder, then convert to string
        X_train_cb[col] = X_train_cb[col].fillna(-999).astype(str)

print(f"Categorical features converted. Sample values from {categorical_cols[0]}:")
print(X_train_cb[categorical_cols[0]].value_counts().head())

# Store OOF predictions
oof_cb_all_seeds = []
cv_scores_cb = []

for seed in SEEDS:
    print(f"\n{'='*50}")
    print(f"--- CATBOOST: TRAINING WITH SEED {seed} ---")
    print(f"{'='*50}\n")
    
    seed_everything(seed)
    skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=seed)
    oof_cb_seed = np.zeros(len(y_train))
    
    for fold, (train_idx, val_idx) in enumerate(skf.split(X_train_cb, y_train)):
        print(f"\n--- Fold {fold+1}/{N_SPLITS} (Seed {seed}) ---")
        fold_start = time.time()
        
        X_tr, X_val = X_train_cb.iloc[train_idx], X_train_cb.iloc[val_idx]
        y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[val_idx]
        
        # Create CatBoost datasets with categorical feature names
        train_pool = cb.Pool(X_tr, y_tr, cat_features=categorical_cols)
        val_pool = cb.Pool(X_val, y_val, cat_features=categorical_cols)
        
        # Train model
        cb_params_fold = CB_PARAMS.copy()
        cb_params_fold['random_state'] = seed
        
        model = cb.CatBoostClassifier(**cb_params_fold)
        model.fit(
            train_pool,
            eval_set=val_pool,
            verbose=100,
            early_stopping_rounds=EARLY_STOPPING_ROUNDS
        )
        
        # Predict OOF
        oof_cb_seed[val_idx] = model.predict_proba(X_val)[:, 1]
        
        # Calculate fold score
        fold_score = amex_metric_mod(y_val.values, oof_cb_seed[val_idx])
        print(f"Fold {fold+1} Amex Score: {fold_score:.6f}")
        
        # Save model
        model_path = os.path.join(CB_MODEL_DIR, f'catboost_seed_{seed}_fold_{fold}.cbm')
        model.save_model(model_path)
        print(f"Model saved to {model_path}")
        print(f"Fold {fold+1} complete in {time.time() - fold_start:.2f}s")
        
        del X_tr, X_val, y_tr, y_val, train_pool, val_pool, model
        gc.collect()
    
    # Calculate OOF score for this seed
    oof_score = amex_metric_mod(y_train.values, oof_cb_seed)
    cv_scores_cb.append(oof_score)
    print(f"\n--- OOF Score for Seed {seed}: {oof_score:.6f} ---\n")
    
    oof_cb_all_seeds.append(oof_cb_seed)

# Average OOF predictions across seeds
oof_cb = np.mean(oof_cb_all_seeds, axis=0)
final_oof_score_cb = amex_metric_mod(y_train.values, oof_cb)

print("\n" + "="*50)
print("CATBOOST TRAINING COMPLETE")
print("="*50)
print(f"Final OOF Score (averaged): {final_oof_score_cb:.6f}")
print(f"CV Scores by seed: {[f'{s:.6f}' for s in cv_scores_cb]}")

del X_train_cb
gc.collect()


STARTING CATBOOST TRAINING

Converting categorical features to string type for CatBoost...
Categorical features converted. Sample values from last6_D_120_last:
last6_D_120_last
0.0    372530
1.0     86383
Name: count, dtype: int64

--- CATBOOST: TRAINING WITH SEED 42 ---


--- Fold 1/5 (Seed 42) ---
0:	learn: 0.6531368	test: 0.6530695	best: 0.6530695 (0)	total: 830ms	remaining: 1h 2m 15s
100:	learn: 0.2323561	test: 0.2328938	best: 0.2328938 (100)	total: 55.2s	remaining: 40m 2s
200:	learn: 0.2215758	test: 0.2240537	best: 0.2240537 (200)	total: 1m 48s	remaining: 38m 45s
300:	learn: 0.2160006	test: 0.2207977	best: 0.2207977 (300)	total: 2m 43s	remaining: 37m 57s
400:	learn: 0.2118852	test: 0.2189251	best: 0.2189251 (400)	total: 3m 36s	remaining: 36m 55s
500:	learn: 0.2087506	test: 0.2179135	best: 0.2179135 (500)	total: 4m 28s	remaining: 35m 40s
600:	learn: 0.2061142	test: 0.2172074	best: 0.2172074 (600)	total: 5m 23s	remaining: 34m 58s
700:	learn: 0.2035287	test: 0.2166985	best: 0.216698

0

## Step 5: Train XGBoost Models

In [6]:
print("\n" + "="*50)
print("STARTING XGBOOST TRAINING")
print("="*50 + "\n")

X_train_xgb = X_train[features].copy()

# **FIX: Properly handle categorical features for XGBoost**
print("Preparing categorical features for XGBoost...")
# XGBoost can handle numeric categories, but we need to ensure they're integers
for col in categorical_cols:
    if col in X_train_xgb.columns:
        # Convert to category dtype (XGBoost handles this automatically)
        X_train_xgb[col] = X_train_xgb[col].astype('category')

print(f"Categorical features prepared. Sample dtypes:")
print(X_train_xgb[categorical_cols[:3]].dtypes)

# Store OOF predictions
oof_xgb_all_seeds = []
cv_scores_xgb = []

for seed in SEEDS:
    print(f"\n{'='*50}")
    print(f"--- XGBOOST: TRAINING WITH SEED {seed} ---")
    print(f"{'='*50}\n")
    
    seed_everything(seed)
    skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=seed)
    oof_xgb_seed = np.zeros(len(y_train))
    
    for fold, (train_idx, val_idx) in enumerate(skf.split(X_train_xgb, y_train)):
        print(f"\n--- Fold {fold+1}/{N_SPLITS} (Seed {seed}) ---")
        fold_start = time.time()
        
        X_tr, X_val = X_train_xgb.iloc[train_idx], X_train_xgb.iloc[val_idx]
        y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[val_idx]
        
        # Train model with error handling
        xgb_params_fold = XGB_PARAMS.copy()
        xgb_params_fold['random_state'] = seed
        
        try:
            model = xgb.XGBClassifier(**xgb_params_fold)
            model.fit(
                X_tr, y_tr,
                eval_set=[(X_val, y_val)],
                verbose=100
            )
        except Exception as e:
            print(f"GPU training failed: {e}")
            print("Falling back to CPU...")
            xgb_params_fold['device'] = 'cpu'
            model = xgb.XGBClassifier(**xgb_params_fold)
            model.fit(
                X_tr, y_tr,
                eval_set=[(X_val, y_val)],
                verbose=100
            )
        
        # Predict OOF
        oof_xgb_seed[val_idx] = model.predict_proba(X_val)[:, 1]
        
        # Calculate fold score
        fold_score = amex_metric_mod(y_val.values, oof_xgb_seed[val_idx])
        print(f"Fold {fold+1} Amex Score: {fold_score:.6f}")
        
        # Save model
        model_path = os.path.join(XGB_MODEL_DIR, f'xgboost_seed_{seed}_fold_{fold}.json')
        model.save_model(model_path)
        print(f"Model saved to {model_path}")
        print(f"Fold {fold+1} complete in {time.time() - fold_start:.2f}s")
        
        del X_tr, X_val, y_tr, y_val, model
        gc.collect()
    
    # Calculate OOF score for this seed
    oof_score = amex_metric_mod(y_train.values, oof_xgb_seed)
    cv_scores_xgb.append(oof_score)
    print(f"\n--- OOF Score for Seed {seed}: {oof_score:.6f} ---\n")
    
    oof_xgb_all_seeds.append(oof_xgb_seed)

# Average OOF predictions across seeds
oof_xgb = np.mean(oof_xgb_all_seeds, axis=0)
final_oof_score_xgb = amex_metric_mod(y_train.values, oof_xgb)

print("\n" + "="*50)
print("XGBOOST TRAINING COMPLETE")
print("="*50)
print(f"Final OOF Score (averaged): {final_oof_score_xgb:.6f}")
print(f"CV Scores by seed: {[f'{s:.6f}' for s in cv_scores_xgb]}")

del X_train_xgb
gc.collect()


STARTING XGBOOST TRAINING

Preparing categorical features for XGBoost...
Categorical features prepared. Sample dtypes:
last6_D_120_last      category
D_64_nunique          category
last3_D_68_nunique    category
dtype: object

--- XGBOOST: TRAINING WITH SEED 42 ---


--- Fold 1/5 (Seed 42) ---
GPU training failed: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, the experimental DMatrix parameter`enable_categorical` must be set to `True`.  Invalid columns:B_30_last: category, B_30_nunique: category, B_30_count: category, B_38_last: category, B_38_nunique: category, B_38_count: category, D_114_last: category, D_114_nunique: category, D_114_count: category, D_116_last: category, D_116_nunique: category, D_116_count: category, D_117_last: category, D_117_nunique: category, D_117_count: category, D_120_last: category, D_120_nunique: category, D_120_count: category, D_126_last: category, D_126_nunique: category, D_126_count: category, D_63_

ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, the experimental DMatrix parameter`enable_categorical` must be set to `True`.  Invalid columns:B_30_last: category, B_30_nunique: category, B_30_count: category, B_38_last: category, B_38_nunique: category, B_38_count: category, D_114_last: category, D_114_nunique: category, D_114_count: category, D_116_last: category, D_116_nunique: category, D_116_count: category, D_117_last: category, D_117_nunique: category, D_117_count: category, D_120_last: category, D_120_nunique: category, D_120_count: category, D_126_last: category, D_126_nunique: category, D_126_count: category, D_63_last: category, D_63_nunique: category, D_63_count: category, D_64_last: category, D_64_nunique: category, D_64_count: category, D_66_last: category, D_66_nunique: category, D_66_count: category, D_68_last: category, D_68_nunique: category, D_68_count: category, last3_B_30_last: category, last3_B_30_nunique: category, last3_B_30_count: category, last3_B_38_last: category, last3_B_38_nunique: category, last3_B_38_count: category, last3_D_114_last: category, last3_D_114_nunique: category, last3_D_114_count: category, last3_D_116_last: category, last3_D_116_nunique: category, last3_D_116_count: category, last3_D_117_last: category, last3_D_117_nunique: category, last3_D_117_count: category, last3_D_120_last: category, last3_D_120_nunique: category, last3_D_120_count: category, last3_D_126_last: category, last3_D_126_nunique: category, last3_D_126_count: category, last3_D_63_last: category, last3_D_63_nunique: category, last3_D_63_count: category, last3_D_64_last: category, last3_D_64_nunique: category, last3_D_64_count: category, last3_D_66_last: category, last3_D_66_nunique: category, last3_D_66_count: category, last3_D_68_last: category, last3_D_68_nunique: category, last3_D_68_count: category, last6_B_30_last: category, last6_B_30_nunique: category, last6_B_30_count: category, last6_B_38_last: category, last6_B_38_nunique: category, last6_B_38_count: category, last6_D_114_last: category, last6_D_114_nunique: category, last6_D_114_count: category, last6_D_116_last: category, last6_D_116_nunique: category, last6_D_116_count: category, last6_D_117_last: category, last6_D_117_nunique: category, last6_D_117_count: category, last6_D_120_last: category, last6_D_120_nunique: category, last6_D_120_count: category, last6_D_126_last: category, last6_D_126_nunique: category, last6_D_126_count: category, last6_D_63_last: category, last6_D_63_nunique: category, last6_D_63_count: category, last6_D_64_last: category, last6_D_64_nunique: category, last6_D_64_count: category, last6_D_66_last: category, last6_D_66_nunique: category, last6_D_66_count: category, last6_D_68_last: category, last6_D_68_nunique: category, last6_D_68_count: category

## Step 6: Load LightGBM OOF Predictions and Create Ensemble

In [None]:
print("\n" + "="*50)
print("CREATING MODEL ENSEMBLE")
print("="*50 + "\n")

# Try to load saved LightGBM OOF predictions first
lgbm_oof_path = os.path.join(MODEL_DIR, 'oof_lgbm.npy')
if os.path.exists(lgbm_oof_path):
    print("Loading saved LightGBM OOF predictions...")
    oof_lgb = np.load(lgbm_oof_path)
    final_oof_score_lgb = amex_metric_mod(y_train.values, oof_lgb)
    print(f"LightGBM OOF Score: {final_oof_score_lgb:.6f}\n")
else:
    # Generate LightGBM OOF predictions from saved models
    print("Generating LightGBM OOF predictions from saved models...")
    X_train_lgb = X_train[features]
    oof_lgb_all_seeds = []
    
    for seed in SEEDS:
        skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=seed)
        oof_lgb_seed = np.zeros(len(y_train))
        
        for fold, (train_idx, val_idx) in enumerate(skf.split(X_train_lgb, y_train)):
            model_path = os.path.join(MODEL_DIR, f'model_seed_{seed}_fold_{fold}.txt')
            if os.path.exists(model_path):
                model = lgb.Booster(model_file=model_path)
                oof_lgb_seed[val_idx] = model.predict(X_train_lgb.iloc[val_idx])
                del model
                gc.collect()
            else:
                print(f"Warning: LightGBM model not found at {model_path}")
        
        oof_lgb_all_seeds.append(oof_lgb_seed)
    
    oof_lgb = np.mean(oof_lgb_all_seeds, axis=0)
    final_oof_score_lgb = amex_metric_mod(y_train.values, oof_lgb)
    print(f"LightGBM OOF Score: {final_oof_score_lgb:.6f}\n")
    del X_train_lgb
    gc.collect()

# **FIX: Initialize best_weights with default values**
best_score = 0
best_weights = (0.4, 0.3, 0.3)  # Default: slightly favor LightGBM

# Test different ensemble weights
print("Testing ensemble weight combinations...")

# Test various weight combinations
weight_combinations = [
    (0.4, 0.3, 0.3),  # Equal-ish
    (0.5, 0.25, 0.25),  # Favor LightGBM
    (0.33, 0.33, 0.34),  # Equal
    (0.45, 0.30, 0.25),  # Slight LightGBM favor
    (0.40, 0.35, 0.25),  # Favor LightGBM + CatBoost
    (0.35, 0.40, 0.25),  # Favor CatBoost
    (0.35, 0.35, 0.30),  # More balanced
]

for w_lgb, w_cb, w_xgb in weight_combinations:
    oof_ensemble = w_lgb * oof_lgb + w_cb * oof_cb + w_xgb * oof_xgb
    score = amex_metric_mod(y_train.values, oof_ensemble)
    print(f"Weights (LGB={w_lgb}, CB={w_cb}, XGB={w_xgb}): Score = {score:.6f}")
    
    if score > best_score:
        best_score = score
        best_weights = (w_lgb, w_cb, w_xgb)

print(f"\nBest Ensemble Weights: LightGBM={best_weights[0]}, CatBoost={best_weights[1]}, XGBoost={best_weights[2]}")
print(f"Best Ensemble OOF Score: {best_score:.6f}")
print(f"\nImprovement over best individual model: {best_score - max(final_oof_score_lgb, final_oof_score_cb, final_oof_score_xgb):.6f}")

# Save best weights
ensemble_config = {
    'lgb_weight': float(best_weights[0]),
    'cb_weight': float(best_weights[1]),
    'xgb_weight': float(best_weights[2]),
    'oof_score': float(best_score),
    'lgb_score': float(final_oof_score_lgb),
    'cb_score': float(final_oof_score_cb),
    'xgb_score': float(final_oof_score_xgb)
}

with open(os.path.join(MODEL_DIR, 'ensemble_config.json'), 'w') as f:
    json.dump(ensemble_config, f, indent=4)

print("\nEnsemble configuration saved.")

# Save OOF predictions
np.save(os.path.join(MODEL_DIR, 'oof_lgbm.npy'), oof_lgb)
np.save(os.path.join(MODEL_DIR, 'oof_catboost.npy'), oof_cb)
np.save(os.path.join(MODEL_DIR, 'oof_xgboost.npy'), oof_xgb)
print("OOF predictions saved.")

del X_train, y_train
gc.collect()

## Step 7: Inference on Test Set

In [None]:
print("\n" + "="*50)
print("STARTING TEST SET INFERENCE")
print("="*50 + "\n")

print(f"Loading PROCESSED test features from {TEST_PATH}...")
start_time = time.time()
X_test = pd.read_parquet(TEST_PATH)
print(f"Processed test data loaded in {time.time() - start_time:.2f}s. Shape: {X_test.shape}")
gc.collect()

X_test_features = X_test[features]

# --- LightGBM Predictions ---
print("\nGenerating LightGBM predictions...")
test_preds_lgb_all = []
for seed in SEEDS:
    test_preds_lgb_seed = np.zeros(len(X_test_features))
    for fold in range(N_SPLITS):
        model_path = os.path.join(MODEL_DIR, f'model_seed_{seed}_fold_{fold}.txt')
        if os.path.exists(model_path):
            model = lgb.Booster(model_file=model_path)
            test_preds_lgb_seed += model.predict(X_test_features) / N_SPLITS
            del model
            gc.collect()
    test_preds_lgb_all.append(test_preds_lgb_seed)

test_preds_lgb = np.mean(test_preds_lgb_all, axis=0)
print(f"LightGBM predictions complete. Shape: {test_preds_lgb.shape}")

# --- CatBoost Predictions ---
print("\nGenerating CatBoost predictions...")
# Convert categorical columns to string (same as training)
X_test_cb = X_test_features.copy()
for col in categorical_cols:
    if col in X_test_cb.columns:
        X_test_cb[col] = X_test_cb[col].fillna(-999).astype(str)

test_preds_cb_all = []
for seed in SEEDS:
    test_preds_cb_seed = np.zeros(len(X_test_cb))
    for fold in range(N_SPLITS):
        model_path = os.path.join(CB_MODEL_DIR, f'catboost_seed_{seed}_fold_{fold}.cbm')
        model = cb.CatBoostClassifier()
        model.load_model(model_path)
        test_preds_cb_seed += model.predict_proba(X_test_cb)[:, 1] / N_SPLITS
        del model
        gc.collect()
    test_preds_cb_all.append(test_preds_cb_seed)

test_preds_cb = np.mean(test_preds_cb_all, axis=0)
print(f"CatBoost predictions complete. Shape: {test_preds_cb.shape}")

# --- XGBoost Predictions ---
print("\nGenerating XGBoost predictions...")
# Convert categorical columns for XGBoost
X_test_xgb = X_test_features.copy()
for col in categorical_cols:
    if col in X_test_xgb.columns:
        X_test_xgb[col] = X_test_xgb[col].astype('category')

test_preds_xgb_all = []
for seed in SEEDS:
    test_preds_xgb_seed = np.zeros(len(X_test_xgb))
    for fold in range(N_SPLITS):
        model_path = os.path.join(XGB_MODEL_DIR, f'xgboost_seed_{seed}_fold_{fold}.json')
        model = xgb.XGBClassifier()
        model.load_model(model_path)
        test_preds_xgb_seed += model.predict_proba(X_test_xgb)[:, 1] / N_SPLITS
        del model
        gc.collect()
    test_preds_xgb_all.append(test_preds_xgb_seed)

test_preds_xgb = np.mean(test_preds_xgb_all, axis=0)
print(f"XGBoost predictions complete. Shape: {test_preds_xgb.shape}")

# --- Create Ensemble Predictions ---
print("\nCreating ensemble predictions...")
test_preds_ensemble = (
    best_weights[0] * test_preds_lgb + 
    best_weights[1] * test_preds_cb + 
    best_weights[2] * test_preds_xgb
)

print(f"Ensemble predictions complete. Shape: {test_preds_ensemble.shape}")
print(f"Prediction range: [{test_preds_ensemble.min():.6f}, {test_preds_ensemble.max():.6f}]")

# Save individual model predictions
np.save(os.path.join(MODEL_DIR, 'test_preds_lgbm.npy'), test_preds_lgb)
np.save(os.path.join(MODEL_DIR, 'test_preds_catboost.npy'), test_preds_cb)
np.save(os.path.join(MODEL_DIR, 'test_preds_xgboost.npy'), test_preds_xgb)
print("\nIndividual model predictions saved.")

del test_preds_lgb_all, test_preds_cb_all, test_preds_xgb_all, X_test_cb, X_test_xgb
gc.collect()

## Step 8: Create Submission File

In [None]:
print("\n" + "="*50)
print("CREATING SUBMISSION FILE")
print("="*50 + "\n")

# Create submission DataFrame
submission_df = pd.DataFrame({
    'customer_ID': X_test['customer_ID'],
    'prediction': test_preds_ensemble
})

# Merge with sample submission to ensure correct format
sample_sub_df = pd.read_csv(SUB_PATH)
submission_df = sample_sub_df[['customer_ID']].merge(submission_df, on='customer_ID', how='left')
submission_df['prediction'] = submission_df['prediction'].fillna(0.0)

# Save submission
submission_path = 'submission_v6_ensemble.csv'
submission_df.to_csv(submission_path, index=False)

print(f"Submission file saved to: {submission_path}")
print("\nSubmission file head:")
print(submission_df.head())
print(f"\nSubmission shape: {submission_df.shape}")
print(f"Prediction statistics:")
print(submission_df['prediction'].describe())

# Also create individual model submissions for comparison
for model_name, preds in [('lgbm', test_preds_lgb), ('catboost', test_preds_cb), ('xgboost', test_preds_xgb)]:
    sub_df = pd.DataFrame({
        'customer_ID': X_test['customer_ID'],
        'prediction': preds
    })
    sub_df = sample_sub_df[['customer_ID']].merge(sub_df, on='customer_ID', how='left')
    sub_df['prediction'] = sub_df['prediction'].fillna(0.0)
    sub_path = f'submission_v6_{model_name}.csv'
    sub_df.to_csv(sub_path, index=False)
    print(f"\n{model_name.upper()} submission saved to: {sub_path}")

print("\n" + "="*50)
print("PIPELINE COMPLETE!")
print("="*50)
print(f"\nFinal Model Performance Summary:")
print(f"LightGBM OOF Score: {final_oof_score_lgb:.6f}")
print(f"CatBoost OOF Score: {final_oof_score_cb:.6f}")
print(f"XGBoost OOF Score: {final_oof_score_xgb:.6f}")
print(f"Ensemble OOF Score: {best_score:.6f}")
print(f"\nBest Ensemble Weights:")
print(f"  LightGBM: {best_weights[0]}")
print(f"  CatBoost: {best_weights[1]}")
print(f"  XGBoost: {best_weights[2]}")

del X_test, X_test_features, submission_df, sample_sub_df
gc.collect()

## Notes & Fixes Applied

**Key Fixes in This Version:**

1. **CatBoost Categorical Features Fix**:
   - CatBoost requires categorical features to be strings or integers, not floats
   - Solution: Convert all categorical columns to string type using `.astype(str)`
   - Fill NaN values with -999 placeholder before conversion

2. **XGBoost GPU/CPU Compatibility**:
   - Changed `tree_method` from 'gpu_hist' to 'hist' (works on both GPU and CPU)
   - Use `device='cuda'` or `device='cpu'` parameter (XGBoost 2.0+ syntax)
   - Added try-except fallback to CPU if GPU fails
   - Removed deprecated `enable_categorical=True` (now automatic)

3. **Ensemble Weights Initialization**:
   - Initialize `best_weights` with default values (0.4, 0.3, 0.3)
   - Prevents `None` type errors
   - Ensures ensemble always has valid weights

4. **GPU Detection**:
   - Added `check_gpu_availability()` function
   - Automatically configures models for GPU or CPU
   - Graceful fallback if GPU is unavailable

5. **Categorical Feature Handling**:
   - Consistent treatment across all models
   - CatBoost: string type
   - XGBoost: category dtype (pandas categorical)
   - LightGBM: numeric (already handled in preprocessing)

**Expected Performance:**
- The ensemble should improve upon individual model scores
- CatBoost typically provides strong performance on categorical features
- XGBoost adds diversity to the ensemble
- LightGBM models from previous notebook provide the baseline

**Hardware Requirements:**
- Works on both GPU and CPU
- GPU recommended for faster training (~3-4 hours)
- CPU fallback available (~8-12 hours)

**Troubleshooting:**
- If categorical errors persist, check the data types in preprocessing step
- For GPU errors, ensure CUDA drivers are properly installed
- Memory errors: Reduce `n_estimators` or use smaller batch sizes