# Inference: LightGBM + CatBoost Test Predictions (MEMORY-EFFICIENT)

**Memory Issue Fixed:**
- The test dataset is too large (7005 samples × 924621 features = 48.3 GB)
- This notebook processes predictions in **batches** to avoid memory errors

**What this does:**
1. Load test data in chunks
2. Generate predictions batch-by-batch
3. Combine results efficiently
4. Create ensemble submissions

**Memory-efficient approach:**
- Batch size: 500 samples at a time
- Aggressive garbage collection
- Sequential processing to minimize RAM usage

In [5]:
import pandas as pd
import numpy as np
import lightgbm as lgb
import catboost as cb
import gc
import os
import time
import json
from tqdm.auto import tqdm

print(f"LightGBM version: {lgb.__version__}")
print(f"CatBoost version: {cb.__version__}")

# --- Define Paths ---
FE_DATA_DIR = '../data_fe/'
CSV_DATA_DIR = '../data/'
MODEL_DIR = './models/'
PREPROCESSOR_DIR = './preprocessors/'

TRAIN_PATH = os.path.join(FE_DATA_DIR, 'train_processed.parquet')
TEST_PATH = os.path.join(FE_DATA_DIR, 'test_processed.parquet') 
SUB_PATH = os.path.join(CSV_DATA_DIR, 'sample_submission.csv')
CB_MODEL_DIR = os.path.join(MODEL_DIR, 'catboost')

# Training configuration
SEEDS = [42, 52, 62]
N_SPLITS = 5
BATCH_SIZE = 500  # Process 500 samples at a time to avoid memory errors

print(f"\nBatch size: {BATCH_SIZE} samples")
print("This will prevent memory errors by processing data in chunks.")

LightGBM version: 4.6.0
CatBoost version: 1.2.8

Batch size: 500 samples
This will prevent memory errors by processing data in chunks.


  from .autonotebook import tqdm as notebook_tqdm


In [6]:
# Load column configuration
print("Loading column configuration...")
with open(os.path.join(PREPROCESSOR_DIR, 'column_lists.json'), 'r') as f:
    column_lists = json.load(f)

features = column_lists['all_features']
categorical_cols = [col for col in column_lists['categorical_cols_for_lgb'] if col in features]

print(f"Total features: {len(features)}")
print(f"Categorical features: {len(categorical_cols)}")

Loading column configuration...
Total features: 7005
Categorical features: 99


## Step 1: Load Test Data (with customer IDs)

In [7]:
print("\n" + "="*70)
print("LOADING TEST DATA")
print("="*70 + "\n")

print(f"Loading test data from {TEST_PATH}...")
start_time = time.time()

# Load only customer_ID and features columns to save memory
cols_to_load = ['customer_ID'] + features
X_test = pd.read_parquet(TEST_PATH, columns=cols_to_load)

print(f"Test data loaded in {time.time() - start_time:.2f}s")
print(f"Test shape: {X_test.shape}")
print(f"Memory usage: {X_test.memory_usage(deep=True).sum() / 1e9:.2f} GB")

# Store customer IDs separately
customer_ids = X_test['customer_ID'].copy()
X_test_features = X_test[features]
del X_test
gc.collect()

n_samples = len(X_test_features)
n_batches = (n_samples + BATCH_SIZE - 1) // BATCH_SIZE

print(f"\nTotal samples: {n_samples:,}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Number of batches: {n_batches}")
print(f"Features shape: {X_test_features.shape}")


LOADING TEST DATA

Loading test data from ../data_fe/test_processed.parquet...
Test data loaded in 53.00s
Test shape: (924621, 7006)
Memory usage: 13.52 GB

Total samples: 924,621
Batch size: 500
Number of batches: 1850
Features shape: (924621, 7005)


## Step 2: LightGBM Test Predictions (Batch Processing)

In [9]:
print("\n" + "="*70)
print("GENERATING LIGHTGBM TEST PREDICTIONS (BATCH MODE)")
print("="*70 + "\n")

# Initialize predictions array
test_preds_lgb = np.zeros(n_samples, dtype=np.float32)

for seed in SEEDS:
    print(f"\n{'='*70}")
    print(f"Processing LightGBM seed {seed}")
    print("="*70)
    
    seed_preds = np.zeros(n_samples, dtype=np.float32)
    
    for fold in range(N_SPLITS):
        print(f"\n  Fold {fold+1}/{N_SPLITS}")
        model_path = os.path.join(MODEL_DIR, f'model_seed_{seed}_fold_{fold}.txt')
        
        if not os.path.exists(model_path):
            print(f"    ✗ Model not found: {model_path}")
            continue
        
        # Load model once
        model = lgb.Booster(model_file=model_path)
        print(f"    Model loaded from: {model_path}")
        
        # Process in batches
        for batch_idx in tqdm(range(n_batches), desc=f"    Predicting"):
            start_idx = batch_idx * BATCH_SIZE
            end_idx = min((batch_idx + 1) * BATCH_SIZE, n_samples)
            
            # Get batch data
            batch_data = X_test_features.iloc[start_idx:end_idx]
            
            # Predict on batch (convert to numpy to avoid pandas memory issues)
            batch_preds = model.predict(batch_data.values)
            seed_preds[start_idx:end_idx] += batch_preds / N_SPLITS
            
            del batch_data, batch_preds
            gc.collect()
        
        del model
        gc.collect()
        print(f"    ✓ Fold {fold+1} complete")
    
    # Add seed predictions to overall predictions
    test_preds_lgb += seed_preds / len(SEEDS)
    print(f"\n  Seed {seed} avg prediction: {seed_preds.mean():.6f}")
    del seed_preds
    gc.collect()

print(f"\n{'='*70}")
print(f"LightGBM predictions complete")
print(f"  Shape: {test_preds_lgb.shape}")
print(f"  Range: [{test_preds_lgb.min():.6f}, {test_preds_lgb.max():.6f}]")
print(f"  Mean: {test_preds_lgb.mean():.6f}")
print("="*70)

# Save predictions
np.save(os.path.join(MODEL_DIR, 'test_preds_lgbm.npy'), test_preds_lgb)
print(f"\n✓ Saved: {os.path.join(MODEL_DIR, 'test_preds_lgbm.npy')}")


GENERATING LIGHTGBM TEST PREDICTIONS (BATCH MODE)


Processing LightGBM seed 42

  Fold 1/5
    Model loaded from: ./models/model_seed_42_fold_0.txt


    Predicting: 100%|██████████| 1850/1850 [03:57<00:00,  7.79it/s]


    ✓ Fold 1 complete

  Fold 2/5
    Model loaded from: ./models/model_seed_42_fold_1.txt


    Predicting: 100%|██████████| 1850/1850 [03:54<00:00,  7.87it/s]


    ✓ Fold 2 complete

  Fold 3/5
    Model loaded from: ./models/model_seed_42_fold_2.txt


    Predicting: 100%|██████████| 1850/1850 [03:58<00:00,  7.75it/s]


    ✓ Fold 3 complete

  Fold 4/5
    Model loaded from: ./models/model_seed_42_fold_3.txt


    Predicting: 100%|██████████| 1850/1850 [03:59<00:00,  7.73it/s]


    ✓ Fold 4 complete

  Fold 5/5
    Model loaded from: ./models/model_seed_42_fold_4.txt


    Predicting: 100%|██████████| 1850/1850 [03:57<00:00,  7.78it/s]


    ✓ Fold 5 complete

  Seed 42 avg prediction: 0.249028

Processing LightGBM seed 52

  Fold 1/5
    Model loaded from: ./models/model_seed_52_fold_0.txt


    Predicting: 100%|██████████| 1850/1850 [03:58<00:00,  7.75it/s]


    ✓ Fold 1 complete

  Fold 2/5
    Model loaded from: ./models/model_seed_52_fold_1.txt


    Predicting: 100%|██████████| 1850/1850 [04:04<00:00,  7.58it/s]


    ✓ Fold 2 complete

  Fold 3/5
    Model loaded from: ./models/model_seed_52_fold_2.txt


    Predicting: 100%|██████████| 1850/1850 [04:05<00:00,  7.52it/s]


    ✓ Fold 3 complete

  Fold 4/5
    Model loaded from: ./models/model_seed_52_fold_3.txt


    Predicting: 100%|██████████| 1850/1850 [04:02<00:00,  7.64it/s]


    ✓ Fold 4 complete

  Fold 5/5
    Model loaded from: ./models/model_seed_52_fold_4.txt


    Predicting: 100%|██████████| 1850/1850 [05:17<00:00,  5.83it/s]


    ✓ Fold 5 complete

  Seed 52 avg prediction: 0.248813

Processing LightGBM seed 62

  Fold 1/5
    Model loaded from: ./models/model_seed_62_fold_0.txt


    Predicting: 100%|██████████| 1850/1850 [04:40<00:00,  6.59it/s]


    ✓ Fold 1 complete

  Fold 2/5
    Model loaded from: ./models/model_seed_62_fold_1.txt


    Predicting: 100%|██████████| 1850/1850 [04:39<00:00,  6.63it/s]


    ✓ Fold 2 complete

  Fold 3/5
    Model loaded from: ./models/model_seed_62_fold_2.txt


    Predicting: 100%|██████████| 1850/1850 [04:39<00:00,  6.61it/s]


    ✓ Fold 3 complete

  Fold 4/5
    Model loaded from: ./models/model_seed_62_fold_3.txt


    Predicting: 100%|██████████| 1850/1850 [04:39<00:00,  6.61it/s]


    ✓ Fold 4 complete

  Fold 5/5
    Model loaded from: ./models/model_seed_62_fold_4.txt


    Predicting: 100%|██████████| 1850/1850 [04:42<00:00,  6.55it/s]


    ✓ Fold 5 complete

  Seed 62 avg prediction: 0.249149

LightGBM predictions complete
  Shape: (924621,)
  Range: [0.000048, 0.999842]
  Mean: 0.248997

✓ Saved: ./models/test_preds_lgbm.npy


## Step 3: CatBoost Test Predictions (Batch Processing)

In [None]:
print("\n" + "="*70)
print("GENERATING CATBOOST TEST PREDICTIONS (BATCH MODE)")
print("="*70 + "\n")

# Prepare categorical features (convert to string)
print("Preparing categorical features for CatBoost...")
X_test_cb = X_test_features.copy()
for col in categorical_cols:
    if col in X_test_cb.columns:
        X_test_cb[col] = X_test_cb[col].fillna(-999).astype(str)
print("✓ Categorical features converted to string type\n")

# Initialize predictions array
test_preds_cb = np.zeros(n_samples, dtype=np.float32)

for seed in SEEDS:
    print(f"{'='*70}")
    print(f"Processing CatBoost seed {seed}")
    print("="*70)
    
    seed_preds = np.zeros(n_samples, dtype=np.float32)
    
    for fold in range(N_SPLITS):
        print(f"\n  Fold {fold+1}/{N_SPLITS}")
        model_path = os.path.join(CB_MODEL_DIR, f'catboost_seed_{seed}_fold_{fold}.cbm')
        
        if not os.path.exists(model_path):
            print(f"    ✗ Model not found: {model_path}")
            continue
        
        # Load model once
        model = cb.CatBoostClassifier()
        model.load_model(model_path)
        print(f"    Model loaded from: {model_path}")
        
        # Process in batches
        for batch_idx in tqdm(range(n_batches), desc=f"    Predicting"):
            start_idx = batch_idx * BATCH_SIZE
            end_idx = min((batch_idx + 1) * BATCH_SIZE, n_samples)
            
            # Get batch data
            batch_data = X_test_cb.iloc[start_idx:end_idx]
            
            # Predict on batch
            batch_preds = model.predict_proba(batch_data)[:, 1]
            seed_preds[start_idx:end_idx] += batch_preds / N_SPLITS
            
            del batch_data, batch_preds
            gc.collect()
        
        del model
        gc.collect()
        print(f"    ✓ Fold {fold+1} complete")
    
    # Add seed predictions to overall predictions
    test_preds_cb += seed_preds / len(SEEDS)
    print(f"\n  Seed {seed} avg prediction: {seed_preds.mean():.6f}")
    del seed_preds
    gc.collect()

print(f"\n{'='*70}")
print(f"CatBoost predictions complete")
print(f"  Shape: {test_preds_cb.shape}")
print(f"  Range: [{test_preds_cb.min():.6f}, {test_preds_cb.max():.6f}]")
print(f"  Mean: {test_preds_cb.mean():.6f}")
print("="*70)

# Save predictions
np.save(os.path.join(MODEL_DIR, 'test_preds_catboost.npy'), test_preds_cb)
print(f"\n✓ Saved: {os.path.join(MODEL_DIR, 'test_preds_catboost.npy')}")

# Clean up
del X_test_cb, X_test_features
gc.collect()


GENERATING CATBOOST TEST PREDICTIONS (BATCH MODE)

Preparing categorical features for CatBoost...
✓ Categorical features converted to string type

Processing CatBoost seed 42

  Fold 1/5
    Model loaded from: ./models/catboost\catboost_seed_42_fold_0.cbm


    Predicting: 100%|██████████| 1850/1850 [32:55<00:00,  1.07s/it]


    ✓ Fold 1 complete

  Fold 2/5
    Model loaded from: ./models/catboost\catboost_seed_42_fold_1.cbm


    Predicting: 100%|██████████| 1850/1850 [50:26<00:00,  1.64s/it]


    ✓ Fold 2 complete

  Fold 3/5
    Model loaded from: ./models/catboost\catboost_seed_42_fold_2.cbm


    Predicting: 100%|██████████| 1850/1850 [42:18<00:00,  1.37s/it]


    ✓ Fold 3 complete

  Fold 4/5
    Model loaded from: ./models/catboost\catboost_seed_42_fold_3.cbm


    Predicting: 100%|██████████| 1850/1850 [31:36<00:00,  1.03s/it]


    ✓ Fold 4 complete

  Fold 5/5
    Model loaded from: ./models/catboost\catboost_seed_42_fold_4.cbm


    Predicting: 100%|██████████| 1850/1850 [28:44<00:00,  1.07it/s]


    ✓ Fold 5 complete

  Seed 42 avg prediction: 0.250170
Processing CatBoost seed 52

  Fold 1/5
    Model loaded from: ./models/catboost\catboost_seed_52_fold_0.cbm


    Predicting: 100%|██████████| 1850/1850 [26:40<00:00,  1.16it/s]


    ✓ Fold 1 complete

  Fold 2/5
    Model loaded from: ./models/catboost\catboost_seed_52_fold_1.cbm


    Predicting:  36%|███▌      | 662/1850 [09:41<16:52,  1.17it/s]

## Step 4: Find Optimal Ensemble Weights

In [None]:
print("\n" + "="*70)
print("FINDING OPTIMAL ENSEMBLE WEIGHTS")
print("="*70 + "\n")

# Try to load OOF predictions for weight optimization
oof_lgb_path = os.path.join(MODEL_DIR, 'oof_lgbm.npy')
oof_cb_path = os.path.join(MODEL_DIR, 'oof_catboost.npy')

best_weights = (0.5, 0.5)  # Default

if os.path.exists(oof_cb_path):
    try:
        from amex_metric import amex_metric
        
        # Load training target
        train_df = pd.read_parquet(TRAIN_PATH, columns=['target'])
        y_train = train_df['target']
        del train_df
        gc.collect()
        
        # Load OOF predictions
        if os.path.exists(oof_lgb_path):
            oof_lgb = np.load(oof_lgb_path)
        else:
            oof_lgb = None
        
        oof_cb = np.load(oof_cb_path)
        
        if oof_lgb is not None:
            def amex_metric_mod(y_true, y_pred):
                dummy_index = range(len(y_true))
                y_true_df = pd.DataFrame({'target': y_true}, index=dummy_index)
                y_pred_df = pd.DataFrame({'prediction': y_pred}, index=dummy_index)
                y_true_df.index.name = 'customer_ID'
                y_pred_df.index.name = 'customer_ID'
                return amex_metric(y_true_df, y_pred_df)
            
            print("Testing ensemble weights...\n")
            best_score = 0
            
            weight_combinations = [
                (0.5, 0.5), (0.6, 0.4), (0.7, 0.3), (0.4, 0.6), (0.3, 0.7),
                (0.55, 0.45), (0.45, 0.55), (0.65, 0.35), (0.35, 0.65)
            ]
            
            for w_lgb, w_cb in weight_combinations:
                oof_ensemble = w_lgb * oof_lgb + w_cb * oof_cb
                score = amex_metric_mod(y_train.values, oof_ensemble)
                print(f"  Weights (LGB={w_lgb:.2f}, CB={w_cb:.2f}): Score = {score:.6f}")
                
                if score > best_score:
                    best_score = score
                    best_weights = (w_lgb, w_cb)
            
            print(f"\n✓ Best weights: LGB={best_weights[0]:.3f}, CB={best_weights[1]:.3f}")
            print(f"✓ Best OOF score: {best_score:.6f}")
        else:
            print("LightGBM OOF not found, using equal weights (0.5, 0.5)")
    except Exception as e:
        print(f"Could not optimize weights: {e}")
        print("Using default equal weights (0.5, 0.5)")
else:
    print("OOF predictions not found, using default equal weights (0.5, 0.5)")

print(f"\nFinal ensemble weights: LGB={best_weights[0]:.3f}, CB={best_weights[1]:.3f}")

## Step 5: Create Submission Files

In [None]:
print("\n" + "="*70)
print("CREATING SUBMISSION FILES")
print("="*70 + "\n")

# Load sample submission
sample_sub = pd.read_csv(SUB_PATH)
print(f"Sample submission loaded: {sample_sub.shape}")

# Create ensemble predictions
test_preds_ensemble = best_weights[0] * test_preds_lgb + best_weights[1] * test_preds_cb

print(f"\nEnsemble predictions:")
print(f"  Shape: {test_preds_ensemble.shape}")
print(f"  Range: [{test_preds_ensemble.min():.6f}, {test_preds_ensemble.max():.6f}]")
print(f"  Mean: {test_preds_ensemble.mean():.6f}")

# 1. Ensemble submission
submission_ensemble = pd.DataFrame({
    'customer_ID': customer_ids,
    'prediction': test_preds_ensemble
})
submission_ensemble = sample_sub[['customer_ID']].merge(submission_ensemble, on='customer_ID', how='left')
submission_ensemble['prediction'] = submission_ensemble['prediction'].fillna(0.0)
ensemble_path = 'submission_lgbm_catboost_ensemble.csv'
submission_ensemble.to_csv(ensemble_path, index=False)
print(f"\n✓ Ensemble submission saved: {ensemble_path}")

# 2. LightGBM only
submission_lgbm = pd.DataFrame({
    'customer_ID': customer_ids,
    'prediction': test_preds_lgb
})
submission_lgbm = sample_sub[['customer_ID']].merge(submission_lgbm, on='customer_ID', how='left')
submission_lgbm['prediction'] = submission_lgbm['prediction'].fillna(0.0)
lgbm_path = 'submission_lgbm_only.csv'
submission_lgbm.to_csv(lgbm_path, index=False)
print(f"✓ LightGBM submission saved: {lgbm_path}")

# 3. CatBoost only
submission_cb = pd.DataFrame({
    'customer_ID': customer_ids,
    'prediction': test_preds_cb
})
submission_cb = sample_sub[['customer_ID']].merge(submission_cb, on='customer_ID', how='left')
submission_cb['prediction'] = submission_cb['prediction'].fillna(0.0)
cb_path = 'submission_catboost_only.csv'
submission_cb.to_csv(cb_path, index=False)
print(f"✓ CatBoost submission saved: {cb_path}")

print(f"\n{'='*70}")
print("✅ ALL SUBMISSIONS CREATED!")
print("="*70)
print(f"\nPreview of ensemble submission:")
print(submission_ensemble.head(10))
print(f"\nSubmission statistics:")
print(submission_ensemble['prediction'].describe())
print("="*70)

## Summary

### Memory-Efficient Processing
- ✅ Processed test data in batches of 500 samples
- ✅ Avoided the 48.3 GB memory allocation error
- ✅ Used numpy arrays instead of pandas DataFrames where possible
- ✅ Aggressive garbage collection after each batch

### Files Created
1. **submission_lgbm_catboost_ensemble.csv** - Ensemble (RECOMMENDED)
2. **submission_lgbm_only.csv** - LightGBM only
3. **submission_catboost_only.csv** - CatBoost only
4. **models/test_preds_lgbm.npy** - Saved predictions
5. **models/test_preds_catboost.npy** - Saved predictions

### Models Used
- LightGBM: 15 models (3 seeds × 5 folds)
- CatBoost: 15 models (3 seeds × 5 folds)
- Total: 30 models averaged

### Performance Tips
- If still getting memory errors, reduce BATCH_SIZE to 250 or 100
- The batch processing adds ~5-10 minutes to total runtime
- Monitor memory usage in Task Manager during execution