# Amex 1st Place Pipeline - Step 2: Model Training (Memory Fix)

This notebook loads the advanced features created by `v4_LGBM_Features.ipynb` and trains a powerful, seed-blended LightGBM (DART) model using the 1st place solution's parameters.

**Memory Fix:** This version loads and preprocesses `train` and `test` data sequentially to avoid MemoryErrors. It trains all models first, frees memory, then loads and predicts on the test set.

In [1]:
import pandas as pd
import numpy as np
import lightgbm as lgb
import gc
import os
import time
import matplotlib.pyplot as plt
import seaborn as sns
import random
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from tqdm.auto import tqdm

# Import the official metric from amex_metric.py
from amex_metric import amex_metric

print(f"LightGBM version: {lgb.__version__}")
sns.set_style('whitegrid')

LightGBM version: 4.6.0


  from .autonotebook import tqdm as notebook_tqdm


## Step 1: Metric Wrapper & Seed Function

In [2]:
# --- This is the corrected function for Cell 2 ---

def amex_metric_lgbm(y_pred, y_true):
    """
    LightGBM compatible version of the Amex metric.
    (Note: y_true is the lgb.Dataset, y_pred is the numpy array)
    """
    # Extract the actual numpy array of labels from the lgb.Dataset
    y_true_labels = y_true.get_label()
    
    # Create a dummy index that matches the length
    dummy_index = range(len(y_true_labels))
    
    # Create DataFrames with the expected column names AND the dummy index
    y_true_df = pd.DataFrame({'target': y_true_labels}, index=dummy_index)
    y_pred_df = pd.DataFrame({'prediction': y_pred}, index=dummy_index)
    
    # Add the customer_ID index NAME as required by the metric
    y_true_df.index.name = 'customer_ID'
    y_pred_df.index.name = 'customer_ID'
    
    return 'amex_metric', amex_metric(y_true_df, y_pred_df), True

print("amex_metric_lgbm wrapper defined.")

def seed_everything(seed):
    random.seed(seed)
    np.random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

print("seed_everything function defined.")

amex_metric_lgbm wrapper defined.
seed_everything function defined.


## Step 2: Define Paths & Load Train Data

This step loads *only* the pre-computed train features.

In [3]:
# --- Paths ---
FE_DATA_DIR = '../data_fe/'
CSV_DATA_DIR = '../data/'

TRAIN_PATH = os.path.join(FE_DATA_DIR, 'train_fe.parquet')
TEST_PATH = os.path.join(FE_DATA_DIR, 'test_fe.parquet') 
SUB_PATH = os.path.join(CSV_DATA_DIR, 'sample_submission.csv')

# --- Load Only Train Data ---
print(f"Loading processed train features from {TRAIN_PATH}...")
start_time = time.time()
train_df = pd.read_parquet(TRAIN_PATH)
print(f"Train data loaded in {time.time() - start_time:.2f}s. Shape: {train_df.shape}")

Loading processed train features from ../data_fe/train_fe.parquet...
Train data loaded in 31.03s. Shape: (458913, 7007)


## Step 3: Preprocess Train Data

This step applies imputation, categorical encoding, and `float16` conversion to the train data, then deletes the original `train_df`.

In [4]:
print(f"Original X_train shape: {train_df.shape}")

# --- 1. Separate X and y ---
y_train = train_df['target']
X_train = train_df.drop(columns=['target'])
del train_df; gc.collect()
print("Separated y_train and X_train. Original train_df deleted.")

# --- 2. Imputation (Fill with 0) ---
print("Applying Zero Imputation to X_train...")
# We process column-by-column to be memory safe, even for train.
all_features = X_train.columns.tolist()
for col in tqdm(all_features):
    if col == 'customer_ID':
        continue
    if X_train[col].isnull().any():
        X_train[col] = X_train[col].fillna(0)
print("Imputation complete.")

# --- 3. Handle Categoricals ---
print("Converting categorical features for X_train...")
original_cat_features = ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']
categorical_cols_for_lgb = [col for col in all_features if any(cat_feat in col for cat_feat in original_cat_features) and col.endswith(('_last', '_nunique', '_count'))]
categorical_cols_for_lgb += [col for col in original_cat_features if col in all_features]
categorical_cols_for_lgb = list(set([col for col in categorical_cols_for_lgb if col in X_train.columns]))

print(f"Found {len(categorical_cols_for_lgb)} total categorical features for LightGBM.")

# LabelEncode object-type categoricals
obj_cols = [col for col in categorical_cols_for_lgb if X_train[col].dtype == 'object']
print(f"Found {len(obj_cols)} object-type columns to LabelEncode: {obj_cols}")

# Store fitted encoders to use on test set
encoders = {}
for col in tqdm(obj_cols):
    le = LabelEncoder()
    X_train[col] = le.fit_transform(X_train[col].astype(str))
    encoders[col] = le # Save the fitted encoder

# Convert all categoricals to 'category' dtype
print("Converting dtypes to 'category'...")
for col in tqdm(categorical_cols_for_lgb):
    X_train[col] = X_train[col].astype('category')

# --- 4. Downcast to float16 for memory --- 
print("Downcasting X_train to float16...")
num_cols = list(X_train.dtypes[(X_train.dtypes == 'float32') | (X_train.dtypes == 'float64') | (X_train.dtypes == 'int64')].index)
num_cols = [col for col in num_cols if col not in categorical_cols_for_lgb + ['customer_ID']]

for col in tqdm(num_cols):
    X_train[col] = X_train[col].astype(np.float16)

print("X_train preprocessing complete.")
print(f"Final X_train shape: {X_train.shape}")
gc.collect()

Original X_train shape: (458913, 7007)
Separated y_train and X_train. Original train_df deleted.
Applying Zero Imputation to X_train...


100%|██████████| 7006/7006 [00:09<00:00, 710.24it/s] 


Imputation complete.
Converting categorical features for X_train...
Found 99 total categorical features for LightGBM.
Found 0 object-type columns to LabelEncode: []


0it [00:00, ?it/s]


Converting dtypes to 'category'...


100%|██████████| 99/99 [00:00<00:00, 180.54it/s]


Downcasting X_train to float16...


  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr.astype(dtype, copy=True)
  return arr

X_train preprocessing complete.
Final X_train shape: (458913, 7006)





0

## Step 4: Train Model (Seed Blending with 1st Place Params)

This cell trains all models and saves them, but does *not* load the test set.

In [None]:
print("Starting 3-Seed DART Model Training...")

# --- LightGBM Parameters (from 1st Place S5_LGB_main.py) ---
lgb_params = {
    'objective': 'binary',
    'metric': 'binary_logloss', # feval will override this for eval
    'boosting': 'dart',
    'max_depth': -1,
    'num_leaves': 64,
    'learning_rate': 0.035,
    'bagging_freq': 5,
    'bagging_fraction': 0.75,
    'feature_fraction': 0.05,
    'min_data_in_leaf': 256,
    'max_bin': 63,
    'min_data_in_bin': 256,
    'tree_learner': 'serial',
    'boost_from_average': 'false',
    'lambda_l1': 0.1,
    'lambda_l2': 30,
    'num_threads': -1,
    'verbosity': 1,
    # --- GPU Parameters --- (Set to 1 to select NVIDIA)
    'device': 'gpu', 
    'gpu_platform_id': 1,  
    'gpu_device_id': 0
}

# --- Setup CV --- 
N_SPLITS = 5
SEEDS = [42] # Seed blending

# --- Create feature lists --- 
features = [col for col in X_train.columns if col not in ['customer_ID']]
categorical_cols = [col for col in categorical_cols_for_lgb if col in features]

print(f"Training with {len(features)} features.")
print(f"Found {len(categorical_cols)} categorical features for LightGBM.")

# --- Arrays to store blended predictions ---
oof_predictions_all_seeds = []
models_all_seeds = [] # Store all trained models
models_seed_42 = [] 
fold_evals_results_seed_42 = [] # This will store the results for plotting

X_train_lgb = X_train[features]

for seed in SEEDS:
    print(f"\n{'='*50}")
    print(f"--- TRAINING WITH SEED {seed} ---")
    print(f"{'='*50}")
    
    seed_everything(seed)
    lgb_params['seed'] = seed
    
    oof_predictions_seed = np.zeros(X_train.shape[0])
    models_seed = [] # Store models for this seed

    skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=seed)
    
    for fold, (train_index, val_index) in enumerate(skf.split(X_train_lgb, y_train)):
        fold_start_time = time.time()
        print(f"\n--- Fold {fold+1}/{N_SPLITS} (Seed {seed}) ---")
        
        X_train_fold, X_val_fold = X_train_lgb.iloc[train_index], X_train_lgb.iloc[val_index]
        y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]
        
        lgb_train = lgb.Dataset(X_train_fold, y_train_fold, categorical_feature = categorical_cols)
        lgb_valid = lgb.Dataset(X_val_fold, y_val_fold, categorical_feature = categorical_cols)
        
        # --- FIX: Create the dictionary to store results ---
        evals_result_dic = {} 
        
        model = lgb.train(
            params = lgb_params,
            train_set = lgb_train,
            num_boost_round = 1, # Using 25 for testing is fine, but 4500 is for the final run
            valid_sets = [lgb_valid], # Only evaluate on validation set
            
            # --- FIX: Pass record_evaluation to the callbacks list ---
            # This is the correct way to get the evals_result_dic populated.
            callbacks=[
                lgb.log_evaluation(period=1),
                lgb.record_evaluation(evals_result_dic) # This will populate the dictionary
            ], 
            
            feval = amex_metric_lgbm
            )
        
        # --- FIX: Set best_iteration correctly ---
        # Since early stopping is off, the best iteration is the final one.
        # model.best_iteration will be 0, so we set it manually.
        model.best_iteration = 4500 

        val_preds = model.predict(X_val_fold)
        oof_predictions_seed[val_index] = val_preds
        
        # Don't predict on test set here, just save the model
        models_seed.append(model)
        
        if seed == SEEDS[0]:
            models_seed_42.append(model)
            # --- FIX: Save the dictionary, not the attribute ---
            fold_evals_results_seed_42.append(evals_result_dic)
        
        print(f"Fold {fold+1} complete in {time.time() - fold_start_time:.2f}s")
        del X_train_fold, X_val_fold, y_train_fold, y_val_fold, lgb_train, lgb_valid, evals_result_dic
        gc.collect()

    oof_df_seed = pd.DataFrame({'customer_ID': X_train['customer_ID'], 'target': y_train, 'prediction': oof_predictions_seed})
    oof_score_seed = amex_metric(oof_df_seed[['customer_ID', 'target']], oof_df_seed[['customer_ID', 'prediction']])
    print(f"\n--- OOF Score for Seed {seed}: {oof_score_seed:.6f} ---")
    
    oof_predictions_all_seeds.append(oof_predictions_seed)
    models_all_seeds.append(models_seed) # Save the list of 5 models for this seed

print(f"\n--- Training Complete ---,")

# --- Calculate Final Blended OOF Score ---
oof_predictions = np.mean(oof_predictions_all_seeds, axis=0)
oof_df = pd.DataFrame({'customer_ID': X_train['customer_ID'], 'target': y_train, 'prediction': oof_predictions})
oof_score = amex_metric(oof_df[['customer_ID', 'target']], oof_df[['customer_ID', 'prediction']])

print(f"\n{'='*51}")
print(f"  Final Blended CV (OOF) Amex Score: {oof_score:.6f}")
print(f"{'='*51}")

# --- FREE UP MEMORY ---
print("Deleting X_train, y_train, and OOF data to free memory...")
del X_train_lgb, X_train, y_train, oof_predictions, oof_df, oof_df_seed, oof_predictions_all_seeds, oof_predictions_seed
gc.collect()

Starting 3-Seed DART Model Training...
Training with 7005 features.
Found 99 categorical features for LightGBM.

--- TRAINING WITH SEED 42 ---

--- Fold 1/5 (Seed 42) ---
[LightGBM] [Info] Number of positive: 95062, number of negative: 272068
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 247898
[LightGBM] [Info] Number of data points in the train set: 367130, number of used features: 6814
[LightGBM] [Info] Using requested OpenCL platform 1 device 0
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce RTX 4060 Laptop GPU, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 2750 dense feature groups (963.54 MB) transferred to GPU in 0.201106 secs. 1 sparse feature groups
[1]	valid_0's binary_logloss: 0.670859	valid_0's amex_metric: 0.700862
Fold 1 complete in 154.60s

--- Fold 2/5 (Seed 42) ---
[LightGBM] [Info] Numbe

## Step 5: Load and Preprocess Test Data (Memory-Efficiently)

Now that training is complete and memory is freed, we load and process the test set.

In [None]:
print(f"Loading processed test features from {TEST_PATH}...")
start_time = time.time()
# --- MEMORY FIX: Load directly into X_test, do not copy test_df ---
X_test = pd.read_parquet(TEST_PATH)
print(f"Test data loaded in {time.time() - start_time:.2f}s. Shape: {X_test.shape}")
gc.collect()

print("Applying Imputation, Dtype Conversion (column-wise) to X_test...")

# Get column lists defined in the previous cell (Cell 6)
# obj_cols, categorical_cols_for_lgb, num_cols are in the global scope

all_test_cols = X_test.columns.tolist()

# Process columns in chunks to save memory
chunk_size = 500 # Process 500 features at a time
for i in tqdm(range(0, len(all_test_cols), chunk_size)):
    chunk_cols = all_test_cols[i:i + chunk_size]
    
    for col in chunk_cols:
        if col == 'customer_ID':
            continue
        
        # 1. Fill NA
        X_test[col] = X_test[col].fillna(0)

        # 2. Handle Categoricals
        if col in obj_cols:
            le = encoders[col] # Get the encoder fitted on train data
            
            # Find values in test that were not in train
            unseen = set(X_test[col].astype(str).unique()) - set(le.classes_)
            if unseen:
                print(f"  Warning: Unseen categories in {col}: {unseen}. Mapping to -1.")
                # Map unseen values to a placeholder (e.g., '-1') *before* transform
                # This assumes '-1' was not a valid category in train
                # A safer method is to add a new class to the encoder
                le_classes = le.classes_.tolist()
                for item in unseen:
                    if item not in le_classes:
                        le_classes.append(item)
                le.classes_ = np.array(le_classes)
                X_test[col] = le.transform(X_test[col].astype(str))
            else:
                X_test[col] = le.transform(X_test[col].astype(str))
            
            X_test[col] = X_test[col].astype('category')
        
        elif col in categorical_cols_for_lgb:
            X_test[col] = X_test[col].astype('category')
        
        # 3. Downcast Numerics (that are not categoricals)
        elif col in num_cols:
            X_test[col] = X_test[col].astype(np.float16)
    
    gc.collect()

print("X_test preprocessing complete.")
print(f"Final X_test shape: {X_test.shape}")
gc.collect()

## Step 6: Inference

Now we use the trained models (saved in `models_all_seeds`) to predict on the processed `X_test`.

In [None]:
print("Starting Inference on Test Set...")
start_time = time.time()

X_test_lgb = X_test[features]
test_predictions_all_seeds = []

for seed_idx, seed_models in enumerate(models_all_seeds):
    print(f"Predicting with models from SEED {SEEDS[seed_idx]}...")
    test_preds_seed = np.zeros(X_test_lgb.shape[0])
    
    for fold_idx, model in enumerate(seed_models):
        print(f"  Predicting with Fold {fold_idx+1} model...")
        test_preds_seed += model.predict(X_test_lgb) / N_SPLITS
    
    test_predictions_all_seeds.append(test_preds_seed)
    gc.collect()

# Blend the predictions from all seeds
test_predictions = np.mean(test_predictions_all_seeds, axis=0)

print(f"Inference complete in {time.time() - start_time:.2f}s")

# Clean up test data to save memory for plotting
del X_test_lgb, test_preds_seed, test_predictions_all_seeds
gc.collect()

## Step 7: Analyze Model Performance (Seed 42)

These plots show the learning curves and feature importance from the **first seed (42)**.

In [None]:
print("--- Plotting Learning Curves (Seed 42) ---")

plt.figure(figsize=(12, 6))
for fold, results in enumerate(fold_evals_results_seed_42):
    metric_values = results['valid_1']['amex_metric']
    plt.plot(metric_values, label=f'Fold {fold+1} Amex Metric')

plt.title('LightGBM DART Learning Curves (Amex Metric, Seed 42)', fontsize=16)
plt.xlabel('Boosting Rounds', fontsize=12)
plt.ylabel('Amex Metric', fontsize=12)
plt.legend()
plt.grid(True)
plt.show()

print("\n--- Analyzing Feature Importance (Seed 42) ---")

feature_importance_df = pd.DataFrame()
feature_importance_df['feature'] = features
feature_importance_df['importance'] = 0

for fold, model in enumerate(models_seed_42):
    feature_importance_df['importance'] += model.feature_importance_(importance_type='gain') / N_SPLITS

feature_importance_df = feature_importance_df.sort_values(by='importance', ascending=False)

print(f"Top 40 Most Important Features (by 'gain', Seed 42):")
print(feature_importance_df.head(40).to_string())

plt.figure(figsize=(12, 10))
sns.barplot(
    x='importance',
    y='feature',
    data=feature_importance_df.head(40),
    palette='viridis'
)
plt.title('Top 40 Feature Importances (Averaged Over 5 Folds, Seed 42)', fontsize=16)
plt.xlabel('Average Importance (Gain)', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.tight_layout()
plt.show()

## Step 8: Create Submission File

This creates `submission_v4_blend.csv` using the final blended predictions.

In [None]:
print("Creating submission file...")

submission_df = pd.DataFrame({
    'customer_ID': X_test['customer_ID'],
    'prediction': test_predictions
})

sample_sub_df = pd.read_csv(SUB_PATH)
submission_df = sample_sub_df[['customer_ID']].merge(submission_df, on='customer_ID', how='left')
submission_df['prediction'] = submission_df['prediction'].fillna(0.0)

submission_path = 'submission_v4_blend.csv'
submission_df.to_csv(submission_path, index=False)

print(f"Submission file saved to: {submission_path}")
print("File head:")
print(submission_df.head())

### Next Steps:

1.  **Run this Notebook**: This sequential process should respect your system's memory limits.
2.  **Submit to Kaggle**: This `submission_v4_blend.csv` should be very high-scoring.
3.  **True 1st Place (Ensembling)**: To get the *actual* 1st place score, you would need to:
    * Implement `S3_series_feature.py` to create LGBM OOF features.
    * Implement `S6_NN_main.py` to train a separate Neural Network model.
    * Implement `S7_ensemble.py` to blend the predictions from your LGBM and NN models.