# Numerai Modeling: Feature Engineering, Ensembling, and Advanced Training (with Checkpoints)

This notebook demonstrates several advanced modeling techniques for the Numerai tournament, broken down into sections with checkpoints to manage resource usage, optimized for lower RAM environments (~12GB).
1. **Setup & Configuration**
2. **Part 1: Data Loading & Initial Exploration** (Checkpoint: Saves raw train/validation data with optimized types)
3. **Part 2: Feature Engineering** (Checkpoint: Saves data with engineered features and fitted transformers/column lists)
4. **Part 3: Base Model Training** (Checkpoint: Saves trained base models)
5. **Part 4: Stacked Ensembling** (Checkpoint: Saves OOF predictions and meta-model)
6. **Part 5: Era-Invariant MLP Training (Optional)** (Checkpoint: Saves trained MLP model)
7. **Part 6: Final Evaluation & Prediction Function**
8. **Part 7: Model Pickling for Upload**

**Workflow:** Run each part sequentially. If you stop and restart, the notebook will attempt to load data from the last completed checkpoint.

## Setup & Configuration

In [None]:
# Install dependencies
!pip install -q numerapi pandas pyarrow matplotlib lightgbm scikit-learn cloudpickle==2.2.1 seaborn umap-learn tensorflow torch ctgan tqdm

# Inline plots
%matplotlib inline

In [None]:
import pandas as pd
import numpy as np
import json
import gc
import os
import tensorflow as tf # Import TF early to potentially suppress warnings
from numerapi import NumerAPI
import lightgbm as lgb
from sklearn.model_selection import GroupKFold
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler, QuantileTransformer
import cloudpickle
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr
from tqdm.notebook import tqdm
import warnings

# Ignore specific warnings if needed (e.g., from CTGAN or TF)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning, module='sklearn.preprocessing._data')
tf.get_logger().setLevel('ERROR') # Suppress TensorFlow warnings

# --- Configuration ---
DATA_VERSION = "v5.0"
MAIN_TARGET = "target_cyrusd_20"
AUX_TARGETS = [
  "target_victor_20",
  "target_xerxes_20",
  "target_teager2b_20"
]
TARGET_CANDIDATES = [MAIN_TARGET] + AUX_TARGETS
ERA_COL = "era"
DATA_TYPE_COL = "data_type"
TARGET_COL = "target" # Alias for MAIN_TARGET in original notebook
PREDICTION_COL = "prediction"
ID_COL = "id" # Define id column name

# Feature Engineering Hyperparameters (Adjusted for lower RAM)
UMAP_N_COMPONENTS = 30 # Reduced UMAP components
AE_ENCODING_DIM = 32   # Reduced AE dimensions
CONTRASTIVE_EMB_DIM = 32 # Reduced Contrastive dimensions
CTGAN_EPOCHS = 20      # Significantly reduced CTGAN epochs
AE_EPOCHS = 3          # Reduced AE epochs
CTGAN_SYNTH_RATIO = 0.2 # Generate fewer synthetic samples

# Stacking Ensemble Config
N_FOLDS = 5 # Number of folds for OOF predictions
STACKING_MODEL_TYPE = 'LGBM' # 'LGBM' or 'Linear'

# PyTorch MLP Config (Adjusted for lower RAM)
MLP_EPOCHS = 3          # Reduced MLP epochs
MLP_BATCH_SIZE = 2048   # Increased batch size can sometimes help RAM if GPU is used, but monitor
MLP_LR = 0.001
VARIANCE_PENALTY_WEIGHT = 0.01 # lambda1
FEATURE_EXPOSURE_WEIGHT = 0.01 # lambda2
TOP_N_FEATURES_FOR_EXPOSURE = 30 # Reduced features for exposure penalty

# Model Selection Flags
USE_STACKING = True # Set to True to use Stacking, False for MLP
USE_MLP = False      # Set to True to use MLP (requires PyTorch)

# Speedup Options (Highly Recommended for faster iteration)
DOWNSAMPLE_TRAIN_ERAS = 4 # Use every Nth era for training (e.g., 4 or 10)
DOWNSAMPLE_VALID_ERAS = 4 # Use every Nth era for validation (e.g., 4 or 10)
FEATURE_SET_SIZE = "small" # 'small', 'medium', or 'all'

# Checkpoint File Paths
CHECKPOINT_DIR = "checkpoints"
os.makedirs(CHECKPOINT_DIR, exist_ok=True)
CP1_TRAIN_PATH = os.path.join(CHECKPOINT_DIR, "train_part1.parquet")
CP1_VALID_PATH = os.path.join(CHECKPOINT_DIR, "validation_part1.parquet")
CP2_TRAIN_PATH = os.path.join(CHECKPOINT_DIR, "train_part2.parquet")
CP2_VALID_PATH = os.path.join(CHECKPOINT_DIR, "validation_part2.parquet")
CP2_FE_INFO_PATH = os.path.join(CHECKPOINT_DIR, "fe_info_part2.pkl") # Changed name
CP3_MODELS_PATH = os.path.join(CHECKPOINT_DIR, "base_models_part3.pkl")
CP4_OOF_PATH = os.path.join(CHECKPOINT_DIR, "oof_preds_part4.parquet")
CP4_META_MODEL_PATH = os.path.join(CHECKPOINT_DIR, "meta_model_part4.pkl")
CP5_MLP_MODEL_PATH = os.path.join(CHECKPOINT_DIR, "mlp_model_part5.pkl")

pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: f'{x:.6f}')

# Add comment about GPU potential
# For significant speedups, especially with AE, CTGAN, and MLP, ensure you are running
# in an environment with a GPU and have the necessary GPU versions of 
# TensorFlow and PyTorch installed.

## Part 1: Data Loading & Initial Exploration

Load training and validation data, perform initial filtering, downsampling, and **data type optimization**.

In [None]:
napi = NumerAPI()

# Function to optimize data types
def optimize_dtypes(df, feature_cols, target_cols):
    print("Optimizing data types...")
    for col in df.columns:
        if col in feature_cols:
            df[col] = df[col].astype(np.int8) # Numerai features are 0-4
        elif col in target_cols:
            df[col] = df[col].astype(np.float16) # Targets are 0, 0.25, ..., 1
        elif col == ERA_COL:
             df[col] = df[col].astype(np.int16) # Eras are integers
    gc.collect()
    return df

# Check if checkpoint exists
if os.path.exists(CP1_TRAIN_PATH) and os.path.exists(CP1_VALID_PATH):
    print(f"Loading data from Checkpoint 1...")
    train = pd.read_parquet(CP1_TRAIN_PATH).set_index(ID_COL)
    validation = pd.read_parquet(CP1_VALID_PATH).set_index(ID_COL)
    # Load metadata to reconstruct column lists
    if not os.path.exists(f"{DATA_VERSION}/features.json"):
        print("Downloading metadata...")
        napi.download_dataset(f"{DATA_VERSION}/features.json")
    feature_metadata = json.load(open(f"{DATA_VERSION}/features.json"))
    feature_sets = feature_metadata["feature_sets"]
    original_feature_cols = feature_sets[FEATURE_SET_SIZE]
    target_cols_meta = feature_metadata["targets"]
    # Recreate targets_df if needed for exploration (or save/load it too)
    target_cols = [col for col in [MAIN_TARGET] + AUX_TARGETS if col in train.columns]
    targets_to_keep = [ERA_COL] + target_cols
    targets_df = train[[col for col in targets_to_keep if col in train.columns]].copy()
    print("Data loaded from Checkpoint 1.")
else:
    print("Checkpoint 1 not found. Loading data from source...")
    # Download metadata and training data
    print("Downloading metadata...")
    napi.download_dataset(f"{DATA_VERSION}/features.json")
    print("Downloading training data...")
    napi.download_dataset(f"{DATA_VERSION}/train.parquet")
    print("Downloading validation data...")
    napi.download_dataset(f"{DATA_VERSION}/validation.parquet")

    # Load feature metadata and define feature sets
    print("Loading feature metadata...")
    feature_metadata = json.load(open(f"{DATA_VERSION}/features.json"))
    feature_sets = feature_metadata["feature_sets"]
    original_feature_cols = feature_sets[FEATURE_SET_SIZE]
    target_cols_meta = feature_metadata["targets"]

    # Define columns to read (including 'id')
    read_columns = [ID_COL, ERA_COL, DATA_TYPE_COL] + original_feature_cols + target_cols_meta

    # Load training data - read specified columns first
    print("Loading training data...")
    train_raw = pd.read_parquet(f"{DATA_VERSION}/train.parquet", columns=read_columns)
    train = train_raw[train_raw[DATA_TYPE_COL] == "train"].set_index(ID_COL).copy()
    del train[DATA_TYPE_COL]
    del train_raw
    gc.collect()

    # Load validation data - read specified columns first
    print("Loading validation data...")
    validation_raw = pd.read_parquet(f"{DATA_VERSION}/validation.parquet", columns=read_columns)
    validation = validation_raw[validation_raw[DATA_TYPE_COL] == "validation"].set_index(ID_COL).copy()
    del validation[DATA_TYPE_COL]
    del validation_raw
    gc.collect()

    # --- Initial Preprocessing & Downsampling ---
    # Check if 'target' is an alias and update target_cols
    if TARGET_COL in train.columns:
        if not train[TARGET_COL].equals(train[MAIN_TARGET]):
            warnings.warn(f"'{TARGET_COL}' column is present but not equal to '{MAIN_TARGET}'. Check data consistency.")
        else:
            print(f"'{TARGET_COL}' column confirmed as alias for '{MAIN_TARGET}'.")
        target_cols = [col for col in target_cols_meta if col != TARGET_COL] # Use original meta list
        train = train.drop(columns=[TARGET_COL], errors='ignore')
        validation = validation.drop(columns=[TARGET_COL], errors='ignore')
    else:
        target_cols = target_cols_meta

    target_cols = [col for col in [MAIN_TARGET] + AUX_TARGETS if col in train.columns] # Ensure only needed targets remain

    # Optimize Dtypes *before* downsampling/embargo if possible, otherwise after
    train = optimize_dtypes(train, original_feature_cols, target_cols)
    validation = optimize_dtypes(validation, original_feature_cols, target_cols)

    # Downsample eras if configured
    if DOWNSAMPLE_TRAIN_ERAS > 1:
        print(f"Downsampling training data to every {DOWNSAMPLE_TRAIN_ERAS}th era...")
        train = train[train[ERA_COL].isin(train[ERA_COL].unique()[::DOWNSAMPLE_TRAIN_ERAS])].copy()
        gc.collect()
    if DOWNSAMPLE_VALID_ERAS > 1:
        print(f"Downsampling validation data to every {DOWNSAMPLE_VALID_ERAS}th era...")
        validation = validation[validation[ERA_COL].isin(validation[ERA_COL].unique()[::DOWNSAMPLE_VALID_ERAS])].copy()
        gc.collect()

    # Embargo validation eras
    last_train_era = int(train[ERA_COL].astype(int).max())
    eras_to_embargo = [str(era).zfill(4) for era in range(last_train_era + 1, last_train_era + 5)]
    validation = validation[~validation[ERA_COL].isin(eras_to_embargo)].copy()
    print(f"Embargoed eras from validation: {eras_to_embargo}")
    gc.collect()

    # Save Checkpoint 1
    print("Saving Checkpoint 1...")
    train.reset_index().to_parquet(CP1_TRAIN_PATH)
    validation.reset_index().to_parquet(CP1_VALID_PATH)
    print("Checkpoint 1 saved.")
    
    # Recreate targets_df for exploration after potential drops/downsampling
    targets_to_keep = [ERA_COL] + target_cols
    targets_df = train[[col for col in targets_to_keep if col in train.columns]].copy()

# --- Exploration Code (Runs whether loading from checkpoint or source) ---
print(f"Using '{MAIN_TARGET}' as the main target.")
print(f"Auxiliary targets being considered: {AUX_TARGETS}")

# Display target columns
print("Target columns in training data (head):")
display(targets_df.head())

# Display target correlations
print(f"\nCorrelations of available targets with {MAIN_TARGET}:")
if MAIN_TARGET in targets_df.columns:
    target_corrs = (
        targets_df[target_cols]
        .corrwith(targets_df[MAIN_TARGET].astype(float)) # Ensure float for correlation
        .sort_values(ascending=False)
        .to_frame(f"corr_with_{MAIN_TARGET}")
    )
    display(target_corrs)

    # Plot correlation matrix heatmap
    plt.figure(figsize=(8, 6))
    sns.heatmap(
      targets_df[target_cols].astype(float).corr(), # Ensure float for correlation matrix
      cmap="coolwarm",
      xticklabels=False,
      yticklabels=False
    )
    plt.title("Target Correlation Matrix")
    plt.show()
else:
    print(f"Main target {MAIN_TARGET} not found in the loaded training data.")

gc.collect()

## Part 2: Feature Engineering

Fit feature engineering models/transformers on training data and apply transformations to both train and validation sets.

### Feature Engineering Functions Definition
(Moved function definitions here for clarity in the checkpoint structure)

In [None]:
import umap
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
from ctgan import CTGAN
from sklearn.preprocessing import QuantileTransformer

# --- UMAP Feature Creation ---
def umap_feature_creation(df_train, df_transform, feature_cols, n_components=UMAP_N_COMPONENTS, random_state=42):
    """Fits UMAP on df_train and transforms both df_train and df_transform."""
    print(f"Creating {n_components} UMAP features...")
    reducer = umap.UMAP(n_components=n_components, random_state=random_state, n_jobs=1)
    print(" Fitting UMAP on training data...")
    # Use float32 for UMAP input
    train_data = df_train[feature_cols].astype(np.float32).fillna(0.5)
    reducer.fit(train_data)
    print(" Transforming training data...")
    umap_features_train = reducer.transform(train_data).astype(np.float32) # Output as float32
    umap_feature_names = [f"umap_feat_{i}" for i in range(n_components)]
    df_train[umap_feature_names] = umap_features_train
    print(" Transforming second dataframe...")
    transform_data = df_transform[feature_cols].astype(np.float32).fillna(0.5)
    umap_features_transform = reducer.transform(transform_data).astype(np.float32)
    df_transform[umap_feature_names] = umap_features_transform
    print("UMAP features created and applied.")
    return df_train, df_transform, umap_feature_names, reducer

# --- Denoising Autoencoder Feature Creation ---
def denoising_autoencoder_features(df_train, df_transform, feature_cols, encoding_dim=AE_ENCODING_DIM, noise_factor=0.1, epochs=AE_EPOCHS, batch_size=1024):
    """Fits AE on df_train and transforms both df_train and df_transform."""
    print(f"Creating {encoding_dim} Denoising AE features...")
    input_dim = len(feature_cols)
    # Use float32 for TF input
    train_data = df_train[feature_cols].astype(np.float32).fillna(0.5).values
    noisy_train_data = train_data + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=train_data.shape)
    noisy_train_data = np.clip(noisy_train_data, 0., 1.).astype(np.float32)
    train_data = train_data.astype(np.float32)

    input_layer = keras.Input(shape=(input_dim,))
    encoded = layers.Dense(encoding_dim * 2, activation='relu')(input_layer)
    encoded = layers.Dense(encoding_dim, activation='relu', name='encoder_output')(encoded)
    decoded = layers.Dense(encoding_dim * 2, activation='relu')(encoded)
    decoded = layers.Dense(input_dim, activation='sigmoid')(decoded)
    autoencoder = keras.Model(input_layer, decoded)
    encoder = keras.Model(input_layer, encoded)
    autoencoder.compile(optimizer='adam', loss='mse')
    print(" Training Denoising Autoencoder...")
    autoencoder.fit(noisy_train_data, train_data, epochs=epochs, batch_size=batch_size, shuffle=True, validation_split=0.1, verbose=0)
    print(" Encoding training data...")
    ae_features_train = encoder.predict(train_data).astype(np.float32) # Output as float32
    ae_feature_names = [f"ae_feat_{i}" for i in range(encoding_dim)]
    df_train[ae_feature_names] = ae_features_train
    print(" Encoding second dataframe...")
    transform_data = df_transform[feature_cols].astype(np.float32).fillna(0.5).values
    ae_features_transform = encoder.predict(transform_data).astype(np.float32)
    df_transform[ae_feature_names] = ae_features_transform
    print("AE features created and applied.")
    return df_train, df_transform, ae_feature_names, encoder

# --- Contrastive Learning Feature Creation (Placeholder) ---
def contrastive_feature_creation(df_train, df_transform, feature_cols, embedding_dim=CONTRASTIVE_EMB_DIM):
    """Placeholder: Generates random features for both dataframes."""
    print(f"Creating {embedding_dim} Contrastive (placeholder) features...")
    num_samples_train = len(df_train)
    num_samples_transform = len(df_transform)
    contrastive_features_train = np.random.rand(num_samples_train, embedding_dim).astype(np.float32)
    contrastive_features_transform = np.random.rand(num_samples_transform, embedding_dim).astype(np.float32)
    contrastive_feature_names = [f"contrastive_feat_{i}" for i in range(embedding_dim)]
    df_train[contrastive_feature_names] = contrastive_features_train
    df_transform[contrastive_feature_names] = contrastive_features_transform
    print("Contrastive (placeholder) features created.")
    return df_train, df_transform, contrastive_feature_names, None 

# --- CTGAN Feature Creation ---
def synthetic_data_ctgan(df_train, df_transform, feature_cols, target_col, n_synthetic_samples_ratio=CTGAN_SYNTH_RATIO, epochs=CTGAN_EPOCHS):
    """Fits CTGAN on df_train and generates a distance feature for both dataframes."""
    print(f"Creating synthetic features using CTGAN (target: {target_col})...")
    # Use only a subset for CTGAN fitting if memory is tight
    train_subset_for_ctgan = df_train.sample(frac=0.5, random_state=42) # Use 50% for fitting
    data_subset_train = train_subset_for_ctgan[feature_cols + [target_col]].copy().dropna(subset=[target_col])
    data_subset_train[feature_cols] = data_subset_train[feature_cols].fillna(0.5)
    qt = QuantileTransformer(output_distribution='uniform', random_state=42)
    print(" Fitting QuantileTransformer...")
    # Fit QT only on the subset used for CTGAN
    data_transformed_train = qt.fit_transform(data_subset_train[feature_cols])
    data_transformed_df_train = pd.DataFrame(data_transformed_train, columns=feature_cols, index=data_subset_train.index)
    data_transformed_df_train[target_col] = data_subset_train[target_col].values
    discrete_columns = []
    print(" Training CTGAN...")
    ctgan_model = CTGAN(verbose=False)
    try:
        ctgan_model.fit(data_transformed_df_train, discrete_columns, epochs=epochs)
    except Exception as e:
        print(f"CTGAN fitting failed: {e}. Skipping CTGAN features.")
        return df_train, df_transform, [], None, None, None
    print(" Generating synthetic data...")
    n_synthetic_samples = int(len(data_subset_train) * n_synthetic_samples_ratio)
    if n_synthetic_samples == 0:
        print("Warning: n_synthetic_samples_ratio is too low, generating 0 samples. Skipping CTGAN features.")
        return df_train, df_transform, [], None, None, None
    synthetic_data_transformed = ctgan_model.sample(n_synthetic_samples)
    synthetic_features_original_scale = qt.inverse_transform(synthetic_data_transformed[feature_cols])
    synthetic_df = pd.DataFrame(synthetic_features_original_scale, columns=feature_cols)
    synthetic_mean = synthetic_df.mean(axis=0).astype(np.float32)
    ctgan_feature_name = f"dist_to_synth_mean_{target_col}"
    print(" Calculating distance feature for training data...")
    original_features_train = df_train[feature_cols].fillna(0.5).values.astype(np.float32)
    distances_train = np.linalg.norm(original_features_train - synthetic_mean.values, axis=1).astype(np.float32)
    df_train[ctgan_feature_name] = distances_train
    print(" Calculating distance feature for second dataframe...")
    original_features_transform = df_transform[feature_cols].fillna(0.5).values.astype(np.float32)
    distances_transform = np.linalg.norm(original_features_transform - synthetic_mean.values, axis=1).astype(np.float32)
    df_transform[ctgan_feature_name] = distances_transform
    print("CTGAN-derived features created and applied.")
    del data_subset_train, data_transformed_train, data_transformed_df_train, synthetic_data_transformed, synthetic_features_original_scale, synthetic_df
    gc.collect()
    return df_train, df_transform, [ctgan_feature_name], ctgan_model, qt, synthetic_mean

print("Feature engineering functions defined.")

In [None]:
# Check if Checkpoint 2 exists
if os.path.exists(CP2_TRAIN_PATH) and os.path.exists(CP2_VALID_PATH) and os.path.exists(CP2_FE_INFO_PATH):
    print("Loading data from Checkpoint 2...")
    train = pd.read_parquet(CP2_TRAIN_PATH).set_index(ID_COL)
    validation = pd.read_parquet(CP2_VALID_PATH).set_index(ID_COL)
    with open(CP2_FE_INFO_PATH, 'rb') as f:
        fe_info = cloudpickle.load(f)
        fitted_transformers = fe_info['transformers']
        original_feature_cols = fe_info['original_feature_cols']
        engineered_feature_cols = fe_info['engineered_feature_cols']
        feature_cols = fe_info['feature_cols']
        target_cols = fe_info['target_cols'] # Load target cols too
    
    # Ensure loaded dataframes have the correct columns (handle potential schema drift)
    required_cols = [ERA_COL] + target_cols + feature_cols
    train = train[[col for col in required_cols if col in train.columns]]
    validation = validation[[col for col in required_cols if col in validation.columns]]
    
    print(f"Data, {len(fitted_transformers)} fitted transformers, and feature lists loaded from Checkpoint 2.")
    print(f"Total features: {len(feature_cols)}")

else:
    print("Checkpoint 2 not found. Running Feature Engineering...")
    # --- Fit Feature Engineering on Training Data & Transform Both ---
    engineered_feature_cols = []
    fitted_transformers = {} # Dictionary to store fitted objects

    # UMAP
    train, validation, umap_feats, fitted_transformers['umap_reducer'] = umap_feature_creation(
        train, validation, original_feature_cols, n_components=UMAP_N_COMPONENTS
    )
    engineered_feature_cols.extend(umap_feats)
    gc.collect()

    # Denoising Autoencoder
    train, validation, ae_feats, fitted_transformers['ae_encoder'] = denoising_autoencoder_features(
        train, validation, original_feature_cols, encoding_dim=AE_ENCODING_DIM, epochs=AE_EPOCHS
    )
    engineered_feature_cols.extend(ae_feats)
    gc.collect()

    # Contrastive Learning (Placeholder)
    train, validation, contrastive_feats, _ = contrastive_feature_creation(
        train, validation, original_feature_cols, embedding_dim=CONTRASTIVE_EMB_DIM
    )
    engineered_feature_cols.extend(contrastive_feats)
    gc.collect()

    # CTGAN (using main target for demonstration)
    # WARNING: CTGAN can be very memory intensive. If this step fails,
    # consider reducing CTGAN_SYNTH_RATIO or commenting out this block.
    print("\n--- Starting CTGAN --- (This may take time and memory)")
    train, validation, ctgan_feats, fitted_transformers['ctgan_model'], fitted_transformers['qt'], fitted_transformers['synthetic_mean'] = synthetic_data_ctgan(
        train, validation, original_feature_cols, MAIN_TARGET, epochs=CTGAN_EPOCHS, n_synthetic_samples_ratio=CTGAN_SYNTH_RATIO
    )
    engineered_feature_cols.extend(ctgan_feats)
    print("--- Finished CTGAN ---")
    gc.collect()

    # Update the main feature list
    feature_cols = original_feature_cols + engineered_feature_cols
    print(f"\nTotal number of features after engineering: {len(feature_cols)}")

    # Optimize dtypes for engineered features (mostly floats)
    for df in [train, validation]:
        for col in engineered_feature_cols:
            if col in df.columns:
                df[col] = df[col].astype(np.float32)
    gc.collect()

    # Save Checkpoint 2
    print("Saving Checkpoint 2...")
    train.reset_index().to_parquet(CP2_TRAIN_PATH)
    validation.reset_index().to_parquet(CP2_VALID_PATH)
    
    # Save fitted transformers and feature lists
    fe_info_to_save = {
        'transformers': fitted_transformers,
        'original_feature_cols': original_feature_cols,
        'engineered_feature_cols': engineered_feature_cols,
        'feature_cols': feature_cols,
        'target_cols': target_cols
    }
    try:
        with open(CP2_FE_INFO_PATH, 'wb') as f:
            cloudpickle.dump(fe_info_to_save, f)
    except Exception as e:
        print(f"Could not pickle FE info: {e}. Some FE steps might need re-running.")
        if os.path.exists(CP2_FE_INFO_PATH):
            os.remove(CP2_FE_INFO_PATH)
    print("Checkpoint 2 saved.")

# Display head of dataframes after FE
print("\nTraining data with engineered features (head):")
display(train[feature_cols].head())
print("\nValidation data with engineered features (head):")
display(validation[feature_cols].head())
gc.collect()

## Part 3: Base Model Training (LightGBM)

Train LightGBM models for each selected target using the original and engineered features.

In [None]:
# Check if Checkpoint 3 exists
if os.path.exists(CP3_MODELS_PATH):
    print("Loading base models from Checkpoint 3...")
    with open(CP3_MODELS_PATH, 'rb') as f:
        models = cloudpickle.load(f)
    print(f"{len(models)} base models loaded.")
else:
    print("Checkpoint 3 not found. Training LightGBM models...")
    models = {}
    for target in tqdm(TARGET_CANDIDATES, desc="Training base models"):
        print(f"Training model for {target}...")
        train_target_filtered = train.dropna(subset=[target])
        
        lgbm_params = {
            'n_estimators': 2000,
            'learning_rate': 0.01,
            'max_depth': 5,
            'num_leaves': 2**4-1,
            'colsample_bytree': 0.1,
            'random_state': 42,
            'n_jobs': -1
        }
        
        model = lgb.LGBMRegressor(**lgbm_params)
        # Ensure feature columns exist before fitting
        train_features = train_target_filtered[[col for col in feature_cols if col in train_target_filtered.columns]].fillna(0.5)
        model.fit(train_features, train_target_filtered[target].astype(np.float32)) # Ensure target is float32
        models[target] = model
        gc.collect()

    # Save Checkpoint 3
    print("Saving Checkpoint 3...")
    with open(CP3_MODELS_PATH, 'wb') as f:
        cloudpickle.dump(models, f)
    print("Checkpoint 3 saved.")

# --- Base Model Evaluation (always run after loading/training) ---
print("\nGenerating validation predictions for base models...")
validation_preds = pd.DataFrame(index=validation.index)
for target_name, model in models.items():
    pred_col_name = f"prediction_{target_name}"
    validation_features = validation[feature_cols].fillna(0.5)
    validation_preds[pred_col_name] = model.predict(validation_features)

# Join predictions, handling potential duplicate column names if cell is re-run
validation = validation.drop(columns=validation_preds.columns, errors='ignore').join(validation_preds)
prediction_cols = list(validation_preds.columns)

print("\nValidation predictions generated:")
display(validation[prediction_cols].head())

print("\nEvaluating base model correlations...")
validation_eval_base = validation.dropna(subset=[MAIN_TARGET] + prediction_cols)
if validation_eval_base.empty:
    print("Warning: No valid rows for base model correlation evaluation.")
    correlations = pd.DataFrame(columns=prediction_cols)
    cumsum_corrs = pd.DataFrame(columns=prediction_cols)
else:
    correlations = validation_eval_base.groupby(ERA_COL).apply(
        lambda d: numerai_corr(d[prediction_cols], d[MAIN_TARGET])
    )
    cumsum_corrs = correlations.cumsum()
    plt.figure(figsize=(10, 6))
    cumsum_corrs.plot(ax=plt.gca())
    plt.title("Cumulative Correlation of Base Model Validation Predictions")
    plt.xlabel("Era")
    plt.ylabel("Cumulative Correlation")
    plt.xticks([])
    plt.legend(title="Model Target")
    plt.grid(True, linestyle='--', alpha=0.5)
    plt.show()

print("\nSummary metrics for base models:")
def get_summary_metrics(scores, cumsum_scores):
    summary_metrics = {}
    mean = scores.mean()
    std = scores.std()
    sharpe = mean / std if std != 0 else np.nan
    if not cumsum_scores.empty:
      rolling_max = cumsum_scores.expanding(min_periods=1).max()
      max_drawdown = (rolling_max - cumsum_scores).max()
    else:
      max_drawdown = np.nan
    return {
        "mean": mean,
        "std": std,
        "sharpe": sharpe,
        "max_drawdown": max_drawdown,
    }

base_model_summary = {}
for pred_col in prediction_cols:
    if pred_col in correlations.columns:
        base_model_summary[pred_col] = get_summary_metrics(correlations[pred_col], cumsum_corrs[pred_col])
    else:
        base_model_summary[pred_col] = {'mean': np.nan, 'std': np.nan, 'sharpe': np.nan, 'max_drawdown': np.nan}

summary_df_base = pd.DataFrame(base_model_summary).T
display(summary_df_base)
gc.collect()

## Part 4: Stacked Ensembling

Generate OOF predictions and train the meta-model.

In [None]:
meta_model = None
scaler = None
oof_preds = None

# Check if Checkpoint 4 exists
if USE_STACKING and os.path.exists(CP4_OOF_PATH) and os.path.exists(CP4_META_MODEL_PATH):
    print("Loading OOF predictions and meta-model from Checkpoint 4...")
    oof_preds = pd.read_parquet(CP4_OOF_PATH).set_index(ID_COL)
    with open(CP4_META_MODEL_PATH, 'rb') as f:
        meta_model_data = cloudpickle.load(f)
        meta_model = meta_model_data['meta_model']
        if STACKING_MODEL_TYPE == 'Linear':
            scaler = meta_model_data.get('scaler') # Load scaler if it exists
            fitted_transformers['stacking_scaler'] = scaler
    fitted_transformers['meta_model'] = meta_model
    print("OOF predictions and meta-model loaded.")

elif USE_STACKING:
    print("Checkpoint 4 not found. Generating OOF predictions and training meta-model...")
    gkf = GroupKFold(n_splits=N_FOLDS)
    oof_preds = pd.DataFrame(index=train.index)

    for target_name, model in tqdm(models.items(), desc="Generating OOF preds"):
        print(f" Generating OOF for {target_name}...")
        oof_preds_target = pd.Series(index=train.index, dtype=np.float32)
        train_target_filtered = train.dropna(subset=[target_name])
        
        for fold, (train_idx_filtered, val_idx_filtered) in enumerate(gkf.split(train_target_filtered[feature_cols], train_target_filtered[target_name], groups=train_target_filtered[ERA_COL])):
            train_index_orig = train_target_filtered.iloc[train_idx_filtered].index
            val_index_orig = train_target_filtered.iloc[val_idx_filtered].index
            X_train_fold, X_val_fold = train.loc[train_index_orig, feature_cols], train.loc[val_index_orig, feature_cols]
            y_train_fold = train.loc[train_index_orig, target_name]
            fold_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.01, max_depth=5, num_leaves=2**4-1, colsample_bytree=0.1, random_state=fold, n_jobs=-1)
            fold_model.fit(X_train_fold, y_train_fold)
            oof_preds_target.loc[val_index_orig] = fold_model.predict(X_val_fold)
            
        oof_preds[f"oof_{target_name}"] = oof_preds_target
        gc.collect()
    
    print("OOF predictions generated.")
    display(oof_preds.head())

    # Prepare training data for the meta-model
    meta_train_features = oof_preds.copy()
    meta_train_features[ERA_COL] = train[ERA_COL]
    meta_train_target = train[MAIN_TARGET]
    valid_indices = meta_train_target.notna() & meta_train_features.notna().all(axis=1)
    meta_train_features = meta_train_features.loc[valid_indices].copy()
    meta_train_target = meta_train_target.loc[valid_indices].copy()
    oof_feature_cols = list(oof_preds.columns)

    # Train the meta-model
    print(f"\nTraining meta-model ({STACKING_MODEL_TYPE})...")
    meta_model_to_save = {}
    if STACKING_MODEL_TYPE == 'LGBM':
        meta_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.01, max_depth=3, num_leaves=2**3-1, colsample_bytree=0.8, random_state=42, n_jobs=-1)
        meta_model.fit(meta_train_features[oof_feature_cols], meta_train_target.astype(np.float32))
        meta_model_to_save['meta_model'] = meta_model
    elif STACKING_MODEL_TYPE == 'Linear':
        scaler = StandardScaler()
        meta_train_features_scaled = scaler.fit_transform(meta_train_features[oof_feature_cols])
        meta_model = Ridge(alpha=1.0, random_state=42)
        meta_model.fit(meta_train_features_scaled, meta_train_target.astype(np.float32))
        meta_model_to_save['meta_model'] = meta_model
        meta_model_to_save['scaler'] = scaler
        fitted_transformers['stacking_scaler'] = scaler
    else:
        raise ValueError("Invalid STACKING_MODEL_TYPE")
    
    fitted_transformers['meta_model'] = meta_model
    print("Meta-model trained.")

    # Save Checkpoint 4
    print("Saving Checkpoint 4...")
    oof_preds.reset_index().to_parquet(CP4_OOF_PATH)
    with open(CP4_META_MODEL_PATH, 'wb') as f:
        cloudpickle.dump(meta_model_to_save, f)
    print("Checkpoint 4 saved.")
    gc.collect()
else:
    print("Skipping Stacking Ensemble based on configuration (USE_STACKING=False).")

# --- Stacked Ensemble Evaluation (if run) ---
if USE_STACKING and meta_model is not None:
    print("\nGenerating stacked predictions on validation set for evaluation...")
    oof_feature_cols = [f"oof_{t}" for t in TARGET_CANDIDATES] # Define OOF columns based on candidates
    meta_val_features = validation[[f"prediction_{t}" for t in TARGET_CANDIDATES]].copy()
    meta_val_features.columns = oof_feature_cols # Rename columns
    meta_val_features = meta_val_features.fillna(meta_val_features.mean())

    if STACKING_MODEL_TYPE == 'Linear':
        scaler_val = fitted_transformers['stacking_scaler']
        meta_val_features_scaled = scaler_val.transform(meta_val_features)
        stacked_preds = meta_model.predict(meta_val_features_scaled)
    else: # LGBM
        stacked_preds = meta_model.predict(meta_val_features)

    validation["prediction_stacked"] = stacked_preds
    print("Stacked predictions generated for validation.")
    display(validation[["prediction_stacked"]].head())

    print("\nEvaluating stacked ensemble performance...")
    evaluation_cols_stacking = prediction_cols + ["prediction_stacked"]
    validation_eval_stacking = validation.dropna(subset=[MAIN_TARGET] + evaluation_cols_stacking)

    if validation_eval_stacking.empty:
        print("Warning: No valid rows for stacking evaluation.")
    else:
        stacked_correlations = validation_eval_stacking.groupby(ERA_COL).apply(
            lambda d: numerai_corr(d[evaluation_cols_stacking], d[MAIN_TARGET])
        )
        stacked_cumsum_corrs = stacked_correlations.cumsum()
        plt.figure(figsize=(10, 6))
        stacked_cumsum_corrs.plot(ax=plt.gca())
        plt.title("Cumulative Correlation including Stacked Ensemble")
        plt.xlabel("Era")
        plt.ylabel("Cumulative Correlation")
        plt.xticks([])
        plt.legend(title="Model")
        plt.grid(True, linestyle='--', alpha=0.5)
        plt.show()

        print("\nSummary metrics including Stacked Ensemble:")
        stacked_summary = {}
        for pred_col in evaluation_cols_stacking:
             if pred_col in stacked_correlations.columns:
               stacked_summary[pred_col] = get_summary_metrics(stacked_correlations[pred_col], stacked_cumsum_corrs[pred_col])
             else:
               stacked_summary[pred_col] = {'mean': np.nan, 'std': np.nan, 'sharpe': np.nan, 'max_drawdown': np.nan}
        stacked_summary_df = pd.DataFrame(stacked_summary).T
        display(stacked_summary_df)
    gc.collect()

## Part 5: Era-Invariant Training (PyTorch MLP Option)

Define and train a PyTorch MLP with custom loss functions.

### MLP and Custom Loss Function Definitions

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
from scipy.stats import rankdata

# --- Define MLP Architecture ---
class SimpleMLP(nn.Module):
    def __init__(self, input_dim):
        super(SimpleMLP, self).__init__()
        self.layers = nn.Sequential(
            nn.BatchNorm1d(input_dim), 
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.BatchNorm1d(256),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.BatchNorm1d(128),
            nn.Linear(128, 1),
            nn.Sigmoid()
        )
    def forward(self, x):
        return self.layers(x)

# --- Define Custom Loss Functions ---
def pearson_corr(preds, target):
    preds = preds.squeeze()
    target = target.squeeze()
    preds_mean = torch.mean(preds)
    target_mean = torch.mean(target)
    cov = torch.mean((preds - preds_mean) * (target - target_mean))
    preds_std = torch.std(preds)
    target_std = torch.std(target)
    epsilon = 1e-6
    corr = cov / (preds_std * target_std + epsilon)
    return torch.nan_to_num(corr, nan=0.0)

def era_correlation_variance_penalty(preds, target, eras):
    unique_eras = torch.unique(eras)
    era_corrs = []
    for era in unique_eras:
        era_mask = (eras == era)
        era_preds = preds[era_mask]
        era_target = target[era_mask]
        if len(era_preds) > 1:
            era_corrs.append(pearson_corr(era_preds, era_target))
    if len(era_corrs) > 1:
        era_corrs_tensor = torch.stack(era_corrs)
        valid_corrs = era_corrs_tensor[~torch.isnan(era_corrs_tensor)]
        if len(valid_corrs) > 1:
             return torch.var(valid_corrs)
    return torch.tensor(0.0, device=preds.device)

def feature_exposure_penalty(preds, features):
    num_features = features.shape[1]
    feature_corrs_sq = []
    preds_squeezed = preds.squeeze()
    for i in range(num_features):
        feature_col = features[:, i]
        if torch.std(feature_col) > 1e-6:
             corr = pearson_corr(preds_squeezed, feature_col)
             if not torch.isnan(corr):
                 feature_corrs_sq.append(corr**2)
    if len(feature_corrs_sq) > 0:
        return torch.mean(torch.stack(feature_corrs_sq))
    return torch.tensor(0.0, device=preds.device)

# --- Training Loop Function ---
def train_mlp(train_df, feature_cols, target_col, era_col, original_feature_cols, top_n_features=TOP_N_FEATURES_FOR_EXPOSURE):
    print("\nTraining PyTorch MLP with custom loss...")
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")
    train_target_filtered = train_df.dropna(subset=[target_col])
    features = torch.tensor(train_target_filtered[feature_cols].fillna(0.5).values, dtype=torch.float32).to(device)
    target = torch.tensor(train_target_filtered[target_col].values, dtype=torch.float32).unsqueeze(1).to(device)
    eras = torch.tensor(train_target_filtered[era_col].astype(int).values, dtype=torch.long).to(device)
    feature_corrs = train_target_filtered[original_feature_cols].corrwith(train_target_filtered[target_col].astype(float))
    top_feature_names = feature_corrs.abs().nlargest(top_n_features).index
    top_features_tensor = torch.tensor(train_target_filtered[top_feature_names].fillna(0.5).values, dtype=torch.float32).to(device)
    print(f"Using top {len(top_feature_names)} original features for exposure penalty.")
    dataset = TensorDataset(features, target, eras, top_features_tensor)
    dataloader = DataLoader(dataset, batch_size=MLP_BATCH_SIZE, shuffle=True)
    input_dim = len(feature_cols)
    mlp_model_local = SimpleMLP(input_dim).to(device)
    optimizer = optim.Adam(mlp_model_local.parameters(), lr=MLP_LR)
    lambda1 = VARIANCE_PENALTY_WEIGHT
    lambda2 = FEATURE_EXPOSURE_WEIGHT
    for epoch in tqdm(range(MLP_EPOCHS), desc="Training MLP"):
        epoch_loss = 0.0
        mlp_model_local.train()
        for batch_features, batch_target, batch_eras, batch_top_features in dataloader:
            optimizer.zero_grad()
            preds = mlp_model_local(batch_features)
            corr_loss = -pearson_corr(preds, batch_target)
            var_penalty = era_correlation_variance_penalty(preds, batch_target, batch_eras)
            exposure_penalty = feature_exposure_penalty(preds, batch_top_features)
            total_loss = corr_loss + lambda1 * var_penalty + lambda2 * exposure_penalty
            if torch.isnan(total_loss):
                continue
            total_loss.backward()
            optimizer.step()
            epoch_loss += total_loss.item()
        avg_epoch_loss = epoch_loss / len(dataloader) if len(dataloader) > 0 else 0
        print(f"Epoch [{epoch+1}/{MLP_EPOCHS}], Loss: {avg_epoch_loss:.6f}")
    print("MLP training finished.")
    return mlp_model_local.to('cpu')

print("MLP and custom loss functions defined.")

In [None]:
mlp_model = None # Initialize

# Check if Checkpoint 5 exists
if USE_MLP and os.path.exists(CP5_MLP_MODEL_PATH):
    print("Loading MLP model from Checkpoint 5...")
    try:
        with open(CP5_MLP_MODEL_PATH, 'rb') as f:
            mlp_model = cloudpickle.load(f)
        fitted_transformers['mlp_model'] = mlp_model
        print("MLP model loaded.")
    except Exception as e:
        print(f"Error loading MLP model from checkpoint: {e}. Will retrain if USE_MLP is True.")
        mlp_model = None

# Train MLP if configured and not loaded from checkpoint
if USE_MLP and mlp_model is None:
    try:
        mlp_model = train_mlp(train, feature_cols, MAIN_TARGET, ERA_COL, original_feature_cols)
        # Save Checkpoint 5
        print("Saving Checkpoint 5...")
        with open(CP5_MLP_MODEL_PATH, 'wb') as f:
            cloudpickle.dump(mlp_model, f)
        fitted_transformers['mlp_model'] = mlp_model
        print("Checkpoint 5 saved.")
    except ImportError:
        print("PyTorch not found. Skipping MLP training. Set USE_MLP=False or install PyTorch.")
        USE_MLP = False
    except Exception as e:
        print(f"An error occurred during MLP training: {e}")
        mlp_model = None
        USE_MLP = False
elif not USE_MLP:
    print("Skipping MLP Training based on configuration (USE_MLP=False).")

# --- MLP Evaluation (if trained/loaded) ---
if USE_MLP and mlp_model is not None:
    print("\nGenerating MLP predictions on validation set for evaluation...")
    mlp_model.eval()
    with torch.no_grad():
        val_features_tensor = torch.tensor(validation[feature_cols].fillna(0.5).values, dtype=torch.float32)
        mlp_preds = mlp_model(val_features_tensor).numpy().squeeze()
    validation["prediction_mlp"] = mlp_preds
    print("MLP predictions generated.")
    display(validation[["prediction_mlp"]].head())
    
    print("\nEvaluating MLP performance...")
    evaluation_cols_mlp = [f"prediction_{MAIN_TARGET}", "prediction_mlp"]
    validation_eval_mlp = validation.dropna(subset=[MAIN_TARGET] + evaluation_cols_mlp)
    
    if validation_eval_mlp.empty:
        print("Warning: No valid rows for MLP evaluation.")
    else:
        mlp_correlations = validation_eval_mlp.groupby(ERA_COL).apply(
            lambda d: numerai_corr(d[evaluation_cols_mlp], d[MAIN_TARGET])
        )
        mlp_cumsum_corrs = mlp_correlations.cumsum()
        plt.figure(figsize=(10, 6))
        mlp_cumsum_corrs.plot(ax=plt.gca())
        plt.title("Cumulative Correlation of MLP vs Base Model")
        plt.xlabel("Era")
        plt.ylabel("Cumulative Correlation")
        plt.xticks([])
        plt.legend(title="Model")
        plt.grid(True, linestyle='--', alpha=0.5)
        plt.show()

        print("\nSummary metrics for MLP:")
        mlp_summary = {}
        for pred_col in evaluation_cols_mlp:
             if pred_col in mlp_correlations.columns:
               mlp_summary[pred_col] = get_summary_metrics(mlp_correlations[pred_col], mlp_cumsum_corrs[pred_col])
             else:
                mlp_summary[pred_col] = {'mean': np.nan, 'std': np.nan, 'sharpe': np.nan, 'max_drawdown': np.nan}
        mlp_summary_df = pd.DataFrame(mlp_summary).T
        display(mlp_summary_df)
    gc.collect()

## Part 6: Final Model Evaluation & Prediction Function

Determine the final model based on configuration flags and evaluate its performance.

In [None]:
# Determine the final prediction column based on flags and successful execution
if USE_MLP and mlp_model and "prediction_mlp" in validation.columns:
    final_pred_col = "prediction_mlp"
    comparison_cols = [f"prediction_{MAIN_TARGET}", final_pred_col]
    print(f"Final model selected: MLP ({final_pred_col})")
elif USE_STACKING and meta_model and "prediction_stacked" in validation.columns:
    final_pred_col = "prediction_stacked"
    comparison_cols = prediction_cols + [final_pred_col] # Compare stacker to base models
    print(f"Final model selected: Stacked Ensemble ({final_pred_col})")
else:
    final_pred_col = f"prediction_{MAIN_TARGET}"
    comparison_cols = [final_pred_col]
    print(f"Final model selected: Base model for {MAIN_TARGET} ({final_pred_col})")
    if final_pred_col not in validation.columns:
         raise ValueError("Could not find a valid prediction column for final evaluation.")

# Ensure all columns for comparison exist
existing_comparison_cols = [col for col in comparison_cols if col in validation.columns]
if not existing_comparison_cols:
     raise ValueError("No valid columns found for final comparison evaluation.")

print(f"\nEvaluating final model performance (comparing: {existing_comparison_cols})...")
validation_eval_final = validation.dropna(subset=[MAIN_TARGET] + existing_comparison_cols)

if validation_eval_final.empty:
    print("Warning: No valid rows for final evaluation.")
else:
    final_correlations = validation_eval_final.groupby(ERA_COL).apply(
        lambda d: numerai_corr(d[existing_comparison_cols], d[MAIN_TARGET])
    )
    final_cumsum_corrs = final_correlations.cumsum()
    plt.figure(figsize=(10, 6))
    final_cumsum_corrs.plot(ax=plt.gca())
    plt.title(f"Cumulative Correlation of Final Model ({final_pred_col}) vs Others")
    plt.xlabel("Era")
    plt.ylabel("Cumulative Correlation")
    plt.xticks([])
    plt.legend(title="Model")
    plt.grid(True, linestyle='--', alpha=0.5)
    plt.show()

    print("\nFinal Summary Metrics:")
    final_summary = {}
    for pred_col in existing_comparison_cols:
        if pred_col in final_correlations.columns:
            final_summary[pred_col] = get_summary_metrics(final_correlations[pred_col], final_cumsum_corrs[pred_col])
        else:
             final_summary[pred_col] = {'mean': np.nan, 'std': np.nan, 'sharpe': np.nan, 'max_drawdown': np.nan}
    final_summary_df = pd.DataFrame(final_summary).T
    display(final_summary_df)

# --- Define Final Prediction Function ---
# This function now relies on the 'dependencies' dict passed during pickling/loading
def predict_final(live_features: pd.DataFrame, dependencies: dict) -> pd.DataFrame:
    """Generates predictions using the chosen final model and pre-fitted FE objects."""
    print("Starting final prediction function...")
    # Load dependencies
    original_feature_cols = dependencies['original_feature_cols']
    feature_cols = dependencies['feature_cols']
    umap_reducer = dependencies.get('umap_reducer')
    ae_encoder = dependencies.get('ae_encoder')
    qt = dependencies.get('qt')
    synthetic_mean = dependencies.get('synthetic_mean')
    # Reconstruct feature names from dependencies if needed
    umap_feature_names = [f"umap_feat_{i}" for i in range(dependencies['UMAP_N_COMPONENTS'])] if umap_reducer else []
    ae_feature_names = [f"ae_feat_{i}" for i in range(dependencies['AE_ENCODING_DIM'])] if ae_encoder else []
    contrastive_feature_names = [f"contrastive_feat_{i}" for i in range(dependencies['CONTRASTIVE_EMB_DIM'])]
    ctgan_feature_names = [f"dist_to_synth_mean_{dependencies['MAIN_TARGET']}"] if dependencies.get('ctgan_model') else []

    models_dep = dependencies['models']
    meta_model_dep = dependencies.get('meta_model')
    scaler_dep = dependencies.get('stacking_scaler')
    mlp_model_dep = dependencies.get('mlp_model')
    USE_MLP_FLAG = dependencies['USE_MLP']
    USE_STACKING_FLAG = dependencies['USE_STACKING']
    STACKING_MODEL_TYPE_FLAG = dependencies['STACKING_MODEL_TYPE']
    MAIN_TARGET_NAME = dependencies['MAIN_TARGET']
    PREDICTION_COL_NAME = dependencies['PREDICTION_COL']

    # Apply feature engineering transformations
    print("Applying feature engineering to live data...")
    live_features_eng = live_features.copy()
    live_data_orig = live_features_eng[original_feature_cols].astype(np.float32).fillna(0.5)

    if umap_reducer and umap_feature_names:
        print(" Applying UMAP...")
        umap_features_live = umap_reducer.transform(live_data_orig).astype(np.float32)
        live_features_eng[umap_feature_names] = umap_features_live
    if ae_encoder and ae_feature_names:
        print(" Applying AE...")
        ae_features_live = ae_encoder.predict(live_data_orig.values).astype(np.float32)
        live_features_eng[ae_feature_names] = ae_features_live
    if contrastive_feature_names:
        print(" Applying Contrastive (placeholder)...")
        num_samples_live = len(live_features_eng)
        contrastive_features_live = np.random.rand(num_samples_live, len(contrastive_feature_names)).astype(np.float32)
        live_features_eng[contrastive_feature_names] = contrastive_features_live
    if ctgan_feature_names and synthetic_mean is not None:
        print(" Applying CTGAN distance feature...")
        distances_live = np.linalg.norm(live_data_orig.values - synthetic_mean.values, axis=1).astype(np.float32)
        live_features_eng[ctgan_feature_names[0]] = distances_live
    print("Feature engineering applied.")

    # Ensure all feature columns exist
    for col in feature_cols:
        if col not in live_features_eng.columns:
            live_features_eng[col] = 0.5 # Fill missing engineered features
    # Convert dtypes for consistency before prediction
    for col in original_feature_cols:
        if col in live_features_eng.columns:
             live_features_eng[col] = live_features_eng[col].astype(np.int8)
    for col in engineered_feature_cols:
         if col in live_features_eng.columns:
             live_features_eng[col] = live_features_eng[col].astype(np.float32)
             
    live_features_eng = live_features_eng[feature_cols].fillna(0.5)

    # Prediction Logic
    if USE_MLP_FLAG and mlp_model_dep:
        print("Generating predictions using MLP model...")
        mlp_model_dep.eval()
        with torch.no_grad():
            live_features_tensor = torch.tensor(live_features_eng.values, dtype=torch.float32)
            predictions = mlp_model_dep(live_features_tensor).numpy().squeeze()
        submission_df = pd.DataFrame({'prediction': predictions}, index=live_features.index)
    elif USE_STACKING_FLAG and meta_model_dep:
        print("Generating predictions using Stacked Ensemble...")
        base_preds_live = pd.DataFrame(index=live_features.index)
        oof_cols = [f"oof_{t}" for t in models_dep.keys()]
        for target_name, model in models_dep.items():
            base_preds_live[f"oof_{target_name}"] = model.predict(live_features_eng)
        base_preds_live = base_preds_live.fillna(base_preds_live.mean())
        if STACKING_MODEL_TYPE_FLAG == 'Linear' and scaler_dep:
             base_preds_live_scaled = scaler_dep.transform(base_preds_live[oof_cols])
             stacked_preds_live = meta_model_dep.predict(base_preds_live_scaled)
        else:
             stacked_preds_live = meta_model_dep.predict(base_preds_live[oof_cols])
        submission_df = pd.DataFrame({'prediction': stacked_preds_live}, index=live_features.index)
    else:
        print(f"Generating predictions using base model for {MAIN_TARGET_NAME}...")
        predictions = models_dep[MAIN_TARGET_NAME].predict(live_features_eng)
        submission_df = pd.DataFrame({'prediction': predictions}, index=live_features.index)

    ranked_submission = submission_df['prediction'].rank(pct=True, method="first")
    print("Final predictions generated and ranked.")
    return ranked_submission.to_frame(PREDICTION_COL_NAME)

print("Final prediction function defined.")

## Part 7: Model Pickling for Upload

Pickle the final prediction function and its dependencies. 
**Important:** When running this notebook in parts, ensure all necessary objects (from previous checkpoints) are loaded into the current session's `fitted_transformers` and `models` dictionaries before executing this cell.

In [None]:
# --- Quick Test on Live Data (using the final function) ---
print("Downloading live features for testing...")
napi.download_dataset(f"{DATA_VERSION}/live.parquet")
live_features = pd.read_parquet(f"{DATA_VERSION}/live.parquet", columns=original_feature_cols).set_index(ID_COL)

# Prepare dependencies dictionary for the test
# Ensure all required fitted objects are loaded into fitted_transformers from checkpoints
test_dependencies = {
    'models': models,
    'umap_reducer': fitted_transformers.get('umap_reducer'),
    'ae_encoder': fitted_transformers.get('ae_encoder'),
    'qt': fitted_transformers.get('qt'),
    'synthetic_mean': fitted_transformers.get('synthetic_mean'),
    'meta_model': fitted_transformers.get('meta_model'),
    'stacking_scaler': fitted_transformers.get('stacking_scaler'),
    'mlp_model': fitted_transformers.get('mlp_model'),
    'original_feature_cols': original_feature_cols,
    'engineered_feature_cols': engineered_feature_cols,
    'feature_cols': feature_cols,
    'ae_feats': [f"ae_feat_{i}" for i in range(AE_ENCODING_DIM)] if fitted_transformers.get('ae_encoder') else [], 
    'umap_feats': [f"umap_feat_{i}" for i in range(UMAP_N_COMPONENTS)] if fitted_transformers.get('umap_reducer') else [],
    'contrastive_feats': [f"contrastive_feat_{i}" for i in range(CONTRASTIVE_EMB_DIM)],
    'ctgan_feats': [f"dist_to_synth_mean_{MAIN_TARGET}"] if fitted_transformers.get('ctgan_model') else [],
    'UMAP_N_COMPONENTS': UMAP_N_COMPONENTS,
    'AE_ENCODING_DIM': AE_ENCODING_DIM,
    'CONTRASTIVE_EMB_DIM': CONTRASTIVE_EMB_DIM,
    'USE_MLP': USE_MLP,
    'USE_STACKING': USE_STACKING,
    'STACKING_MODEL_TYPE': STACKING_MODEL_TYPE,
    'MAIN_TARGET': MAIN_TARGET,
    'PREDICTION_COL': PREDICTION_COL
}

# Generate predictions using the final function
final_predictions = predict_final(live_features, test_dependencies)

print("\nSample of final predictions:")
display(final_predictions.head())

# --- Pickle the Prediction Function and Dependencies ---
print("\nPickling the prediction function...")
try:
    # Define the dictionary containing the function and its necessary dependencies
    # Use the same 'test_dependencies' dictionary structure
    pickle_payload = {
        'predict_fn': predict_final,
        'dependencies': test_dependencies # Pass the constructed dictionary
    }
    
    # Register libraries that cloudpickle might struggle with by default
    cloudpickle.register_pickle_by_value(umap)
    cloudpickle.register_pickle_by_value(tf)
    cloudpickle.register_pickle_by_value(torch)
    cloudpickle.register_pickle_by_value(ctgan)
    
    # Pickle the payload
    with open("predict_final_model.pkl", "wb") as f:
        cloudpickle.dump(pickle_payload, f)
    print("Prediction function and dependencies pickled successfully to predict_final_model.pkl")

except NameError as e:
    print(f"Pickling failed: A required object might not be defined. Error: {e}")
    print("Ensure all models and transformers used in 'predict_final' are trained/loaded and available in 'test_dependencies'.")
except Exception as e:
     print(f"An unexpected error occurred during pickling: {e}")

# --- Download Final Pickle File ---
try:
    from google.colab import files
    files.download('predict_final_model.pkl')
except ImportError:
    print("\nSkipping download (not in Colab environment).")
except Exception as e:
    print(f"\nFile download failed: {e}")

## 9. Conclusion

This notebook demonstrated adding feature engineering, stacked ensembling, and an optional era-invariant MLP training pipeline, structured with checkpoints.

To use the checkpoints:
* Run the cells within each 'Part' sequentially.
* If you stop and restart the kernel/environment, re-run the 'Setup & Configuration' cell first.
* Then, run the cell for the 'Part' you want to start from. It will attempt to load from the previous part's checkpoint.
* **Crucially**, if you load from checkpoints, ensure you run all subsequent parts up to Part 7 before pickling, so all necessary models and transformers are loaded into memory for the `predict_final` function.

Remember to choose the model (Stacking or MLP) you want to submit by setting the `USE_STACKING` or `USE_MLP` flags before pickling and uploading `predict_final_model.pkl` to [numer.ai](https://numer.ai).