# Model: PyTorch MLP with All Features (No PCA)

This notebook trains a **PyTorch-based Multi-Layer Perceptron (MLP)** classifier on all available features (regular + all embeddings) with comprehensive preprocessing:
- ‚úÖ All regular features
- ‚úÖ All embedding families (no PCA compression)
- ‚úÖ Feature scaling (StandardScaler)
- ‚úÖ Fixed Hyperparameters with CV Validation
- ‚úÖ Threshold Fine-tuning
- ‚úÖ Model Saving
- ‚úÖ Submission.csv Generation
- ‚úÖ OOM Safe with aggressive memory management
- ‚úÖ SMOTETomek for class imbalance
- ‚úÖ GPU acceleration with CPU fallback

# üìë PyTorch MLP - Code Navigation Index

## Quick Navigation
- **[Setup](#1-setup)** - Imports, paths, device configuration, robustness utilities
- **[Data Loading](#2-data-loading--feature-extraction)** - Load and split features (NO PCA)
- **[SMOTETomek](#3-class-imbalance-handling-smotetomek)** - Class imbalance resampling
- **[Feature Scaling](#4-feature-scaling)** - StandardScaler normalization
- **[Hyperparameter Selection](#5-hyperparameter-selection)** - Fixed hyperparameters with CV validation
- **[Threshold Tuning](#6-threshold-tuning--final-evaluation)** - Optimal threshold finding
- **[Model Saving](#7-save-model)** - Save model weights and metadata
- **[Submission](#8-generate-submission)** - Generate test predictions

## Model Type: PyTorch MLP (all features, no PCA)

## Key Features
‚úÖ GPU-friendly with CPU fallback  
‚úÖ Aggressive garbage collection  
‚úÖ OOM resistant with chunked processing  
‚úÖ Kernel panic resistant (signal handlers, checkpoints)  
‚úÖ Polars-only (no pandas)  
‚úÖ Fixed hyperparameters with CV validation  
‚úÖ SMOTETomek for class imbalance  
‚úÖ Feature scaling & normalization  
‚úÖ Fine-grained threshold optimization  
‚úÖ Model weights saved  
‚úÖ Chunked/batched data processing  

## 1. Setup

In [1]:
import os
from pathlib import Path
import random
import gc
import numpy as np
import polars as pl
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, TensorDataset
from typing import Dict, Optional, Tuple
import sys
import time
import json
import pickle
import signal
import atexit
from functools import wraps
from datetime import datetime


In [2]:
# =========================
# STARTUP & REPRODUCIBILITY
# =========================

TOTAL_START_TIME = time.time()
START_TIME_STR = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
print(f"\n{'='*80}")
print("MODEL_PYTORCH_MLP EXECUTION STARTED")
print(f"Start Time: {START_TIME_STR}")
print(f"{'='*80}\n")

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed(SEED)
    torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)



MODEL_PYTORCH_MLP EXECUTION STARTED
Start Time: 2025-11-19 18:48:24



Using device: cuda


In [3]:
# ==============
# PATH MANAGEMENT
# ==============

current = Path(os.getcwd())
PROJECT_ROOT = current
for _ in range(5):
    if (PROJECT_ROOT / "data").exists():
        break
    PROJECT_ROOT = PROJECT_ROOT.parent
else:
    PROJECT_ROOT = current.parent.parent

MODEL_READY_DIR = PROJECT_ROOT / "data" / "model_ready"
MODEL_SAVE_DIR = PROJECT_ROOT / "models" / "saved_models"
SUBMISSION_DIR = PROJECT_ROOT / "data" / "submission_files"
MODEL_SAVE_DIR.mkdir(parents=True, exist_ok=True)
SUBMISSION_DIR.mkdir(parents=True, exist_ok=True)
utils_path = PROJECT_ROOT / "src" / "utils"
print("PROJECT_ROOT:", PROJECT_ROOT)
print("MODEL_READY_DIR:", MODEL_READY_DIR)


PROJECT_ROOT: /gpfs/accounts/si670f25_class_root/si670f25_class/santoshd/Kaggle_2
MODEL_READY_DIR: /gpfs/accounts/si670f25_class_root/si670f25_class/santoshd/Kaggle_2/data/model_ready


In [4]:
# ==========
# ML LIBRARIES
# ==========
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score, roc_auc_score, classification_report, precision_recall_curve, roc_curve, confusion_matrix
from imblearn.combine import SMOTETomek
from tqdm.auto import tqdm

# ==========
# VISUALIZATION LIBRARIES
# ==========
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Image
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")


In [5]:
# ===============================
# MEMORY UTILITIES (FALLBACK DEFS)
# ===============================
try:
    from model_training_utils import cleanup_memory, memory_usage, check_memory_safe
    print("‚úÖ Memory utilities imported from shared module")
except ImportError:
    def cleanup_memory():
        """Aggressive memory cleanup for both CPU and GPU."""
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            torch.cuda.synchronize()
            torch.cuda.ipc_collect()
        gc.collect()
    
    def memory_usage():
        """Display current memory usage statistics."""
        try:
            import psutil
            process = psutil.Process(os.getpid())
            mem_gb = process.memory_info().rss / 1024**3
            print(f"üíæ Memory: {mem_gb:.2f} GB (RAM)", end="")
            if torch.cuda.is_available():
                gpu_mem = torch.cuda.memory_allocated() / 1024**3
                gpu_reserved = torch.cuda.memory_reserved() / 1024**3
                print(f" | {gpu_mem:.2f}/{gpu_reserved:.2f} GB (GPU used/reserved)")
            else:
                print()
        except:
            pass
    
    def check_memory_safe(ram_threshold_gb=0.85, gpu_threshold=0.80):
        """Check if memory usage is safe for operations."""
        try:
            import psutil
            process = psutil.Process(os.getpid())
            ram_gb = process.memory_info().rss / 1024**3
            total_ram = psutil.virtual_memory().total / 1024**3
            ram_ratio = ram_gb / total_ram if total_ram > 0 else 0
            gpu_ratio = 0
            if torch.cuda.is_available():
                gpu_used = torch.cuda.memory_allocated() / 1024**3
                gpu_total = torch.cuda.get_device_properties(0).total_memory / 1024**3
                gpu_ratio = gpu_used / gpu_total if gpu_total > 0 else 0
            is_safe = ram_ratio < ram_threshold_gb and gpu_ratio < gpu_threshold
            return is_safe, {"ram_gb": ram_gb, "ram_ratio": ram_ratio, "gpu_ratio": gpu_ratio}
        except:
            return True, {}
    
    print("‚ö†Ô∏è Using fallback memory utilities")

memory_usage()

‚ö†Ô∏è Using fallback memory utilities
üíæ Memory: 0.62 GB (RAM) | 0.00/0.00 GB (GPU used/reserved)


In [6]:
# ===============================
# ROBUSTNESS/CHECKPOINT UTILITIES
# ===============================

_checkpoint_state = {
    "pca_complete": False,
    "scaling_complete": False,
    "cv_complete": False,
    "final_model_trained": False,
    "last_saved_checkpoint": None,
}

def save_checkpoint(state_name: str, data: dict, checkpoint_dir: Path = None):
    """Save checkpoint to resume from failures."""
    if checkpoint_dir is None:
        checkpoint_dir = PROJECT_ROOT / "data" / "checkpoints"
    checkpoint_dir.mkdir(parents=True, exist_ok=True)
    checkpoint_path = checkpoint_dir / f"model_pytorch_mlp_checkpoint_{state_name}.pkl"
    try:
        with open(checkpoint_path, "wb") as f:
            pickle.dump(data, f)
        _checkpoint_state["last_saved_checkpoint"] = checkpoint_path
        print(f"‚úÖ Checkpoint saved: {checkpoint_path}")
    except Exception as e:
        print(f"‚ö†Ô∏è Failed to save checkpoint: {e}")

def load_checkpoint(state_name: str, checkpoint_dir: Path = None):
    """Load checkpoint to resume from failures."""
    if checkpoint_dir is None:
        checkpoint_dir = PROJECT_ROOT / "data" / "checkpoints"
    checkpoint_path = checkpoint_dir / f"model_pytorch_mlp_checkpoint_{state_name}.pkl"
    if checkpoint_path.exists():
        try:
            with open(checkpoint_path, "rb") as f:
                data = pickle.load(f)
            print(f"‚úÖ Checkpoint loaded: {checkpoint_path}")
            return data
        except Exception as e:
            print(f"‚ö†Ô∏è Failed to load checkpoint: {e}")
    return None

def safe_operation(operation_name: str, max_retries: int = 3, checkpoint_on_success: bool = False):
    """Decorator for safe operations with retry and checkpoint support."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    is_safe, mem_info = check_memory_safe(ram_threshold_gb=0.80, gpu_threshold=0.75)
                    if not is_safe:
                        cleanup_memory()
                        if torch.cuda.is_available():
                            torch.cuda.empty_cache()
                        time.sleep(1)
                    result = func(*args, **kwargs)
                    cleanup_memory()
                    if checkpoint_on_success:
                        save_checkpoint(operation_name, {"status": "complete", "result": result})
                    return result
                except (MemoryError, RuntimeError) as e:
                    error_msg = str(e).lower()
                    if "out of memory" in error_msg or "oom" in error_msg:
                        if attempt < max_retries - 1:
                            cleanup_memory()
                            if torch.cuda.is_available():
                                torch.cuda.empty_cache()
                            time.sleep(2)
                            continue
                        else:
                            raise
                    else:
                        raise
                except Exception as e:
                    if attempt < max_retries - 1:
                        cleanup_memory()
                        time.sleep(1)
                        continue
                    else:
                        raise
            return None
        return wrapper
    return decorator

def chunked_operation(
    data,
    operation_func,
    chunk_size: int = 10000,
    progress_every: int = 10,
    operation_name: str = "operation",
):
    """Execute operation on data in chunks with progress tracking."""
    total_chunks = (len(data) + chunk_size - 1) // chunk_size
    results = []
    for i in range(0, len(data), chunk_size):
        chunk_num = i // chunk_size + 1
        chunk = data[i : i + chunk_size]
        try:
            is_safe, mem_info = check_memory_safe(ram_threshold_gb=0.85, gpu_threshold=0.80)
            if not is_safe:
                cleanup_memory()
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()
                time.sleep(0.5)
            chunk_result = operation_func(chunk)
            results.append(chunk_result)
            if chunk_num % progress_every == 0 or chunk_num == total_chunks:
                print(f"  Progress: {chunk_num}/{total_chunks} chunks ({chunk_num*100//total_chunks}%)")
            del chunk
            if chunk_num % 5 == 0:
                cleanup_memory()
        except (MemoryError, RuntimeError) as e:
            error_msg = str(e).lower()
            if "out of memory" in error_msg or "oom" in error_msg:
                cleanup_memory()
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()
                smaller_chunk_size = max(1000, chunk_size // 2)
                if smaller_chunk_size < chunk_size:
                    return chunked_operation(
                        data[i:],
                        operation_func,
                        chunk_size=smaller_chunk_size,
                        progress_every=progress_every,
                        operation_name=operation_name,
                    )
                else:
                    raise
            else:
                raise
    return results

def emergency_cleanup():
    """Emergency cleanup on exit."""
    try:
        cleanup_memory()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        print("‚úÖ Emergency cleanup completed")
    except:
        pass

atexit.register(emergency_cleanup)

def signal_handler(signum, frame):
    """Handle signals for graceful shutdown."""
    print(f"‚ö†Ô∏è Received signal {signum}, saving checkpoint...")
    save_checkpoint("emergency", {"status": "signal_received", "signal": signum})
    emergency_cleanup()
    raise KeyboardInterrupt

try:
    signal.signal(signal.SIGINT, signal_handler)
    signal.signal(signal.SIGTERM, signal_handler)
except:
    pass
print("‚úÖ Enhanced robustness utilities loaded")

def safe_prediction(predict_func, *args, **kwargs):
    """Execute prediction with chunked processing."""
    try:
        is_safe, mem_info = check_memory_safe(ram_threshold_gb=0.85, gpu_threshold=0.80)
        if not is_safe:
            cleanup_memory()
        if "X" in kwargs and len(kwargs["X"]) > 50000:
            X = kwargs["X"]
            chunk_size = 10000
            predictions = []
            for i in range(0, len(X), chunk_size):
                chunk = X[i : i + chunk_size]
                kwargs["X"] = chunk
                chunk_preds = predict_func(*args, **kwargs)
                predictions.append(chunk_preds)
                del chunk, chunk_preds
                if i % (chunk_size * 5) == 0:
                    cleanup_memory()
            return np.concatenate(predictions)
        else:
            return predict_func(*args, **kwargs)
    except (MemoryError, RuntimeError) as e:
        error_msg = str(e).lower()
        if "out of memory" in error_msg or "oom" in error_msg:
            cleanup_memory()
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
            if "X" in kwargs:
                X = kwargs["X"]
                chunk_size = 5000
                predictions = []
                for i in range(0, len(X), chunk_size):
                    chunk = X[i : i + chunk_size]
                    kwargs["X"] = chunk
                    chunk_preds = predict_func(*args, **kwargs)
                    predictions.append(chunk_preds)
                    del chunk, chunk_preds
                    cleanup_memory()
                return np.concatenate(predictions)
            else:
                raise
        else:
            raise

print("‚úÖ Training robustness wrappers loaded")


‚úÖ Enhanced robustness utilities loaded
‚úÖ Training robustness wrappers loaded


## 2. Data Loading & Feature Extraction

In [7]:
def load_parquet_split(split: str) -> pl.DataFrame:
    """Load a model_ready parquet split with error handling."""
    try:
        path = MODEL_READY_DIR / f"{split}_model_ready.parquet"
        if not path.exists():
            alt = MODEL_READY_DIR / f"{split}_model_ready_reduced.parquet"
            if alt.exists():
                path = alt
            else:
                raise FileNotFoundError(f"Could not find {split} data")
        print(f"Loading {split} from {path}")
        return pl.read_parquet(path)
    except Exception as e:
        print(f"‚ùå Error loading {split}: {e}")
        raise


def split_features_reg_and_all_emb(df: pl.DataFrame):
    """Split features into regular and embedding families."""
    cols = df.columns
    dtypes = df.dtypes
    label = df["label"].to_numpy() if "label" in cols else None

    reg_cols = []
    EMBEDDING_FAMILY_PREFIXES = ["sent_transformer_", "scibert_", "specter_", "specter2_", "ner_"]
    emb_family_to_cols = {p: [] for p in EMBEDDING_FAMILY_PREFIXES}

    NUMERIC_DTYPES = {
        pl.Int8,
        pl.Int16,
        pl.Int32,
        pl.Int64,
        pl.UInt8,
        pl.UInt16,
        pl.UInt32,
        pl.UInt64,
        pl.Float32,
        pl.Float64,
    }

    for c, dt in zip(cols, dtypes):
        if c in ("id", "label"):
            continue
        matched = False
        for p in EMBEDDING_FAMILY_PREFIXES:
            if c.startswith(p):
                emb_family_to_cols[p].append(c)
                matched = True
                break
        if not matched and dt in NUMERIC_DTYPES:
            reg_cols.append(c)

    X_reg = df.select(reg_cols).to_numpy() if reg_cols else None
    X_emb_families = {}
    for p, clist in emb_family_to_cols.items():
        if clist:
            X_emb_families[p] = df.select(clist).to_numpy()

    return X_reg, X_emb_families, label, reg_cols, emb_family_to_cols


# Load data
try:
    print("\n" + "=" * 80)
    print("PHASE 1: Data Loading")
    print("=" * 80)
    phase_start = time.time()
    
    train_df = load_parquet_split("train")
    val_df = load_parquet_split("val")

    X_reg_train, X_emb_train_fams, y_train, reg_cols, emb_family_to_cols = (
        split_features_reg_and_all_emb(train_df)
    )
    X_reg_val, X_emb_val_fams, y_val, _, _ = split_features_reg_and_all_emb(val_df)

    # Combine regular + ALL embeddings (NO PCA)
    X_emb_train_list = []
    X_emb_val_list = []
    for fam in X_emb_train_fams.keys():
        X_emb_train_list.append(X_emb_train_fams[fam])
        X_emb_val_list.append(X_emb_val_fams[fam])
    
    X_emb_train = np.hstack(X_emb_train_list) if X_emb_train_list else None
    X_emb_val = np.hstack(X_emb_val_list) if X_emb_val_list else None

    if X_reg_train is not None:
        X_train = np.hstack([X_reg_train, X_emb_train]) if X_emb_train is not None else X_reg_train
        X_val = np.hstack([X_reg_val, X_emb_val]) if X_emb_val is not None else X_reg_val
    else:
        X_train = X_emb_train
        X_val = X_emb_val

    phase_time = time.time() - phase_start
    print(f"\nüìä Data Summary:")
    print(f"  Regular features: {len(reg_cols)}")
    print(f"  Total features: {X_train.shape[1]}")
    for fam, arr in X_emb_train_fams.items():
        print(f"  Embedding {fam}: {arr.shape[1]} dims (NO PCA)")
        print(f"  Embedding {fam}: {arr.shape[1]} dims")
    print(
        f"  Train samples: {len(y_train)}, Positive: {y_train.sum()}, Negative: {(y_train==0).sum()}"
    )
    print(f"  Val samples: {len(y_val)}, Positive: {y_val.sum()}, Negative: {(y_val==0).sum()}")
    print(f"\n‚è±Ô∏è  Data Loading Time: {phase_time:.2f} seconds ({phase_time/60:.2f} minutes)")

    del train_df, val_df, X_reg_train, X_reg_val, X_emb_train_fams, X_emb_val_fams
    del X_emb_train_list, X_emb_val_list, X_emb_train, X_emb_val
    cleanup_memory()
    memory_usage()
except Exception as e:
    print(f"‚ùå Error loading data: {e}")
    raise


PHASE 1: Data Loading
Loading train from /gpfs/accounts/si670f25_class_root/si670f25_class/santoshd/Kaggle_2/data/model_ready/train_model_ready.parquet


Loading val from /gpfs/accounts/si670f25_class_root/si670f25_class/santoshd/Kaggle_2/data/model_ready/val_model_ready.parquet



üìä Data Summary:
  Regular features: 54
  Total features: 1974
  Embedding sent_transformer_: 384 dims (NO PCA)
  Embedding sent_transformer_: 384 dims
  Embedding scibert_: 768 dims (NO PCA)
  Embedding scibert_: 768 dims
  Embedding specter2_: 768 dims (NO PCA)
  Embedding specter2_: 768 dims
  Train samples: 960000, Positive: 65808, Negative: 894192
  Val samples: 120000, Positive: 8075, Negative: 111925

‚è±Ô∏è  Data Loading Time: 92.85 seconds (1.55 minutes)


üíæ Memory: 48.62 GB (RAM) | 0.00/0.00 GB (GPU used/reserved)


## 3. Class Imbalance Handling: SMOTETomek

In [None]:
from imblearn.combine import SMOTETomek

# Skip SMOTETomek for very large datasets (>100k samples) - use class_weight instead
USE_CLASS_WEIGHT = False  # Will be set based on dataset size

print("\n" + "=" * 80)
print("PHASE 2: SMOTETomek Resampling")
print("=" * 80)
phase_start = time.time()

print("\nüìä Checking class imbalance and applying SMOTETomek resampling...")
print(f"  Before: {len(X_train)} samples, Positive: {y_train.sum()}, Negative: {(y_train == 0).sum()}")
print(f"  Imbalance ratio: {(y_train == 0).sum() / max(y_train.sum(), 1):.2f}:1")

try:
    # SMOTETomek is REQUIRED - use adaptive strategy for large datasets
    if len(X_train) > 500_000:
        print(f"  ‚ö†Ô∏è Large dataset detected ({len(X_train):,} samples), using sampling_strategy=0.2 for memory efficiency")
        smt = SMOTETomek(random_state=42, sampling_strategy=0.2, n_jobs=-1)
    else:
        smt = SMOTETomek(random_state=42, sampling_strategy=0.4, n_jobs=-1)
    
    # Fit and resample with memory cleanup
    print("  Fitting SMOTETomek...")
    cleanup_memory()
    X_train_resampled, y_train_resampled = smt.fit_resample(X_train, y_train)
    cleanup_memory()

    phase_time = time.time() - phase_start
    print(f"  After: {len(X_train_resampled)} samples, Positive: {y_train_resampled.sum()}, Negative: {(y_train_resampled == 0).sum()}")
    print(f"  Balance ratio: {(y_train_resampled == 0).sum() / max(y_train_resampled.sum(), 1):.2f}:1")
    print(f"\n‚è±Ô∏è  SMOTETomek Resampling Time: {phase_time:.2f} seconds ({phase_time/60:.2f} minutes)")

    X_train = X_train_resampled
    y_train = y_train_resampled

    del X_train_resampled, y_train_resampled
    cleanup_memory()
except Exception as e:
    phase_time = time.time() - phase_start
    print(f"  ‚ö†Ô∏è SMOTETomek failed: {e}")
    print("  Continuing with original training data...")
    print(f"\n‚è±Ô∏è  SMOTETomek Time (failed): {phase_time:.2f} seconds ({phase_time/60:.2f} minutes)")
    cleanup_memory()

## 4. Feature Scaling

In [None]:
print("\n" + "=" * 80)
print("PHASE 3: Feature Scaling")
print("=" * 80)

phase_start = time.time()
print("\nüìä Applying Feature Scaling to combined features...")

# Store raw (unscaled) data for CV Pipeline (scaler will be fit per fold)
X_train_raw = X_train.copy()
X_val_raw = X_val.copy()
y_train_raw = y_train.copy()
y_val_raw = y_val.copy()

# Use StandardScaler (RobustScaler doesn't support partial_fit)
scaler = StandardScaler()

# For large datasets, fit on sample then transform in chunks
CHUNK_SIZE = 50000

if X_train.shape[0] > CHUNK_SIZE:
    print(f"  Fitting scaler on sample ({min(CHUNK_SIZE, X_train.shape[0])} samples) for OOM protection...")
    sample_indices = np.random.choice(X_train.shape[0], size=min(CHUNK_SIZE, X_train.shape[0]), replace=False)
    scaler.fit(X_train[sample_indices])
    del sample_indices
    cleanup_memory()

    # Transform train in chunks
    print(f"  Transforming train data in chunks (size={CHUNK_SIZE})...")
    X_train_chunks = []
    for i in range(0, X_train.shape[0], CHUNK_SIZE):
        chunk = scaler.transform(X_train[i:i + CHUNK_SIZE])
        X_train_chunks.append(chunk)
        del chunk
        if i % (CHUNK_SIZE * 5) == 0:
            cleanup_memory()
    X_train = np.vstack(X_train_chunks)
    del X_train_chunks
    cleanup_memory()

    # Transform val in chunks
    if X_val.shape[0] > CHUNK_SIZE:
        print(f"  Transforming val data in chunks (size={CHUNK_SIZE})...")
        X_val_chunks = []
        for i in range(0, X_val.shape[0], CHUNK_SIZE):
            chunk = scaler.transform(X_val[i:i + CHUNK_SIZE])
            X_val_chunks.append(chunk)
            del chunk
        X_val = np.vstack(X_val_chunks)
        del X_val_chunks
    else:
        X_val = scaler.transform(X_val)
else:
    # Small dataset - fit and transform normally
    X_train = scaler.fit_transform(X_train)
    X_val = scaler.transform(X_val)

cleanup_memory()
phase_time = time.time() - phase_start
print("  ‚úÖ Scaling complete!")
print(f"\n‚è±Ô∏è  Feature Scaling Time: {phase_time:.2f} seconds ({phase_time/60:.2f} minutes)")
memory_usage()

# Store raw (unscaled) data again for safety (if further processing needed)
X_train_raw = X_train.copy()
X_val_raw = X_val.copy()
y_train_raw = y_train.copy()
y_val_raw = y_val.copy()

## 5. PyTorch MLP Model Definition

In [None]:
class MLP(nn.Module):
    """Enhanced Multi-Layer Perceptron with BatchNorm and Residual Connections."""
    def __init__(self, input_dim: int, hidden_dims: Tuple[int, ...], dropout_rate: float = 0.0, 
                 activation: str = 'relu', use_batch_norm: bool = True, use_residual: bool = False):
        super(MLP, self).__init__()
        
        self.use_residual = use_residual
        self.layers = nn.ModuleList()
        prev_dim = input_dim
        
        for i, hidden_dim in enumerate(hidden_dims):
            # Linear layer
            self.layers.append(nn.Linear(prev_dim, hidden_dim))
            
            # Batch normalization
            if use_batch_norm:
                self.layers.append(nn.BatchNorm1d(hidden_dim))
            
            # Activation
            if activation == 'relu':
                self.layers.append(nn.ReLU())
            elif activation == 'tanh':
                self.layers.append(nn.Tanh())
            elif activation == 'gelu':
                self.layers.append(nn.GELU())
            elif activation == 'swish':
                self.layers.append(nn.SiLU())  # Swish/SiLU
            else:
                self.layers.append(nn.ReLU())
            
            # Dropout
            if dropout_rate > 0:
                self.layers.append(nn.Dropout(dropout_rate))
            
            prev_dim = hidden_dim
        
        # Output layer (with sigmoid)
        self.output = nn.Sequential(
            nn.Linear(prev_dim, 1),
            nn.Sigmoid()
        )
        
        # Initialize weights
        self._initialize_weights()
    
    def _initialize_weights(self):
        """Xavier/Glorot initialization for better convergence."""
        for module in self.modules():
            if isinstance(module, nn.Linear):
                nn.init.xavier_uniform_(module.weight)
                if module.bias is not None:
                    nn.init.constant_(module.bias, 0)
            elif isinstance(module, nn.BatchNorm1d):
                nn.init.constant_(module.weight, 1)
                nn.init.constant_(module.bias, 0)
    
    def forward(self, x):
        # Store input for residual connection
        if self.use_residual and len(self.layers) > 0:
            residual = x
        
        # Forward through hidden layers
        for layer in self.layers:
            x = layer(x)
        
        # Residual connection (if dimensions match)
        if self.use_residual and x.shape == residual.shape:
            x = x + residual
        
        # Output layer
        return self.output(x).squeeze()


## 6. Hyperparameter Selection (Fixed Parameters)

In [None]:
print("\n" + "=" * 80)
print("PHASE 4: Hyperparameter Selection (Fixed Parameters)")
print("=" * 80)

# Use fixed hyperparameters (no Optuna) - IMPROVED FOR BETTER PERFORMANCE
print("\nüìä Using improved fixed hyperparameters:")
best_params = {
    'n_layers': 4,  # Deeper network
    'hidden_dim_base': 512,  # Wider network
    'dim_strategy': 'decreasing',  # Start wide, get narrower
    'dropout_rate': 0.3,  # More regularization
    'activation': 'swish',  # Swish/SiLU activation (better than ReLU)
    'learning_rate': 0.0005,  # Lower learning rate for stability
    'batch_size': 256,  # Larger batch size
    'weight_decay': 1e-3,  # More weight decay
    'use_batch_norm': True,  # Batch normalization
    'use_residual': False,  # Can enable if needed
    'label_smoothing': 0.05,  # Label smoothing for better generalization
}

print("  Hyperparameters:")
for key, value in best_params.items():
    print(f"    {key}: {value}")

# Optional: Quick CV validation with fixed params
print("\nüîç Running quick CV validation with fixed parameters...")
MAX_SAMPLES_FOR_CV = 50000
X_full = np.vstack([X_train_raw, X_val_raw])
y_full = np.hstack([y_train_raw, y_val_raw])

if len(X_full) > MAX_SAMPLES_FOR_CV:
    print(f"‚ö†Ô∏è Dataset too large ({len(X_full)} samples), using subset ({MAX_SAMPLES_FOR_CV} samples) for CV")
    from sklearn.model_selection import train_test_split
    X_full, _, y_full, _ = train_test_split(
        X_full, y_full,
        train_size=MAX_SAMPLES_FOR_CV,
        stratify=y_full,
        random_state=SEED
    )
    print(f"  Using {len(X_full)} samples for CV validation")
    cleanup_memory()

# Scale the CV data
scaler_cv = StandardScaler()
X_full_scaled = scaler_cv.fit_transform(X_full)

print(f"\nüìä CV dataset: {X_full_scaled.shape}, labels: {y_full.shape}")
print(f"  Positive samples: {y_full.sum()}, Negative: {(y_full == 0).sum()}")

# Setup Stratified K-Fold
N_FOLDS = 5
skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

# Build hidden_dims
n_layers = best_params['n_layers']
hidden_dim_base = best_params['hidden_dim_base']
dim_strategy = best_params['dim_strategy']

hidden_dims = []
for i in range(n_layers):
    if dim_strategy == 'decreasing':
        dim = hidden_dim_base // (2 ** i)
    else:
        dim = hidden_dim_base
    hidden_dims.append(max(32, dim))

print(f"\n  Network architecture: {X_full_scaled.shape[1]} -> {hidden_dims} -> 1")

# Quick CV validation
cv_scores = []
print(f"\n  Running {N_FOLDS}-fold CV...")

for fold_idx, (train_idx, val_idx) in enumerate(skf.split(X_full_scaled, y_full)):
    X_fold_train = X_full_scaled[train_idx]
    y_fold_train = y_full[train_idx]
    X_fold_val = X_full_scaled[val_idx]
    y_fold_val = y_full[val_idx]
    
    # Convert to tensors
    X_train_tensor = torch.FloatTensor(X_fold_train).to(device)
    y_train_tensor = torch.FloatTensor(y_fold_train).to(device)
    X_val_tensor = torch.FloatTensor(X_fold_val).to(device)
    y_val_tensor = torch.FloatTensor(y_fold_val).to(device)
    
    # Create model with improved architecture
    model = MLP(
        input_dim=X_full_scaled.shape[1],
        hidden_dims=tuple(hidden_dims),
        dropout_rate=best_params['dropout_rate'],
        activation=best_params['activation'],
        use_batch_norm=best_params.get('use_batch_norm', True),
        use_residual=best_params.get('use_residual', False)
    ).to(device)
    
    # Use AdamW optimizer (better weight decay)
    optimizer = optim.AdamW(
        model.parameters(), 
        lr=best_params['learning_rate'], 
        weight_decay=best_params['weight_decay'],
        betas=(0.9, 0.999)
    )
    
    # Learning rate scheduler (cosine annealing with warm restarts)
    scheduler = optim.lr_scheduler.CosineAnnealingWarmRestarts(
        optimizer, T_0=10, T_mult=2, eta_min=1e-6
    )
    
    # Calculate class weights for imbalanced data
    pos_weight = (y_fold_train == 0).sum() / max((y_fold_train == 1).sum(), 1)
    pos_weight_tensor = torch.tensor(pos_weight, device=device)
    
    # Training with improved techniques
    model.train()
    n_epochs = 50  # More epochs for better convergence
    best_val_f1 = 0.0
    patience = 10  # More patience
    patience_counter = 0
    
    # Label smoothing
    label_smoothing = best_params.get('label_smoothing', 0.0)
    
    # Loss function with class weights
    criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight_tensor)
    
    # Progress bar for epochs
    epoch_pbar = tqdm(range(n_epochs), desc=f"Fold {fold_idx+1}/{N_FOLDS}", leave=False)
    for epoch in epoch_pbar:
        # Mini-batch training
        indices = torch.randperm(len(X_train_tensor), device=device)
        batch_losses = []
        for i in range(0, len(X_train_tensor), best_params['batch_size']):
            batch_indices = indices[i:i + best_params['batch_size']]
            X_batch = X_train_tensor[batch_indices]
            y_batch = y_train_tensor[batch_indices]
            
            # Label smoothing
            if label_smoothing > 0:
                y_batch_smooth = y_batch * (1 - label_smoothing) + (1 - y_batch) * label_smoothing
            else:
                y_batch_smooth = y_batch
            
            optimizer.zero_grad()
            outputs = model(X_batch)
            # Use logits for BCEWithLogitsLoss (remove sigmoid from forward if using this)
            # For now, use regular BCE with smoothed labels
            loss = nn.BCELoss()(outputs, y_batch_smooth)
            loss.backward()
            
            # Gradient clipping for stability
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            
            optimizer.step()
            batch_losses.append(loss.item())
        
        # Learning rate scheduling
        scheduler.step()
        
        # Validation
        if (epoch + 1) % 3 == 0:  # Validate more frequently
            model.eval()
            with torch.no_grad():
                val_outputs = model(X_val_tensor)
                val_preds = (val_outputs.cpu().numpy() >= 0.5).astype(int)
                val_f1 = f1_score(y_fold_val, val_preds)
                
                if val_f1 > best_val_f1:
                    best_val_f1 = val_f1
                    patience_counter = 0
                else:
                    patience_counter += 1
                    if patience_counter >= patience:
                        epoch_pbar.close()
                        break
            model.train()
        
        # Update progress bar
        avg_loss = np.mean(batch_losses) if batch_losses else 0.0
        current_lr = scheduler.get_last_lr()[0]
        epoch_pbar.set_postfix({
            'loss': f'{avg_loss:.4f}',
            'val_f1': f'{best_val_f1:.4f}',
            'lr': f'{current_lr:.2e}',
            'patience': patience_counter
        })
    
    epoch_pbar.close()
    
    cv_scores.append(best_val_f1)
    print(f"    Fold {fold_idx + 1}/{N_FOLDS}: F1 = {best_val_f1:.4f}")
    
    # Cleanup
    del model, optimizer, X_train_tensor, y_train_tensor, X_val_tensor, y_val_tensor
    cleanup_memory()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

best_cv_score = np.mean(cv_scores)
print(f"\n‚úÖ CV validation complete")
print(f"  Mean CV F1: {best_cv_score:.4f}")
print(f"  Std CV F1: {np.std(cv_scores):.4f}")

cleanup_memory()
memory_usage()


In [None]:
# Run Optuna study (if enabled) or use default hyperparameters
USE_OPTUNA = False
if USE_OPTUNA:
    try:
        study = optuna.create_study(
            direction='maximize',
            sampler=TPESampler(seed=SEED)
        )

        print("\nüöÄ Starting Optuna optimization...")
        start_time = time.time()

        study.optimize(
            objective,
            n_trials=N_TRIALS,
            timeout=TIMEOUT_SECONDS,
            show_progress_bar=True
        )

        elapsed_time = time.time() - start_time

        best_params = study.best_params
        best_cv_score = study.best_value

        print(f"\n‚úÖ Optuna optimization complete ({elapsed_time/60:.1f} min)")
        print(f"  Best CV F1: {best_cv_score:.4f}")
        print(f"  Best parameters:")
        for key, value in best_params.items():
            print(f"    {key}: {value}")

        cleanup_memory()
        memory_usage()

    except Exception as e:
        print(f"‚ùå Error in Optuna optimization: {e}")
        import traceback
        traceback.print_exc()
        best_params = {}
        best_cv_score = 0.0
cleanup_memory()


## 7. Final Model Training & Threshold Tuning

In [None]:
# Train final model on full data
try:
    print("\n" + "=" * 80)
    print("PHASE 5: Final Model Training & Threshold Tuning")
    print("=" * 80)
    phase_start = time.time()
    print("Training Final Model on Full Dataset...")

    # Use best parameters from Optuna or improved defaults
    if 'best_params' not in locals() or not best_params:
        print("  ‚ö†Ô∏è best_params not found, using improved defaults")
        best_params = {
            'n_layers': 4,
            'hidden_dim_base': 512,
            'dim_strategy': 'decreasing',
            'dropout_rate': 0.3,
            'activation': 'swish',
            'learning_rate': 0.0005,
            'batch_size': 256,
            'weight_decay': 1e-3,
            'use_batch_norm': True,
            'use_residual': False,
            'label_smoothing': 0.05
        }
        best_cv_score = 0.0

    # Build hidden_dims from best_params
    n_layers = best_params.get('n_layers', 3)
    hidden_dim_base = best_params.get('hidden_dim_base', 256)
    dim_strategy = best_params.get('dim_strategy', 'decreasing')
    
    hidden_dims = []
    for i in range(n_layers):
        if dim_strategy == 'decreasing':
            dim = hidden_dim_base // (2 ** i)
        else:
            dim = hidden_dim_base
        hidden_dims.append(max(32, dim))

    # Create final model with improved architecture
    final_model = MLP(
        input_dim=X_train.shape[1],
        hidden_dims=tuple(hidden_dims),
        dropout_rate=best_params.get('dropout_rate', 0.3),
        activation=best_params.get('activation', 'swish'),
        use_batch_norm=best_params.get('use_batch_norm', True),
        use_residual=best_params.get('use_residual', False)
    ).to(device)

    # Use AdamW optimizer with improved settings
    optimizer = optim.AdamW(
        final_model.parameters(), 
        lr=best_params.get('learning_rate', 0.0005), 
        weight_decay=best_params.get('weight_decay', 1e-3),
        betas=(0.9, 0.999)
    )
    
    # Learning rate scheduler
    scheduler = optim.lr_scheduler.CosineAnnealingWarmRestarts(
        optimizer, T_0=15, T_mult=2, eta_min=1e-6
    )
    
    # Calculate class weights
    pos_weight = (y_train == 0).sum() / max((y_train == 1).sum(), 1)
    pos_weight_tensor = torch.tensor(pos_weight, device=device)
    
    batch_size = best_params.get('batch_size', 256)
    label_smoothing = best_params.get('label_smoothing', 0.05)
    X_train_tensor = torch.FloatTensor(X_train).to(device)
    y_train_tensor = torch.FloatTensor(y_train).to(device)
    X_val_tensor = torch.FloatTensor(X_val).to(device)
    y_val_tensor = torch.FloatTensor(y_val).to(device)

    final_model.train()
    n_epochs = 150  # More epochs
    best_val_f1 = 0.0
    patience = 20  # More patience
    patience_counter = 0

    print(f"  Training for up to {n_epochs} epochs with early stopping (patience={patience})...")
    print(f"  Using class weight: {pos_weight:.2f}, label smoothing: {label_smoothing}")

    # Progress bar for epochs
    epoch_pbar = tqdm(range(n_epochs), desc="Training", unit="epoch")
    for epoch in epoch_pbar:
        # Mini-batch training
        indices = torch.randperm(len(X_train_tensor), device=device)
        batch_losses = []
        for i in range(0, len(X_train_tensor), batch_size):
            batch_indices = indices[i:i + batch_size]
            X_batch = X_train_tensor[batch_indices]
            y_batch = y_train_tensor[batch_indices]
            
            # Label smoothing
            if label_smoothing > 0:
                y_batch_smooth = y_batch * (1 - label_smoothing) + (1 - y_batch) * label_smoothing
            else:
                y_batch_smooth = y_batch
            
            optimizer.zero_grad()
            outputs = final_model(X_batch)
            loss = nn.BCELoss()(outputs, y_batch_smooth)
            loss.backward()
            
            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(final_model.parameters(), max_norm=1.0)
            
            optimizer.step()
            batch_losses.append(loss.item())
        
        # Learning rate scheduling
        scheduler.step()
        
        # Validation every epoch
        final_model.eval()
        with torch.no_grad():
            val_outputs = final_model(X_val_tensor)
            val_proba = val_outputs.cpu().numpy()
            val_preds = (val_proba >= 0.5).astype(int)
            val_f1 = f1_score(y_val, val_preds)
            
            if val_f1 > best_val_f1:
                best_val_f1 = val_f1
                patience_counter = 0
                # Save best model state
                best_model_state = final_model.state_dict().copy()
            else:
                patience_counter += 1
                if patience_counter >= patience:
                    epoch_pbar.set_description(f"Early stopping at epoch {epoch + 1}")
                    epoch_pbar.close()
                    print(f"\n  Early stopping at epoch {epoch + 1}")
                    break
        
        final_model.train()
        
        # Update progress bar
        avg_loss = np.mean(batch_losses) if batch_losses else 0.0
        current_lr = scheduler.get_last_lr()[0]
        epoch_pbar.set_postfix({
            'loss': f'{avg_loss:.4f}',
            'val_f1': f'{val_f1:.4f}',
            'best_f1': f'{best_val_f1:.4f}',
            'lr': f'{current_lr:.2e}',
            'patience': f'{patience_counter}/{patience}'
        })
    
    epoch_pbar.close()

    # Load best model state
    if 'best_model_state' in locals():
        final_model.load_state_dict(best_model_state)

    # Get predictions on validation set
    final_model.eval()
    with torch.no_grad():
        y_val_proba = final_model(X_val_tensor).cpu().numpy()

    # Find optimal threshold
    precision, recall, pr_thresholds = precision_recall_curve(y_val, y_val_proba)
    f1_scores_pr = 2 * (precision * recall) / (precision + recall + 1e-10)
    best_pr_idx = np.argmax(f1_scores_pr)
    best_pr_threshold = pr_thresholds[best_pr_idx] if best_pr_idx < len(pr_thresholds) else 0.5
    best_pr_f1 = f1_scores_pr[best_pr_idx]

    # Manual fine-grained search
    thresholds = np.concatenate([
        np.linspace(0.01, 0.05, 20),
        np.linspace(0.05, 0.15, 50),
        np.linspace(0.15, 0.3, 30),
        np.linspace(0.3, 0.9, 20)
    ])

    best_threshold = best_pr_threshold
    best_f1 = best_pr_f1
    for thr in thresholds:
        y_pred = (y_val_proba >= thr).astype(int)
        f1 = f1_score(y_val, y_pred, pos_label=1, zero_division=0)
        if f1 > best_f1:
            best_f1 = f1
            best_threshold = thr

    phase_time = time.time() - phase_start
    print(f"\n‚úÖ Final Optimal Threshold: {best_threshold:.4f}")
    print(f"‚úÖ Final Validation F1: {best_f1:.4f}")
    print(f"\n‚è±Ô∏è  Final Model Training & Threshold Tuning Time: {phase_time:.2f} seconds ({phase_time/60:.2f} minutes)")

    # Classification report
    y_val_pred = (y_val_proba >= best_threshold).astype(int)
    print("\nüìä Classification Report:")
    print(classification_report(y_val, y_val_pred, digits=4, zero_division=0))

    cleanup_memory()
    memory_usage()

except Exception as e:
    print(f"‚ùå Error in final training: {e}")
    raise


## 8. Save Model

In [None]:
# Save model
try:
    model_save_path = MODEL_SAVE_DIR / "model_pytorch_mlp_all_features_best.pkl"

    save_dict = {
        "model_state_dict": final_model.state_dict(),
        "model_config": {
            "input_dim": X_train.shape[1],
            "hidden_dims": hidden_dims,
            "dropout_rate": best_params.get('dropout_rate', 0.2),
            "activation": best_params.get('activation', 'relu')
        },
        "scaler": scaler,
        "best_params": best_params,
        "best_cv_score": best_cv_score,
        "best_threshold": best_threshold,
        "best_f1": best_f1,
        "reg_cols": reg_cols,
        "emb_family_to_cols": emb_family_to_cols,
    }

    with open(model_save_path, "wb") as f:
        pickle.dump(save_dict, f)

    print(f"\nüíæ Model saved to: {model_save_path}")

except Exception as e:
    print(f"‚ùå Error saving model: {e}")
    import traceback
    traceback.print_exc()


## 9. Generate Submission

In [None]:
import re

def extract_work_id(id_value: str) -> str:
    """Extract work_id from URL or return as is if already just ID."""
    id_str = str(id_value)
    # If it already looks like a work ID, just return it
    if id_str.startswith('W') and len(id_str) > 1 and '/' not in id_str:
        return id_str
    # Otherwise, extract from URL or string
    match = re.search(r'W\d+', id_str)
    if match:
        return match.group(0)
    return id_str

# Load test data and generate predictions
try:
    print("\n" + "=" * 80)
    print("PHASE 6: Test Predictions")
    print("=" * 80)
    phase_start = time.time()
    print("Generating Test Predictions...")

    test_df = load_parquet_split("test")
    test_ids = test_df["id"].to_numpy()

    # Process test data same as train
    X_reg_test, X_emb_test_fams, _, _, _ = split_features_reg_and_all_emb(test_df)
    del test_df

    # Combine embeddings (NO PCA)
    X_emb_test_list = []
    for fam in X_emb_test_fams.keys():
        X_emb_test_list.append(X_emb_test_fams[fam])
    X_emb_test = np.hstack(X_emb_test_list) if X_emb_test_list else None

    if X_reg_test is not None:
        X_test = np.hstack([X_reg_test, X_emb_test]) if X_emb_test is not None else X_reg_test
    else:
        X_test = X_emb_test

    del X_reg_test, X_emb_test_fams, X_emb_test_list, X_emb_test
    cleanup_memory()

    # Scale
    if "scaler" in locals():
        X_test = scaler.transform(X_test)

    # Predict in chunks
    chunk_size = 10000
    final_model.eval()
    y_test_proba_chunks = []
    
    for i in range(0, X_test.shape[0], chunk_size):
        X_test_chunk = torch.FloatTensor(X_test[i:i + chunk_size]).to(device)
        with torch.no_grad():
            chunk_proba = final_model(X_test_chunk).cpu().numpy()
        y_test_proba_chunks.append(chunk_proba)
        del X_test_chunk
        cleanup_memory()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
    
    y_test_proba = np.concatenate(y_test_proba_chunks)
    del y_test_proba_chunks

    y_test_pred = (y_test_proba >= best_threshold).astype(int)

    # Create submission using Polars
    work_ids = np.array([extract_work_id(str(id_val)) for id_val in test_ids])
    submission_df = pl.DataFrame({"work_id": work_ids, "label": y_test_pred})

    submission_path = SUBMISSION_DIR / "submission_model_pytorch_mlp.csv"
    submission_df.write_csv(submission_path)

    phase_time = time.time() - phase_start
    print(f"\n‚úÖ Submission saved to: {submission_path}")
    print(f"  Test predictions: {len(y_test_pred)}, Positive: {y_test_pred.sum()}, Negative: {(y_test_pred==0).sum()}")
    print(f"\n‚è±Ô∏è  Test Predictions Time: {phase_time:.2f} seconds ({phase_time/60:.2f} minutes)")

    cleanup_memory()
    memory_usage()
    
    # Print total execution time summary
    total_time = time.time() - TOTAL_START_TIME
    print(f"\n{'='*80}")
    # Print total execution time summary
    total_time = time.time() - TOTAL_START_TIME
    END_TIME_STR = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    print(f"\n{'='*80}")
    print(f"MODEL_PYTORCH_MLP EXECUTION COMPLETED")
    print(f"Start Time: {START_TIME_STR}")
    print(f"End Time: {END_TIME_STR}")
    print(f"Total Execution Time: {total_time:.2f} seconds ({total_time/60:.2f} minutes / {total_time/3600:.2f} hours)")
    print(f"Final Validation F1 Score: {best_f1:.4f}")
    print(f"{'='*80}\n")
    print(f"End Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"Final Validation F1 Score: {best_f1:.4f}")
    print(f"{'='*80}\n")

except Exception as e:
    total_time = time.time() - TOTAL_START_TIME
    print(f"\n‚ùå Error generating submission: {e}")
    print(f"Execution failed after {total_time:.2f} seconds ({total_time/60:.2f} minutes)")
    raise
