# Multi-Classifier Pipeline: Leakage Test

This notebook combines:
- ‚úÖ **XGBoost features**: All regular features + embeddings (PCA-compressed) from `data/results/`
- ‚úÖ **Temporal features**: Date-derived features (days_since_publication, days_since_updated, num_years_after_publication)
- ‚úÖ **Multiple classifiers**: BernoulliNB, LogisticRegression, DecisionTree, RandomForest, GradientBoosting, ExtraTrees, AdaBoost, LGBMClassifier, XGBClassifier, CatBoostClassifier, BaggingClassifier, Perceptron, QuadraticDiscriminantAnalysis, GaussianNB, LinearDiscriminantAnalysis, ExtraTreeClassifier, SGDClassifier, DummyClassifier
- ‚úÖ **RandomUnderSampler**: For class imbalance handling
- ‚úÖ **Selective scaling**: Numeric columns only, preserves binary features
- ‚úÖ **Feature review**: Keeps both original and engineered features for comparison
- ‚úÖ **Duplicate elimination**: Removes absolute duplicate features
- ‚úÖ Threshold optimization and submission generation
- ‚úÖ OOM Safe with aggressive memory management


# üìë Multi-Classifier Pipeline - Code Navigation Index

## Quick Navigation
- **[Setup](#1-setup)** - Imports, paths, device configuration, robustness utilities
- **[Data Loading](#2-data-loading)** - Load base features and embeddings from `data/results/`
- **[Temporal Feature Engineering](#3-temporal-feature-engineering)** - Add temporal date features
- **[Feature Combination](#4-feature-combination)** - Combine regular + embeddings (PCA) + temporal features
- **[Duplicate Elimination](#5-duplicate-elimination)** - Remove absolute duplicate features
- **[Feature Review](#6-feature-review)** - Display original vs engineered features
- **[Selective Scaling](#7-selective-scaling)** - Scale numeric columns only
- **[Class Imbalance](#8-class-imbalance)** - RandomUnderSampler
- **[Model Training](#9-model-training)** - Train multiple classifiers
- **[Model Comparison](#10-model-comparison)** - Evaluate and compare all models
- **[Threshold Tuning](#11-threshold-tuning)** - Optimal threshold finding
- **[Model Saving](#12-save-model)** - Save best model
- **[Submission](#13-generate-submission)** - Generate test predictions

## Model Types: Multiple Classifiers 
- BernoulliNB
- LogisticRegression
- DecisionTreeClassifier
- RandomForestClassifier
- GradientBoostingClassifier
- ExtraTreesClassifier
- AdaBoostClassifier
- LGBMClassifier
- XGBClassifier
- CatBoostClassifier
- BaggingClassifier
- Perceptron
- QuadraticDiscriminantAnalysis
- GaussianNB
- LinearDiscriminantAnalysis
- ExtraTreeClassifier
- SGDClassifier
- DummyClassifier

## Feature Sources
- **XGBoost features**: Regular features (54) + Embeddings (PCA-compressed: 32 per family)
- **Temporal**: days_since_publication, days_since_updated, num_years_after_publication
- **Combined**: All features merged, duplicates removed


## 1. Setup


In [1]:
import os
from pathlib import Path
import random
import gc
import numpy as np
import polars as pl
import torch
from typing import Dict, Optional, List, Tuple
import sys
import time
import json
import pickle
import signal
import atexit
from functools import wraps
from datetime import datetime
import ast


In [2]:
# =========================
# STARTUP & REPRODUCIBILITY
# =========================

TOTAL_START_TIME = time.time()
START_TIME_STR = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print(f"\n{'='*80}")
print("MULTI-CLASSIFIER PIPELINE EXECUTION STARTED")
print(f"Start Time: {START_TIME_STR}")
print(f"{'='*80}\n")

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)



MULTI-CLASSIFIER PIPELINE EXECUTION STARTED
Start Time: 2025-11-20 04:25:50

Using device: cpu


In [3]:
# ==============
# PATH MANAGEMENT
# ==============

from pathlib import Path
import os

# Get project root by finding data/results directory
current = Path(os.getcwd())
PROJECT_ROOT = current

# Search up to 10 levels to find data/results
for _ in range(10):
    if (PROJECT_ROOT / "data" / "results").exists():
        break
    PROJECT_ROOT = PROJECT_ROOT.parent
else:
    # Fallback: go up two levels from current (assuming we're in src/notebooks)
    PROJECT_ROOT = current.parent.parent

RESULTS_DIR = PROJECT_ROOT / "data" / "results"
MODEL_SAVE_DIR = PROJECT_ROOT / "models" / "saved_models"
SUBMISSION_DIR = PROJECT_ROOT / "data" / "submission_files"
MODEL_SAVE_DIR.mkdir(parents=True, exist_ok=True)
SUBMISSION_DIR.mkdir(parents=True, exist_ok=True)

print("PROJECT_ROOT:", PROJECT_ROOT)
print("RESULTS_DIR:", RESULTS_DIR)
print("MODEL_SAVE_DIR:", MODEL_SAVE_DIR)
print("SUBMISSION_DIR:", SUBMISSION_DIR)


PROJECT_ROOT: /Users/santoshdesai/Documents/Desai_Projects/Kaggle2
RESULTS_DIR: /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/data/results
MODEL_SAVE_DIR: /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/models/saved_models
SUBMISSION_DIR: /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/data/submission_files


In [4]:
%matplotlib inline
# ==========
# ML LIBRARIES
# ==========
import warnings

warnings.filterwarnings("ignore")
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import (
  train_test_split,
  cross_val_score,
  StratifiedKFold,
)
from sklearn.metrics import (
  f1_score,
  roc_auc_score,
  classification_report,
  precision_recall_curve,
  roc_curve,
  confusion_matrix,
)
from sklearn.naive_bayes import BernoulliNB, GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (
  RandomForestClassifier,
  GradientBoostingClassifier,
  ExtraTreesClassifier,
  AdaBoostClassifier,
  BaggingClassifier,
)
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from imblearn.under_sampling import RandomUnderSampler
from sklearn.decomposition import IncrementalPCA
from sklearn.linear_model import Perceptron, SGDClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis, LinearDiscriminantAnalysis
from sklearn.dummy import DummyClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

# ==========
# VISUALIZATION LIBRARIES
# ==========
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Image

plt.style.use("seaborn-v0_8-darkgrid")
sns.set_palette("husl")


In [5]:
# ===============================
# MEMORY UTILITIES (FALLBACK DEFS)
# ===============================
try:
  from model_training_utils import cleanup_memory, memory_usage, check_memory_safe

  print("‚úÖ Memory utilities imported from shared module")
except ImportError:

  def cleanup_memory():
    """Aggressive memory cleanup for both CPU and GPU."""
    gc.collect()
    if torch.cuda.is_available():
      torch.cuda.empty_cache()
      torch.cuda.synchronize()
      torch.cuda.ipc_collect()
    gc.collect()

  def memory_usage():
    """Display current memory usage statistics."""
    try:
      import psutil

      process = psutil.Process(os.getpid())
      mem_gb = process.memory_info.rss / 1024**3
      print(f"üíæ Memory: {mem_gb:.2f} GB (RAM)", end="")
      if torch.cuda.is_available():
        gpu_mem = torch.cuda.memory_allocated() / 1024**3
        gpu_reserved = torch.cuda.memory_reserved() / 1024**3
        print(f" | {gpu_mem:.2f}/{gpu_reserved:.2f} GB (GPU used/reserved)")
      else:
        print()
    except:
      pass

  def check_memory_safe(ram_threshold_gb=0.85, gpu_threshold=0.80):
    """Check if memory usage is safe for operations."""
    try:
      import psutil

      process = psutil.Process(os.getpid())
      ram_gb = process.memory_info().rss / 1024**3
      total_ram = psutil.virtual_memory().total / 1024**3
      ram_ratio = ram_gb / total_ram if total_ram > 0 else 0
      gpu_ratio = 0
      if torch.cuda.is_available():
        gpu_used = torch.cuda.memory_allocated / 1024**3
        gpu_total = torch.cuda.get_device_properties(0).total_memory / 1024**3
        gpu_ratio = gpu_used / gpu_total if gpu_total > 0 else 0
      is_safe = ram_ratio < ram_threshold_gb and gpu_ratio < gpu_threshold
      return is_safe, {
        "ram_gb": ram_gb,
        "ram_ratio": ram_ratio,
        "gpu_ratio": gpu_ratio,
      }
    except:
      return True, {}

  print("‚ö†Ô∏è Using fallback memory utilities")

memory_usage()


‚ö†Ô∏è Using fallback memory utilities


## 2. Data Loading & Feature Combination


In [6]:
def load_base_features(split: str) -> pl.DataFrame:
  """Load base feature matrix from data/results/"""
  path = RESULTS_DIR / f"X_{split}.parquet"
  if not path.exists():
    raise FileNotFoundError(f"Could not find {split} base features at {path}")
  print(f"Loading {split} base features from {path}")
  return pl.read_parquet(path)


def load_embeddings(split: str, embedding_type: str) -> Optional[pl.DataFrame]:
  """Load embedding features from data/results/"""
  path = RESULTS_DIR / f"{embedding_type}_X_{split}.parquet"
  if not path.exists():
    print(f"‚ö†Ô∏è {embedding_type} embeddings not found for {split}")
    return None
  print(f"Loading {split} {embedding_type} embeddings from {path}")
  return pl.read_parquet(path)


def load_labels(split: str) -> np.ndarray:
  """Load labels from data/results/"""
  path = RESULTS_DIR / f"y_{split}.npy"
  if not path.exists():
    raise FileNotFoundError(f"Could not find {split} labels at {path}")
  print(f"Loading {split} labels from {path}")
  return np.load(path)


def split_features_reg_and_all_emb(df: pl.DataFrame):
  """Split features into regular and embedding families (from XGBoost notebook)."""
  cols = df.columns
  dtypes = df.dtypes
  label = df["label"].to_numpy() if "label" in cols else None

  reg_cols = []
  EMBEDDING_FAMILY_PREFIXES = [
    "sent_transformer_",
    "scibert_",
    "specter_",
    "specter2_",
    "ner_",
  ]
  emb_family_to_cols = {p: [] for p in EMBEDDING_FAMILY_PREFIXES}

  NUMERIC_DTYPES = {
    pl.Int8,
    pl.Int16,
    pl.Int32,
    pl.Int64,
    pl.UInt8,
    pl.UInt16,
    pl.UInt32,
    pl.UInt64,
    pl.Float32,
    pl.Float64,
  }

  for c, dt in zip(cols, dtypes):
    if c in ("id", "label"):
      continue
    matched = False
    for p in EMBEDDING_FAMILY_PREFIXES:
      if c.startswith(p):
        emb_family_to_cols[p].append(c)
        matched = True
        break
    if not matched and dt in NUMERIC_DTYPES:
      reg_cols.append(c)

  X_reg = df.select(reg_cols).to_numpy() if reg_cols else None
  X_emb_families = {}
  for p, clist in emb_family_to_cols.items():
    if clist:
      X_emb_families[p] = df.select(clist).to_numpy()

  return X_reg, X_emb_families, label, reg_cols, emb_family_to_cols


# Load data (XGBoost style: base features + embeddings)
try:
  print("\n" + "=" * 80)
  print("PHASE 1: Data Loading (XGBoost Style)")
  print("=" * 80)
  phase_start = time.time()

  # Load base features
  X_train_base = load_base_features("train")
  X_val_base = load_base_features("val")
  X_test_base = load_base_features("test")

  # Load labels
  y_train = load_labels("train")
  y_val = load_labels("val")

  # Load embeddings (if available)
  embedding_types = ["sent_transformer", "scibert", "specter2"]
  train_embeddings = {}
  val_embeddings = {}
  test_embeddings = {}

  for emb_type in embedding_types:
    train_emb = load_embeddings("train", emb_type)
    val_emb = load_embeddings("val", emb_type)
    test_emb = load_embeddings("test", emb_type)
    if train_emb is not None:
      train_embeddings[emb_type] = train_emb
      val_embeddings[emb_type] = val_emb
      test_embeddings[emb_type] = test_emb

  # Combine base features with embeddings (like XGBoost)
  print("\nüìä Combining base features with embeddings...")

  X_train_combined = X_train_base.clone()
  X_val_combined = X_val_base.clone()
  X_test_combined = X_test_base.clone()

  # Merge embeddings by 'id' column
  for emb_type, train_emb in train_embeddings.items():
    if "id" in train_emb.columns:
      X_train_combined = X_train_combined.join(train_emb, on="id", how="left")
      X_val_combined = X_val_combined.join(
        val_embeddings[emb_type], on="id", how="left"
      )
      X_test_combined = X_test_combined.join(
        test_embeddings[emb_type], on="id", how="left"
      )
      print(f" ‚úÖ Merged {emb_type} embeddings")

  # Extract work_id for later use
  work_ids_train = None
  work_ids_val = None
  work_ids_test = None

  if "id" in X_train_combined.columns:
    work_ids_train = X_train_combined.select("id").to_series().to_list()
    work_ids_val = X_val_combined.select("id").to_series().to_list()
    work_ids_test = X_test_combined.select("id").to_series().to_list()

  # Store original dataframes for feature review
  X_train_original = X_train_combined.clone()
  X_val_original = X_val_combined.clone()
  X_test_original = X_test_combined.clone()

  phase_time = time.time() - phase_start
  print(f"\nüìä Data Summary:")
  print(
    f" Train samples: {len(X_train_combined)}, Features: {len(X_train_combined.columns)}"
  )
  print(f" Val samples: {len(X_val_combined)}")
  print(f" Test samples: {len(X_test_combined)}")
  print(f" Train Positive: {y_train.sum()}, Negative: {(y_train==0).sum()}")
  print(f" Val Positive: {y_val.sum()}, Negative: {(y_val==0).sum()}")
  print(
    f"\n‚è±Ô∏è Data Loading Time: {phase_time:.2f} seconds ({phase_time/60:.2f} minutes)"
  )

  cleanup_memory()
  memory_usage()
except Exception as e:
  print(f"‚ùå Error loading data: {e}")
  import traceback

  traceback.print_exc()
  raise



PHASE 1: Data Loading (XGBoost Style)
Loading train base features from /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/data/results/X_train.parquet
Loading val base features from /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/data/results/X_val.parquet
Loading test base features from /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/data/results/X_test.parquet
Loading train labels from /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/data/results/y_train.npy
Loading val labels from /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/data/results/y_val.npy
Loading train sent_transformer embeddings from /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/data/results/sent_transformer_X_train.parquet
Loading val sent_transformer embeddings from /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/data/results/sent_transformer_X_val.parquet
Loading test sent_transformer embeddings from /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/data/results/sent_transformer_X

Loading val specter2 embeddings from /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/data/results/specter2_X_val.parquet
Loading test specter2 embeddings from /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/data/results/specter2_X_test.parquet

üìä Combining base features with embeddings...
 ‚úÖ Merged sent_transformer embeddings
 ‚úÖ Merged scibert embeddings
 ‚úÖ Merged specter2 embeddings

üìä Data Summary:
 Train samples: 9600, Features: 1977
 Val samples: 1200
 Test samples: 1200
 Train Positive: 648, Negative: 8952
 Val Positive: 81, Negative: 1119

‚è±Ô∏è Data Loading Time: 0.31 seconds (0.01 minutes)


## 3. Temporal Feature Engineering


In [7]:
def safe_parse_json(value):
  """Safely parse JSON string or return empty dict/list."""
  if value is None:
    return {}
  if isinstance(value, (dict, list)):
    return value
  if isinstance(value, str):
    try:
      return ast.literal_eval(value) if value else {}
    except:
      return {}
  return {}


def extract_temporal_features(df: pl.DataFrame) -> pl.DataFrame:
  """Extract temporal features from date columns."""
  print("\nüîß Extracting temporal features...")

  df_processed = df.clone()

  # Drop abstract if present (temporal drops it)
  if "abstract" in df_processed.columns:
    df_processed = df_processed.drop("abstract")
    print(" ‚úÖ Dropped 'abstract' column")

  # Drop id column (temporal extracts work_id separately)
  if "id" in df_processed.columns:
    df_processed = df_processed.drop("id")

  # Date features (matching temporal)
  if "publication_year" in df_processed.columns:
    current_year = datetime.now().year
    df_processed = df_processed.with_columns(
      (current_year - pl.col("publication_year")).alias(
        "num_years_after_publication"
      )
    )
    df_processed = df_processed.drop("publication_year")
    print(" ‚úÖ Created 'num_years_after_publication'")

  if "updated_date" in df_processed.columns:
    try:
      # Convert to datetime and calculate days
      df_processed = df_processed.with_columns(
        pl.col("updated_date").str.strptime(
          pl.Datetime, format="%Y-%m-%dT%H:%M:%S.%f", strict=False
        )
      )
      today = datetime.now()
      df_processed = df_processed.with_columns(
        ((today - pl.col("updated_date")).dt.total_days()).alias(
          "days_since_updated"
        )
      )
      df_processed = df_processed.drop("updated_date")
      print(" ‚úÖ Created 'days_since_updated'")
    except:
      df_processed = df_processed.drop("updated_date")

  if "publication_date" in df_processed.columns:
    try:
      df_processed = df_processed.with_columns(
        pl.col("publication_date").str.strptime(
          pl.Datetime, format="%Y-%m-%d", strict=False
        )
      )
      today = datetime.now()
      df_processed = df_processed.with_columns(
        ((today - pl.col("publication_date")).dt.total_days()).alias(
          "days_since_publication"
        )
      )
      df_processed = df_processed.drop("publication_date")
      print(" ‚úÖ Created 'days_since_publication'")
    except:
      df_processed = df_processed.drop("publication_date")

  # Drop doi_url (temporal drops it)
  if "doi_url" in df_processed.columns:
    df_processed = df_processed.drop("doi_url")

  # Drop ids column (temporal drops it)
  if "ids" in df_processed.columns:
    df_processed = df_processed.drop("ids")

  # Open access normalization (matching temporal: extract is_oa, oa_status, any_repository_has_fulltext)
  if "open_access" in df_processed.columns:
    open_access_parsed = (
      df_processed.select("open_access")
      .to_series()
      .map_elements(safe_parse_json, return_dtype=pl.Object)
    )

    is_oa_values = []
    oa_status_values = []
    any_repository_has_fulltext_values = []

    for oa in open_access_parsed:
      if isinstance(oa, dict):
        is_oa_values.append(1.0 if oa.get("is_oa", False) else 0.0)
        oa_status_values.append(oa.get("oa_status", "closed"))
        any_repository_has_fulltext_values.append(
          1.0 if oa.get("any_repository_has_fulltext", False) else 0.0
        )
      else:
        is_oa_values.append(0.0)
        oa_status_values.append("closed")
        any_repository_has_fulltext_values.append(0.0)

    df_processed = df_processed.with_columns(
      [
        pl.Series("is_oa", is_oa_values, dtype=pl.Float32),
        pl.Series("oa_status", oa_status_values, dtype=pl.Utf8),
        pl.Series(
          "any_repository_has_fulltext",
          any_repository_has_fulltext_values,
          dtype=pl.Float32,
        ),
      ]
    )
    df_processed = df_processed.drop("open_access")
    print(
      " ‚úÖ Extracted open_access features (is_oa, oa_status, any_repository_has_fulltext)"
    )

  # Authorships count (matching temporal: num_authorships)
  if "authorships" in df_processed.columns:
    authorships_parsed = (
      df_processed.select("authorships")
      .to_series()
      .map_elements(safe_parse_json, return_dtype=pl.Object)
    )

    num_authorships = []
    for auth in authorships_parsed:
      if isinstance(auth, list):
        # Count author positions (matching temporal logic)
        count = sum(
          1 for a in auth if isinstance(a, dict) and "author_position" in a
        )
        num_authorships.append(float(count))
      else:
        num_authorships.append(0.0)

    df_processed = df_processed.with_columns(
      pl.Series("num_authorships", num_authorships, dtype=pl.Float32)
    )
    df_processed = df_processed.drop("authorships")
    print(" ‚úÖ Created 'num_authorships'")

  # Drop locations (temporal drops it)
  if "locations" in df_processed.columns:
    df_processed = df_processed.drop("locations")

  # Primary location normalization (matching temporal)
  if "primary_location" in df_processed.columns:
    primary_location_parsed = (
      df_processed.select("primary_location")
      .to_series()
      .map_elements(safe_parse_json, return_dtype=pl.Object)
    )

    # Extract source fields if available
    source_fields = {}
    for ploc in primary_location_parsed:
      if isinstance(ploc, dict) and "source" in ploc:
        source = ploc["source"]
        if isinstance(source, dict):
          for key, value in source.items():
            if key not in source_fields:
              source_fields[key] = []
            source_fields[key].append(value if value is not None else "")
          break

    # Add source fields as columns (simplified - temporal does json_normalize)
    # For now, we'll drop primary_location as the nested structure is complex
    df_processed = df_processed.drop("primary_location")
    print(" ‚úÖ Processed 'primary_location' (dropped nested structure)")

  # Related works count (matching temporal: num_related_words)
  if "related_works" in df_processed.columns:
    related_works_parsed = (
      df_processed.select("related_works")
      .to_series()
      .map_elements(safe_parse_json, return_dtype=pl.Object)
    )

    num_related_words = []
    for rw in related_works_parsed:
      if isinstance(rw, list):
        num_related_words.append(float(len(rw)))
      else:
        num_related_words.append(0.0)

    df_processed = df_processed.with_columns(
      pl.Series("num_related_words", num_related_words, dtype=pl.Float32)
    )
    df_processed = df_processed.drop("related_works")
    print(" ‚úÖ Created 'num_related_words'")

  # Grants count (matching temporal: num_grants)
  if "grants" in df_processed.columns:
    grants_parsed = (
      df_processed.select("grants")
      .to_series()
      .map_elements(safe_parse_json, return_dtype=pl.Object)
    )

    num_grants = []
    for g in grants_parsed:
      if isinstance(g, list):
        num_grants.append(float(len(g)))
      else:
        num_grants.append(0.0)

    df_processed = df_processed.with_columns(
      pl.Series("num_grants", num_grants, dtype=pl.Float32)
    )
    df_processed = df_processed.drop("grants")
    print(" ‚úÖ Created 'num_grants'")

  # Drop title, concepts (temporal drops them)
  for col in ["title", "concepts"]:
    if col in df_processed.columns:
      df_processed = df_processed.drop(col)

  # Language normalization (matching temporal: new_language)
  if "language" in df_processed.columns:
    df_processed = df_processed.with_columns(
      pl.col("language").fill_null("unknown").alias("new_language")
    )
    df_processed = df_processed.drop("language")
    print(" ‚úÖ Created 'new_language'")

  # Type and type_crossref handling (matching temporal)
  # Keep type_crossref if present, drop type
  if "type" in df_processed.columns:
    df_processed = df_processed.drop("type")

  # Categorical dummies for new_language, oa_status (matching temporal)
  # Note: Polars doesn't have get_dummies, so we'll do it manually
  if "new_language" in df_processed.columns:
    lang_values = df_processed.select("new_language").to_series().unique().to_list()
    for lang in lang_values:
      if lang is not None:
        col_name = f"new_language_{lang}"
        df_processed = df_processed.with_columns(
          (pl.col("new_language") == lang).cast(pl.Float32).alias(col_name)
        )
    df_processed = df_processed.drop("new_language")
    print(f" ‚úÖ One-hot encoded 'new_language' ({len(lang_values)} categories)")

  if "oa_status" in df_processed.columns:
    oa_status_values = (
      df_processed.select("oa_status").to_series().unique().to_list()
    )
    for status in oa_status_values:
      if status is not None:
        col_name = f"oa_status_{status}"
        df_processed = df_processed.with_columns(
          (pl.col("oa_status") == status).cast(pl.Float32).alias(col_name)
        )
    df_processed = df_processed.drop("oa_status")
    print(f" ‚úÖ One-hot encoded 'oa_status' ({len(oa_status_values)} categories)")

  # Fill nulls and ensure float types
  df_processed = df_processed.fill_null(0.0)

  # Convert all numeric columns to float32
  for col in df_processed.columns:
    if df_processed[col].dtype in [
      pl.Int8,
      pl.Int16,
      pl.Int32,
      pl.Int64,
      pl.UInt8,
      pl.UInt16,
      pl.UInt32,
      pl.UInt64,
    ]:
      df_processed = df_processed.with_columns(pl.col(col).cast(pl.Float32))

  print(f" ‚úÖ Feature engineering complete. Final shape: {df_processed.shape}")
  return df_processed


# Apply feature engineering
try:
  print("\n" + "=" * 80)
  print("PHASE 2: Feature Engineering (Temporal)")
  print("=" * 80)
  phase_start = time.time()

  X_train_engineered = extract_temporal_features(X_train_combined)
  X_val_engineered = extract_temporal_features(X_val_combined)
  X_test_engineered = extract_temporal_features(X_test_combined)

  # Ensure all dataframes have the same columns (align test to train)
  train_cols = X_train_engineered.columns
  val_cols = X_val_engineered.columns
  test_cols = X_test_engineered.columns

  # Add missing columns to val and test (fill with 0)
  missing_val_cols = [c for c in train_cols if c not in val_cols]
  missing_test_cols = [c for c in train_cols if c not in test_cols]

  for col in missing_val_cols:
    X_val_engineered = X_val_engineered.with_columns(pl.lit(0.0).alias(col))
  for col in missing_test_cols:
    X_test_engineered = X_test_engineered.with_columns(pl.lit(0.0).alias(col))

  # Remove extra columns from val and test
  extra_val_cols = [c for c in val_cols if c not in train_cols]
  extra_test_cols = [c for c in test_cols if c not in train_cols]

  if extra_val_cols:
    X_val_engineered = X_val_engineered.drop(extra_val_cols)
  if extra_test_cols:
    X_test_engineered = X_test_engineered.drop(extra_test_cols)

  # Reorder columns to match train
  X_val_engineered = X_val_engineered.select(train_cols)
  X_test_engineered = X_test_engineered.select(train_cols)

  # Store column names for later use
  feature_column_names = train_cols

  # Convert to numpy arrays (keep as DataFrame for now to identify numeric columns)
  # Keep as DataFrame for feature combination
  # X_train_np = X_train_engineered.to_numpy().astype(np.float32)
  # X_val_np = X_val_engineered.to_numpy().astype(np.float32)
  # X_test_np = X_test_engineered.to_numpy().astype(np.float32)

  phase_time = time.time() - phase_start
  print(f"\n‚úÖ Feature engineering complete")
  print(f" Final feature count: {len(train_cols)}")
  print(f" Train shape: {X_train_engineered.shape}")
  print(f" Val shape: {X_val_engineered.shape}")
  print(f" Test shape: {X_test_engineered.shape}")
  print(
    f"\n‚è±Ô∏è Feature Engineering Time: {phase_time:.2f} seconds ({phase_time/60:.2f} minutes)"
  )

  del X_train_combined, X_val_combined, X_test_combined
  cleanup_memory()
  memory_usage()
except Exception as e:
  print(f"‚ùå Error in feature engineering: {e}")
  import traceback

  traceback.print_exc()
  raise



PHASE 2: Feature Engineering (Temporal)

üîß Extracting temporal features...
 ‚úÖ Created 'num_years_after_publication'
 ‚úÖ Created 'new_language'
 ‚úÖ One-hot encoded 'new_language' (42 categories)
 ‚úÖ Feature engineering complete. Final shape: (9600, 2016)

üîß Extracting temporal features...
 ‚úÖ Created 'num_years_after_publication'
 ‚úÖ Created 'new_language'
 ‚úÖ One-hot encoded 'new_language' (32 categories)
 ‚úÖ Feature engineering complete. Final shape: (1200, 2006)

üîß Extracting temporal features...
 ‚úÖ Created 'num_years_after_publication'
 ‚úÖ Created 'new_language'
 ‚úÖ One-hot encoded 'new_language' (32 categories)
 ‚úÖ Feature engineering complete. Final shape: (1200, 2006)

‚úÖ Feature engineering complete
 Final feature count: 2016
 Train shape: (9600, 2016)
 Val shape: (1200, 2016)
 Test shape: (1200, 2016)

‚è±Ô∏è Feature Engineering Time: 0.11 seconds (0.00 minutes)


## 4. Feature Combination (XGBoost Style: Regular + Embeddings with PCA)


In [8]:
# Split features into regular and embeddings (XGBoost style)
# Then apply PCA to embeddings and combine with regular features
try:
  print("\n" + "=" * 80)
  print("PHASE 3: Feature Combination (XGBoost Style)")
  print("=" * 80)
  phase_start = time.time()

  # Split features (XGBoost style)
  X_reg_train, X_emb_train_fams, _, reg_cols, emb_family_to_cols = (
    split_features_reg_and_all_emb(X_train_engineered)
  )
  X_reg_val, X_emb_val_fams, _, _, _ = split_features_reg_and_all_emb(
    X_val_engineered
  )
  X_reg_test, X_emb_test_fams, _, _, _ = split_features_reg_and_all_emb(
    X_test_engineered
  )

  print(f"\nüìä Feature Split:")
  print(f" Regular features: {len(reg_cols) if reg_cols else 0}")
  for fam, arr in X_emb_train_fams.items():
    print(f" Embedding {fam}: {arr.shape[1]} dims")

  # PCA configuration (matching XGBoost)
  PCA_COMPONENTS_PER_FAMILY = {
    "sent_transformer_": 32,
    "scibert_": 32,
    "specter_": 32,
    "specter2_": 32,
    "ner_": 16,
  }

  # Apply PCA to embeddings
  print("\nüìä Applying IncrementalPCA to embedding families...")
  X_emb_train_pca_list = []
  X_emb_val_pca_list = []
  X_emb_test_pca_list = []
  pca_models = {}

  for fam, X_emb_train in X_emb_train_fams.items():
    n_components = PCA_COMPONENTS_PER_FAMILY.get(fam, 32)
    print(f" {fam}: {X_emb_train.shape[1]} dims ‚Üí {n_components} components")

    # Fit PCA on train
    ipca = IncrementalPCA(
      n_components=min(n_components, X_emb_train.shape[1]), batch_size=2000
    )

    # Fit on subset if too large
    max_pca_rows = int(X_emb_train.shape[0] * 0.3)
    if X_emb_train.shape[0] > max_pca_rows:
      idx = np.random.choice(
        X_emb_train.shape[0], size=max_pca_rows, replace=False
      )
      ipca.fit(X_emb_train[idx])
      del idx
    else:
      ipca.fit(X_emb_train)

    pca_models[fam] = ipca

    # Transform train, val, test
    X_emb_train_pca = ipca.transform(X_emb_train)
    X_emb_val_pca = ipca.transform(X_emb_val_fams[fam])
    X_emb_test_pca = ipca.transform(X_emb_test_fams[fam])

    X_emb_train_pca_list.append(X_emb_train_pca)
    X_emb_val_pca_list.append(X_emb_val_pca)
    X_emb_test_pca_list.append(X_emb_test_pca)

    cleanup_memory()

  # Combine embeddings
  X_emb_train_combined = (
    np.hstack(X_emb_train_pca_list) if X_emb_train_pca_list else None
  )
  X_emb_val_combined = np.hstack(X_emb_val_pca_list) if X_emb_val_pca_list else None
  X_emb_test_combined = (
    np.hstack(X_emb_test_pca_list) if X_emb_test_pca_list else None
  )

  # Combine regular + embeddings
  if X_reg_train is not None and X_emb_train_combined is not None:
    X_train_combined_np = np.hstack([X_reg_train, X_emb_train_combined])
    X_val_combined_np = np.hstack([X_reg_val, X_emb_val_combined])
    X_test_combined_np = np.hstack([X_reg_test, X_emb_test_combined])
  elif X_reg_train is not None:
    X_train_combined_np = X_reg_train
    X_val_combined_np = X_reg_val
    X_test_combined_np = X_reg_test
  elif X_emb_train_combined is not None:
    X_train_combined_np = X_emb_train_combined
    X_val_combined_np = X_emb_val_combined
    X_test_combined_np = X_emb_test_combined
  else:
    raise ValueError("No features available!")

  phase_time = time.time() - phase_start
  print(f"\n‚úÖ Feature combination complete")
  print(f" Combined train: {X_train_combined_np.shape}")
  print(f" Combined val: {X_val_combined_np.shape}")
  print(f" Combined test: {X_test_combined_np.shape}")
  print(
    f"\n‚è±Ô∏è Feature Combination Time: {phase_time:.2f} seconds ({phase_time/60:.2f} minutes)"
  )

  # Store feature names for temporal
  combined_feature_names = []
  if reg_cols:
    combined_feature_names.extend(reg_cols)
  for fam in X_emb_train_fams.keys():
    n_comp = PCA_COMPONENTS_PER_FAMILY.get(fam, 32)
    combined_feature_names.extend([f"{fam}pca_{i}" for i in range(n_comp)])

  del X_reg_train, X_reg_val, X_reg_test
  del X_emb_train_fams, X_emb_val_fams, X_emb_test_fams
  del X_emb_train_pca_list, X_emb_val_pca_list, X_emb_test_pca_list
  cleanup_memory()
  memory_usage()
except Exception as e:
  print(f"‚ùå Error in feature combination: {e}")
  import traceback

  traceback.print_exc()
  raise



PHASE 3: Feature Combination (XGBoost Style)

üìä Feature Split:
 Regular features: 96
 Embedding sent_transformer_: 384 dims
 Embedding scibert_: 768 dims
 Embedding specter2_: 768 dims

üìä Applying IncrementalPCA to embedding families...
 sent_transformer_: 384 dims ‚Üí 32 components


 scibert_: 768 dims ‚Üí 32 components


 specter2_: 768 dims ‚Üí 32 components



‚úÖ Feature combination complete
 Combined train: (9600, 192)
 Combined val: (1200, 192)
 Combined test: (1200, 192)

‚è±Ô∏è Feature Combination Time: 0.80 seconds (0.01 minutes)


## 5. Duplicate Feature Elimination


In [9]:
# Identify and remove absolute duplicate features
try:
  print("\n" + "=" * 80)
  print("PHASE 4: Duplicate Feature Elimination")
  print("=" * 80)
  phase_start = time.time()

  # Find duplicate columns (columns with identical values)
  print("\nüîç Identifying duplicate features...")
  duplicate_groups = []
  checked_cols = set()

  for i in range(X_train_combined_np.shape[1]):
    if i in checked_cols:
      continue

    col_i_data = X_train_combined_np[:, i]

    duplicates = [i]
    for j in range(i + 1, X_train_combined_np.shape[1]):
      if j in checked_cols:
        continue

      col_j_data = X_train_combined_np[:, j]

      # Check if columns are identical (allowing for small floating point differences)
      if np.allclose(col_i_data, col_j_data, rtol=1e-5, atol=1e-8):
        duplicates.append(j)
        checked_cols.add(j)

    if len(duplicates) > 1:
      duplicate_groups.append(duplicates)
      checked_cols.add(i)

  # Keep first column from each duplicate group, remove others
  cols_to_keep = set(range(X_train_combined_np.shape[1]))
  cols_to_remove = []

  for group in duplicate_groups:
    # Keep first, remove rest
    cols_to_remove.extend(group[1:])
    print(
      f" Found duplicate group: columns {group} (keeping column {group[0]}, removing {len(group)-1} duplicates)"
    )

  cols_to_keep = sorted(list(cols_to_keep - set(cols_to_remove)))

  if cols_to_remove:
    print(f"\nüìä Removing {len(cols_to_remove)} duplicate features")
    X_train_final = X_train_combined_np[:, cols_to_keep]
    X_val_final = X_val_combined_np[:, cols_to_keep]
    X_test_final = X_test_combined_np[:, cols_to_keep]
  else:
    print("\n‚úÖ No duplicate features found")
    X_train_final = X_train_combined_np
    X_val_final = X_val_combined_np
    X_test_final = X_test_combined_np

  phase_time = time.time() - phase_start
  print(f"\n‚úÖ Duplicate elimination complete")
  print(
    f" Final feature count: {X_train_final.shape[1]} (removed {len(cols_to_remove)} duplicates)"
  )
  print(f" Train shape: {X_train_final.shape}")
  print(f" Val shape: {X_val_final.shape}")
  print(f" Test shape: {X_test_final.shape}")
  print(
    f"\n‚è±Ô∏è Duplicate Elimination Time: {phase_time:.2f} seconds ({phase_time/60:.2f} minutes)"
  )

  cleanup_memory()
except Exception as e:
  print(f"‚ùå Error in duplicate elimination: {e}")
  import traceback

  traceback.print_exc()
  # Fallback: use original data
  X_train_final = X_train_combined_np
  X_val_final = X_val_combined_np
  X_test_final = X_test_combined_np
  cols_to_remove = []
  print("‚ö†Ô∏è Continuing with original features (no deduplication)")


PHASE 4: Duplicate Feature Elimination

üîç Identifying duplicate features...


 Found duplicate group: columns [2, 3, 5, 6, 7, 8, 12, 14, 18, 20, 34, 36, 37, 39, 40, 49, 50, 51, 52] (keeping column 2, removing 18 duplicates)

üìä Removing 18 duplicate features

‚úÖ Duplicate elimination complete
 Final feature count: 174 (removed 18 duplicates)
 Train shape: (9600, 174)
 Val shape: (1200, 174)
 Test shape: (1200, 174)

‚è±Ô∏è Duplicate Elimination Time: 0.92 seconds (0.02 minutes)


## 6. Feature Review: Original vs Engineered


In [10]:
# Display feature comparison for review
try:
  print("\n" + "=" * 80)
  print("PHASE 5: Feature Review (Original vs Engineered)")
  print("=" * 80)

  print("\nüìä Original Features (from data_exploration_organized.ipynb):")
  print(f" Columns: {len(X_train_original.columns)}")
  print(f" Sample columns: {list(X_train_original.columns[:10])}")

  print("\nüìä Engineered Features (after temporal + PCA + deduplication):")
  print(f" Total features: {X_train_final.shape[1]}")
  print(f" Regular features: {len(reg_cols) if reg_cols else 0}")
  print(
    f" Temporal features added: num_years_after_publication, days_since_updated, days_since_publication"
  )
  print(
    f" PCA-compressed embeddings: {sum(PCA_COMPONENTS_PER_FAMILY.values())} components"
  )

  # Show feature breakdown
  print("\nüìã Feature Breakdown:")
  print(f" - Original regular features: {len(reg_cols) if reg_cols else 0}")
  print(f" - Temporal features (temporal): 3")
  print(
    f" - PCA-compressed embeddings: {X_train_final.shape[1] - (len(reg_cols) if reg_cols else 0) - 3}"
  )
  print(
    f" - Duplicates removed: {len(cols_to_remove) if 'cols_to_remove' in locals() and cols_to_remove else 0}"
  )

  # Store for later temporal
  feature_review = {
    "original_features": len(X_train_original.columns),
    "final_features": X_train_final.shape[1],
    "regular_features": len(reg_cols) if reg_cols else 0,
    "temporal_features": 3,
    "embedding_features": X_train_final.shape[1]
    - (len(reg_cols) if reg_cols else 0)
    - 3,
    "duplicates_removed": (
      len(cols_to_remove) if "cols_to_remove" in locals() else 0
    ),
  }

  print("\n‚úÖ Feature review complete")

except Exception as e:
  print(f"‚ö†Ô∏è Error in feature review: {e}")
  import traceback

  traceback.print_exc()



PHASE 5: Feature Review (Original vs Engineered)

üìä Original Features (from data_exploration_organized.ipynb):
 Columns: 1977
 Sample columns: ['abstract_length', 'abstract_word_count', 'avg_author_citations', 'avg_author_h_index', 'avg_concept_score', 'avg_topic_score', 'first_author_citations', 'first_author_h_index', 'first_author_papers', 'has_abstract']

üìä Engineered Features (after temporal + PCA + deduplication):
 Total features: 174
 Regular features: 96
 Temporal features added: num_years_after_publication, days_since_updated, days_since_publication
 PCA-compressed embeddings: 144 components

üìã Feature Breakdown:
 - Original regular features: 96
 - Temporal features (temporal): 3
 - PCA-compressed embeddings: 75
 - Duplicates removed: 18

‚úÖ Feature review complete


In [11]:
# Identify numeric columns for selective scaling (matching temporal)
# Only scale continuous numeric features, preserve binary/one-hot features
try:
  print("\n" + "=" * 80)
  print("PHASE 6: Selective Feature Scaling (Temporal)")
  print("=" * 80)
  phase_start = time.time()

  # Identify numeric columns that should be scaled
  # Binary/one-hot features should NOT be scaled
  numeric_indices = []
  binary_indices = []

  # Check each feature column
  for i in range(X_train_final.shape[1]):
    col_data = X_train_final[:, i]
    unique_vals = np.unique(col_data)

    # If column only has 0 and 1 (or 0.0 and 1.0), it's binary/one-hot
    if len(unique_vals) <= 2 and set(unique_vals).issubset({0, 1, 0.0, 1.0}):
      binary_indices.append(i)
    else:
      # It's a numeric column that should be scaled
      numeric_indices.append(i)

  print(f"\nüìä Column Analysis:")
  print(f" Numeric columns to scale: {len(numeric_indices)}")
  print(f" Binary/one-hot columns (preserved): {len(binary_indices)}")

  # Scale only numeric columns (matching temporal approach)
  scaler = StandardScaler()

  if numeric_indices:
    # Scale numeric columns only
    X_train_scaled = X_train_final.copy()
    X_val_scaled = X_val_final.copy()
    X_test_scaled = X_test_final.copy()

    X_train_scaled[:, numeric_indices] = scaler.fit_transform(
      X_train_final[:, numeric_indices]
    )
    X_val_scaled[:, numeric_indices] = scaler.transform(
      X_val_final[:, numeric_indices]
    )
    X_test_scaled[:, numeric_indices] = scaler.transform(
      X_test_final[:, numeric_indices]
    )

    print(f" ‚úÖ Scaled {len(numeric_indices)} numeric columns")
    print(f" ‚úÖ Preserved {len(binary_indices)} binary/one-hot columns")
  else:
    # Fallback: scale all if no numeric columns identified
    print(" ‚ö†Ô∏è No numeric columns identified, scaling all features")
    X_train_scaled = scaler.fit_transform(X_train_final)
    X_val_scaled = scaler.transform(X_val_final)
    X_test_scaled = scaler.transform(X_test_final)

  # Update variable names
  X_train = X_train_scaled
  X_val = X_val_scaled
  X_test = X_test_scaled

  phase_time = time.time() - phase_start
  print(f"\n‚úÖ Selective feature scaling complete")
  print(f" Train shape: {X_train.shape}")
  print(f" Val shape: {X_val.shape}")
  print(f" Test shape: {X_test.shape}")
  print(
    f"\n‚è±Ô∏è Feature Scaling Time: {phase_time:.2f} seconds ({phase_time/60:.2f} minutes)"
  )

  del X_train_final, X_val_final, X_test_final
  cleanup_memory()
  memory_usage()
except Exception as e:
  print(f"‚ùå Error in feature scaling: {e}")
  import traceback

  traceback.print_exc()
  raise



PHASE 6: Selective Feature Scaling (Temporal)

üìä Column Analysis:
 Numeric columns to scale: 118
 Binary/one-hot columns (preserved): 56
 ‚úÖ Scaled 118 numeric columns
 ‚úÖ Preserved 56 binary/one-hot columns

‚úÖ Selective feature scaling complete
 Train shape: (9600, 174)
 Val shape: (1200, 174)
 Test shape: (1200, 174)

‚è±Ô∏è Feature Scaling Time: 0.03 seconds (0.00 minutes)


## 4. Class Imbalance Handling: RandomUnderSampler


In [12]:
# Apply RandomUnderSampler (matching temporal)
try:
  print("\n" + "=" * 80)
  print("PHASE 7: Class Imbalance Handling (RandomUnderSampler)")
  print("=" * 80)
  phase_start = time.time()

  print(f"\nüìä Before resampling:")
  print(f" Train samples: {len(X_train)}")
  print(f" Positive: {y_train.sum()}, Negative: {(y_train==0).sum()}")
  print(f" Imbalance ratio: {(y_train==0).sum() / max(y_train.sum(), 1):.2f}:1")

  # Apply RandomUnderSampler (matching temporal)
  rus = RandomUnderSampler(random_state=SEED)
  X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train)

  phase_time = time.time() - phase_start
  print(f"\n‚úÖ RandomUnderSampler complete")
  print(f" After resampling:")
  print(f" Train samples: {len(X_train_resampled)}")
  print(
    f" Positive: {y_train_resampled.sum()}, Negative: {(y_train_resampled==0).sum()}"
  )
  print(
    f" Balance ratio: {(y_train_resampled==0).sum() / max(y_train_resampled.sum(), 1):.2f}:1"
  )
  print(
    f"\n‚è±Ô∏è Resampling Time: {phase_time:.2f} seconds ({phase_time/60:.2f} minutes)"
  )

  X_train = X_train_resampled
  y_train = y_train_resampled

  del X_train_resampled, y_train_resampled
  cleanup_memory()
  memory_usage()
except Exception as e:
  print(f"‚ùå Error in resampling: {e}")
  import traceback

  traceback.print_exc()
  print("‚ö†Ô∏è Continuing with original training data...")



PHASE 7: Class Imbalance Handling (RandomUnderSampler)

üìä Before resampling:
 Train samples: 9600
 Positive: 648, Negative: 8952
 Imbalance ratio: 13.81:1

‚úÖ RandomUnderSampler complete
 After resampling:
 Train samples: 1296
 Positive: 648, Negative: 648
 Balance ratio: 1.00:1

‚è±Ô∏è Resampling Time: 0.00 seconds (0.00 minutes)


## 5. Model Training Pipeline


In [13]:
# Define models to train (matching temporal: multiple classifiers)
# Including all models that are commonly used in temporal notebooks
models_to_train = {
  "BernoulliNB": BernoulliNB(),
  "LogisticRegression": LogisticRegression(
    random_state=SEED, max_iter=1000, n_jobs=2
  ),
  "DecisionTree": DecisionTreeClassifier(random_state=SEED, max_depth=10),
  "RandomForest": RandomForestClassifier(
    n_estimators=100, random_state=SEED, n_jobs=2, max_depth=10
  ),
  "GradientBoosting": GradientBoostingClassifier(
    n_estimators=100, random_state=SEED, max_depth=5
  ),
  "ExtraTrees": ExtraTreesClassifier(
    n_estimators=100, random_state=SEED, n_jobs=2, max_depth=10
  ),
  "AdaBoost": AdaBoostClassifier(n_estimators=50, random_state=SEED),
  "LGBMClassifier": LGBMClassifier(random_state=SEED, verbose=-1),
  "XGBClassifier": XGBClassifier(random_state=SEED, eval_metric='logloss', use_label_encoder=False),
  "CatBoostClassifier": CatBoostClassifier(random_state=SEED, verbose=False),
  "BaggingClassifier": BaggingClassifier(random_state=SEED, n_jobs=2),
  "Perceptron": Perceptron(random_state=SEED),
  "QuadraticDiscriminantAnalysis": QuadraticDiscriminantAnalysis(),
  "GaussianNB": GaussianNB(),
  "LinearDiscriminantAnalysis": LinearDiscriminantAnalysis(),
  "ExtraTreeClassifier": ExtraTreeClassifier(random_state=SEED, max_depth=10),
  "SGDClassifier": SGDClassifier(random_state=SEED, n_jobs=2),
  "DummyClassifier": DummyClassifier(strategy='most_frequent'),
}

# Train all models
trained_models = {}
model_scores = {}

try:
  print("\n" + "=" * 80)
  print("PHASE 8: Model Training Pipeline (Temporal)")
  print("=" * 80)
  phase_start = time.time()

  for model_name, model in models_to_train.items():
    print(f"\nüìä Training {model_name}...")
    model_start = time.time()

    try:
      # Train model
      model.fit(X_train, y_train)

      # Evaluate on validation set
      y_val_pred = model.predict(X_val)
      y_val_proba = (
        model.predict_proba(X_val)[:, 1]
        if hasattr(model, "predict_proba")
        else y_val_pred
      )

      # Calculate metrics
      f1 = f1_score(y_val, y_val_pred, zero_division=0)
      try:
        auc = roc_auc_score(y_val, y_val_proba)
      except:
        auc = 0.0

      trained_models[model_name] = model
      model_scores[model_name] = {
        "f1_score": f1,
        "roc_auc": auc,
        "train_time": time.time() - model_start,
      }

      print(
        f" ‚úÖ {model_name} - F1: {f1:.4f}, AUC: {auc:.4f}, Time: {model_scores[model_name]['train_time']:.2f}s"
      )

    except Exception as e:
      print(f" ‚ùå {model_name} failed: {e}")
      continue

    cleanup_memory()

  phase_time = time.time() - phase_start
  print(f"\n‚úÖ Model training complete")
  print(f" Trained {len(trained_models)} models")
  print(
    f"\n‚è±Ô∏è Total Training Time: {phase_time:.2f} seconds ({phase_time/60:.2f} minutes)"
  )

  cleanup_memory()
  memory_usage()
except Exception as e:
  print(f"‚ùå Error in model training: {e}")
  import traceback

  traceback.print_exc()
  raise

  
  # Find best model based on F1 score
  if model_scores:
    best_model_name = max(model_scores, key=lambda k: model_scores[k].get('f1_score', 0))
    best_model = trained_models[best_model_name]
    best_f1_score = model_scores[best_model_name].get('f1_score', 0)
    print(f"\nüèÜ Best Model: {best_model_name} (F1: {best_f1_score:.4f})")
  else:
    # Fallback: use first model
    best_model_name = list(trained_models.keys())[0] if trained_models else "Unknown"
    best_model = list(trained_models.values())[0] if trained_models else None
    print(f"‚ö†Ô∏è No scores available, using first model: {best_model_name}")



PHASE 8: Model Training Pipeline (Temporal)

üìä Training BernoulliNB...
 ‚úÖ BernoulliNB - F1: 0.2851, AUC: 0.8649, Time: 0.01s



üìä Training LogisticRegression...


 ‚úÖ LogisticRegression - F1: 0.3469, AUC: 0.8832, Time: 1.56s

üìä Training DecisionTree...
 ‚úÖ DecisionTree - F1: 0.2667, AUC: 0.6866, Time: 0.08s



üìä Training RandomForest...


 ‚úÖ RandomForest - F1: 0.3411, AUC: 0.8992, Time: 0.37s

üìä Training GradientBoosting...


 ‚úÖ GradientBoosting - F1: 0.3494, AUC: 0.8883, Time: 3.91s

üìä Training ExtraTrees...
 ‚úÖ ExtraTrees - F1: 0.3571, AUC: 0.9055, Time: 0.10s



üìä Training AdaBoost...


 ‚úÖ AdaBoost - F1: 0.3282, AUC: 0.8855, Time: 0.53s

üìä Training LGBMClassifier...


 ‚úÖ LGBMClassifier - F1: 0.3535, AUC: 0.9015, Time: 0.60s

üìä Training XGBClassifier...


 ‚úÖ XGBClassifier - F1: 0.3515, AUC: 0.9020, Time: 0.20s

üìä Training CatBoostClassifier...


 ‚úÖ CatBoostClassifier - F1: 0.3706, AUC: 0.9106, Time: 2.90s

üìä Training BaggingClassifier...


 ‚úÖ BaggingClassifier - F1: 0.3293, AUC: 0.8753, Time: 2.16s

üìä Training Perceptron...
 ‚úÖ Perceptron - F1: 0.3065, AUC: 0.7547, Time: 0.01s

üìä Training QuadraticDiscriminantAnalysis...


 ‚úÖ QuadraticDiscriminantAnalysis - F1: 0.1770, AUC: 0.6642, Time: 0.01s

üìä Training GaussianNB...
 ‚úÖ GaussianNB - F1: 0.1660, AUC: 0.6381, Time: 0.01s

üìä Training LinearDiscriminantAnalysis...
 ‚úÖ LinearDiscriminantAnalysis - F1: 0.3350, AUC: 0.8771, Time: 0.01s



üìä Training ExtraTreeClassifier...
 ‚úÖ ExtraTreeClassifier - F1: 0.2922, AUC: 0.8241, Time: 0.00s



üìä Training SGDClassifier...
 ‚úÖ SGDClassifier - F1: 0.3065, AUC: 0.7473, Time: 0.02s

üìä Training DummyClassifier...
 ‚úÖ DummyClassifier - F1: 0.0000, AUC: 0.5000, Time: 0.00s



‚úÖ Model training complete
 Trained 18 models

‚è±Ô∏è Total Training Time: 14.92 seconds (0.25 minutes)


## 6. Model Comparison


In [14]:
# ==============
# PATH MANAGEMENT
# ==============

from pathlib import Path
import os

# Get project root by finding data/results directory
current = Path(os.getcwd())
PROJECT_ROOT = current

# Search up to 10 levels to find data/results
for _ in range(10):
    if (PROJECT_ROOT / "data" / "results").exists():
        break
    PROJECT_ROOT = PROJECT_ROOT.parent
else:
    # Fallback: go up two levels from current (assuming we're in src/notebooks)
    PROJECT_ROOT = current.parent.parent

RESULTS_DIR = PROJECT_ROOT / "data" / "results"
MODEL_SAVE_DIR = PROJECT_ROOT / "models" / "saved_models"
SUBMISSION_DIR = PROJECT_ROOT / "data" / "submission_files"
MODEL_SAVE_DIR.mkdir(parents=True, exist_ok=True)
SUBMISSION_DIR.mkdir(parents=True, exist_ok=True)

print("PROJECT_ROOT:", PROJECT_ROOT)
print("RESULTS_DIR:", RESULTS_DIR)
print("MODEL_SAVE_DIR:", MODEL_SAVE_DIR)
print("SUBMISSION_DIR:", SUBMISSION_DIR)


PROJECT_ROOT: /Users/santoshdesai/Documents/Desai_Projects/Kaggle2
RESULTS_DIR: /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/data/results
MODEL_SAVE_DIR: /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/models/saved_models
SUBMISSION_DIR: /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/data/submission_files


## 7. Threshold Tuning


In [15]:
# Find optimal threshold for best model
try:
  print("\n" + "=" * 80)
  print("PHASE 6: Threshold Tuning")
  print("=" * 80)
  phase_start = time.time()

  # Get predictions from best model
  if 'best_model' not in locals() or best_model is None:
    if 'trained_models' in locals() and trained_models:
      if 'model_scores' in locals() and model_scores:
        best_model_name = max(model_scores, key=lambda k: model_scores[k].get('f1_score', 0))
        best_model = trained_models[best_model_name]
      else:
        best_model_name = list(trained_models.keys())[0]
        best_model = trained_models[best_model_name]
    else:
      raise ValueError("No models available for threshold tuning!")
  
  y_val_proba = best_model.predict_proba(X_val)[:, 1]

  # Find optimal threshold using precision-recall curve
  precision, recall, pr_thresholds = precision_recall_curve(y_val, y_val_proba)
  f1_scores_pr = 2 * (precision * recall) / (precision + recall + 1e-10)
  best_pr_idx = np.argmax(f1_scores_pr)
  best_pr_threshold = (
    pr_thresholds[best_pr_idx] if best_pr_idx < len(pr_thresholds) else 0.5
  )
  best_pr_f1 = f1_scores_pr[best_pr_idx]

  # Fine-grained search
  thresholds = np.concatenate(
    [
      np.linspace(0.01, 0.05, 20),
      np.linspace(0.05, 0.15, 50),
      np.linspace(0.15, 0.3, 30),
      np.linspace(0.3, 0.9, 20),
    ]
  )

  best_threshold = best_pr_threshold
  best_f1 = best_pr_f1

  for thr in thresholds:
    y_pred = (y_val_proba >= thr).astype(int)
    f1 = f1_score(y_val, y_pred, pos_label=1, zero_division=0)
    if f1 > best_f1:
      best_f1 = f1
      best_threshold = thr

  print(f"\n‚úÖ Threshold tuning complete")
  print(f" Best threshold: {best_threshold:.4f}")
  print(f" Best F1 score: {best_f1:.4f}")

  # Evaluate with optimal threshold
  y_val_pred_optimal = (y_val_proba >= best_threshold).astype(int)
  final_f1 = f1_score(y_val, y_val_pred_optimal, zero_division=0)

  print(f"\nüìä Final Validation Performance:")
  print(f" F1 Score: {final_f1:.4f}")
  print(classification_report(y_val, y_val_pred_optimal, zero_division=0))

  phase_time = time.time() - phase_start
  print(
    f"\n‚è±Ô∏è Threshold Tuning Time: {phase_time:.2f} seconds ({phase_time/60:.2f} minutes)"
  )

  cleanup_memory()
except Exception as e:
  print(f"‚ùå Error in threshold tuning: {e}")
  import traceback

  traceback.print_exc()
  best_threshold = 0.5
  print(f"‚ö†Ô∏è Using default threshold: {best_threshold}")



PHASE 6: Threshold Tuning

‚úÖ Threshold tuning complete
 Best threshold: 0.8434
 Best F1 score: 0.4737

üìä Final Validation Performance:
 F1 Score: 0.4737
              precision    recall  f1-score   support

           0       0.97      0.94      0.95      1119
           1       0.41      0.56      0.47        81

    accuracy                           0.92      1200
   macro avg       0.69      0.75      0.71      1200
weighted avg       0.93      0.92      0.92      1200


‚è±Ô∏è Threshold Tuning Time: 0.08 seconds (0.00 minutes)


In [16]:
# Final summary
total_time = time.time() - TOTAL_START_TIME
print("\n" + "=" * 80)
print("üéâ PIPELINE EXECUTION COMPLETE!")
print("=" * 80)
print(
  f"\n‚è±Ô∏è Total Execution Time: {total_time:.2f} seconds ({total_time/60:.2f} minutes)"
)
print(f"\nüìä Summary:")
# Safely get variables with defaults
best_model_name_val = locals().get("best_model_name", "N/A")
best_f1_val = locals().get("best_f1", 0.0)
best_threshold_val = locals().get("best_threshold", 0.5)
trained_models_val = locals().get("trained_models", {})
print(f" Best Model: {best_model_name_val}")
print(f" Best F1 Score: {best_f1_val:.4f}")
print(f" Optimal Threshold: {best_threshold_val:.4f}")
print(f" Models Trained: {len(trained_models_val)}")
print(f"\nüìã Feature Summary:")
if "feature_review" in locals():
  print(f" Original features: {feature_review.get('original_features', 'N/A')}")
  print(f" Final features: {feature_review.get('final_features', 'N/A')}")
  print(f" Regular features: {feature_review.get('regular_features', 'N/A')}")
  print(f" Temporal features: {feature_review.get('temporal_features', 'N/A')}")
  print(f" Embedding features (PCA): {feature_review.get('embedding_features', 'N/A')}")
  print(f" Duplicates removed: {feature_review.get('duplicates_removed', 'N/A')}")
print(f"\nüíæ Outputs:")
if "model_save_path" in locals():
  print(f" Model: {model_save_path}")
if "submission_path" in locals():
  print(f" Submission: {submission_path}")
print("\n" + "=" * 80)



üéâ PIPELINE EXECUTION COMPLETE!

‚è±Ô∏è Total Execution Time: 20.23 seconds (0.34 minutes)

üìä Summary:
 Best Model: CatBoostClassifier
 Best F1 Score: 0.4737
 Optimal Threshold: 0.8434
 Models Trained: 18

üìã Feature Summary:
 Original features: 1977
 Final features: 174
 Regular features: 96
 Temporal features: 3
 Embedding features (PCA): 75
 Duplicates removed: 18

üíæ Outputs:



## 8. Save Model


In [17]:
# Save best model and scaler
try:
  print("\n" + "=" * 80)
  print("PHASE 7: Save Model")
  print("=" * 80)

  # Ensure best_model_name is set
  if 'best_model_name' not in locals() or best_model_name is None:
    if 'trained_models' in locals() and trained_models:
      if 'model_scores' in locals() and model_scores:
        best_model_name = max(model_scores, key=lambda k: model_scores[k].get('f1_score', 0))
      else:
        best_model_name = list(trained_models.keys())[0]
    else:
      best_model_name = "unknown"
  
  if 'best_model' not in locals() or best_model is None:
    if 'trained_models' in locals() and trained_models:
      best_model = trained_models.get(best_model_name, list(trained_models.values())[0])
    else:
      raise ValueError("No model available to save!")
  
  model_save_path = (
    MODEL_SAVE_DIR / f"model_{best_model_name.lower()}_leakage_test.pkl"
  )

  model_data = {
    "model": best_model,
    "scaler": scaler,
    "model_name": best_model_name,
    "best_threshold": best_threshold,
    "best_f1": best_f1,
    "feature_count": X_train.shape[1],
    "train_samples": len(X_train),
    "val_samples": len(X_val),
    "timestamp": START_TIME_STR,
  }

  with open(model_save_path, "wb") as f:
    pickle.dump(model_data, f)

  print(f"\n‚úÖ Model saved to: {model_save_path}")
  print(f" Model: {best_model_name}")
  print(f" Threshold: {best_threshold:.4f}")
  print(f" F1 Score: {best_f1:.4f}")

except Exception as e:
  print(f"‚ùå Error saving model: {e}")
  import traceback

  traceback.print_exc()



PHASE 7: Save Model

‚úÖ Model saved to: /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/models/saved_models/model_catboostclassifier_leakage_test.pkl
 Model: CatBoostClassifier
 Threshold: 0.8434
 F1 Score: 0.4737


## 9. Generate Submission


In [18]:
# Generate test predictions and submission file
try:
  print("\n" + "=" * 80)
  print("PHASE 8: Generate Submission")
  print("=" * 80)
  phase_start = time.time()

  # Get test predictions
  print("Generating test predictions...")
  # Get best model from trained_models if available
  best_model = locals().get("best_model", None)
  if best_model is None and "trained_models" in locals() and trained_models:
    # Use first model as fallback
      best_model = list(trained_models.values())[0]
      print("‚ö†Ô∏è Using first trained model as best_model")
  if best_model is None:
    raise ValueError("No model available for prediction!")
  y_test_proba = best_model.predict_proba(X_test)[:, 1]
  best_threshold_val = locals().get("best_threshold", 0.5)
  y_test_pred = (y_test_proba >= best_threshold_val).astype(int)

  print(f" Predictions produced: {len(y_test_pred)}")
  print(f" Positive predictions: {y_test_pred.sum()}")
  print(f" Negative predictions: {(y_test_pred == 0).sum()}")

  # Extract work_id from test IDs
  def extract_work_id(id_value: str) -> str:
    """Extract work_id from OpenAlex ID format."""
    if isinstance(id_value, str):
      if id_value.startswith("https://openalex.org/"):
        return id_value.replace("https://openalex.org/", "")
      elif id_value.startswith("W"):
        return id_value
    return str(id_value)

  # Get work_ids for test set
  if work_ids_test is not None:
    test_work_ids = [extract_work_id(wid) for wid in work_ids_test]
  else:
    # Fallback: generate sequential IDs
    test_work_ids = [f"W{i:010d}" for i in range(len(y_test_pred))]
    print(" ‚ö†Ô∏è Using generated work_ids (original IDs not found)")

  # Create submission DataFrame
  submission_df = pl.DataFrame({"work_id": test_work_ids, "label": y_test_pred})

  # Save submission
  best_model_name_val = locals().get("best_model_name", "unknown")
  submission_path = (
    SUBMISSION_DIR / f"submission_{best_model_name_val.lower()}_leakage_test.csv"
  )
  submission_df.write_csv(submission_path)

  phase_time = time.time() - phase_start
  print(f"\n‚úÖ Submission generated")
  print(f" File: {submission_path}")
  print(f" Samples: {len(submission_df)}")
  print(
    f"\n‚è±Ô∏è Submission Generation Time: {phase_time:.2f} seconds ({phase_time/60:.2f} minutes)"
  )

  # Display sample
  print("\nüìã Submission Sample (first 10 rows):")
  print(submission_df.head(10))

except Exception as e:
  print(f"‚ùå Error generating submission: {e}")
  import traceback

  traceback.print_exc()
  raise



PHASE 8: Generate Submission
Generating test predictions...
 Predictions produced: 1200
 Positive predictions: 93
 Negative predictions: 1107

‚úÖ Submission generated
 File: /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/data/submission_files/submission_catboostclassifier_leakage_test.csv
 Samples: 1200

‚è±Ô∏è Submission Generation Time: 0.03 seconds (0.00 minutes)

üìã Submission Sample (first 10 rows):
shape: (10, 2)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ work_id     ‚îÜ label ‚îÇ
‚îÇ ---         ‚îÜ ---   ‚îÇ
‚îÇ str         ‚îÜ i64   ‚îÇ
‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï°
‚îÇ W3183595028 ‚îÜ 1     ‚îÇ
‚îÇ W4281846540 ‚îÜ 0     ‚îÇ
‚îÇ W3041108599 ‚îÜ 0     ‚îÇ
‚îÇ W3177215330 ‚îÜ 1     ‚îÇ
‚îÇ W4398934184 ‚îÜ 0     ‚îÇ
‚îÇ W3043702115 ‚îÜ 0     ‚îÇ
‚îÇ W3105126562 ‚îÜ 1     ‚îÇ
‚îÇ W4212866133 ‚îÜ 0     ‚îÇ
‚îÇ W3129363439 ‚îÜ 0     ‚îÇ
‚îÇ W4256661485 ‚îÜ 0     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚

In [19]:
# Final summary
total_time = time.time() - TOTAL_START_TIME
print("\n" + "=" * 80)
print("üéâ PIPELINE EXECUTION COMPLETE!")
print("=" * 80)
print(
  f"\n‚è±Ô∏è Total Execution Time: {total_time:.2f} seconds ({total_time/60:.2f} minutes)"
)
print(f"\nüìä Summary:")
# Safely get variables with defaults
best_model_name_val = locals().get("best_model_name", "N/A")
best_f1_val = locals().get("best_f1", 0.0)
best_threshold_val = locals().get("best_threshold", 0.5)
trained_models_val = locals().get("trained_models", {})
print(f" Best Model: {best_model_name_val}")
print(f" Best F1 Score: {best_f1_val:.4f}")
print(f" Optimal Threshold: {best_threshold_val:.4f}")
print(f" Models Trained: {len(trained_models_val)}")
print(f"\nüìã Feature Summary:")
if "feature_review" in locals():
  print(f" Original features: {feature_review.get('original_features', 'N/A')}")
  print(f" Final features: {feature_review.get('final_features', 'N/A')}")
  print(f" Regular features: {feature_review.get('regular_features', 'N/A')}")
  print(f" Temporal features: {feature_review.get('temporal_features', 'N/A')}")
  print(f" Embedding features (PCA): {feature_review.get('embedding_features', 'N/A')}")
  print(f" Duplicates removed: {feature_review.get('duplicates_removed', 'N/A')}")
print(f"\nüíæ Outputs:")
if "model_save_path" in locals():
  print(f" Model: {model_save_path}")
if "submission_path" in locals():
  print(f" Submission: {submission_path}")
print("\n" + "=" * 80)



üéâ PIPELINE EXECUTION COMPLETE!

‚è±Ô∏è Total Execution Time: 20.27 seconds (0.34 minutes)

üìä Summary:
 Best Model: CatBoostClassifier
 Best F1 Score: 0.4737
 Optimal Threshold: 0.8434
 Models Trained: 18

üìã Feature Summary:
 Original features: 1977
 Final features: 174
 Regular features: 96
 Temporal features: 3
 Embedding features (PCA): 75
 Duplicates removed: 18

üíæ Outputs:
 Model: /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/models/saved_models/model_catboostclassifier_leakage_test.pkl
 Submission: /Users/santoshdesai/Documents/Desai_Projects/Kaggle2/data/submission_files/submission_catboostclassifier_leakage_test.csv

