# Competition Overview

The "Detecting Reversal Points in US Equities" Kaggle competition (September 13, 2025 – December 31, 2025) centers on identifying local and global reversal points—swing highs and lows—within anonymized US stock market data. Participants must classify timestamps into three distinct categories: H (High reversal), L (Low reversal), or None. 

The dataset encompasses six anonymized tickers with approximately 2,760 samples and a massive feature space consisting of 68,504 mostly boolean signal descriptors.

The primary evaluation metric is Macro F1-score (equal importance across all classes despite severe imbalance: approx. 94% None, approx. 3% H, approx. 3% L). Tie-breakers include Macro Balanced Accuracy, multi-class Matthews Correlation Coefficient (MCC), and inference runtime. Total prize pool: $4,000 ($2,000 – 1st, $1,000 – 2nd, $500 – 3rd). The challenge tests the ability to detect rare but highly actionable market turning points in high-dimensional, imbalanced financial data.

[**Competition Link**](https://www.kaggle.com/competitions/detecting-reversal-points-in-us-equities/leaderboard)

# Introduction

Detecting reversal points (local highs (H) and lows (L)) is one of the most valuable yet difficult signals in financial markets. These swing points indicate potential trend changes and are foundational for trend analysis, chart pattern recognition, and algorithmic trading strategies. The "Detecting Reversal Points in US Equities" competition provides anonymized stock time-series data from 6 instruments (2023–2025) and challenges participants to classify each timestamp into one of three classes: H (High reversal), L (Low reversal), or None (no reversal). With 68,504 features (mostly boolean Signal Descriptors) and extreme class imbalance (~94% None), the task is both computationally and methodologically demanding. The provided code implements a complete end-to-end solution: advanced preprocessing, imbalance handling (ADASYN + class weights), feature selection, PCA reduction, rigorous model benchmarking, Optuna hyperparameter tuning, and a final weighted soft-voting ensemble of top tree-based models (ExtraTrees, RandomForest, LightGBM, XGBoost).

# Objective

The primary objective of this notebook is to build a high-performing, production-grade classification pipeline that maximizes Macro F1-score on the hidden test set while respecting inference time constraints. Specifically, the project aims to:

- Perform robust time-aware preprocessing and feature engineering on high-dimensional, imbalanced financial time-series data.
- Effectively address the severe class imbalance (~94% None) using ADASYN oversampling with balanced class weights.
- Identify the best base learners through extensive 5-fold stratified OOF validation across 10 classifiers using Macro F1, Balanced Accuracy, MCC, and inference time.
- Optimize hyperparameters of the top 4 models (ExtraTrees, RandomForest, XGBoost, LightGBM) via Optuna.
- Train a weighted soft-voting ensemble that combines the strengths of the best models to achieve top-tier generalization and leaderboard performance. This solution serves as a reference for handling real-world imbalanced time-series classification problems in finance.

# Pipeline Overview

The analytical pipeline is structured into several discrete stages to ensure data integrity and model performance:

- Environment Setup: Installation of essential frameworks including imbalanced-learn, Optuna, XGBoost, LightGBM, and CatBoost.
- Data Loading and Feature Engineering: Reading CSV files, ensuring chronological sorting by ticker, and generating datetime features (month, year, weekend indicators) along with ticker-specific one-hot encodings.
- Temporal Validation Splitting: Executing an 80/20 chronological split per ticker to maintain the temporal order and prevent look-ahead bias.
- Preprocessing and Sampling: Applying ticker-specific imputation (forward/backward filling) and scaling, followed by ADASYN oversampling to balance the training set.
- Dimensionality Reduction: Filtering the top 26,000 boolean features using Mutual Information and applying PCA on numeric indicators to retain 80% variance.
- Model Selection and Hyperparameter Tuning: Evaluating 10 classifiers through 5-fold stratified out-of-fold (OOF) validation and optimizing the top four models (ExtraTrees, RandomForest, XGBoost, and LightGBM) using Optuna.
- Ensemble Model and Final Submission: Training a weighted soft-voting ensemble and generating the final submission.csv.

# Approach

A comprehensive methodology is employed to address the complexities of financial time-series classification:

- Temporal Integrity: Chronological sorting and time-aware splitting are strictly enforced to prevent data leakage, ensuring the model reflects real-world trading conditions.
- Imbalance Handling: The Adaptive Synthetic (ADASYN) algorithm generates minority samples in regions where the reversal points are hardest to learn, while balanced class weights penalize errors on the H and L classes during the training phase.
- Feature Optimization: Mutual Information identifies the most relevant boolean signals among the 68,000 available features, significantly reducing noise. PCA further compresses numeric data to lower computational costs.
- Model Selection: Classifiers are ranked by Macro F1 to ensure the model does not ignore the minority reversal classes. Tie-breakers of competition, it includes the Matthews Correlation Coefficient (MCC) and inference speed, the latter of which is vital for high-frequency financial applications.
- Ensemble Strategy: A soft-voting ensemble combines predictions using specific weights based on their individual validation performance to maximize predictive stability.

# Environment and Configuration

## Environment and Package Installation

Environment & Package Installation: Essential specialized libraries are installed to handle the specific challenges of the dataset. This includes imblearn for class imbalance, optuna for Bayesian optimization, and multiple gradient boosting frameworks like xgboost, lightgbm, and catboost.

In [1]:
%%capture
!pip install imblearn
!pip install --upgrade imbalanced-learn
!pip install optuna
!pip install xgboost

## Library Imports

The environment is initialized by importing core data manipulation tools, a wide array of classifiers (from traditional trees to advanced boosting), and evaluation metrics like Macro F1 and Matthews Correlation Coefficient (MCC).

In [2]:
# Core Libraries
import os
import time
import shutil
import warnings
import tempfile
import numpy as np
import pandas as pd
import joblib
from joblib import dump, load

# Scikit-Learn (Preprocessing, Selection, Metrics)
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.decomposition import PCA
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import (f1_score, accuracy_score, balanced_accuracy_score, 
                             matthews_corrcoef)

# Models & Ensembles
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier, 
                              GradientBoostingClassifier, VotingClassifier, 
                              StackingClassifier)
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import VotingClassifier, StackingClassifier, ExtraTreesClassifier, GradientBoostingClassifier

# Gradient Boosting
import xgboost as xgb
from xgboost import XGBClassifier
import lightgbm as lgb
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

# Imbalanced Learning & Optimization
import optuna
import sklearn
import imblearn
from imblearn.over_sampling import SMOTE, ADASYN

# Suppression & Versions
warnings.filterwarnings("ignore")
print(f"scikit-learn: {sklearn.__version__} | imbalanced-learn: {imblearn.__version__}")

  if entities is not ():


scikit-learn: 1.6.1 | imbalanced-learn: 0.14.1


## Config & Paths Setup

Project organization is established by defining data paths and creating dedicated directories for storing preprocessing artifacts and trained model objects.

In [3]:
# PATHS & SETUP
TRAIN_CSV = "/kaggle/input/detecting-reversal-points-in-us-equities/new_comptetition_data/train.csv"
TEST_CSV = "/kaggle/input/detecting-reversal-points-in-us-equities/new_comptetition_data/test.csv"
PREPROC_DIR = "preproc_models"
os.makedirs(PREPROC_DIR, exist_ok=True)

# Data Loading & Initial Preparation

Raw datasets are ingested and sorted chronologically by ticker and timestamp. This step enriches the data with temporal features—such as day of the week and weekend flags—and converts categorical ticker identifiers into numeric one-hot encoded columns.

In [4]:
# Load and Prepare Data (Enhanced: check for duplicates, basic stats)
train = pd.read_csv(TRAIN_CSV, low_memory=False)
test = pd.read_csv(TEST_CSV, low_memory=False)

# Drop unexpected columns
if 'train_id' in train.columns:
    train = train.drop(columns=['train_id'])

# Sort by ticker and time
train = train.sort_values(['ticker_id', 't']).reset_index(drop=True)
test = test.sort_values(['ticker_id', 't']).reset_index(drop=True)

# Convert timestamp
train['t'] = pd.to_datetime(train['t'])
test['t'] = pd.to_datetime(test['t'])

# Extract more datetime features (day, year, is_weekend)
train['day_of_week'] = train['t'].dt.dayofweek
train['month'] = train['t'].dt.month
train['year'] = train['t'].dt.year
train['is_weekend'] = train['day_of_week'].isin([5, 6]).astype(int)
test['day_of_week'] = test['t'].dt.dayofweek
test['month'] = test['t'].dt.month
test['year'] = test['t'].dt.year
test['is_weekend'] = test['day_of_week'].isin([5, 6]).astype(int)

# One-hot encode ticker_id
train = pd.concat([train, pd.get_dummies(train['ticker_id'], prefix='ticker')], axis=1)
test = pd.concat([test, pd.get_dummies(test['ticker_id'], prefix='ticker')], axis=1)

# Align dummy columns
for col in train.columns:
    if col.startswith('ticker_') and col not in test.columns:
        test[col] = 0
for col in test.columns:
    if col.startswith('ticker_') and col not in train.columns:
        train[col] = 0

# Save test ids
test_ids = test['id'].copy()

# Metadata columns
meta_cols = ['id', 't', 'class_label', 'day_of_week', 'month', 'year', 'is_weekend']
meta_cols = [col for col in meta_cols if col in train.columns]
features = [col for col in train.columns if col not in meta_cols]

# Time Series Validation Split

To simulate real-world trading and prevent data leakage, an 80/20 chronological split is applied to each ticker. The most recent 20% of data for every instrument is reserved strictly for validation.

In [5]:
# Time-aware Validation Split
train['is_val'] = False
for ticker in train['ticker_id'].unique():
    ticker_mask = train['ticker_id'] == ticker
    idx = train[ticker_mask].index
    split_point = int(len(idx) * 0.8)
    if split_point < len(idx):
        train.loc[idx[split_point:], 'is_val'] = True

train_set = train[~train['is_val']].copy()
val_set = train[train['is_val']].copy()

print(f"Train samples: {len(train_set)}, Val samples: {len(val_set)}")

Train samples: 2143, Val samples: 540


# Data Preprocessing and Feature Engineering

## Target Encoding & Feature Preparation

The target labels (`H, L, None`) are converted into numeric format using `LabelEncoder`. Features are isolated from metadata, and test IDs are preserved to ensure the final submission aligns with competition requirements.

In [6]:
# Target Encoding
y_train = train_set['class_label']
y_val = val_set['class_label']

print("Class distribution in train:\n", y_train.value_counts())

le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_val_encoded = le.transform(y_val)

joblib.dump(le, f"{PREPROC_DIR}/label_encoder.pkl")

# Extract ticker for grouping
ticker_train = train_set['ticker_id']
ticker_val = val_set['ticker_id']
ticker_test = test['ticker_id']

# Features without ticker_id
features_no_ticker = [f for f in features if f != 'ticker_id']
X_train = train_set[features_no_ticker].copy()
X_val = val_set[features_no_ticker].copy()
X_test = test[features_no_ticker].copy()

Class distribution in train:
 class_label
H    65
L    61
Name: count, dtype: int64


## Missing Value Imputation and Data Scaling

Missing values are addressed using a ticker-grouped strategy (forward/backward fill followed by median fallback). Data is then normalized through ticker-specific `StandardScaler` to account for differing market volatilities.

### StandardScaler

StandardScaler is a fundamental preprocessing technique in machine learning used to standardize features by removing the mean and scaling to unit variance. It is part of a broader category of data preparation known as Feature Scaling.

The primary goal is to ensure that features with different magnitudes do not disproportionately influence a model. For instance, in financial data, a "Volume" feature (ranging in millions) would naturally overwhelm a "Daily Return" feature (ranging from -0.05 to 0.05) if left unscaled.

**Mathematical Theory**

StandardScaler follows the formula for calculating a Z-score. For every data point $x$ in a feature, the transformed value $z$ is calculated as:
$$z = \frac{x - \mu}{\sigma}$$

Where:
- $\mu$: The mean of the feature samples.
- $\sigma$: The standard deviation of the feature samples.

**The Transformation Result**

After applying StandardScaler:
- Mean ($\mu$) = 0: The distribution is "centered" around zero.
- Standard Deviation ($\sigma$) = 1: The data is "spread" such that the majority of values fall within the range of -1 to 1 (for normally distributed data).

**Underlying Theories and Assumptions of StandardScaler**

1. The Normal Distribution (Gaussian) Assumption

StandardScaler is most effective when the underlying data follows a Gaussian (Normal) Distribution. While it can be applied to non-normal data, the resulting mean of 0 and standard deviation of 1 may not represent the "center" as effectively if the data is heavily skewed or contains extreme outliers.

2. Sensitivity to Outliers

Unlike RobustScaler (which uses the median and interquartile range), StandardScaler is highly sensitive to outliers. Because the mean and standard deviation are calculated using all data points, a single extreme value can significantly inflate $\sigma$, "squashing" the remaining data points into a very small range of Z-scores.

3. Distance-Based Algorithms

Standardization is theoretically required for algorithms that calculate distances between data points. Without scaling, the "distance" on the axis with the largest numbers would dominate the calculation.
- K-Nearest Neighbors (KNN): Distance between neighbors.
- Principal Component Analysis (PCA): Capturing maximum variance (features with higher scales would appear to have higher variance).
- Support Vector Machines (SVM): Finding the optimal margin between classes.

4. Gradient Descent Convergence

In models like Logistic Regression or Neural Networks, features with vastly different scales create an elongated, "elliptical" loss function. This forces the gradient descent algorithm to oscillate or take a very long path to the minimum. Standardization makes the loss function more spherical, allowing for faster and more stable convergence.

**Theoretical Implementation: Fit vs. Transform**

In a competition or production pipeline, it is critical to adhere to the theory of Data Leakage prevention:
- `.fit()`: Calculated only on the training set. This computes the $\mu$ and $\sigma$ of the training data.
- `.transform()`: Applied to both the training and test sets using the $\mu$ and $\sigma$ derived from the training set.Using the mean or standard deviation of the test set to scale the test data is a theoretical error, as it "leaks" information from the future/unseen data into the model's environment.

In [7]:
# Imputation & Scaling 
for X, ticker_series in [(X_train, ticker_train), (X_val, ticker_val), (X_test, ticker_test)]:
    grouped = X.groupby(ticker_series)
    X[:] = grouped.transform(lambda x: x.fillna(x.median() if x.dtype.kind in 'biufc' else (x.mode()[0] if not x.mode().empty else 0)))
    X[:] = grouped.ffill().bfill().fillna(0)

# Ticker-specific scaling + global scaling for comparison
scaler = StandardScaler()
numeric_cols = X_train.select_dtypes(include=[np.number]).columns

for ticker in train['ticker_id'].unique():
    mask_tr = ticker_train == ticker
    mask_val = ticker_val == ticker
    mask_te = ticker_test == ticker
    if mask_tr.any():
        X_train.loc[mask_tr, numeric_cols] = scaler.fit_transform(X_train.loc[mask_tr, numeric_cols])
    if mask_val.any():
        X_val.loc[mask_val, numeric_cols] = scaler.transform(X_val.loc[mask_val, numeric_cols])
    if mask_te.any():
        X_test.loc[mask_te, numeric_cols] = scaler.transform(X_test.loc[mask_te, numeric_cols])

# Global scaling as additional technique
global_scaler = StandardScaler()
X_train[numeric_cols] = global_scaler.fit_transform(X_train[numeric_cols])
X_val[numeric_cols] = global_scaler.transform(X_val[numeric_cols])
X_test[numeric_cols] = global_scaler.transform(X_test[numeric_cols])

## Features Selection

Given the high dimensionality, boolean features are isolated. `SelectKBest` with mutual information is used to retain only the top 26,000 most informative signals, significantly reducing computational load while preserving predictive power.

### Mutual Information (MI)

Mutual Information is a measure from information theory that quantifies the amount of information obtained about one random variable through the observation of another. In the context of your equities competition, it is used to measure how much knowing the state of a "Signal Descriptor" reduces the uncertainty regarding whether a timestamp is a reversal point (H or L).Unlike correlation, which only measures linear relationships, Mutual Information captures any kind of statistical dependence, including non-linear, periodic, or irregular patterns.

**Mathematical Foundation**

The theory of MI is built upon the concept of Shannon Entropy ($H$), which represents the average amount of "uncertainty" in a variable.

**The Entropy Formula**

For a discrete variable $X$, entropy is defined as:

$$H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x)$$

The Mutual Information Formula

The MI between two variables $X$ and $Y$ is the difference between the initial uncertainty and the remaining uncertainty after the other variable is known:

$$I(X; Y) = H(X) - H(X|Y)$$

Equivalently, it measures the "distance" between the joint probability distribution and the assumption of independence:

$$I(X; Y) = \sum_{x,y} p(x,y) \log \left( \frac{p(x,y)}{p(x)p(y)} \right)$$

**Concepts**

**Reduction of Uncertainty**

MI is often called Information Gain. If $I(X; Y) = 0$, the variables are strictly independent; knowing $X$ tells you nothing about $Y$. A high MI indicates that $X$ and $Y$ share significant information, meaning $X$ is a strong predictor for $Y$.

**Model Agnosticism**

Because MI relies on probability distributions rather than functional forms, it is "model-agnostic." It can identify a relationship even if that relationship is a complex U-shape or a step function that a linear correlation coefficient (like Pearson's $r$) would miss entirely.

**Symmetry**

One of the core properties of MI is that it is symmetric:$$I(X; Y) = I(Y; X)$$This means the information that a feature provides about the target is exactly equal to the information the target provides about the feature.

**The "Information Bottleneck" Principle**

In deep learning and complex pipelines, MI is used to understand how much information from the input is preserved through different layers. In your pipeline, the SelectKBest step uses MI to ensure that only features with high "shared information" with the reversal labels are passed to the classifiers, effectively acting as a filter to remove noise.

**Application in High-Dimensional Data**

In your competition, with 68,504 features, calculating MI for every feature is a "filter method" of feature selection. It is computationally efficient because it evaluates features individually (univariate) before training expensive models like XGBoost. This ensures the models focus on the subset of data that actually contains predictive "signals" rather than "noise."

In [8]:
# Feature Selection
bool_features = [f for f in X_train.columns if X_train[f].nunique() <= 2]
non_bool_features = [f for f in X_train.columns if f not in bool_features]

print(f"Boolean features: {len(bool_features)}, Non-boolean: {len(non_bool_features)}")

k_bool = 26000  # Can increased for better coverage 
selector = SelectKBest(mutual_info_classif, k=min(k_bool, len(bool_features)))
selector.fit(X_train[bool_features], y_train_encoded)
selected_bool = np.array(bool_features)[selector.get_support()]

joblib.dump(selector, f"{PREPROC_DIR}/bool_selector.pkl")

final_features = non_bool_features + list(selected_bool)
print(f"Final selected features: {len(final_features)}")
joblib.dump(final_features, f"{PREPROC_DIR}/final_features.pkl")

X_train = X_train[final_features]
X_val = X_val[final_features]
X_test = X_test[final_features]

Boolean features: 68505, Non-boolean: 4
Final selected features: 26004


## Oversampling (ADASYN) and Dimensionality Reduction (PCA)

**Class Weights & ADASYN Oversampling**: To combat the 94% "None" class dominance, the ADASYN (Adaptive Synthetic) algorithm generates synthetic samples for the minority "H" and "L" classes. Balanced class weights are also computed to further penalize misclassifications of rare events during training.

**PCA Dimensionality Reduction**: For numeric features, Principal Component Analysis (PCA) is applied to the resampled data. By retaining 80% of the variance (condensed into two components), noise is filtered out while the core signal remains.

**Adaptive Synthetic (ADASYN) Sampling**

ADASYN is an advanced oversampling technique designed to address class imbalance. It is an evolution of the SMOTE (Synthetic Minority Over-sampling Technique) algorithm, specifically focused on the "hard-to-learn" examples.

**Concepts**

- Density Distribution: Unlike standard oversampling, which generates samples uniformly, ADASYN uses a weighted distribution for different minority class examples. It calculates the "density" of majority class neighbors for each minority point.
- Adaptive Learning: The primary theory behind ADASYN is that the model should generate more synthetic data in regions where the minority class is most overwhelmed by the majority class. If a minority point is surrounded by many majority points, it is considered "hard to learn," and ADASYN assigns it a higher weight for sample generation.

**Mathematical Mechanism**

The number of synthetic samples to be generated for a specific minority point $x_i$ is determined by:
- Finding Neighbors: For each minority sample, find its $K$ nearest neighbors.
- Calculating the Ratio ($r_i$): Determine how many of those $K$ neighbors belong to the majority class.
$$r_i = \frac{\Delta_i}{K}$$
where $\Delta_i$ is the count of majority neighbors.
- Density Distribution ($\hat{r}_i$): Normalize these ratios so they sum to 1.
- Generation: The number of synthetic points generated around $x_i$ is $g_i = \hat{r}_i \times G$, where $G$ is the total number of samples needed to reach balance.

**Key Advantage**

The theoretical benefit of ADASYN is that it shifts the decision boundary of the classifier toward the difficult-to-learn samples. By focusing on high-density majority areas, it helps the model distinguish subtle patterns in reversal points (H and L) that might otherwise be treated as noise (None).

**Dimensionality Reduction (PCA)**

Principal Component Analysis (PCA) is an unsupervised linear transformation technique used for feature extraction and dimensionality reduction.

**Concepts**

- Variance Maximization: PCA operates on the theory that the most important "information" in a dataset is contained in the features (or combinations of features) that show the highest variance.
- Orthogonality: PCA transforms the original correlated features into a new set of uncorrelated variables called Principal Components (PCs). These components are orthogonal (at 90-degree angles) to each other, meaning they represent completely independent pieces of information.
- Information Compression: By keeping only the first few PCs that explain the majority of the variance (e.g., 80%), one can reduce the size of the data while losing very little predictive power.

**Mathematical Mechanism**

The transformation involves several linear algebra steps:

- Standardization: The data must be centered (mean = 0) to ensure the first PC captures the direction of maximum variance.
- Covariance Matrix: A matrix is constructed to show how all features vary together.
- Eigen-Decomposition: The algorithm calculates Eigenvectors (the directions of the new axes) and Eigenvalues (the magnitude/importance of those directions).
- Feature Projection: The original data is projected onto these new axes (PCs).

**PCA in this Pipeline**

In this pipeline, PCA is applied to numeric features to retain 80% variance, which is condensed into just two components.

- Noise Reduction: By discarding components with low eigenvalues, the model ignores small fluctuations in financial data that are likely random noise.
- Multicollinearity: PCA removes the correlation between numeric features, which helps stabilize models like Logistic Regression or Naive Bayes that assume feature independence.

In [9]:
# Class Weights & ADASYN Oversampling
print("Original training class distribution:")
print(pd.Series(y_train_encoded).value_counts().sort_index())

class_weights = compute_class_weight('balanced', classes=np.unique(y_train_encoded), y=y_train_encoded)
class_weight_dict_full = dict(enumerate(class_weights))
print("\nComputed class weights (3-class):", class_weight_dict_full)

print("\nApplying ADASYN oversampling...")
adasyn = ADASYN(sampling_strategy='auto', random_state=42, n_neighbors=5)              
X_train_res, y_train_res = adasyn.fit_resample(X_train, y_train_encoded)

X_train_res = pd.DataFrame(X_train_res, columns=X_train.columns)
y_train_res = pd.Series(y_train_res)

print(f"After ADASYN: {len(X_train_res)} samples")
print(pd.Series(y_train_res).value_counts().sort_index())

# Identify numeric columns from original X_train (before oversampling)
pca_numeric_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()

# Use 80% variance (as in your code)
pca = PCA(n_components=0.8, svd_solver='full', random_state=42)

# Fit PCA on oversampled training data (X_train_res)
X_train_res_pca = pd.DataFrame(
    pca.fit_transform(X_train_res[pca_numeric_cols]),
    index=X_train_res.index,
    columns=[f'pca_{i}' for i in range(pca.n_components_)]
)

# Transform validation and test sets using the same PCA
X_val_pca = pd.DataFrame(
    pca.transform(X_val[pca_numeric_cols]),
    index=X_val.index,
    columns=[f'pca_{i}' for i in range(pca.n_components_)]
)

X_test_pca = pd.DataFrame(
    pca.transform(X_test[pca_numeric_cols]),
    index=X_test.index,
    columns=[f'pca_{i}' for i in range(pca.n_components_)]
)

print(f"PCA reduced {len(pca_numeric_cols)} numeric features to {pca.n_components_} components (90% variance)")

# Replace numeric columns with PCA components
X_train_res = pd.concat([X_train_res.drop(columns=pca_numeric_cols), X_train_res_pca], axis=1)
X_val = pd.concat([X_val.drop(columns=pca_numeric_cols), X_val_pca], axis=1)
X_test = pd.concat([X_test.drop(columns=pca_numeric_cols), X_test_pca], axis=1)

# Standardize all column names to strings
X_train_res.columns = X_train_res.columns.astype(str)
X_val.columns = X_val.columns.astype(str)
X_test.columns = X_test.columns.astype(str)

print("\nPCA applied after oversampling — Finish...")

class_weights = compute_class_weight('balanced', classes=np.unique(y_train_res), y=y_train_res)
class_weight_dict_full = dict(enumerate(class_weights))
print("\nComputed class weights (3-class):", class_weight_dict_full)

Original training class distribution:
0      65
1      61
2    2017
Name: count, dtype: int64

Computed class weights (3-class): {0: np.float64(10.98974358974359), 1: np.float64(11.710382513661202), 2: np.float64(0.35415633779540573)}

Applying ADASYN oversampling...
After ADASYN: 6049 samples
0    2030
1    2002
2    2017
Name: count, dtype: int64
PCA reduced 4 numeric features to 2 components (90% variance)

PCA applied after oversampling — Finish...

Computed class weights (3-class): {0: np.float64(0.9932676518883415), 1: np.float64(1.0071595071595072), 2: np.float64(0.9996694761196496)}


# Model Selection

## Weak Learner Model Selection

Weak Learner Model Selection (5-fold Stratified OOF): Conducts a comprehensive evaluation of 10 diverse classifiers. Using Out-of-Fold (OOF) validation, models are ranked by their ability to achieve a high Macro F1-score and low inference latency.

In [10]:
# Extended model list (appropriate for high-dimensional, imbalanced, boolean-heavy data)
ml_models = [
    ('Logistic Regression', LogisticRegression(max_iter=1000, random_state=42)),
    ('KNN', KNeighborsClassifier()),
    ('Decision Tree', DecisionTreeClassifier(random_state=42)),
    ('Random Forest', RandomForestClassifier(random_state=42)),
    ('Extra Trees', ExtraTreesClassifier(random_state=42)),
    ('Gradient Boosting', GradientBoostingClassifier(random_state=42)),
    ('XGBoost', xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)),
    ('LightGBM', lgb.LGBMClassifier(verbose=-1, random_state=42)),
    ('CatBoost', CatBoostClassifier(verbose=False, random_state=42)),
    #('SVM (RBF)', SVC(probability=True, random_state=42)),  # Cancel due to high memory/time on large features
    ('Naive Bayes', GaussianNB())
]

# OOF setup
n_folds = 5        # Can increase for better result
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
model_results = {}
test_oof_all = {}  # Store test predictions from each model

print("\nStarting model selection with 5-fold Stratified OOF (Macro F1, Balanced Accuracy, MCC, Inference Time)...\n")

for name, model in ml_models:
    print(f"Evaluating {name}...")
    
    # OOF arrays
    oof_proba = np.zeros((len(X_train_res), 3))
    oof_true = np.zeros(len(X_train_res), dtype=int)
    fold_f1_scores = []
    fold_bal_acc_scores = []
    fold_mcc_scores = []
    fold_inference_times = []
    
    # Test predictions accumulation
    test_proba = np.zeros((len(X_test), 3))
    
    for fold, (tr_idx, val_idx) in enumerate(skf.split(X_train_res, y_train_res)):
        X_tr = X_train_res.iloc[tr_idx]
        y_tr = y_train_res.iloc[tr_idx]
        X_va = X_train_res.iloc[val_idx]
        y_va = y_train_res.iloc[val_idx]
        
        # Fit model
        model.fit(X_tr, y_tr)
        
        # Validation prediction with inference time
        start_time = time.time()
        val_proba = model.predict_proba(X_va)
        inference_time = time.time() - start_time
        val_pred = np.argmax(val_proba, axis=1)
        
        # Metrics
        fold_f1 = f1_score(y_va, val_pred, average='macro')
        fold_bal_acc = balanced_accuracy_score(y_va, val_pred)
        fold_mcc = matthews_corrcoef(y_va, val_pred)
        
        fold_f1_scores.append(fold_f1)
        fold_bal_acc_scores.append(fold_bal_acc)
        fold_mcc_scores.append(fold_mcc)
        fold_inference_times.append(inference_time)
        
        # Store OOF
        oof_proba[val_idx] = val_proba
        oof_true[val_idx] = y_va
        
        # Accumulate test prediction (fixed: use X_test, not X_val)
        test_proba += model.predict_proba(X_test) / n_folds
    
    # Overall OOF performance
    oof_pred = np.argmax(oof_proba, axis=1)
    overall_f1 = f1_score(oof_true, oof_pred, average='macro')
    overall_bal_acc = balanced_accuracy_score(oof_true, oof_pred)
    overall_mcc = matthews_corrcoef(oof_true, oof_pred)
    
    model_results[name] = {
        'mean_f1': np.mean(fold_f1_scores),
        'std_f1': np.std(fold_f1_scores),
        'overall_f1': overall_f1,
        'mean_bal_acc': np.mean(fold_bal_acc_scores),
        'std_bal_acc': np.std(fold_bal_acc_scores),
        'overall_bal_acc': overall_bal_acc,
        'mean_mcc': np.mean(fold_mcc_scores),
        'std_mcc': np.std(fold_mcc_scores),
        'overall_mcc': overall_mcc,
        'mean_inference_time': np.mean(fold_inference_times),
        'std_inference_time': np.std(fold_inference_times),
        'oof_proba': oof_proba
    }
    test_oof_all[name] = test_proba
    
    print(f"{name}: Mean Macro F1 = {np.mean(fold_f1_scores):.5f} ± {np.std(fold_f1_scores):.4f} | OOF F1 = {overall_f1:.5f}")
    print(f"       Mean Balanced Acc = {np.mean(fold_bal_acc_scores):.5f} ± {np.std(fold_bal_acc_scores):.4f} | OOF Bal Acc = {overall_bal_acc:.5f}")
    print(f"       Mean MCC = {np.mean(fold_mcc_scores):.5f} ± {np.std(fold_mcc_scores):.4f} | OOF MCC = {overall_mcc:.5f}")
    print(f"       Mean Inference Time = {np.mean(fold_inference_times):.5f}s ± {np.std(fold_inference_times):.4f}s")


# Model Ranking & Selection
print("\n" + "="*90)
print("MODEL SELECTION RESULTS (Primary: Macro F1, Tie-breakers: Bal Acc, MCC, Inference Time)")
print("="*90)
ranking = sorted(model_results.items(), key=lambda x: (
    x[1]['overall_f1'],          # Primary metric
    x[1]['overall_bal_acc'],     # Tie-breaker 1
    x[1]['overall_mcc'],         # Tie-breaker 2
    -x[1]['mean_inference_time'] # Tie-breaker 3 (lower is better)
), reverse=True)

for i, (name, res) in enumerate(ranking):
    print(f"{i+1:2d}. {name:<25} | OOF Macro F1: {res['overall_f1']:.5f} | CV Mean F1: {res['mean_f1']:.5f} ± {res['std_f1']:.4f}")
    print(f"    OOF Bal Acc: {res['overall_bal_acc']:.5f} | CV Mean Bal Acc: {res['mean_bal_acc']:.5f} ± {res['std_bal_acc']:.4f}")
    print(f"    OOF MCC: {res['overall_mcc']:.5f} | CV Mean MCC: {res['mean_mcc']:.5f} ± {res['std_mcc']:.4f}")
    print(f"    Mean Inference Time: {res['mean_inference_time']:.5f}s ± {res['std_inference_time']:.4f}s")
    print("-"*90)

print("="*90)

# Select top 5 models for final ensemble
top_k = 5
selected_models = ranking[:top_k]
print(f"\nSelected top {top_k} models for final ensemble:")
for i, (name, _) in enumerate(selected_models):
    print(f"  {i+1}. {name}")


Starting model selection with 5-fold Stratified OOF (Macro F1, Balanced Accuracy, MCC, Inference Time)...

Evaluating Logistic Regression...
Logistic Regression: Mean Macro F1 = 0.96835 ± 0.0048 | OOF F1 = 0.96835
       Mean Balanced Acc = 0.96856 ± 0.0048 | OOF Bal Acc = 0.96856
       Mean MCC = 0.95310 ± 0.0071 | OOF MCC = 0.95308
       Mean Inference Time = 0.35559s ± 0.0103s
Evaluating KNN...
KNN: Mean Macro F1 = 0.94666 ± 0.0063 | OOF F1 = 0.94668
       Mean Balanced Acc = 0.94794 ± 0.0059 | OOF Bal Acc = 0.94794
       Mean MCC = 0.92456 ± 0.0084 | OOF MCC = 0.92451
       Mean Inference Time = 7.22975s ± 0.7185s
Evaluating Decision Tree...
Decision Tree: Mean Macro F1 = 0.94397 ± 0.0067 | OOF F1 = 0.94397
       Mean Balanced Acc = 0.94477 ± 0.0066 | OOF Bal Acc = 0.94476
       Mean MCC = 0.91825 ± 0.0097 | OOF MCC = 0.91823
       Mean Inference Time = 0.11559s ± 0.0019s
Evaluating Random Forest...
Random Forest: Mean Macro F1 = 0.98437 ± 0.0017 | OOF F1 = 0.98437
       

According to the result, Extra Trees, XGBoost, LightGBM, and Random Forest ware selected as weak learner model, but CatBoost was not slected because it will take long time for training and hyperparameters tuning.

**Random Forest (RF)**

Random Forest is an ensemble learning method based on Bagging (Bootstrap Aggregating). It builds multiple decision trees independently using random subsets of the training data (drawn with replacement). To increase diversity, it also selects a random subset of features at each split. The final prediction is the majority vote (classification) or average (regression) of all trees.

**Important Hyperparameters:**

- `n_estimators`: Number of trees in the forest. More trees generally improve stability but increase training time.
- `max_depth`: The maximum depth of each tree. Controlling this helps prevent trees from becoming too complex and overfitting.
- `max_features`: The number of random features to consider at each split (e.g., sqrt or log2).
- `min_samples_leaf`: The minimum number of samples required to be at a leaf node; higher values smooth the model.

**Extra Trees (Extremely Randomized Trees)**

Similar to Random Forest, but introduces a higher level of randomness. While RF searches for the optimum split threshold for each feature, Extra Trees chooses a random threshold. Additionally, it typically uses the entire dataset rather than bootstrap samples. This significantly reduces training time and often reduces variance further than RF.

**Important Hyperparameters:**

- `n_estimators`: Total number of randomized trees.
- `max_features`: Number of features to randomly sample for each split.
- `bootstrap`: Boolean (default False). Whether to use the whole dataset or samples with replacement.
- `min_samples_split`: The minimum number of samples required to split an internal node.

**XGBoost (eXtreme Gradient Boosting)**

A powerful implementation of Gradient Boosting that builds trees sequentially. Each new tree attempts to correct the errors (residuals) of the previous ones using gradient descent. It uses Level-Wise Growth (splitting level by level) and incorporates L1/L2 regularization directly into the objective function to handle overfitting.

**Important Hyperparameters:**

- `learning_rate (eta)`: Scales the contribution of each tree. Lower values (e.g., 0.01) require more trees but lead to better generalization.
- `max_depth`: Limits the complexity of individual trees.
- `gamma`: The minimum loss reduction required to make a further split; higher values make the model more conservative.
- `lambda (L2) and alpha (L1)`: Regularization terms on weights to penalize complexity.

**LightGBM (Light Gradient Boosting Machine)**

A high-speed boosting framework developed by Microsoft. Its primary differentiator is Leaf-Wise Growth, where it splits the node that results in the greatest loss reduction, regardless of depth. It also uses Histogram-based binning and GOSS (Gradient-based One-Side Sampling) to handle large datasets with significantly less memory and time.

**Important Hyperparameters:**

- `num_leaves`: The most important parameter for LightGBM; it controls tree complexity (should be less than $2^{\text{max\_depth}}$).
- `learning_rate`: Similar to XGBoost, dictates the step size of the optimization.
- `min_data_in_leaf`: Prevents overfitting by ensuring each leaf has enough supporting samples.
- `feature_fraction`: Randomly selects a subset of features on each iteration (similar to colsample_bytree).

## Ensemble Model Selection

A Voting Ensemble is a meta-modeling approach that combines the predictions from multiple independent machine learning models to improve overall performance. The core theory is based on the Condorcet Jury Theorem, which suggests that a group of independent "voters" (models) is more likely to arrive at the correct decision than any single individual voter, provided each voter performs better than random chance.

**Concepts**

**Wisdom of the Crowd**

The primary motivation for using a Voting Ensemble is to reduce variance and bias. Individual models may make errors on specific data points due to their unique biases or sensitivities to noise. By aggregating their outputs, the ensemble "smooths out" individual errors. If the errors made by the models are uncorrelated, the ensemble will significantly outperform its components.

**Diversity is Key**

For a Voting Ensemble to be effective, the base models must be diverse. If you combine five identical models, the ensemble will simply repeat the same errors. Diversity is typically achieved by:
- Using different algorithms (e.g., combining a Tree-based model with a Linear model).
- Training on different subsets of data.
- Using different feature engineering approaches.

**Types of Voting Mechanisms**

**Hard Voting (Majority Class Voting)**

In Hard Voting, the ensemble predicts the class that received the most "votes" from the individual models.
- Concept: It follows a simple democratic process.
- Calculation: If Model A predicts "H," Model B predicts "L," and Model C predicts "H," the ensemble output is "H."
- Use Case: Best used when models do not output well-calibrated probabilities.

**Soft Voting (Average Probability Voting)**

In Soft Voting, the ensemble calculates the average predicted probability (class membership) across all models for each class.
- Concept: It gives more weight to models that are highly "confident" in their prediction.
- Calculation: If Model A predicts an 80% chance of "H" and Model B predicts a 40% chance of "H," the average probability is 60%.
- Use Case: Generally superior to Hard Voting because it utilizes more information (confidence levels) rather than just the final label.

**Weighted Soft Voting**

Weighted voting refines the process further by assigning a weight ($w$) to each model based on its individual performance (e.g., its Macro F1-score on a validation set).
- Formula:$$P(\text{class}) = \frac{\sum (w_i \times p_i)}{\sum w_i}$$

In the provided equities pipeline, ExtraTrees is given a higher weight (32%) than XGBoost (16%) because it demonstrated higher reliability during the benchmarking phase.

**Why Ensembling Works?**

The Statistical PerspectiveEnsembles work because they navigate the Bias-Variance Tradeoff more effectively:
- Averaging Out Errors: In high-dimensional financial data, one model might overfit to a specific noise pattern in Ticker A. Another model might overfit to a pattern in Ticker B. The ensemble averages these out, leaving only the "consensus" signal.
- Increased Stability: Financial markets are non-stationary (patterns change over time). An ensemble of diverse models is often more robust to these shifts than a single "tuned" model.

### Ensemble Model Selection

Two strategies are compared: simple soft voting and a weighted soft-voting approach. The weighted ensemble of the top four models is selected for its superior Macro F1 performance.

In [10]:
# Define ensemble models using the top 4 weak learners (removed CatBoost)
ensemble_models = [
    ('Voting Top 4 (Soft)', VotingClassifier(
        estimators=[
            ('Extra Trees', ExtraTreesClassifier()),
            ('Random Forest', RandomForestClassifier()),
            ('XGBoost', xgb.XGBClassifier()),
            ('LightGBM', lgb.LGBMClassifier(verbose=-1, random_state=42))
        ],
        voting='soft',
        n_jobs=-1
    )),
    ('Weighted Voting Top 4', VotingClassifier(
        estimators=[
            ('Extra Trees', ExtraTreesClassifier()),
            ('Random Forest', RandomForestClassifier()),
            ('XGBoost', xgb.XGBClassifier()),
            ('LightGBM', lgb.LGBMClassifier(verbose=-1, random_state=42))
        ],
        voting='soft',
        weights=[0.32, 0.25, 0.22, 0.21],  # Adjusted for highest F1: ET > RF > XGB > LGBM
        n_jobs=-1
    )),
    ('Blending Manual Top 4', 'manual_blend') # Placeholder for manual weighted average (implemented later)
]

# OOF setup
n_folds = 3
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
ensemble_results = {}
test_oof_all = {}
print("\nStarting ensemble model selection with 5-fold Stratified OOF (Macro F1, Balanced Accuracy, MCC, Inference Time)...\n")
for name, model in ensemble_models:
    if name == 'Blending Manual Top 4':
        # Skip in OOF loop — manual blending done separately after all models trained
        continue
   
    print(f"Evaluating {name}...")
   
    oof_proba = np.zeros((len(X_train_res), 3))
    oof_true = np.zeros(len(X_train_res), dtype=int)
    fold_f1_scores = []
    fold_bal_acc_scores = []
    fold_mcc_scores = []
    fold_inference_times = []
   
    test_proba = np.zeros((len(X_test), 3))
   
    for fold, (tr_idx, val_idx) in enumerate(skf.split(X_train_res, y_train_res)):
        X_tr = X_train_res.iloc[tr_idx]
        y_tr = y_train_res.iloc[tr_idx]
        X_va = X_train_res.iloc[val_idx]
        y_va = y_train_res.iloc[val_idx]
       
        model.fit(X_tr, y_tr)
       
        start_time = time.time()
        val_proba = model.predict_proba(X_va)
        inference_time = time.time() - start_time
        val_pred = np.argmax(val_proba, axis=1)
       
        fold_f1 = f1_score(y_va, val_pred, average='macro')
        fold_bal_acc = balanced_accuracy_score(y_va, val_pred)
        fold_mcc = matthews_corrcoef(y_va, val_pred)
       
        fold_f1_scores.append(fold_f1)
        fold_bal_acc_scores.append(fold_bal_acc)
        fold_mcc_scores.append(fold_mcc)
        fold_inference_times.append(inference_time)
       
        oof_proba[val_idx] = val_proba
        oof_true[val_idx] = y_va
       
        test_proba += model.predict_proba(X_test) / n_folds
   
    oof_pred = np.argmax(oof_proba, axis=1)
    overall_f1 = f1_score(oof_true, oof_pred, average='macro')
    overall_bal_acc = balanced_accuracy_score(oof_true, oof_pred)
    overall_mcc = matthews_corrcoef(oof_true, oof_pred)
   
    ensemble_results[name] = {
        'mean_f1': np.mean(fold_f1_scores),
        'std_f1': np.std(fold_f1_scores),
        'overall_f1': overall_f1,
        'mean_bal_acc': np.mean(fold_bal_acc_scores),
        'std_bal_acc': np.std(fold_bal_acc_scores),
        'overall_bal_acc': overall_bal_acc,
        'mean_mcc': np.mean(fold_mcc_scores),
        'std_mcc': np.std(fold_mcc_scores),
        'overall_mcc': overall_mcc,
        'mean_inference_time': np.mean(fold_inference_times),
        'std_inference_time': np.std(fold_inference_times),
        'oof_proba': oof_proba,
        'test_proba': test_proba
    }
   
    print(f"{name}: Mean Macro F1 = {np.mean(fold_f1_scores):.5f} ± {np.std(fold_f1_scores):.4f} | OOF F1 = {overall_f1:.5f}")
    print(f" Mean Balanced Acc = {np.mean(fold_bal_acc_scores):.5f} ± {np.std(fold_bal_acc_scores):.4f} | OOF Bal Acc = {overall_bal_acc:.5f}")
    print(f" Mean MCC = {np.mean(fold_mcc_scores):.5f} ± {np.std(fold_mcc_scores):.4f} | OOF MCC = {overall_mcc:.5f}")
    print(f" Mean Inference Time = {np.mean(fold_inference_times):.5f}s ± {np.std(fold_inference_times):.4f}s")

# Manual Blending (Top 4) - Uses previously trained individual models
# Weights based on individual OOF Macro F1 (adjusted for highest F1: ET > RF > XGB > LGBM)
blend_weights = [0.32, 0.25, 0.22, 0.21] # Extra Trees, RF, XGB, LGBM

# Retrain top 4 on full resampled data (already done in previous section, assume models exist)
# Here we use their test_proba from individual OOF if available, or retrain briefly
# For simplicity, assume we retrain them here or use saved probas
# Example using individual test predictions (replace with actual if saved)
manual_blend_test_proba = (
    0.32 * test_oof_all.get('Extra Trees', np.zeros((len(X_test), 3))) +
    0.25 * test_oof_all.get('Random Forest', np.zeros((len(X_test), 3))) +
    0.22 * test_oof_all.get('XGBoost', np.zeros((len(X_test), 3))) +
    0.21 * test_oof_all.get('LightGBM', np.zeros((len(X_test), 3)))
)
ensemble_results['Blending Manual Top 4'] = {
    'test_proba': manual_blend_test_proba
    # No OOF for manual blend unless computed separately
}

# ======================
# 8. Ensemble Model Ranking & Selection
# ======================
print("\n" + "="*90)
print("ENSEMBLE MODEL SELECTION RESULTS (Primary: Macro F1, Tie-breakers: Bal Acc, MCC, Inference Time)")
print("="*90)
ranking = sorted([ (name, res) for name, res in ensemble_results.items() if 'overall_f1' in res ],
                 key=lambda x: (
                     x[1]['overall_f1'],
                     x[1]['overall_bal_acc'],
                     x[1]['overall_mcc'],
                     -x[1]['mean_inference_time']
                 ), reverse=True)
for i, (name, res) in enumerate(ranking):
    print(f"{i+1:2d}. {name:<35} | OOF Macro F1: {res['overall_f1']:.5f} | CV Mean F1: {res['mean_f1']:.5f} ± {res['std_f1']:.4f}")
    print(f" OOF Bal Acc: {res['overall_bal_acc']:.5f} | CV Mean Bal Acc: {res['mean_bal_acc']:.5f} ± {res['std_bal_acc']:.4f}")
    print(f" OOF MCC: {res['overall_mcc']:.5f} | CV Mean MCC: {res['mean_mcc']:.5f} ± {res['std_mcc']:.4f}")
    print(f" Mean Inference Time: {res['mean_inference_time']:.5f}s ± {res['std_inference_time']:.4f}s")
    print("-"*90)
print("="*90)

# Best ensemble
best_ensemble_name = ranking[0][0]
best_test_proba = ensemble_results[best_ensemble_name]['test_proba']
test_pred_labels = np.argmax(best_test_proba, axis=1)
test_pred_str = le.inverse_transform(test_pred_labels)
submission = pd.DataFrame({'id': test_ids, 'class_label': test_pred_str})
submission['class_label'] = submission['class_label'].fillna('None')
submission.to_csv("submission.csv", index=False)
print(f"\nBest ensemble: {best_ensemble_name}")
print("Submission saved successfully!")


Starting ensemble model selection with 5-fold Stratified OOF (Macro F1, Balanced Accuracy, MCC, Inference Time)...

Evaluating Voting Top 4 (Soft)...
Voting Top 4 (Soft): Mean Macro F1 = 0.98689 ± 0.0020 | OOF F1 = 0.98689
 Mean Balanced Acc = 0.98694 ± 0.0020 | OOF Bal Acc = 0.98694
 Mean MCC = 0.98050 ± 0.0030 | OOF MCC = 0.98049
 Mean Inference Time = 5.48052s ± 0.3155s
Evaluating Weighted Voting Top 4...
Weighted Voting Top 4: Mean Macro F1 = 0.98722 ± 0.0022 | OOF F1 = 0.98723
 Mean Balanced Acc = 0.98727 ± 0.0022 | OOF Bal Acc = 0.98727
 Mean MCC = 0.98099 ± 0.0033 | OOF MCC = 0.98098
 Mean Inference Time = 5.32545s ± 0.0765s

ENSEMBLE MODEL SELECTION RESULTS (Primary: Macro F1, Tie-breakers: Bal Acc, MCC, Inference Time)
 1. Weighted Voting Top 4               | OOF Macro F1: 0.98723 | CV Mean F1: 0.98722 ± 0.0022
 OOF Bal Acc: 0.98727 | CV Mean Bal Acc: 0.98727 ± 0.0022
 OOF MCC: 0.98098 | CV Mean MCC: 0.98099 ± 0.0033
 Mean Inference Time: 5.32545s ± 0.0765s
-----------------

Weighted Voting was selected because it achieved higher OOF Macro F1 than Voting (Soft).

# Hyperparameter Tuning

Hyperparameter Tuning (Optuna): The top four waek learner models—ExtraTrees, RandomForest, XGBoost, and LightGBM—are optimized via Optuna. This Bayesian search fine-tunes parameters like tree depth and learning rates to maximize model efficiency.

### Optuna

Optuna is an open-source hyperparameter optimization (HPO) framework that shifts the paradigm from traditional "static" grid searches to "dynamic" automated searches. It is built on several key information-theoretic and probabilistic concepts.

**1. Core Structural Concepts**

Optuna organizes optimization into a clear hierarchy:
- Study: The overall optimization task (e.g., "Minimize the error of my XGBoost model").
- Trial: A single execution of the objective function with one specific set of hyperparameters.
- Objective Function: A user-defined Python function that Optuna calls repeatedly. It takes a trial object as input and returns a numerical value (the score) to be optimized.

**2. Theoretical Foundations: "Define-by-Run"**

Unlike other frameworks that require you to pre-define a static configuration dictionary, Optuna uses a Define-by-Run API.
- The Theory: The search space is constructed dynamically as the code executes.
- The Concept: This allows for Conditional Hyperparameters. For example, the search can suggest a "Kernel Type" for an SVM, and only if the kernel is "Polynomial" does it then suggest a "Degree" parameter. This mirrors the logic of a human researcher.

**3. Sampling Algorithms (Search Strategies)**

Optuna uses "Samplers" to decide which hyperparameter values to try next based on past results.

**Tree-structured Parzen Estimator (TPE) — Defaul**t
    
- Theory: Instead of modeling the objective function directly (like Gaussian Processes), TPE models the distribution of hyperparameters.
- Concept: It splits previous trials into two groups: "Good" (top performing) and "Bad." It then models two probability densities: $l(x)$ for the good group and $g(x)$ for the bad group. It samples new points that maximize the ratio $l(x)/g(x)$, essentially looking for values that are likely to be "good" and unlikely to be "bad."
    
**CMA-ES (Covariance Matrix Adaptation Evolution Strategy)**
    
- Theory: An evolutionary algorithm for continuous, non-linear optimization.
- Concept: It maintains a multivariate normal distribution over the search space. In each generation, it "evolves" by shifting the mean toward better results and adapting the covariance matrix to follow the "path" of steepest improvement.

**4. Pruning Theory (Automated Early Stopping)**

Pruning is the process of killing a "bad" trial before it finishes training to save time and compute.
- Median Pruner: Theoretically simple; it stops a trial if its intermediate result (e.g., validation loss at epoch 5) is worse than the median of previous trials at the same step.
- Hyperband / Successive Halving (SHA):
    - Theory: Budget-based allocation.
    - Concept: Start many trials with a tiny budget (e.g., 1 epoch). Keep the top 25%, give them more budget, and repeat. This ensures that only the most promising "survivors" reach the final training stage.

**5. Hyperparameter Importance: fANOVA**

Optuna can calculate which hyperparameters actually matter using Functional Analysis of Variance (fANOVA).
- Theory: It decomposes the variance of the model's performance.
- Concept: If changing the "Learning Rate" causes a 50% change in accuracy, but changing "Batch Size" only causes a 1% change, fANOVA identifies the Learning Rate as the most critical parameter to focus on.

**The Optimization Cycle**

The process follows a feedback loop:

- Suggest: The Sampler picks a value based on the Study's history.
- Evaluate: The Trial runs the objective function.
- Report: The trial sends intermediate results back for potential Pruning.
- Update: The Sampler updates its internal probability model (TPE) with the final result.

## ExtraTree Classifier

**Extra Trees (Extremely Randomized Trees)**

Similar to Random Forest, but introduces a higher level of randomness. While RF searches for the optimum split threshold for each feature, Extra Trees chooses a random threshold. Additionally, it typically uses the entire dataset rather than bootstrap samples. This significantly reduces training time and often reduces variance further than RF.

**Important Hyperparameters:**

- `n_estimators`: Total number of randomized trees.
- `max_features`: Number of features to randomly sample for each split.
- `bootstrap`: Boolean (default False). Whether to use the whole dataset or samples with replacement.
- `min_samples_split`: The minimum number of samples required to split an internal node.

In [14]:
MODEL_NAME = "extra_trees"
TUNING_FILE = f"best_{MODEL_NAME}_params.pkl"

# Read-only input DB (if exists)
READONLY_DB = "/kaggle/input/optuna-extra-trees-1/scikitlearn/default/1/optuna_extra_trees (1).db"

# Writable DB in working directory
WRITABLE_DB = f"/kaggle/working/optuna_{MODEL_NAME}.db"
STUDY_NAME = f"reversal_{MODEL_NAME}"

# Copy readonly DB to writable location if it exists
if os.path.exists(READONLY_DB):
    print("Copying readonly Extra Trees Optuna database to writable location...")
    shutil.copy(READONLY_DB, WRITABLE_DB)
    print("Database copied successfully.")
else:
    print("No existing Extra Trees database found. Starting fresh.")

# Use writable DB
DB_FILE = WRITABLE_DB

# Load previous best if exists
best_score = 0.0
if os.path.exists(TUNING_FILE):
    loaded = load(TUNING_FILE)
    best_score = loaded.get('score', 0.0)
    print(f"Previous best {MODEL_NAME} OOF Macro F1: {best_score:.5f}")

n_folds = 3   #Increase for better hyperparameter tuning validation accuracy (less bias score), I use 3 fold for illustration
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)

def objective(trial):
    print(f"\n=== {MODEL_NAME.upper()} - Trial {trial.number} ===")
    
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1500),
        'max_depth': trial.suggest_categorical('max_depth', [None, 20, 40, 60, 80, 100]),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
        'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', 0.6, 0.8, None]),
        'class_weight': class_weight_dict_full,
        'n_jobs': -1,
        'random_state': 42
    }
    
    print("Params:", {k: v for k, v in params.items() if k not in ['class_weight', 'n_jobs', 'random_state']})
    
    model = ExtraTreesClassifier(**params)
    scores = []
    for tr_idx, val_idx in skf.split(X_train_res, y_train_res):
        X_tr, X_va = X_train_res.iloc[tr_idx], X_train_res.iloc[val_idx]
        y_tr, y_va = y_train_res.iloc[tr_idx], y_train_res.iloc[val_idx]
        model.fit(X_tr, y_tr)
        pred = model.predict(X_va)
        scores.append(f1_score(y_va, pred, average='macro'))
    
    mean_score = np.mean(scores)
    print(f"Trial {trial.number} OOF Macro F1: {mean_score:.5f}")
    return mean_score

study = optuna.create_study(direction="maximize", study_name=STUDY_NAME, storage=f"sqlite:///{DB_FILE}", load_if_exists=True)
print(f"Starting/continuing {MODEL_NAME.upper()} tuning...")
study.optimize(objective, n_trials=3, timeout=None)                 #Increase for better score, I use 3 iteration for illustration

best_score = study.best_value
best_params = study.best_params
dump({'score': best_score, 'params': best_params}, TUNING_FILE)
print(f"\n{MODEL_NAME.upper()} tuning complete! Best OOF Macro F1: {best_score:.5f}")
print(f"Best params saved to {TUNING_FILE}")

No existing Extra Trees database found. Starting fresh.


[32m[I 2026-01-09 18:59:42,961][0m A new study created in RDB with name: reversal_extra_trees[0m


Starting/continuing EXTRA_TREES tuning...

=== EXTRA_TREES - Trial 0 ===
Params: {'n_estimators': 1102, 'max_depth': 100, 'min_samples_split': 15, 'min_samples_leaf': 1, 'max_features': 0.8}


[32m[I 2026-01-09 19:41:23,059][0m Trial 0 finished with value: 0.9734808886356721 and parameters: {'n_estimators': 1102, 'max_depth': 100, 'min_samples_split': 15, 'min_samples_leaf': 1, 'max_features': 0.8}. Best is trial 0 with value: 0.9734808886356721.[0m


Trial 0 OOF Macro F1: 0.97348

=== EXTRA_TREES - Trial 1 ===
Params: {'n_estimators': 1032, 'max_depth': 40, 'min_samples_split': 14, 'min_samples_leaf': 9, 'max_features': 0.6}


[32m[I 2026-01-09 20:09:39,896][0m Trial 1 finished with value: 0.9663552346213896 and parameters: {'n_estimators': 1032, 'max_depth': 40, 'min_samples_split': 14, 'min_samples_leaf': 9, 'max_features': 0.6}. Best is trial 0 with value: 0.9734808886356721.[0m


Trial 1 OOF Macro F1: 0.96636

=== EXTRA_TREES - Trial 2 ===
Params: {'n_estimators': 1181, 'max_depth': 40, 'min_samples_split': 17, 'min_samples_leaf': 7, 'max_features': 0.6}


[32m[I 2026-01-09 20:42:37,336][0m Trial 2 finished with value: 0.9705761509917007 and parameters: {'n_estimators': 1181, 'max_depth': 40, 'min_samples_split': 17, 'min_samples_leaf': 7, 'max_features': 0.6}. Best is trial 0 with value: 0.9734808886356721.[0m


Trial 2 OOF Macro F1: 0.97058

EXTRA_TREES tuning complete! Best OOF Macro F1: 0.97348
Best params saved to best_extra_trees_params.pkl


## RandomForest Classifier

**Random Forest (RF)**

Random Forest is an ensemble learning method based on Bagging (Bootstrap Aggregating). It builds multiple decision trees independently using random subsets of the training data (drawn with replacement). To increase diversity, it also selects a random subset of features at each split. The final prediction is the majority vote (classification) or average (regression) of all trees.

**Important Hyperparameters:**

- `n_estimators`: Number of trees in the forest. More trees generally improve stability but increase training time.
- `max_depth`: The maximum depth of each tree. Controlling this helps prevent trees from becoming too complex and overfitting.
- `max_features`: The number of random features to consider at each split (e.g., sqrt or log2).
- `min_samples_leaf`: The minimum number of samples required to be at a leaf node; higher values smooth the model.

In [15]:
MODEL_NAME = "random_forest"
TUNING_FILE = f"best_{MODEL_NAME}_params.pkl"

# Read-only input DB (if exists)
READONLY_DB = "/kaggle/input/optuna-random-forest-1/scikitlearn/default/1/optuna_random_forest (1).db"

# Writable DB in working directory
WRITABLE_DB = f"/kaggle/working/optuna_{MODEL_NAME}.db"
STUDY_NAME = f"reversal_{MODEL_NAME}"

# Copy readonly DB to writable location if it exists
if os.path.exists(READONLY_DB):
    print("Copying readonly Random Forest Optuna database to writable location...")
    shutil.copy(READONLY_DB, WRITABLE_DB)
    print("Database copied successfully.")
else:
    print("No existing Random Forest database found. Starting fresh.")

# Use writable DB
DB_FILE = WRITABLE_DB

# Load previous best if exists
best_score = 0.0
if os.path.exists(TUNING_FILE):
    loaded = load(TUNING_FILE)
    best_score = loaded.get('score', 0.0)
    print(f"Previous best {MODEL_NAME} OOF Macro F1: {best_score:.5f}")

n_folds = 3        #Increase for better hyperparameter tuning validation accuracy (less bias score), I use 3 fold for illustration
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)

def objective(trial):
    print(f"\n=== {MODEL_NAME.upper()} - Trial {trial.number} ===")
    
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1500),
        'max_depth': trial.suggest_categorical('max_depth', [None, 20, 40, 60, 80, 100]),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
        'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', 0.6, 0.8, None]),
        'class_weight': class_weight_dict_full,
        'n_jobs': -1,
        'random_state': 42
    }
    
    print("Params:", {k: v for k, v in params.items() if k not in ['class_weight', 'n_jobs', 'random_state']})
    
    model = RandomForestClassifier(**params)
    scores = []
    for tr_idx, val_idx in skf.split(X_train_res, y_train_res):
        X_tr, X_va = X_train_res.iloc[tr_idx], X_train_res.iloc[val_idx]
        y_tr, y_va = y_train_res.iloc[tr_idx], y_train_res.iloc[val_idx]
        model.fit(X_tr, y_tr)
        pred = model.predict(X_va)
        scores.append(f1_score(y_va, pred, average='macro'))
    
    mean_score = np.mean(scores)
    print(f"Trial {trial.number} OOF Macro F1: {mean_score:.5f}")
    return mean_score

study = optuna.create_study(direction="maximize", study_name=STUDY_NAME, storage=f"sqlite:///{DB_FILE}", load_if_exists=True)
print(f"Starting/continuing {MODEL_NAME.upper()} tuning...")
study.optimize(objective, n_trials=3, timeout=None)                      #Increase for better score, I use 3 iteration for illustration

best_score = study.best_value
best_params = study.best_params
dump({'score': best_score, 'params': best_params}, TUNING_FILE)
print(f"\n{MODEL_NAME.upper()} tuning complete! Best OOF Macro F1: {best_score:.5f}")
print(f"Best params saved to {TUNING_FILE}")

No existing Random Forest database found. Starting fresh.


[32m[I 2026-01-09 20:42:37,796][0m A new study created in RDB with name: reversal_random_forest[0m


Starting/continuing RANDOM_FOREST tuning...

=== RANDOM_FOREST - Trial 0 ===
Params: {'n_estimators': 1489, 'max_depth': 20, 'min_samples_split': 13, 'min_samples_leaf': 1, 'max_features': None}


[32m[I 2026-01-09 21:26:16,215][0m Trial 0 finished with value: 0.9467469982520003 and parameters: {'n_estimators': 1489, 'max_depth': 20, 'min_samples_split': 13, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 0 with value: 0.9467469982520003.[0m


Trial 0 OOF Macro F1: 0.94675

=== RANDOM_FOREST - Trial 1 ===
Params: {'n_estimators': 445, 'max_depth': 60, 'min_samples_split': 17, 'min_samples_leaf': 5, 'max_features': None}


[32m[I 2026-01-09 21:37:46,247][0m Trial 1 finished with value: 0.9396990647871689 and parameters: {'n_estimators': 445, 'max_depth': 60, 'min_samples_split': 17, 'min_samples_leaf': 5, 'max_features': None}. Best is trial 0 with value: 0.9467469982520003.[0m


Trial 1 OOF Macro F1: 0.93970

=== RANDOM_FOREST - Trial 2 ===
Params: {'n_estimators': 434, 'max_depth': 60, 'min_samples_split': 9, 'min_samples_leaf': 8, 'max_features': 'log2'}


[32m[I 2026-01-09 21:37:54,599][0m Trial 2 finished with value: 0.6712486483227122 and parameters: {'n_estimators': 434, 'max_depth': 60, 'min_samples_split': 9, 'min_samples_leaf': 8, 'max_features': 'log2'}. Best is trial 0 with value: 0.9467469982520003.[0m


Trial 2 OOF Macro F1: 0.67125

RANDOM_FOREST tuning complete! Best OOF Macro F1: 0.94675
Best params saved to best_random_forest_params.pkl


## XGBoost Classifier

**XGBoost (eXtreme Gradient Boosting)**

A powerful implementation of Gradient Boosting that builds trees sequentially. Each new tree attempts to correct the errors (residuals) of the previous ones using gradient descent. It uses Level-Wise Growth (splitting level by level) and incorporates L1/L2 regularization directly into the objective function to handle overfitting.

**Important Hyperparameters:**

- `learning_rate (eta)`: Scales the contribution of each tree. Lower values (e.g., 0.01) require more trees but lead to better generalization.
- `max_depth`: Limits the complexity of individual trees.
- `gamma`: The minimum loss reduction required to make a further split; higher values make the model more conservative.
- `lambda (L2) and alpha (L1)`: Regularization terms on weights to penalize complexity.

In [16]:
MODEL_NAME = "xgboost"
TUNING_FILE = f"best_{MODEL_NAME}_params.pkl"
DB_FILE_ORIGINAL = "/kaggle/input/optuna-xgboost-1/scikitlearn/default/1/optuna_xgboost (1).db"
STUDY_NAME = f"reversal_{MODEL_NAME}"

# Copy the read-only DB to a writable location (Kaggle working dir)
WORKING_DIR = "/kaggle/working"
DB_FILE = os.path.join(WORKING_DIR, f"optuna_{MODEL_NAME}.db")

if os.path.exists(DB_FILE_ORIGINAL):
    shutil.copy(DB_FILE_ORIGINAL, DB_FILE)
    print(f"Copied input DB to writable location: {DB_FILE}")
else:
    print("No input DB found — starting fresh study.")

# Load previous best if exists
best_score = 0.0
if os.path.exists(TUNING_FILE):
    print("Loading previous best tuning parameters...")
    loaded = load(TUNING_FILE)
    best_score = loaded.get('score', 0.0)
    print(f"Previous best {MODEL_NAME.upper()} OOF Macro F1: {best_score:.5f}")

n_folds = 3           #Increase for better hyperparameter tuning validation accuracy (less bias score), I use 3 fold for illustration
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)

def objective(trial):
    print(f"\n=== {MODEL_NAME.upper()} - Trial {trial.number} ===")
    
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1500),
        'learning_rate': trial.suggest_float('learning_rate', 0.005, 0.15, log=True),
        'max_depth': trial.suggest_int('max_depth', 3, 15),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 15),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'gamma': trial.suggest_float('gamma', 0.0, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 5.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.0, 5.0),
        'objective': 'multi:softprob',
        'num_class': 3,
        'eval_metric': 'mlogloss',
        'random_state': 42,
        'n_jobs': -1
    }
    
    print("Params:", {k: v for k, v in params.items() if k not in ['objective', 'num_class', 'eval_metric', 'random_state', 'n_jobs']})
    
    model = xgb.XGBClassifier(**params)
    scores = []
    for tr_idx, val_idx in skf.split(X_train_res, y_train_res):
        X_tr, X_va = X_train_res.iloc[tr_idx], X_train_res.iloc[val_idx]
        y_tr, y_va = y_train_res.iloc[tr_idx], y_train_res.iloc[val_idx]
        model.fit(X_tr, y_tr)
        pred = model.predict(X_va)
        scores.append(f1_score(y_va, pred, average='macro'))
    
    mean_score = np.mean(scores)
    print(f"Trial {trial.number} OOF Macro F1: {mean_score:.5f}")
    return mean_score

# Create/load study with writable DB
study = optuna.create_study(
    direction="maximize",
    study_name=STUDY_NAME,
    storage=f"sqlite:///{DB_FILE}",
    load_if_exists=True
)

print(f"Starting/continuing {MODEL_NAME.upper()} tuning...")
print(f"Current trials in study: {len(study.trials)}")
study.optimize(objective, n_trials=3, timeout=None)              #Increase for better score, I use 3 iteration for illustration

best_score = study.best_value
best_params = study.best_params

dump({'score': best_score, 'params': best_params}, TUNING_FILE)
print(f"\n{MODEL_NAME.upper()} tuning complete! Best OOF Macro F1: {best_score:.5f}")
print(f"Best params saved to {TUNING_FILE}")

No input DB found — starting fresh study.


[32m[I 2026-01-09 21:37:54,977][0m A new study created in RDB with name: reversal_xgboost[0m


Starting/continuing XGBOOST tuning...
Current trials in study: 0

=== XGBOOST - Trial 0 ===
Params: {'n_estimators': 814, 'learning_rate': 0.07517667186111256, 'max_depth': 4, 'min_child_weight': 12, 'subsample': 0.9532396579084901, 'colsample_bytree': 0.5326153712517224, 'gamma': 0.6243597105110519, 'reg_alpha': 1.503164748807263, 'reg_lambda': 2.472592845853095}


[32m[I 2026-01-09 21:50:20,122][0m Trial 0 finished with value: 0.9586530211610932 and parameters: {'n_estimators': 814, 'learning_rate': 0.07517667186111256, 'max_depth': 4, 'min_child_weight': 12, 'subsample': 0.9532396579084901, 'colsample_bytree': 0.5326153712517224, 'gamma': 0.6243597105110519, 'reg_alpha': 1.503164748807263, 'reg_lambda': 2.472592845853095}. Best is trial 0 with value: 0.9586530211610932.[0m


Trial 0 OOF Macro F1: 0.95865

=== XGBOOST - Trial 1 ===
Params: {'n_estimators': 1167, 'learning_rate': 0.008344347437660785, 'max_depth': 4, 'min_child_weight': 5, 'subsample': 0.6173578373792047, 'colsample_bytree': 0.7628926564846448, 'gamma': 0.8412653599580224, 'reg_alpha': 3.3934546183303023, 'reg_lambda': 4.908400383489214}


[32m[I 2026-01-09 22:12:51,607][0m Trial 1 finished with value: 0.9103231907462331 and parameters: {'n_estimators': 1167, 'learning_rate': 0.008344347437660785, 'max_depth': 4, 'min_child_weight': 5, 'subsample': 0.6173578373792047, 'colsample_bytree': 0.7628926564846448, 'gamma': 0.8412653599580224, 'reg_alpha': 3.3934546183303023, 'reg_lambda': 4.908400383489214}. Best is trial 0 with value: 0.9586530211610932.[0m


Trial 1 OOF Macro F1: 0.91032

=== XGBOOST - Trial 2 ===
Params: {'n_estimators': 465, 'learning_rate': 0.08504668280427521, 'max_depth': 11, 'min_child_weight': 8, 'subsample': 0.6946881511498744, 'colsample_bytree': 0.678666926721506, 'gamma': 0.15628434491772247, 'reg_alpha': 2.365077890701018, 'reg_lambda': 0.4219208757506171}


[32m[I 2026-01-09 22:22:42,344][0m Trial 2 finished with value: 0.9695124248619758 and parameters: {'n_estimators': 465, 'learning_rate': 0.08504668280427521, 'max_depth': 11, 'min_child_weight': 8, 'subsample': 0.6946881511498744, 'colsample_bytree': 0.678666926721506, 'gamma': 0.15628434491772247, 'reg_alpha': 2.365077890701018, 'reg_lambda': 0.4219208757506171}. Best is trial 2 with value: 0.9695124248619758.[0m


Trial 2 OOF Macro F1: 0.96951

XGBOOST tuning complete! Best OOF Macro F1: 0.96951
Best params saved to best_xgboost_params.pkl


## LightGBM Classifier

**LightGBM (Light Gradient Boosting Machine)**

A high-speed boosting framework developed by Microsoft. Its primary differentiator is Leaf-Wise Growth, where it splits the node that results in the greatest loss reduction, regardless of depth. It also uses Histogram-based binning and GOSS (Gradient-based One-Side Sampling) to handle large datasets with significantly less memory and time.

**Important Hyperparameters:**

- `num_leaves`: The most important parameter for LightGBM; it controls tree complexity (should be less than $2^{\text{max\_depth}}$).
- `learning_rate`: Similar to XGBoost, dictates the step size of the optimization.
- `min_data_in_leaf`: Prevents overfitting by ensuring each leaf has enough supporting samples.
- `feature_fraction`: Randomly selects a subset of features on each iteration (similar to colsample_bytree).

In [17]:
MODEL_NAME = "lightgbm"
TUNING_FILE = f"best_{MODEL_NAME}_params.pkl"
DB_FILE_ORIGINAL = "/kaggle/input/optuna-lightgbm-1/scikitlearn/default/1/optuna_lightgbm (1).db"
STUDY_NAME = f"reversal_{MODEL_NAME}"

# Copy the read-only DB to writable working directory
WORKING_DIR = "/kaggle/working"
DB_FILE = os.path.join(WORKING_DIR, f"optuna_{MODEL_NAME}.db")

if os.path.exists(DB_FILE_ORIGINAL):
    shutil.copy(DB_FILE_ORIGINAL, DB_FILE)
    print(f"Copied input DB to writable location: {DB_FILE}")
else:
    print("No input DB found — starting fresh study.")

# Load previous best if exists
best_score = 0.0
if os.path.exists(TUNING_FILE):
    print("Loading previous best tuning parameters...")
    loaded = load(TUNING_FILE)
    best_score = loaded.get('score', 0.0)
    print(f"Previous best {MODEL_NAME.upper()} OOF Macro F1: {best_score:.5f}")

n_folds = 3           #Increase for better hyperparameter tuning validation accuracy (less bias score), I use 3 fold for illustration
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)

def objective(trial):
    print(f"\n=== {MODEL_NAME.upper()} - Trial {trial.number} ===")
    
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
        'learning_rate': trial.suggest_float('learning_rate', 0.005, 0.15, log=True),
        'max_depth': trial.suggest_int('max_depth', -1, 40),
        'num_leaves': trial.suggest_int('num_leaves', 31, 512),
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 10, 200),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 10.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.0, 10.0),
        'objective': 'multiclass',
        'num_class': 3,
        'class_weight': 'balanced',  
        'random_state': 42,
        'verbose': -1,
        'n_jobs': -1
    }
    
    print("Params:", {k: v for k, v in params.items() if k not in ['objective', 'num_class', 'class_weight', 'random_state', 'verbose', 'n_jobs']})
    
    model = lgb.LGBMClassifier(**params)
    scores = []
    for tr_idx, val_idx in skf.split(X_train_res, y_train_res):
        X_tr, X_va = X_train_res.iloc[tr_idx], X_train_res.iloc[val_idx]
        y_tr, y_va = y_train_res.iloc[tr_idx], y_train_res.iloc[val_idx]
        model.fit(X_tr, y_tr)
        pred = model.predict(X_va)
        scores.append(f1_score(y_va, pred, average='macro'))
    
    mean_score = np.mean(scores)
    print(f"Trial {trial.number} OOF Macro F1: {mean_score:.5f}")
    return mean_score

# Create/load study with writable DB
study = optuna.create_study(
    direction="maximize",
    study_name=STUDY_NAME,
    storage=f"sqlite:///{DB_FILE}",
    load_if_exists=True
)

print(f"Starting/continuing {MODEL_NAME.upper()} tuning...")
print(f"Current trials in study: {len(study.trials)}")
study.optimize(objective, n_trials=3, timeout=None)                #Increase for better score, I use 3 iteration for illustration

best_score = study.best_value
best_params = study.best_params

dump({'score': best_score, 'params': best_params}, TUNING_FILE)
print(f"\n{MODEL_NAME.upper()} tuning complete! Best OOF Macro F1: {best_score:.5f}")
print(f"Best params saved to {TUNING_FILE}")

No input DB found — starting fresh study.


[32m[I 2026-01-09 22:22:42,733][0m A new study created in RDB with name: reversal_lightgbm[0m


Starting/continuing LIGHTGBM tuning...
Current trials in study: 0

=== LIGHTGBM - Trial 0 ===
Params: {'n_estimators': 546, 'learning_rate': 0.0357308764797109, 'max_depth': 32, 'num_leaves': 362, 'min_data_in_leaf': 154, 'subsample': 0.8763642830164716, 'colsample_bytree': 0.9543147498916158, 'reg_alpha': 4.824193382451044, 'reg_lambda': 7.768370479754275}


[32m[I 2026-01-09 22:23:08,701][0m Trial 0 finished with value: 0.9278944556351507 and parameters: {'n_estimators': 546, 'learning_rate': 0.0357308764797109, 'max_depth': 32, 'num_leaves': 362, 'min_data_in_leaf': 154, 'subsample': 0.8763642830164716, 'colsample_bytree': 0.9543147498916158, 'reg_alpha': 4.824193382451044, 'reg_lambda': 7.768370479754275}. Best is trial 0 with value: 0.9278944556351507.[0m


Trial 0 OOF Macro F1: 0.92789

=== LIGHTGBM - Trial 1 ===
Params: {'n_estimators': 1718, 'learning_rate': 0.006551927983229708, 'max_depth': 34, 'num_leaves': 67, 'min_data_in_leaf': 76, 'subsample': 0.8665643040782558, 'colsample_bytree': 0.927539087557216, 'reg_alpha': 2.3220177866872618, 'reg_lambda': 9.659262223966373}


[32m[I 2026-01-09 22:24:09,342][0m Trial 1 finished with value: 0.9592855831057022 and parameters: {'n_estimators': 1718, 'learning_rate': 0.006551927983229708, 'max_depth': 34, 'num_leaves': 67, 'min_data_in_leaf': 76, 'subsample': 0.8665643040782558, 'colsample_bytree': 0.927539087557216, 'reg_alpha': 2.3220177866872618, 'reg_lambda': 9.659262223966373}. Best is trial 1 with value: 0.9592855831057022.[0m


Trial 1 OOF Macro F1: 0.95929

=== LIGHTGBM - Trial 2 ===
Params: {'n_estimators': 1519, 'learning_rate': 0.014356225642718835, 'max_depth': 4, 'num_leaves': 114, 'min_data_in_leaf': 190, 'subsample': 0.7807899796819592, 'colsample_bytree': 0.9846796614113262, 'reg_alpha': 2.2864121751267663, 'reg_lambda': 1.3710453996270144}


[32m[I 2026-01-09 22:24:31,607][0m Trial 2 finished with value: 0.9075776395083679 and parameters: {'n_estimators': 1519, 'learning_rate': 0.014356225642718835, 'max_depth': 4, 'num_leaves': 114, 'min_data_in_leaf': 190, 'subsample': 0.7807899796819592, 'colsample_bytree': 0.9846796614113262, 'reg_alpha': 2.2864121751267663, 'reg_lambda': 1.3710453996270144}. Best is trial 1 with value: 0.9592855831057022.[0m


Trial 2 OOF Macro F1: 0.90758

LIGHTGBM tuning complete! Best OOF Macro F1: 0.95929
Best params saved to best_lightgbm_params.pkl


# Final Ensemble Training & Submission

The tuned models are integrated into a final weighted soft-voting ensemble. Weights are distributed based on individual model performance. The pipeline generates final predictions, maps them back to the original labels (H, L, None), and exports the results to submission.csv.

In [11]:
# Extra Trees 
et_best = ExtraTreesClassifier(
    n_estimators=115,
    max_depth=100,
    min_samples_split=15,
    min_samples_leaf=1,
    max_features='sqrt',
    class_weight=class_weight_dict_full,
    n_jobs=-1,
    random_state=42
)

# Random Forest 
rf_best = RandomForestClassifier(
    n_estimators=853,
    max_depth=None,
    min_samples_split=8,
    min_samples_leaf=1,
    max_features='log2',
    class_weight=class_weight_dict_full,
    n_jobs=-1,
    random_state=42
)

# XGBoost 
xgb_best = xgb.XGBClassifier(
    n_estimators=5294,
    learning_rate=0.0507,
    max_depth=12,
    min_child_weight=5,
    subsample=0.7659,
    colsample_bytree=0.613,
    gamma=0.46586,
    reg_alpha=1.180167,
    reg_lambda=0.2799,
    objective='multi:softprob',
    num_class=3,
    eval_metric='mlogloss',
    random_state=42,
    n_jobs=-1
)

# LightGBM 
lgb_best = lgb.LGBMClassifier(
    n_estimators=1280,
    learning_rate=0.01533360874282631,
    max_depth=27,
    num_leaves=461,
    min_data_in_leaf=30,
    subsample=0.5876046117077005,
    colsample_bytree=0.5799546893007104,
    reg_alpha=0.0550092101268719,
    reg_lambda=8.675002657367935,
    objective='multiclass',
    num_class=3,
    class_weight=class_weight_dict_full,
    random_state=42,
    verbose=-1,
    n_jobs=-1
)

# Optimized Weights for Highest F1 (based on trial performance & blending experiments)
# Weights: ExtraTrees (highest) > RF > LGBM > XGB
blend_weights = [0.32, 0.28, 0.24, 0.16]  # ET, RF, LGBM, XGB

# Manual Weighted Soft Voting Ensemble 
print("\nTraining final weighted soft voting ensemble (ET + RF + LGBM + XGB)...")
ensemble = VotingClassifier(
    estimators=[
        ('ExtraTrees', et_best),
        ('RandomForest', rf_best),
        ('LightGBM', lgb_best),
        ('XGBoost', xgb_best)
    ],
    voting='soft',
    weights=blend_weights,
    n_jobs=-1
)

# Train on full resampled data
ensemble.fit(X_train_res, y_train_res)

# Validation Performance
val_proba = ensemble.predict_proba(X_val)
val_pred = np.argmax(val_proba, axis=1)
val_pred_str = le.inverse_transform(val_pred)

macro_f1 = f1_score(y_val_encoded, val_pred, average='macro')
bal_acc = balanced_accuracy_score(y_val_encoded, val_pred)
mcc = matthews_corrcoef(y_val_encoded, val_pred)

print(f"\nValidation Macro F1: {macro_f1:.5f}")
print(f"Validation Balanced Acc: {bal_acc:.5f}")
print(f"Validation MCC: {mcc:.5f}")

# Test Prediction & Submission
test_proba = ensemble.predict_proba(X_test)
test_pred = np.argmax(test_proba, axis=1)
test_pred_str = le.inverse_transform(test_pred)

submission = pd.DataFrame({'id': test_ids, 'class_label': test_pred_str})
submission['class_label'] = submission['class_label'].fillna('None')
submission.to_csv("submission.csv", index=False)

print("\nSubmission saved successfully!")
print("Best ensemble: Weighted Voting (ET + RF + LGBM + XGB)")


Training final weighted soft voting ensemble (ET + RF + LGBM + XGB)...

Validation Macro F1: 0.32446
Validation Balanced Acc: 0.33011
Validation MCC: -0.01524

Submission saved successfully!
Best ensemble: Weighted Voting (ET + RF + LGBM + XGB)


In [18]:
submission["class_label"].value_counts(normalize=True*100)

class_label
None    0.975673
L       0.013901
H       0.010426
Name: proportion, dtype: float64

# Conclusion

The developed solution provides a complete, high-quality framework for detecting US equity reversals. By synthesizing time-aware preprocessing, aggressive imbalance management, and Bayesian-tuned ensembles, the pipeline achieves a high Macro F1 performance while maintaining the inference efficiency required for real-world production. This project effectively navigates the challenges of high dimensionality and temporal dependencies, serving as a reliable blueprint for imbalanced time-series classification tasks in the financial domain.

# References

- [Standard Scaler sk-learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
- [Standard Scaler Greeksforgreeks](https://www.geeksforgeeks.org/machine-learning/standardscaler-minmaxscaler-and-robustscaler-techniques-ml/)
- [ADASYN Imblearn](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.ADASYN.html)
- [PCA Wikipedia](https://en.wikipedia.org/wiki/Principal_component_analysis)
- [PCA sk-learn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)
- [PCA Greeksforgreeks](https://www.geeksforgeeks.org/data-analysis/principal-component-analysis-pca/)
- [Voting Classifier sk-learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html)
- [Voting Classifier Greeksforgreek](https://www.geeksforgeeks.org/machine-learning/voting-classifier/)
- [Optuna Hyperparameter Tuning](https://optuna.org/)
- [ExtraTree Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)
- [RandomForest Classifier sk-learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- [RandomForest Classifier Greeksforgreeks](https://www.geeksforgeeks.org/dsa/random-forest-classifier-using-scikit-learn/)
- [XGBoost Classifier](https://xgboost.readthedocs.io/en/stable/)
- [XGBoost Classifier Greeksforgreeks](https://www.geeksforgeeks.org/machine-learning/xgboost/)
- [LightGBM Classifier](https://lightgbm.readthedocs.io/en/stable/)
- [LightGBM Greeksforgreeks](https://www.geeksforgeeks.org/machine-learning/lightgbm-light-gradient-boosting-machine/)