# Plan

## Feature Engineering Pipeline
- **Nonlinear dimensionality reduction:** Implement autoencoder-based feature extraction to capture complex nonlinear relationships identified in EDA.
- **Collinearity block compression:** Transform identified collinear feature blocks into compact representations while preserving predictive information.

## Feature Selection & Validation
- **Model-based selection:** Use XGBoost with 6-fold time-series cross-validation combined with SHAP analysis to identify the most predictive features.
- **Robust feature ranking:** Aggregate feature importance scores across all CV folds to ensure stable feature selection that generalizes across time periods.
- **Correlation filtering:** Identify and remove highly correlated features to reduce redundancy in the final feature set.

## Feature Interaction & Final Modeling
- **Regime identification features:** Apply pre-trained regime classifier (trained on HMM-identified regimes) to generate regime features. **Critical:** Exclude any data used in HMM training and regime classifier training from the validation stage of these features!!!!
- **Interaction engineering:** Create linear combinations and interaction features from the selected high-value features.
- **Model optimization:** Re-run tree-based model selection with the enhanced feature set to identify the optimal modeling approach.

# 1. Libraries

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.preprocessing import StandardScaler

import sys
sys.path.append('../../')  
from personal_utils import quick_notify
sys.path.remove('../../')

# 2. Data Loading

In [2]:
filepath = "../data/"

train_df = pd.read_parquet(filepath+"train.parquet")
features = train_df.drop('label', axis = 1)
y = train_df['label']
# test_df = pd.read_parquet(filepath+"test.parquet")

# 3. Collinearity Block Compression
- Idea was from Tony271YnoT's solution, by no means novel on my end
- Used to remove highly correlated and possibly uninformative features

In [5]:
corr_matrix = features.corr(method="spearman")

In [None]:
threshold = 0.6 # Copied from Tony271YnoT's Approach
groups = []
visited = set()

for col in corr_matrix.columns:
    if col in visited:
        continue

    group = set(corr_matrix.columns[corr_matrix[col] > threshold])
    groups.append(group)
    visited |= group

selected_features = []

for group in groups:
    if len(group) == 1:
        selected_features.extend(group)
        continue
    
    sub_corr = corr_matrix.loc[list(group), list(group)]
    mean_corr = sub_corr.mean(axis=1)
    medoid = mean_corr.idxmax()

    selected_features.append(medoid)

reduced_df = train_df[selected_features]

# 4. Autoencoder

# 5. Feature Selection Function Definition + Initial Run

# 6. Regime Features

# 7. Interaction Engineering

# 8. Model Optimization

# 9. Summary + Next Steps