# Feature Engineering & Target Definition

## Context & Previous Work
- In the previous version of the project, we used returns over a certain period of time to predict trends. This was clearly unsuccessful
- Then, we identified the Golden Cross rule to be the best way to classify financial trend over several different methods (in the target_testing notebook)
- We have with this target stable regimes with few transitions

## New Target: Transition Prediction

### Motivation
- We know that the Golden Cross rule is functionning to predict financial trend, but use it as a target to predict with ML models is useless, one could just compute it.
- What we will implement here is a "transition incoming" target. More explanations below

### Target Definition
- Target: "transition_incoming". in other words, the target will no longer be Bullish/Non-Bullish, but rather "a transition is coming/Nothing is coming" 
- Label = 1 during the **30-day window preceding** a transition
- Label = 0 otherwise
- This creates a prediction horizon: we want to detect transitions up to 30 days in advance

### Decision Rule (Rolling days confirmation)
- Identified issue: Classifying a transition with only one label 1 30 days before the transition happens is subject to false positive (which is very dangerous in finance)
- So, we implemented a rule: A transition is coming if during 10 days, at least 7 labels were equal to 1. It allows the prediciton to be more accurate and less strict than 10 out of 10 1 labels.
- Trade-off: With the 10-day rolling window, we effectively predict transitions with **20-30 days of advance notice** (depending on when the 7/10 threshold is met)

## Class Imbalance
- With 25 transitions over ~6100 trading days, and a 30-day labeling window, we expect approximately **12% positive labels**
- This class imbalance reflects market reality: regime transitions are rare events
- We will address this through:
  * Appropriate evaluation metrics (Precision, Recall, F1, AUC-PR)
  * Class weighting in models if needed
  * Focus on Precision over Recall (false positives are costly)

### Excluded Features (data leakage)

Since the financial regime is computed according to the Golden Cross rule, any features including directly or indirectly the usage of moving averages (MA, EMA, MACD...) are excluded to prevent data leakage

### Included Features
- Actual features : vol, returns, cumulative returns, RSI
- New features : ATR, Volume ROC, Stochastic oscillator

## Implementation Plan
1. Implement the new labeling function with 30-day window
2. Add new technical indicators (ATR, Volume ROC, Stochastic)
3. Verify no feature correlation/redundancy
4. Train baseline models with proper temporal split
5. Evaluate with Precision-focused metrics

In [1]:
import sys
from pathlib import Path
import pandas as pd

PROJECT_ROOT = Path.cwd().parent
sys.path.insert(0, str(PROJECT_ROOT))

from src.data_loader import load_data


spy = load_data()
spy.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits,Capital Gains
0,2000-01-03 00:00:00-05:00,93.388775,93.388775,90.632782,91.617065,8164300,0.0,0.0,0.0
1,2000-01-04 00:00:00-05:00,90.416236,90.750892,87.965371,88.034271,8089800,0.0,0.0,0.0
2,2000-01-05 00:00:00-05:00,88.152394,89.156362,86.459427,88.191765,12177900,0.0,0.0,0.0
3,2000-01-06 00:00:00-05:00,87.955477,89.136616,86.774338,86.774338,6227200,0.0,0.0,0.0
4,2000-01-07 00:00:00-05:00,88.388577,91.813881,88.231092,91.813881,8066500,0.0,0.0,0.0


In [2]:
from src.features import add_all_features, add_target_ma_cross

In [11]:
df = spy.copy()
df = add_all_features(df)
df = add_target_ma_cross(df, short_window=50, long_window=200)
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits,Capital Gains,MA10,MA50,EMA20,Return,Log Return,Volatility,Distance_MA50,Distance_EMA20,Cumulated_Return_5d,RSI14,Golden_Cross
0,2000-01-03 00:00:00-05:00,93.388775,93.388775,90.632782,91.617065,8164300,0.0,0.0,0.0,,,91.617065,,,,,0.0,,,Non-Bullish
1,2000-01-04 00:00:00-05:00,90.416236,90.750892,87.965371,88.034271,8089800,0.0,0.0,0.0,,,89.736098,-0.039106,-0.039891,,,-0.018965,,,Non-Bullish
2,2000-01-05 00:00:00-05:00,88.152394,89.156362,86.459427,88.191765,12177900,0.0,0.0,0.0,,,89.169028,0.001789,0.001787,,,-0.01096,,,Non-Bullish
3,2000-01-06 00:00:00-05:00,87.955477,89.136616,86.774338,86.774338,6227200,0.0,0.0,0.0,,,88.477718,-0.016072,-0.016203,,,-0.019252,,,Non-Bullish
4,2000-01-07 00:00:00-05:00,88.388577,91.813881,88.231092,91.813881,8066500,0.0,0.0,0.0,,,89.284708,0.058076,0.056453,,,0.028327,,,Non-Bullish


In [16]:
def add_target(df: pd.DataFrame, period=30):
    df1 = df.copy()
    df1['Golden_Cross'] = df1['Golden_Cross'].replace({'Non-Bullish': 0, 'Bullish': 1})
    

    df1['Transition'] = 0
    

    label = df1['Golden_Cross'].iloc[0]
    indices = []
    for i in range(1, len(df1)):
        if df1['Golden_Cross'].iloc[i] != label:
            label = df1['Golden_Cross'].iloc[i]
            indices.append(i)

    for i in indices:
        start = max(0, i - period)  
        df1.loc[start:i-1, 'Transition'] = 1  
    
    return df1, indices

df1 = add_target(df)[0]

print(f"Number of days with transition = 1: {df1['Transition'].sum()}")
print(f"Number of transitions: {len(add_target(df)[1])}") 
print(f"Proportion of transition: {df1['Transition'].mean():.2%}")
print('\n-----------------------------------------------------------------------\n')
df1.head()

Number of days with transition = 1: 713
Number of transitions: 25
Proportion of transition: 11.34%

-----------------------------------------------------------------------



  df1['Golden_Cross'] = df1['Golden_Cross'].replace({'Non-Bullish': 0, 'Bullish': 1})
  df1['Golden_Cross'] = df1['Golden_Cross'].replace({'Non-Bullish': 0, 'Bullish': 1})


Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits,Capital Gains,MA10,...,EMA20,Return,Log Return,Volatility,Distance_MA50,Distance_EMA20,Cumulated_Return_5d,RSI14,Golden_Cross,Transition
0,2000-01-03 00:00:00-05:00,93.388775,93.388775,90.632782,91.617065,8164300,0.0,0.0,0.0,,...,91.617065,,,,,0.0,,,0,0
1,2000-01-04 00:00:00-05:00,90.416236,90.750892,87.965371,88.034271,8089800,0.0,0.0,0.0,,...,89.736098,-0.039106,-0.039891,,,-0.018965,,,0,0
2,2000-01-05 00:00:00-05:00,88.152394,89.156362,86.459427,88.191765,12177900,0.0,0.0,0.0,,...,89.169028,0.001789,0.001787,,,-0.01096,,,0,0
3,2000-01-06 00:00:00-05:00,87.955477,89.136616,86.774338,86.774338,6227200,0.0,0.0,0.0,,...,88.477718,-0.016072,-0.016203,,,-0.019252,,,0,0
4,2000-01-07 00:00:00-05:00,88.388577,91.813881,88.231092,91.813881,8066500,0.0,0.0,0.0,,...,89.284708,0.058076,0.056453,,,0.028327,,,0,0
