# Data Pipeline and Normalization (Current)

This notebook explains how data is transformed before entering TCN training.


## 1) Pipeline Stages

The active flow in `DataProcessor` (TCN pipeline) includes:

1. OHLCV load,
2. multi-horizon returns,
3. return statistics,
4. technical indicators,
5. covariance features,
6. fundamentals merge (if enabled),
7. macro merge (if enabled),
8. regime features,
9. quant alpha/cross-sectional features,
10. normalization,
11. split handling for train/test.


## 2) No-Leak Normalization Rule

Scalers are fit on train-window data only, then applied to the dataset.

Conceptually:

$$
z = \frac{x-\mu_{\text{train}}}{\sigma_{\text{train}}}.
$$

This avoids fitting normalization statistics on out-of-sample data.


## 3) Cross-Sectional Z-Score Rule

Cross-sectional z-score columns are excluded from re-normalization to preserve their cross-sectional meaning.


## 4) State Dimension

If there are $N$ risky assets and $F$ active features per asset:

$$
\text{state\_dim} = N \times F.
$$


## 5) Robustness Against Missing Values

- preprocessing includes fill/clean steps,
- environment converts any remaining NaN/Inf with `nan_to_num` before rollout.


In [None]:
from src.data_utils import DataProcessor
from src.config import PHASE1_CONFIG
processor = DataProcessor(PHASE1_CONFIG)
phase_key = 'phase' + '1'  # internal compatibility key
cols = processor.get_feature_columns(phase_key)
print('Feature count (active):', len(cols))
print('Sample columns:', cols[:15])
