# 01 — Datos y conjuntos de features (FS0/FS1/FS2)

Objetivo: cargar el dataset, auditarlo rápidamente y definir formalmente los conjuntos de variables para la ablación **E0**.

## Definición de feature sets
- **FS0 (target-history only)**: lags + rolling stats del objetivo (por tienda).
- **FS1 (FS0 + calendario)**: FS0 + `weekofyear`, `month`, `year`.
- **FS2 (FS1 + exógenas)**: FS1 + exógenas (`Holiday_Flag`, `Temperature`, `Fuel_Price`, `CPI`, `Unemployment`).

## Anti-leakage
- Los **lags** usan `shift(k)` por tienda.
- Los **rolling stats** se calculan sobre `y` desplazada (`shift(1)`), evitando usar el valor actual.
- El escalado (en modelos deep) se ajustará **solo con train** (en el notebook 02).

**Nota importante**: FS2 asume que las exógenas están disponibles en el instante de predicción (
_oracle exog_). Si en tu caso real no lo están, esto debe documentarse como limitación.

In [1]:
from __future__ import annotations



import json

import sys

from pathlib import Path



# Ensure PROJECT_ROOT is on sys.path so `import src.*` works reliably

NOTEBOOK_DIR = Path.cwd()

PROJECT_ROOT = NOTEBOOK_DIR

if (PROJECT_ROOT / 'src').exists() is False and (PROJECT_ROOT.parent / 'src').exists():

    PROJECT_ROOT = PROJECT_ROOT.parent

sys.path.insert(0, str(PROJECT_ROOT))



from src.e0_ablation_utils import get_project_paths



paths = get_project_paths(project_root=PROJECT_ROOT, output_dir='outputs/E0_ablation')

DATA_PATH = paths.data_path

OUTPUT_DIR = paths.output_dir



print('PROJECT_ROOT:', PROJECT_ROOT)

print('DATA_PATH:', DATA_PATH)

print('OUTPUT_DIR:', OUTPUT_DIR)


PROJECT_ROOT: /home/sagemaker-user/TFMAXEL
DATA_PATH: /home/sagemaker-user/TFMAXEL/data/Walmart_Sales.csv
OUTPUT_DIR: /home/sagemaker-user/TFMAXEL/outputs/E0_ablation


In [2]:
import pandas as pd

from src.common import EXOG_COLUMNS, TEST_WEEKS, load_data, make_features, validate_split_consistency

df = load_data(DATA_PATH)
display(df.head())

print('shape:', df.shape)
print('stores:', df['Store'].nunique())
print('date range:', df['Date'].min(), '→', df['Date'].max())
print('missing values (sum):')
print(df.isna().sum())

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,2010-02-05,1643690.9,0,42.31,2.572,211.096358,8.106
1,1,2010-02-12,1641957.44,1,38.51,2.548,211.24217,8.106
2,1,2010-02-19,1611968.17,0,39.93,2.514,211.289143,8.106
3,1,2010-02-26,1409727.59,0,46.63,2.561,211.319643,8.106
4,1,2010-03-05,1554806.68,0,46.5,2.625,211.350143,8.106


shape: (6435, 8)
stores: 45
date range: 2010-02-05 00:00:00 → 2012-10-26 00:00:00
missing values (sum):
Store           0
Date            0
Weekly_Sales    0
Holiday_Flag    0
Temperature     0
Fuel_Price      0
CPI             0
Unemployment    0
dtype: int64


## Construcción de features (desde `src.common.make_features`)
`make_features` construye un conjunto 


de features *leakage-safe* y devuelve: (1) dataframe con columnas nuevas y (2) lista `feature_cols` con el orden de columnas recomendado para modelos ML.

Para la ablación, **filtraremos** esa lista para obtener FS0/FS1/FS2 sin reimplementar lógica.

In [3]:
def select_fs0(feature_cols: list[str]) -> list[str]:
    # Solo lags + rolling stats (sin exógenas ni calendario)
    return [c for c in feature_cols if c.startswith('lag_') or c.startswith('roll_')]


def select_fs1(feature_cols: list[str]) -> list[str]:
    # FS0 + calendario
    calendar = {'weekofyear', 'month', 'year'}
    return [c for c in feature_cols if (c.startswith('lag_') or c.startswith('roll_') or c in calendar)]


def select_fs2(feature_cols: list[str]) -> list[str]:
    # Completo (FS1 + exógenas)
    return list(feature_cols)

In [4]:
# FS0: construimos features sin calendario y filtramos exógenas
df_feat0, feature_cols0_all = make_features(df, add_calendar=False)
FS0_FEATURES = select_fs0(feature_cols0_all)

print('feature_cols0_all:', len(feature_cols0_all))
print('FS0_FEATURES:', len(FS0_FEATURES))
print('Example FS0:', FS0_FEATURES[:10])

assert all(c not in EXOG_COLUMNS for c in FS0_FEATURES)
assert all(c not in {'weekofyear','month','year'} for c in FS0_FEATURES)

feature_cols0_all: 16
FS0_FEATURES: 11
Example FS0: ['lag_1', 'lag_2', 'lag_4', 'lag_8', 'lag_52', 'roll_mean_4', 'roll_std_4', 'roll_mean_8', 'roll_std_8', 'roll_mean_12']


In [5]:
# FS1/FS2: construimos con calendario
df_feat12, feature_cols12_all = make_features(df, add_calendar=True)

FS1_FEATURES = select_fs1(feature_cols12_all)
FS2_FEATURES = select_fs2(feature_cols12_all)

print('feature_cols12_all:', len(feature_cols12_all))
print('FS1_FEATURES:', len(FS1_FEATURES))
print('FS2_FEATURES:', len(FS2_FEATURES))

assert all(c not in EXOG_COLUMNS for c in FS1_FEATURES)
assert all(c in FS2_FEATURES for c in EXOG_COLUMNS)
assert all(c in FS1_FEATURES for c in ['weekofyear','month','year'])
assert all(c in FS2_FEATURES for c in ['weekofyear','month','year'])

feature_cols12_all: 19
FS1_FEATURES: 14
FS2_FEATURES: 19


## Sanity checks anti-leakage (lags/rolling)
Validamos (en un ejemplo) que:
- `lag_1[t] == y[t-1]`
- `roll_mean_4[t] == mean(y[t-4:t-1])`

In [6]:
import numpy as np

store = int(df_feat12['Store'].iloc[0])
g = df_feat12[df_feat12['Store'] == store].sort_values('Date').reset_index(drop=True)

# pick a row where rolling is defined
idx = 60
y = g['Weekly_Sales'].values

lag_ok = np.isclose(g.loc[idx, 'lag_1'], y[idx-1])
roll_mean_ok = np.isclose(g.loc[idx, 'roll_mean_4'], np.mean(y[idx-4:idx]))

print('store:', store)
print('lag_1 check:', lag_ok)
print('roll_mean_4 check:', roll_mean_ok)

# Fail fast if something is wrong
assert lag_ok
assert roll_mean_ok

store: 1
lag_1 check: True
roll_mean_4 check: True


## Guardar especificación de FS0/FS1/FS2
Se guarda un JSON para que el notebook 02 lo pueda cargar sin redefinir listas manualmente.

In [7]:
feature_sets = {
    'FS0': {
        'add_calendar': False,
        'exog_cols': [],
        'feature_cols': FS0_FEATURES,
        'notes': 'Target-history only: lags + rolling stats',
    },
    'FS1': {
        'add_calendar': True,
        'exog_cols': [],
        'feature_cols': FS1_FEATURES,
        'notes': 'FS0 + calendar (weekofyear, month, year)',
    },
    'FS2': {
        'add_calendar': True,
        'exog_cols': list(EXOG_COLUMNS),
        'feature_cols': FS2_FEATURES,
        'notes': 'FS1 + exogenous variables (oracle exog)',
    },
    'COMPLETAR': {
        'oracle_exog_assumption': '[COMPLETAR: describe disponibilidad real de exógenas en deployment]',
    },
}

out_path = OUTPUT_DIR / 'feature_sets.json'
out_path.write_text(json.dumps(feature_sets, indent=2, ensure_ascii=False), encoding='utf-8')
print('Saved:', out_path)

Saved: /home/sagemaker-user/TFMAXEL/outputs/E0_ablation/feature_sets.json


In [8]:
# Validación opcional del split (si existe metadata.json)
metadata_path = PROJECT_ROOT / 'outputs' / 'metadata.json'
if metadata_path.exists():
    validation = validate_split_consistency(PROJECT_ROOT / 'outputs', test_weeks=TEST_WEEKS)
    print(validation)

    # En una corrida "from scratch", es normal que no existan predicciones todavía.
    issues = list(validation.get("issues", []) or [])
    only_missing_preds = (issues == ["No test_predictions.csv files found under outputs_dir"])
    if not validation.get("ok", False) and not only_missing_preds:
        raise AssertionError(issues)
    elif only_missing_preds:
        print("[WARN] Split OK, pero todavía no hay test_predictions.csv (aún no se ejecutaron modelos).")
else:
    print('metadata.json not found; skip split validation')


{'outputs_dir': '/home/sagemaker-user/TFMAXEL/outputs', 'test_weeks': 39, 'metadata_path': '/home/sagemaker-user/TFMAXEL/outputs/metadata.json', 'n_prediction_files': 6, 'issues': [], 'ok': True}
