# 02 — Prophet con regresores (por Store)

**Objetivo:** forecasting de `Weekly_Sales` semanal por `Store` usando Prophet con regresores exógenos.

## Supuesto experimental (oracle exog)
Se asume disponibilidad de todas las covariables exógenas durante el horizonte de predicción (escenario oracle).

## Outputs estándar
- `outputs/predictions/prophet_regressors_predictions.csv` con: `Store, Date, y_true, y_pred, model`
- `outputs/metrics/prophet_regressors_metrics_global.csv`
- `outputs/metrics/prophet_regressors_metrics_by_store.csv`
- `outputs/figures/prophet_regressors_plot_*.png`

In [1]:
# 0) Imports y configuración
from __future__ import annotations

import json
import sys
from pathlib import Path

import numpy as np
import pandas as pd

PROJECT_ROOT = Path.cwd().parent
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from src.common import (
    compute_metrics,
    load_data,
    make_features,
    save_outputs,
    temporal_split,
)

MODEL_NAME = 'prophet_regressors'
SEED = 42
np.random.seed(SEED)

DATA_PATH = PROJECT_ROOT / 'data' / 'Walmart_Sales.csv'
METADATA_PATH = PROJECT_ROOT / 'outputs' / 'metadata.json'
OUTPUTS_DIR = PROJECT_ROOT / 'outputs'

## 1) Cargar metadata (split + features)
Esto garantiza consistencia entre modelos.

In [2]:
metadata = json.loads(METADATA_PATH.read_text(encoding='utf-8'))
split = metadata['split']
feature_cols = metadata['features']
print('Split:', split)
print('N features:', len(feature_cols))

Split: {'train_start': '2010-02-05', 'train_end': '2011-12-02', 'val_start': '2011-12-09', 'val_end': '2012-01-27', 'test_start': '2012-02-03', 'test_end': '2012-10-26'}
N features: 19


## 2) Carga de datos + features
- Parseo/orden
- Construcción de lags/rolling (sin leakage)
- Exógenas alineadas por fecha

In [3]:
df = load_data(DATA_PATH)
df_feat, _ = make_features(df, add_calendar=True)

# Importante: para entrenar, debes decidir cómo tratar NaNs creados por lags/rolling
# Opción típica: descartar filas con NaNs en features (por store al inicio)
model_df = df_feat.dropna(subset=feature_cols + ['Weekly_Sales']).copy()
model_df.shape

(4095, 22)

## 3) Split temporal
Reutiliza exactamente el split definido en el notebook 00.

In [4]:
train_df, val_df, test_df, split_cfg = temporal_split(df)

# Aplicar el split sobre model_df (ya sin NaNs por lags)
train = model_df[model_df['Date'].between(split_cfg.train_start, split_cfg.train_end)].copy()
val = model_df[model_df['Date'].between(split_cfg.val_start, split_cfg.val_end)].copy()
test = model_df[model_df['Date'].between(split_cfg.test_start, split_cfg.test_end)].copy()

print(len(train), len(val), len(test))

1980 360 1755


## 4) Entrenamiento del modelo
Implementación Prophet por tienda con regresores exógenos.

In [5]:
from warnings import filterwarnings

filterwarnings("ignore")

try:
    from prophet import Prophet
except Exception as exc:
    raise ImportError(
        "Prophet no está instalado. Instala con: pip install prophet"
    ) from exc

# Prophet usa columnas ds (fecha) y y (target); exógenas como regressors
prophet_exog_cols = [c for c in feature_cols if not c.startswith("lag_") and not c.startswith("roll_")]

preds = pd.Series(index=test.index, dtype=float)
failed_stores = []

for store, g_train in train.groupby("Store"):
    g_test = test[test["Store"] == store]
    if g_test.empty or g_train.empty:
        continue

    # Formato Prophet
    train_p = g_train[["Date", "Weekly_Sales"] + prophet_exog_cols].copy()
    train_p = train_p.rename(columns={"Date": "ds", "Weekly_Sales": "y"})
    test_p = g_test[["Date"] + prophet_exog_cols].copy()
    test_p = test_p.rename(columns={"Date": "ds"})

    try:
        m = Prophet(
            yearly_seasonality=True,
            weekly_seasonality=False,
            daily_seasonality=False,
        )
        for col in prophet_exog_cols:
            m.add_regressor(col)
        m.fit(train_p)
        forecast = m.predict(test_p)
        preds.loc[g_test.index] = forecast["yhat"].values
    except Exception:
        failed_stores.append(int(store))
        preds.loc[g_test.index] = g_train["Weekly_Sales"].mean()

# Relleno de seguridad si alguna predicción quedó NaN
if preds.isna().any():
    preds = preds.fillna(train["Weekly_Sales"].mean())

y_pred_test = preds.values
print("Failed stores:", len(failed_stores))

Failed stores: 45


## 5) Métricas (MAE, RMSE, sMAPE)
Se reporta:
- Global
- Por store

In [6]:
pred_df = pd.DataFrame({
    'Store': test['Store'].astype(int).values,
    'Date': test['Date'].values,
    'y_true': test['Weekly_Sales'].values,
    'y_pred': np.asarray(y_pred_test, dtype=float),
    'model': MODEL_NAME,
})

global_metrics = compute_metrics(pred_df['y_true'].values, pred_df['y_pred'].values)
metrics_global_df = pd.DataFrame([{'model': MODEL_NAME, **global_metrics}])

by_store = []
for store, g in pred_df.groupby('Store'):
    m = compute_metrics(g['y_true'].values, g['y_pred'].values)
    by_store.append({'model': MODEL_NAME, 'Store': int(store), **m})
metrics_by_store_df = pd.DataFrame(by_store).sort_values('Store')

metrics_global_df, metrics_by_store_df.head()

(                model           MAE          RMSE     sMAPE      WAPE
 0  prophet_regressors  64022.118873  93903.703841  6.494456  0.061357,
                 model  Store           MAE           RMSE     sMAPE      WAPE
 0  prophet_regressors      1  90027.212972  118384.510366  5.597127  0.056196
 1  prophet_regressors      2  82061.244703  109853.537016  4.279222  0.042912
 2  prophet_regressors      3  29584.004103   37698.428905  7.061835  0.069742
 3  prophet_regressors      4  81904.468881  115096.776479  3.753702  0.037647
 4  prophet_regressors      5  21024.426300   27393.712420  6.352647  0.063162)

## 6) Guardado de outputs estándar

In [7]:
paths = save_outputs(
    model_name=MODEL_NAME,
    predictions=pred_df,
    metrics_global=metrics_global_df,
    metrics_by_store=metrics_by_store_df,
    output_dir=OUTPUTS_DIR,
)
paths

{'predictions': '/home/sagemaker-user/TFMAXEL/outputs/predictions/prophet_regressors_predictions.csv',
 'metrics_global': '/home/sagemaker-user/TFMAXEL/outputs/metrics/prophet_regressors_metrics_global.csv',
 'metrics_by_store': '/home/sagemaker-user/TFMAXEL/outputs/metrics/prophet_regressors_metrics_by_store.csv'}

## 7) Figuras
- 3 tiendas: real vs predicción en test
- Distribución del error (`y_true - y_pred`)

Guardar PNGs en `outputs/figures/`.

In [8]:
import matplotlib.pyplot as plt
import seaborn as sns

FIG_DIR = OUTPUTS_DIR / "figures"
FIG_DIR.mkdir(parents=True, exist_ok=True)

# Selección de 3 tiendas (mayor media de ventas en test)
top_stores = (
    pred_df.groupby("Store")["y_true"]
    .mean()
    .sort_values(ascending=False)
    .head(3)
    .index
    .tolist()
)

for store in top_stores:
    g = pred_df[pred_df["Store"] == store].sort_values("Date")
    plt.figure(figsize=(10, 4))
    plt.plot(g["Date"], g["y_true"], label="y_true")
    plt.plot(g["Date"], g["y_pred"], label="y_pred")
    plt.title(f"Store {store} — Prophet")
    plt.xlabel("Date")
    plt.ylabel("Weekly_Sales")
    plt.legend()
    plt.tight_layout()
    plt.savefig(FIG_DIR / f"{MODEL_NAME}_plot_store_{store}.png", dpi=150)
    plt.close()

# Distribución de error
errors = pred_df["y_true"] - pred_df["y_pred"]
plt.figure(figsize=(8, 4))
sns.histplot(errors, bins=30, kde=True)
plt.title("Error distribution (y_true - y_pred)")
plt.xlabel("Error")
plt.tight_layout()
plt.savefig(FIG_DIR / f"{MODEL_NAME}_plot_error_dist.png", dpi=150)
plt.close()