# DFM (Factor Regression) – Real Data Tuning

Dieses Notebook nutzt **echte Daten** aus `data/processed/` und schätzt einen **DFM-ähnlichen** Ansatz:
PCA-Faktoren (mit fester Anzahl `r`) + Ridge/OLS auf den Faktoren, vollständig **train-only** und walk-forward.

**Ablauf**:
1. Daten laden (`target.csv`, `cleaned_features.csv`), Index-Ausrichtung.
2. Optional: TSFresh/Chronos-Blöcke aus Parquet einhängen (falls vorhanden).
3. Mehrere Faktorzahlen `r` testen (Stage A, je ein eigener `model_name`).
4. Die beste `r` anhand Block-3-RMSE wählen und **Stage B** nur für den Gewinner fahren.


In [11]:

import os, sys, json
from pathlib import Path
import numpy as np, pandas as pd

# Projektpfad setzen (Notebook liegt in repo/notebooks)
PROJECT_ROOT = Path.cwd().resolve().parent
os.environ["PROJECT_ROOT"] = str(PROJECT_ROOT)
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from src.config import GlobalConfig, DEFAULT_CORR_SPEC, EWMA_CORR_SPEC, outputs_for_model, STAGEA_DIR, STAGEB_DIR
from src.tuning import run_stageA, run_stageB
from src.io_timesplits import load_target, load_ifo_features
import src.features as F

from src.models.dfm import ForecastModel

print("PROJECT_ROOT:", PROJECT_ROOT)


PROJECT_ROOT: /Users/jonasschernich/Documents/Masterarbeit/Code


In [12]:

# 1) Daten laden
y = load_target()          # repo/data/processed/target.csv
X = load_ifo_features()    # repo/data/processed/cleaned_features.csv

# Index-Schnitt
idx = y.index.intersection(X.index)
y = y.loc[idx]
X = X.loc[idx]

print("Shapes:", X.shape, y.shape)
print("Dates:", y.index.min(), "→", y.index.max())


Shapes: (408, 20) (408,)
Dates: 1991-01-01 00:00:00 → 2024-12-01 00:00:00


In [14]:

# 2) Optionale TSFresh/Chronos-Blocks aus Parquet einhängen (falls vorhanden)
ts_path = PROJECT_ROOT / "data" / "processed" / "tsfresh_slim.parquet"
ch_path = PROJECT_ROOT / "data" / "processed" / "chronos_1step.parquet"

def _try_hook_target_blocks():
    hooked = False
    try:
        if ts_path.exists():
            tsfresh_df = pd.read_parquet(ts_path)
            if "date" in tsfresh_df.columns:
                tsfresh_df["date"] = pd.to_datetime(tsfresh_df["date"])
                tsfresh_df = tsfresh_df.set_index("date")
            tsfresh_df = tsfresh_df.reindex(y.index)
        else:
            tsfresh_df = None

        if ch_path.exists():
            chronos_df = pd.read_parquet(ch_path)
            if "date" in chronos_df.columns:
                chronos_df["date"] = pd.to_datetime(chronos_df["date"])
                chronos_df = chronos_df.set_index("date")
            chronos_df = chronos_df.reindex(y.index)
        else:
            chronos_df = None

        if tsfresh_df is not None:
            F.tsfresh_block  = lambda y_s, I_t, W=12: tsfresh_df.loc[[y_s.index[I_t-1]]]
            hooked = True
        if chronos_df is not None:
            F.chronos_block  = lambda y_s, I_t, W=12: chronos_df.loc[[y_s.index[I_t-1]]]
            hooked = True
        print("Target-only Hooks aktiv:", hooked)
    except Exception as e:
        print("Hooking TSFresh/Chronos fehlgeschlagen:", e)

#_try_hook_target_blocks()


In [15]:

# 3) Gemeinsame Basiskonfiguration
base_cfg = GlobalConfig()
# Korrelation: expanding (oder EWMA via dict(EWMA_CORR_SPEC))
base_cfg.corr_spec = dict(DEFAULT_CORR_SPEC)
base_cfg.nuisance_seasonal = "auto"

# Lag/Smoother/Screening (realistischere Größen gern anpassen)
base_cfg.lag_candidates = tuple(range(1, 12+1))
base_cfg.top_k_lags_per_feature = 1
base_cfg.k1_topk = 600
base_cfg.redundancy_method = "greedy"
base_cfg.redundancy_param = 0.9

# Splits (wie im Text)
base_cfg.W0_A = 180
base_cfg.BLOCKS_A = [(181,200), (201,220), (221,240)]
base_cfg.W0_B = 240

# Online-Policy
base_cfg.policy_window = 12
base_cfg.policy_gain_min = 0.03
base_cfg.policy_cooldown = 3

base_cfg.to_dict()


{'seed': 123,
 'refresh_cadence_months': 12,
 'nuisance_seasonal': 'auto',
 'corr_spec': {'mode': 'expanding', 'window': None, 'lam': None},
 'lag_candidates': (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
 'top_k_lags_per_feature': 1,
 'use_rm3': True,
 'k1_topk': 600,
 'screen_threshold': None,
 'redundancy_method': 'greedy',
 'redundancy_param': 0.9,
 'dr_method': 'none',
 'pca_var_target': 0.95,
 'pca_kmax': 25,
 'pls_components': 2,
 'W0_A': 180,
 'BLOCKS_A': [(181, 200), (201, 220), (221, 240)],
 'W0_B': 240,
 'policy_window': 12,
 'policy_gain_min': 0.03,
 'policy_cooldown': 3}

In [20]:

# 4) Kandidaten für Faktorzahlen r
candidate_r = [2, 3, 4, 5, 6]

# Modell-Grid für DFM (Ridge/OLS)
model_grid = [
    {"alpha": 0.0},
    {"alpha": 0.1},
    {"alpha": 0.2},
    {"alpha": 0.3},
    {"alpha": 0.4},
    {"alpha": 0.5},
    {"alpha": 0.6},
    {"alpha": 0.7},
    {"alpha": 0.8},
    {"alpha": 0.9},
    {"alpha": 1.0},

]

# Stage A für jedes r durchführen (getrennte model_names, um Outputs zu trennen)
shortlists_by_r = {}
for r in candidate_r:
    cfg = GlobalConfig(**base_cfg.to_dict())
    # PCA als "Faktor"-Extraktor: exakt r Komponenten erzwingen mit var_target=1.0 und kmax=r
    cfg.dr_method = "pca"
    cfg.pca_var_target = 1.0
    cfg.pca_kmax = r

    model_name = f"dfm_r{r}"
    outputs_for_model(model_name)
    print(f"\n=== Stage A: {model_name} ===")
    shortlist = run_stageA(
        model_name=model_name,
        model_ctor=lambda hp: ForecastModel(hp),
        model_grid=model_grid,
        X=X, y=y, cfg=cfg
    )
    shortlists_by_r[r] = shortlist

shortlists_by_r



=== Stage A: dfm_r2 ===
[Stage A][Block 1] train_end=180, OOS=181-200 | configs=11
  - Config 1/11: {'alpha': 0.0}
    · Month 1/20 processed | running RMSE=7.0631
    · Month 2/20 processed | running RMSE=5.9129
    · Month 3/20 processed | running RMSE=6.4685
    · Month 4/20 processed | running RMSE=6.5816
    · Month 5/20 processed | running RMSE=6.5361
    · Month 6/20 processed | running RMSE=6.6352
    · Month 7/20 processed | running RMSE=6.7848
    · Month 8/20 processed | running RMSE=6.8315
    · Month 9/20 processed | running RMSE=6.8668
    · Month 10/20 processed | running RMSE=7.0193
    · Month 11/20 processed | running RMSE=7.1630
    · Month 12/20 processed | running RMSE=7.0635
    · Month 13/20 processed | running RMSE=6.9634
    · Month 14/20 processed | running RMSE=6.8288
    · Month 15/20 processed | running RMSE=6.6402
    · Month 16/20 processed | running RMSE=6.5136
    · Month 17/20 processed | running RMSE=6.3865
    · Month 18/20 processed | running RMSE=

{2: [{'alpha': 0.1}, {'alpha': 0.2}],
 3: [{'alpha': 0.0}, {'alpha': 0.1}],
 4: [{'alpha': 0.0}, {'alpha': 0.1}],
 5: [{'alpha': 0.3}, {'alpha': 0.4}],
 6: [{'alpha': 0.3}, {'alpha': 0.4}]}

In [21]:

# 5) Beste r anhand Block-3-RMSE wählen
def _read_block_rmse(model_name: str, block_id: int = 3):
    path = STAGEA_DIR / model_name / f"block{block_id}" / "rmse.csv"
    if path.exists():
        return pd.read_csv(path)
    return None

scores = []
for r in candidate_r:
    df = _read_block_rmse(f"dfm_r{r}", block_id=3)
    if df is not None and not df.empty:
        best = df.sort_values("rmse").iloc[0]
        scores.append({"r": r, "rmse": best["rmse"], "config_id": int(best["config_id"])})
sel_df = pd.DataFrame(scores).sort_values("rmse")
sel_df


Unnamed: 0,r,rmse,config_id
4,6,3.974206,1
3,5,4.120386,1
2,4,5.211788,1
1,3,5.283027,1
0,2,5.643985,3


In [9]:

# 6) Stage B für Gewinner-r
assert not sel_df.empty, "Keine RMSE-Ergebnisse gefunden – bitte Stage A prüfen."
best_r = int(sel_df.iloc[0]["r"])
best_model_name = f"dfm_r{best_r}"

# Shortlist laden (falls dict in shortlists_by_r fehlt, von Datei lesen)
if best_r in shortlists_by_r and shortlists_by_r[best_r]:
    shortlist = shortlists_by_r[best_r]
else:
    import json
    path = STAGEA_DIR / best_model_name / "shortlist.json"
    with open(path, "r") as f:
        shortlist = json.load(f)

print(f"Gewählte Faktoranahl r={best_r} → Stage B läuft für {best_model_name}")
cfg = GlobalConfig(**base_cfg.to_dict())
cfg.dr_method = "pca"
cfg.pca_var_target = 1.0
cfg.pca_kmax = best_r

run_stageB(
    model_name=best_model_name,
    model_ctor=lambda hp: ForecastModel(hp),
    shortlist=shortlist,
    X=X, y=y, cfg=cfg,
    max_months=None  # vollständige Stage B
)
print("Stage B done →", STAGEB_DIR / best_model_name)


Gewählte Faktoranahl r=6 → Stage B läuft für dfm_r6
[Stage B] Month origin t=240 | evaluating 2 configs | active=1
[Stage B] Month origin t=241 | evaluating 2 configs | active=1
[Stage B] Month origin t=242 | evaluating 2 configs | active=1
[Stage B] Month origin t=243 | evaluating 2 configs | active=1
[Stage B] Month origin t=244 | evaluating 2 configs | active=1
[Stage B] Month origin t=245 | evaluating 2 configs | active=1
[Stage B] Month origin t=246 | evaluating 2 configs | active=1
[Stage B] Month origin t=247 | evaluating 2 configs | active=1
[Stage B] Month origin t=248 | evaluating 2 configs | active=1
[Stage B] Month origin t=249 | evaluating 2 configs | active=1
[Stage B] Month origin t=250 | evaluating 2 configs | active=1
[Stage B] Month origin t=251 | evaluating 2 configs | active=1
[Stage B] Month origin t=252 | evaluating 2 configs | active=1
[Stage B] Month origin t=253 | evaluating 2 configs | active=1
[Stage B] Month origin t=254 | evaluating 2 configs | active=1
[St

In [22]:
import pandas as pd
from pathlib import Path
from math import sqrt

MODEL = "dfm_r6"  # oder dein Modellname
preds_path = PROJECT_ROOT / f"outputs/stageB/{MODEL}/monthly/preds.csv"
df = pd.read_csv(preds_path)

# (a) RMSE pro Config über die ganze Stage-B-OOS-Periode:
rmse_by_config = (
    df.assign(se=lambda d: (d["y_true"] - d["y_pred"])**2)
      .groupby("config_id")["se"].mean().pow(0.5)
      .sort_values()
)
print("RMSE pro Config:\n", rmse_by_config)

# (b) RMSE der tatsächlich eingesetzten (aktiven) Prognosen über die ganze OOS-Periode:
active = df[df["is_active"] == True].copy()
rmse_active = ((active["y_true"] - active["y_pred"])**2).mean() ** 0.5
print("\nRMSE der aktiven Prognosen (Policy-Track):", rmse_active)


FileNotFoundError: [Errno 2] No such file or directory: '/Users/jonasschernich/Documents/Masterarbeit/Code/outputs/stageB/dfm_r6/monthly/preds.csv'