# Volatility Forecasting and Volatility Control: Walk-Forward HOLDOUT Results

This notebook runs the repository’s walk-forward volatility forecasting experiment and reports:

- **Forecast evaluation on HOLDOUT**: QLIKE (primary), RMSE on the volatility scale (supporting), and ranking diagnostics.
- **Diebold–Mariano (DM) tests vs baseline** (Newey-West HAC; HAC lag grid as sensitivity check).
- **Vol-control strategy results under transaction costs** (no leverage; risky weight capped at 1).

The notebook is intentionally thin: all compute logic lives in `src/vol_forecast/`. This file is a reproducible driver that loads data, runs the experiment, and displays the resulting tables/figures.


In [1]:
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from vol_forecast.wf_config import WalkForwardConfig
from vol_forecast.runner.experiment_pipeline import build_experiment_df
from vol_forecast.runner.experiment import compute_experiment_report
from vol_forecast.models_tuning.tuning import tune_xgb_params_pre_holdout

from vol_forecast.runner.report_io import console_report, plot_report, save_report_csvs

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)
pd.set_option("display.max_rows", 200)
pd.set_option("display.float_format", lambda x: f"{x:,.6g}")



## Overview

### Target
We forecast **forward realized variance** over a fixed horizon (here `horizon=20` trading days), annualized using `freq=252`.
Because forward windows overlap, loss differentials are serially dependent (relevant for DM/HAC later).

### HOLDOUT (what is evaluated, and why)
Even though forecasts are produced via a leakage-safe walk-forward procedure, we still define a fixed **HOLDOUT evaluation window**:

- `holdout_start_date` is the first date included in the HOLDOUT evaluation set (`t >= holdout_start_date`).
- All headline tables, diagnostics, and DM tests are computed **only on HOLDOUT**.

Purpose: keep any model selection strictly **pre-HOLDOUT** (in this repo, this applies only to **XGB hyperparameter tuning**), so we are not iterating on the final evaluation window.

### Baseline and models
**Baseline:** `rw_forecast_VAR` (a random-walk-style variance forecast: trailing realized variance over the horizon, lagged by 1 day).

**Models compared:**
- HAR (1/5/22-day components)
- GARCH and GJR-GARCH (Student-t innovations)
- XGB on HAR features, and XGB on HAR+VIX (both tuned pre-HOLDOUT; see tuning section)

### Forecast evaluation metrics
- **Primary:** QLIKE on the variance scale (lower is better).
- **Supporting:** RMSE on the volatility scale (`sqrt(var)`), included for interpretability (not the ranking objective).
- **Ranking diagnostic:** Spearman correlation between forecast vol and realized vol (ordering of regimes).
- **DM tests vs baseline**: Newey-West HAC standard errors over a HAC lag grid (sensitivity check).
### Strategy layer (capped vol-control; no leverage)

The vol-control backtest allocates between the risky equity index and a cash proxy: risky weight $w_t$, remainder held in cash $1-w_t$.

We map forecast volatility $(\hat\sigma_t = \sqrt{\widehat{\mathrm{var}}_t})$ into the risky weight via:

$$
w_t = \min\left(1,\ \sigma_{\text{target}}/\hat\sigma_t\right).
$$

Because the cap binds in low-vol regimes, achieved realized volatility can differ from $\sigma_{\text{target}}$; outcomes also depend on turnover and transaction costs.




## Experiment configuration

Key settings are defined in the next cell:

- **Data window:** `data_start_date`, `data_end_date`
- **Evaluation boundary:** `holdout_start_date`
- **Target definition:** `horizon` (forward window, trading days), `freq` (annualization factor)
- **Walk-forward protocol:** `wf_cfg` (window type/size, minimum training size, refit cadence)
- **XGB tuning blocks:** `xgb_tuning_block_dates` / `xgb_tuning_blocks` (pre-HOLDOUT blocks used to select `xgb_params_overrides`)
- **Strategy grid:** `sigma_target`, `strategy_variants`, `tcost_grid_bps` (vol-target level, execution rule, transaction-cost levels)
- **DM sensitivity grid (used later):** `hac_lag_grid` (HAC lag choices for DM standard errors)

In [None]:
data_start_date = "2004-01-01"
# data_end_date = None
data_end_date = "2026-02-06"
holdout_start_date = "2015-01-01"

horizon = 20
freq = 252

wf_cfg = WalkForwardConfig(
    window_type="rolling",
    rolling_window_size=1000,
    min_train_size=500,
    refit_every=60,
)


sigma_target = 0.10
tcost_grid_bps = [0.0, 5.0, 10.0, 25.0]
hac_lag_grid = [20, 30, 40, 60]
strategy_variants = ["daily_reset", "band_no_trade"]

xgb_tuning_block_dates = [
    ("2005-01-01", "2006-12-31"),
    ("2008-01-01", "2009-12-31"),
    ("2011-01-01", "2012-12-31"),
    ("2013-06-01", "2014-11-30"),
]
xgb_tuning_blocks = [(pd.Timestamp(s), pd.Timestamp(e)) for s, e in xgb_tuning_block_dates]


## Build canonical experiment DataFrame

We build the canonical experiment dataset:

- `base_df`: aligned returns, cash proxy, VIX, engineered predictors, forward variance target, and the baseline forecast.
- `build_meta`: build diagnostics used for quick integrity checks (index integrity, coverage, missingness).



In [None]:
base_df, build_meta = build_experiment_df(
    start_date=data_start_date,
    end_date=data_end_date,
    horizon=horizon,
    freq=freq,
)

base_df.shape, base_df.index.min(), base_df.index.max()

((5558, 16),
 Timestamp('2004-01-05 00:00:00'),
 Timestamp('2026-02-05 00:00:00'))

## Data sanity checks (index + coverage)

Before looking at model tables, we verify the dataset build isn’t broken:

- **Index integrity:** no duplicate timestamps; gaps look like trading-day breaks (weekends/holidays).
- **Interior coverage:** core predictors and the forward target should be almost fully populated on the interior, with missingness mostly confined to the **warmup** at the start (features need lookback history) and the **cooldown** at the end (the forward target needs the next `horizon` trading days).

In [None]:
core = build_meta["data_diag_core"]["core_index"]
pd.Series(core)

n                               5558
start            2004-01-05 00:00:00
end              2026-02-05 00:00:00
n_dups                             0
gap_days_mean                1.45168
gap_days_p95                       3
gap_days_max                       5
dtype: object

In [None]:
build_meta["data_diag_core"]["core_coverage"]

Unnamed: 0,col,nonNaN_full,pct_nonNaN_full,pct_nonNaN_interior,n_interior,head_warmup,tail_cooldown
0,log_dvhar_22d,5536,0.996042,1,5517,22,19
1,rw_forecast_VAR,5538,0.996402,1,5517,22,19
2,rvar_trail,5539,0.996582,1,5517,22,19
3,rvar_fwd,5539,0.996582,1,5517,22,19
4,log_rvar_fwd,5539,0.996582,1,5517,22,19
5,dlog_vix_5,5552,0.99892,1,5517,22,19
6,log_dvhar_5d,5553,0.9991,1,5517,22,19
7,log_dvhar_1d,5557,0.99982,1,5517,22,19
8,log_vix_lag1,5557,0.99982,1,5517,22,19
9,log_ret,5558,1.0,1,5517,22,19


## XGB tuning (strictly pre-HOLDOUT; blocked)

XGBoost hyperparameters can materially affect results, so we tune XGB without looking into HOLDOUT.

Procedure:
- We evaluate a small, fixed set of candidate parameter overrides by QLIKE on several pre-HOLDOUT validation blocks using walk-forward forecasts.
- We embargo the boundary by `horizon - 1` trading days before `holdout_start_date` so forward target windows used for tuning do not overlap HOLDOUT.
- We select by **median block QLIKE** (tie-breaker: **worst-block QLIKE**).

**Why median block QLIKE?**
Volatility is regime dependent and a single pooled pre-HOLDOUT score can be dominated by one regime. Using multiple blocks that span different environments (including a recent pre-HOLDOUT slice to avoid tuning only on older regimes) and selecting by the median block QLIKE makes the choice robust to any single extreme block.

Output:
- We select `xgb_params_overrides` using XGB(HAR) on the pre-HOLDOUT blocks, then apply the selected overrides to both XGB(HAR) and XGB(HAR+VIX) in the experiment run below.


In [None]:
xgb_params_overrides = None
xgb_tuning = {"status": "not_run", "best": None, "table": None}

best_meta, tune_table = tune_xgb_params_pre_holdout(
    df=base_df,
    holdout_start=pd.Timestamp(holdout_start_date),
    horizon=horizon,
    cfg=wf_cfg,
    blocks=xgb_tuning_blocks,
)
xgb_params_overrides = best_meta["best_params_overrides"]
xgb_tuning = {"status": "ran", "best": best_meta, "table": tune_table}

xgb_tuning["status"], xgb_params_overrides

if xgb_tuning["status"] == "ran":
    display(xgb_tuning["table"].head(30))


Unnamed: 0,candidate_id,median_block_qlike,worst_block_qlike,overrides,block_0_qlike,block_0_n,block_1_qlike,block_1_n,block_2_qlike,block_2_n,block_3_qlike,block_3_n
0,1,0.308643,4.77884,{'max_depth': 2},0.161976,213,4.77884,505,0.358574,502,0.258711,378
1,8,0.31342,4.7355,"{'min_child_weight': 10, 'subsample': 0.7, 'co...",0.159327,213,4.7355,505,0.36775,502,0.25909,378
2,5,0.313676,5.09397,"{'subsample': 0.7, 'colsample_bytree': 0.7}",0.161541,213,5.09397,505,0.366602,502,0.26075,378
3,10,0.31369,4.88481,{'reg_lambda': 5.0},0.161025,213,4.88481,505,0.3683,502,0.259081,378
4,0,0.313765,4.99911,{},0.162759,213,4.99911,505,0.365254,502,0.262277,378
5,9,0.314425,4.95806,{'reg_lambda': 2.0},0.16188,213,4.95806,505,0.368852,502,0.259999,378
6,4,0.315527,4.9881,{'min_child_weight': 10},0.16262,213,4.9881,505,0.36933,502,0.261723,378
7,3,0.315647,4.8723,{'min_child_weight': 5},0.163735,213,4.8723,505,0.368714,502,0.262579,378
8,7,0.316375,5.18618,"{'max_depth': 4, 'min_child_weight': 10}",0.161492,213,5.18618,505,0.36629,502,0.266461,378
9,2,0.317713,5.07666,{'max_depth': 4},0.165759,213,5.07666,505,0.369254,502,0.266171,378


The table shows a pronounced spread between median and worst-block scores for all candidates (the crisis block 2008–2009 is the worst block for all). This suggests that the regime sensitivity is not an artifact of the median-based selection rule, and achieving regime-stable XGB performance would likely require a much more thorough tuning/search setup (and/or different features or modeling choices), not just small override tweaks.

## Forecast results (HOLDOUT)

This section reports out-of-sample forecasting performance on the **HOLDOUT** window only

We present:
1) a **sample check** (availability)
2) **headline performance** on the full HOLDOUT
3) a **split-half stability check** (same headline table on two time-ordered halves)

In [None]:
# This is the pure compute step: fit walk-forward forecasts, build the holdout panel, compute eval tables (headline, DM, calibration, etc.), and run the strategy grid.

report = compute_experiment_report(
    base_df,
    horizon=horizon,
    freq=freq,
    wf_cfg=wf_cfg,
    garch_dist='t',
    holdout_start_date=holdout_start_date,
    hac_lag_grid=hac_lag_grid,
    run_strategy=True,
    strategy_variants=strategy_variants,
    sigma_target=sigma_target,
    tcost_grid_bps=tcost_grid_bps,
    xgb_params_overrides=xgb_params_overrides,
)

# Attach build + tuning context (same spirit as run_experiment)
report["build_meta"] = build_meta
report["tuning"] = {"xgb": {
    "status": xgb_tuning["status"],
    "best": xgb_tuning["best"],
    "table": (xgb_tuning["table"].to_dict(orient="records") if xgb_tuning["status"] == "ran" else None),
}}

report["meta"]["final_xgb_params"] = xgb_params_overrides
report["meta"]["xgb_tuning_status"] = xgb_tuning["status"]


### 1) Availability (common sample check)

Before interpreting rankings, we verify the HOLDOUT date coverage per model.  
If the HOLDOUT `intersection_n(all_models+baseline)` matches each model’s `n(target+model+baseline)`, then differences in QLIKE/RMSE cannot be attributed to differing evaluation sets.

In this run, all models have complete HOLDOUT coverage and are evaluated on the same set of dates.

In [None]:
display(report["availability"])

Unnamed: 0,model,n(target+model),n(target+model+baseline),missing_%_in_holdout,baseline_n(target+baseline),intersection_n(all_models+baseline)
0,rw_forecast_VAR,2771,2771,0,2771,2771
1,har_daily_wf_forecast_var,2771,2771,0,2771,2771
2,xgb_har_wf_mean_var,2771,2771,0,2771,2771
3,xgb_harvix_wf_mean_var,2771,2771,0,2771,2771
4,garch_wf_forecast_var,2771,2771,0,2771,2771
5,gjr_wf_forecast_var,2771,2771,0,2771,2771


### 2) Headline table (full HOLDOUT)

-  **Ranking metric**: **QLIKE** on the variance scale. Picked because it is a proper scoring rule, scale-aware and asymmetric (penalizes under-forecasting more than over-forecasting).

Note:
- `delta_qlike_vs_baseline < 0` means the model improves on the RW baseline on the same HOLDOUT dates.

In [None]:
display(report["headline_full"])


Unnamed: 0,segment,model,n,qlike,delta_qlike_vs_baseline,rmse_vol
0,HOLDOUT_full,gjr_wf_forecast_var,2771,0.453064,-0.209597,0.0917729
1,HOLDOUT_full,garch_wf_forecast_var,2771,0.458214,-0.204448,0.0895283
2,HOLDOUT_full,har_daily_wf_forecast_var,2771,0.509717,-0.152945,0.0879647
3,HOLDOUT_full,xgb_harvix_wf_mean_var,2771,0.638979,-0.0236829,0.0908496
4,HOLDOUT_full,rw_forecast_VAR,2771,0.662662,0.0,0.0988165
5,HOLDOUT_full,xgb_har_wf_mean_var,2771,0.747362,0.0846999,0.0939925


### 3) Split-half headline tables (stability check)

We split HOLDOUT (at `split_mid`) into two halves and compute the same headline table on both because volatility dynamics can be regime-dependent. If a model’s ranking changes materially across halves, its full HOLDOUT average may mask regime sensitivity.

In [None]:
print("Split mid:", report["split_mid"])
display(report["headline_half1"])
display(report["headline_half2"])

Split mid: 2020-07-06 00:00:00


Unnamed: 0,segment,model,n,qlike,delta_qlike_vs_baseline,rmse_vol
0,HOLDOUT_half1,garch_wf_forecast_var,1386,0.645425,-0.315718,0.10523
1,HOLDOUT_half1,gjr_wf_forecast_var,1386,0.652168,-0.308976,0.109526
2,HOLDOUT_half1,har_daily_wf_forecast_var,1386,0.755893,-0.20525,0.106987
3,HOLDOUT_half1,rw_forecast_VAR,1386,0.961143,0.0,0.121441
4,HOLDOUT_half1,xgb_harvix_wf_mean_var,1386,1.04402,0.0828731,0.114655
5,HOLDOUT_half1,xgb_har_wf_mean_var,1386,1.22027,0.259124,0.118793


Unnamed: 0,segment,model,n,qlike,delta_qlike_vs_baseline,rmse_vol
0,HOLDOUT_half2,xgb_harvix_wf_mean_var,1385,0.233649,-0.130316,0.0579485
1,HOLDOUT_half2,gjr_wf_forecast_var,1385,0.253818,-0.110147,0.0696131
2,HOLDOUT_half2,har_daily_wf_forecast_var,1385,0.263363,-0.100601,0.0634553
3,HOLDOUT_half2,garch_wf_forecast_var,1385,0.270867,-0.0930972,0.070392
4,HOLDOUT_half2,xgb_har_wf_mean_var,1385,0.274114,-0.0898505,0.0596122
5,HOLDOUT_half2,rw_forecast_VAR,1385,0.363965,0.0,0.0691225


### What the headline tables show in this run

**Full HOLDOUT (QLIKE).**
- GARCH-family (GARCH, GJR) delivers the largest average improvements versus the RW baseline.
- HAR improves versus baseline, but less than GARCH-family (though HAR has the lowest RMSE(vol) in this run).
- XGB(HAR) underperforms the baseline.
- XGB(HAR+VIX) shows only a small improvement relative to the baseline.

**Split-half stability.**
- GARCH/GJR and HAR beat the baseline in both halves (their edge is not concentrated in a single sub-period), with GARCH/GJR showing the smallest half-to-half changes in QLIKE and HAR just behind.
- XGB variants are the most strongly period-dependent: weak in HOLDOUT half 1 but materially stronger in half 2 (with XGB(HAR+VIX) also ranking best by QLIKE in half 2). The full-HOLDOUT average therefore hides a large within-HOLDOUT swing.


**Note on excessive XGB period dependence**

The period dependence of XGB variants we see here is consistent with the earlier blocked tuning results (material block-to-block dispersion).

Likely drivers of the observed period dependence:
- **Walk-forward regime coverage (adaptation lag):** with rolling refits, XGB is trained on a recent window that may not be representative of the subsequent period; after a regime shift, performance can lag until enough of the new regime enters the training window.
- **Rare/Extreme-state behavior:** gradient-boosted trees make threshold-based, piecewise predictions anchored to regions well-covered in training, which can lead to under-forecasting in rare/extreme states, especially when they are under-represented. With QLIKE, this becomes even more costly because under-forecasting is penalized more than over-forecasting, so these episodes can disproportionately affect performance.

## Supporting diagnostics (HOLDOUT)

The tables below help interpret the headline QLIKE results (ranking and DM tests vs baseline).

### Ranking diagnostic (Spearman on vol scale)

Spearman correlation measures whether the forecast **orders** higher- vs lower-volatility states correctly on HOLDOUT (monotonic association), and is largely insensitive to exact variance/volatility levels.

This complements QLIKE:
- A model can rank regimes well (high Spearman) but still score poorly on QLIKE if it makes large level errors on the variance scale.
- Conversely, QLIKE can improve through better level fit without being the strongest ranker.

In [None]:
display(report["spearman_rank_vol"])

Unnamed: 0,model,n,spearman_rank_vol
0,xgb_harvix_wf_mean_var,2771,0.578738
1,gjr_wf_forecast_var,2771,0.570625
2,garch_wf_forecast_var,2771,0.559153
3,rw_forecast_VAR,2771,0.55359
4,har_daily_wf_forecast_var,2771,0.541891
5,xgb_har_wf_mean_var,2771,0.453624


XGB(HAR+VIX) ranks regimes best (Spearman ≈ 0.58), but as seen in the prior section its QLIKE is modest overall and strongly **period-dependent** across the two HOLDOUT halves. This reinforces the point that the main weakness is **variance-level fit** and likely under-forecasting in extreme/high-volatility states. XGB(HAR) is also weak on regime ordering, which helps explain why it underperforms XGB(HAR+VIX).







### DM tests vs baseline (HAC)

We report Diebold–Mariano (DM) tests comparing each model’s QLIKE loss series to the RW baseline on HOLDOUT.

- We define the loss differential at time $t$ as  
  $d_t = \ell_{model,t} - \ell_{baseline,t}$, where $\ell$ is QLIKE on the variance scale.  
  Negative $\overline{d}$ implies the model has lower average QLIKE than the baseline.
- Because the target is a 20-trading-day forward window, adjacent $d_t$ are serially dependent (overlapping windows).  
  We therefore use Newey–West (HAC) to compute the standard error of $\overline{d}$, and report results over a HAC truncation-lag grid as a sensitivity check.
- **Interpretation:** `mean_d` conveys direction + effect size; `dm_stat` / `p_value` summarize evidence relative to HAC-estimated noise.  
  DM is treated as **supporting context** alongside headline QLIKE and split-half stability, not as a model-selection criterion. P-values are **unadjusted** and shown for context only. Given multiple model-vs-baseline comparisons (and the HAC lag grid used as a sensitivity check), they shouldn't be read against a hard 5% threshold; strict inference would require a multiple-testing correction such as **Holm**.

In [None]:
display(report["dm"])

Unnamed: 0,model,hac_lag,n,mean_d,dm_stat,p_value,mean_better_than_baseline
0,garch_wf_forecast_var,20,2771,-0.204448,-3.03382,0.00241475,True
1,garch_wf_forecast_var,30,2771,-0.204448,-2.94197,0.00326136,True
2,garch_wf_forecast_var,40,2771,-0.204448,-2.92614,0.00343191,True
3,garch_wf_forecast_var,60,2771,-0.204448,-2.91401,0.0035682,True
4,gjr_wf_forecast_var,20,2771,-0.209597,-2.9594,0.00308243,True
5,gjr_wf_forecast_var,30,2771,-0.209597,-2.86981,0.0041072,True
6,gjr_wf_forecast_var,40,2771,-0.209597,-2.84884,0.00438783,True
7,gjr_wf_forecast_var,60,2771,-0.209597,-2.84983,0.00437422,True
8,har_daily_wf_forecast_var,20,2771,-0.152945,-2.06769,0.0386696,True
9,har_daily_wf_forecast_var,30,2771,-0.152945,-2.00292,0.0451861,True


- **GARCH-family is strongly supported vs baseline:** GARCH and GJR have materially negative `mean_d` and consistently small p-values across HAC lags. This matches the headline QLIKE ranking.
- **HAR shows a modest DM signal:** HAR’s `mean_d` is negative, but p-values are only moderately small and sensitive to the HAC lag choice; the DM evidence is weaker than for GARCH/GJR, even though HAR improves QLIKE in both HOLDOUT halves and shows a non-trivial average effect size.
- **XGB(HAR) is not supported:** `mean_d` is positive with large p-values, consistent with underperforming the baseline on pooled HOLDOUT QLIKE.
- **XGB(HAR+VIX) is regime-dependent:** `mean_d` is slightly negative but p-values are large on pooled HOLDOUT, consistent with split-half behavior (weak in half 1, strong in half 2). In this run, pooled DM does not provide evidence of a stable improvement versus baseline.

## Strategy evaluation (HOLDOUT): capped vol-control under transaction costs


We apply the same capped vol-control policy to each volatility forecast on HOLDOUT and evaluate realized performance under transaction costs. We compare two execution variants:

- `daily_reset`: rebalance to the target risky weight each day  
- `band_no_trade`: rebalance only when the new target weight lies outside a ±5% band from the current weight (lower turnover)

### How to read the tables (key columns)

- `avg_trade`: turnover proxy; primary driver of transaction-cost sensitivity as `tcost_bps` increases.
- `vol_ratio`: achieved realized volatility / target volatility; (>1) means the strategy ran above target, (<1) below target.
- `pct_capped`: fraction of days with full exposure ($w_t = 1$); higher values imply a more buy-and-hold-like exposure profile.
- `max_drawdown`: maximum peak-to-trough drawdown over HOLDOUT.
- `avg_risky_weight`: average equity weight.

### Results 

We show snapshots at `tcost_bps ∈ {0, 10, 25}`, sorted by Sharpe within each cost level.

In [None]:
df= report["strategy"]
chosen = sorted(df["tcost_bps"].dropna().unique())

for t in chosen:
    display(
        df[df["tcost_bps"] == t]
        .sort_values("sharpe", ascending=False)
        .reset_index(drop=True)
    )

Unnamed: 0,tcost_bps,strategy,n,ann_log_ret,ann_simple_ret,ann_vol,vol_ratio,sharpe,max_drawdown,avg_risky_weight,avg_trade,pct_capped
0,0,gjr_wf_forecast_var__band_no_trade,2771,0.0831494,0.0867042,0.0841901,0.841901,0.987639,-0.101916,0.642564,0.025224,0.0
1,0,gjr_wf_forecast_var__daily_reset,2771,0.0838985,0.0875185,0.08595,0.8595,0.976132,-0.103081,0.656449,0.0337402,0.0
2,0,garch_wf_forecast_var__band_no_trade,2771,0.0858512,0.0896442,0.088804,0.88804,0.96675,-0.110205,0.658568,0.0209618,0.00216528
3,0,garch_wf_forecast_var__daily_reset,2771,0.0869417,0.0908331,0.0901315,0.901315,0.96461,-0.111019,0.668034,0.0318415,0.00433057
4,0,rw_forecast_VAR__band_no_trade,2771,0.0967167,0.101548,0.101954,1.01954,0.948631,-0.125568,0.754819,0.0117551,0.243955
5,0,rw_forecast_VAR__daily_reset,2771,0.0946172,0.099238,0.101917,1.01917,0.928377,-0.127352,0.756011,0.0166855,0.313605
6,0,har_daily_wf_forecast_var__band_no_trade,2771,0.0876583,0.091615,0.0986965,0.986965,0.88816,-0.157414,0.686387,0.019083,0.0378925
7,0,xgb_harvix_wf_mean_var__daily_reset,2771,0.099858,0.105014,0.112533,1.12533,0.887367,-0.193492,0.759806,0.0345942,0.123421
8,0,har_daily_wf_forecast_var__daily_reset,2771,0.0871019,0.0910078,0.0982049,0.982049,0.88694,-0.156104,0.686488,0.0299144,0.0458318
9,0,xgb_har_wf_mean_var__daily_reset,2771,0.103799,0.109378,0.118059,1.18059,0.879215,-0.215806,0.763627,0.0319826,0.0840852


Unnamed: 0,tcost_bps,strategy,n,ann_log_ret,ann_simple_ret,ann_vol,vol_ratio,sharpe,max_drawdown,avg_risky_weight,avg_trade,pct_capped
0,5,gjr_wf_forecast_var__band_no_trade,2771,0.0799712,0.0832559,0.0842022,0.842022,0.949752,-0.103951,0.642564,0.025224,0.0
1,5,garch_wf_forecast_var__band_no_trade,2771,0.0832103,0.0867703,0.0888115,0.888115,0.936932,-0.111153,0.658568,0.0209618,0.00216528
2,5,rw_forecast_VAR__band_no_trade,2771,0.0952348,0.0999171,0.101967,1.01967,0.933975,-0.126248,0.754819,0.0117551,0.243955
3,5,gjr_wf_forecast_var__daily_reset,2771,0.0796479,0.0829057,0.085959,0.85959,0.92658,-0.105799,0.656449,0.0337402,0.0
4,5,garch_wf_forecast_var__daily_reset,2771,0.08293,0.0864658,0.0901441,0.901441,0.919972,-0.112321,0.668034,0.0318415,0.00433057
5,5,rw_forecast_VAR__daily_reset,2771,0.0925146,0.0969292,0.101927,1.01927,0.907657,-0.128679,0.756011,0.0166855,0.313605
6,5,har_daily_wf_forecast_var__band_no_trade,2771,0.0852536,0.0889932,0.0987077,0.987077,0.863697,-0.157637,0.686387,0.019083,0.0378925
7,5,xgb_harvix_wf_mean_var__band_no_trade,2771,0.0957944,0.100533,0.112874,1.12874,0.848681,-0.197125,0.759268,0.0267573,0.104655
8,5,xgb_harvix_wf_mean_var__daily_reset,2771,0.0955,0.100209,0.112541,1.12541,0.848577,-0.193744,0.759806,0.0345942,0.123421
9,5,har_daily_wf_forecast_var__daily_reset,2771,0.0833333,0.086904,0.0982125,0.982125,0.848499,-0.156392,0.686488,0.0299144,0.0458318


Unnamed: 0,tcost_bps,strategy,n,ann_log_ret,ann_simple_ret,ann_vol,vol_ratio,sharpe,max_drawdown,avg_risky_weight,avg_trade,pct_capped
0,10,rw_forecast_VAR__band_no_trade,2771,0.0937528,0.0982883,0.101981,1.01981,0.919315,-0.126928,0.754819,0.0117551,0.243955
1,10,gjr_wf_forecast_var__band_no_trade,2771,0.0767929,0.0798184,0.0842165,0.842165,0.911851,-0.105982,0.642564,0.025224,0.0
2,10,garch_wf_forecast_var__band_no_trade,2771,0.0805692,0.0839039,0.0888202,0.888202,0.907104,-0.112101,0.658568,0.0209618,0.00216528
3,10,rw_forecast_VAR__daily_reset,2771,0.090412,0.0946251,0.101937,1.01937,0.886936,-0.130005,0.756011,0.0166855,0.313605
4,10,gjr_wf_forecast_var__daily_reset,2771,0.0753971,0.0783122,0.0859698,0.859698,0.877019,-0.108541,0.656449,0.0337402,0.0
5,10,garch_wf_forecast_var__daily_reset,2771,0.0789181,0.0821157,0.0901576,0.901576,0.875335,-0.113622,0.668034,0.0318415,0.00433057
6,10,har_daily_wf_forecast_var__band_no_trade,2771,0.0828488,0.0863775,0.09872,0.9872,0.839229,-0.157859,0.686387,0.019083,0.0378925
7,10,xgb_harvix_wf_mean_var__band_no_trade,2771,0.0924232,0.0968289,0.112884,1.12884,0.818743,-0.197324,0.759268,0.0267573,0.104655
8,10,xgb_har_wf_mean_var__band_no_trade,2771,0.0964896,0.101298,0.117916,1.17916,0.818291,-0.218115,0.760565,0.0239415,0.0396969
9,10,xgb_har_wf_mean_var__daily_reset,2771,0.0957414,0.100474,0.118075,1.18075,0.810852,-0.2163,0.763627,0.0319826,0.0840852


Unnamed: 0,tcost_bps,strategy,n,ann_log_ret,ann_simple_ret,ann_vol,vol_ratio,sharpe,max_drawdown,avg_risky_weight,avg_trade,pct_capped
0,25,rw_forecast_VAR__band_no_trade,2771,0.0893064,0.0934157,0.102027,1.02027,0.87532,-0.128963,0.754819,0.0117551,0.243955
1,25,rw_forecast_VAR__daily_reset,2771,0.0841036,0.0877416,0.101973,1.01973,0.824765,-0.134044,0.756011,0.0166855,0.313605
2,25,garch_wf_forecast_var__band_no_trade,2771,0.0726452,0.0753489,0.0888541,0.888541,0.817578,-0.115227,0.658568,0.0209618,0.00216528
3,25,gjr_wf_forecast_var__band_no_trade,2771,0.0672564,0.0695697,0.0842716,0.842716,0.798092,-0.112047,0.642564,0.025224,0.0
4,25,har_daily_wf_forecast_var__band_no_trade,2771,0.0756335,0.0785672,0.0987633,0.987633,0.765806,-0.158528,0.686387,0.019083,0.0378925
5,25,garch_wf_forecast_var__daily_reset,2771,0.0668816,0.0691688,0.0902036,0.902036,0.741451,-0.123314,0.668034,0.0318415,0.00433057
6,25,xgb_har_wf_mean_var__band_no_trade,2771,0.0874404,0.0913773,0.11794,1.1794,0.741398,-0.21867,0.760565,0.0239415,0.0396969
7,25,xgb_harvix_wf_mean_var__band_no_trade,2771,0.0823083,0.0857905,0.112923,1.12923,0.728887,-0.197921,0.759268,0.0267573,0.104655
8,25,gjr_wf_forecast_var__daily_reset,2771,0.0626433,0.064647,0.0860123,0.860123,0.728306,-0.122386,0.656449,0.0337402,0.0
9,25,buy_and_hold,2771,0.127961,0.136509,0.179389,1.79389,0.713314,-0.337905,1.0,0.0,1.0


What the tables show:

- **Realized risk differs materially across forecasts under the same cap.**  
  - GARCH/GJR run below the 10% target (`vol_ratio` ≈ 0.84–0.90) with shallower drawdowns (≈ -10% to -12.3%).  
  - RW baseline stays near target (`vol_ratio` ≈ 1.02) but is fully invested a meaningful fraction of days (`pct_capped` ≈ 24%–31%), so part of its performance reflects higher average exposure in calm periods ($w_t=1$ more often).
  - XGB variants run above target (`vol_ratio` ≈ 1.13–1.18) with materially deeper drawdowns (≈ -19% to -22%), suggesting the forecast does not call for enough de-risking in high-volatility episodes. This is in line with the earlier note that XGB can under-forecast rare/extreme volatility and lag after regime shifts.

- **Transaction costs flip the ranking mainly through turnover.**  
   Transaction costs penalize turnover, so lower-turnover implementations retain more net performance as `tcost_bps` rises (favoring `band_no_trade`), with RW + `band_no_trade` taking the top spot at 25 bps in this run.
   


### Conclusion

Across HOLDOUT in this run, GARCH/GJR are the strongest forecasters by QLIKE (HAR also improves), while XGB is less stable and does not deliver a robust QLIKE edge.

In the strategy, the main practical lesson is that transaction costs (via turnover) can overwhelm incremental forecasting edge: while better signals can lead at 0 bps, as `tcost_bps` increases rankings shift toward low-turnover implementations.

Strategy rankings here are protocol-dependent (horizon=20, rolling refits, cap, cash proxy, cost model).