# HAR-RV

---

## Introduction

Realised volatility in this notebook means the ex-post variability of returns: how much an asset's price has actually moved over a past window. A standard way to quantify this is realized variance $RV_t$, computed as the sum of intraday squared returns over day $t$. The HAR-RV (Heterogeneous Autoregressive Realized Volatility) model then treats this realized variance series as a time series and explains tomorrow's realized volatility using a small set of lagged components corresponding to different horizons (daily, weekly, monthly). The goal is to obtain a simple, interpretable, and empirically strong forecasting model for $RV_{t+1}$ (or $\log RV_{t+1}$) that captures the persistence and long-memory behavior seen in realized volatility.


## Intraday Returns to RV

Let $S_t$ denote the asset price and let $r_{t,i}$ be the intraday log-return in interval $i$ of day $t$:
$$
r_{t,i} = \ln\left(\frac{S_{t,i}}{S_{t,i-1}}\right),
$$
where $i = 1, \dots, n$ indexes the $n$ intraday intervals. The (daily) realized variance on day $t$ is
$$
RV_t = \sum_{i=1}^{n} r_{t,i}^2.
$$
This $RV_t$ is a nonparametric estimator of the continuous-sample quadratic variation of log prices over day $t$.


## HAR-RV Specification

To mimic the idea that different types of market participants operate at different horizons, HAR-RV constructs three overlapping averages of realized variance:

- Daily component:
$$
RV_t^d = RV_t.
$$

- Weekly component (5-day average):
$$
RV_t^w = \frac{1}{5} \sum_{i=0}^{4} RV_{t-i}.
$$

- Monthly component (22-day average):
$$
RV_t^m = \frac{1}{22} \sum_{i=0}^{21} RV_{t-i}.
$$

The canonical HAR-RV model on levels is
$$
RV_{t+1} = \beta_0 + \beta_d RV_t^d + \beta_w RV_t^w + \beta_m RV_t^m + u_{t+1},
$$
where $u_{t+1}$ is an innovation term with mean zero. A common empirical variant works on logs to stabilize variance and enforce positivity:
$$
\log RV_{t+1} = \beta_0 + \beta_d \log RV_t^d + \beta_w \log RV_t^w + \beta_m \log RV_t^m + u_{t+1}.
$$

Each coefficient $(\beta_d, \beta_w, \beta_m)$ measures how much the short-, medium-, and long-horizon realized volatility components feed into tomorrow's realized volatility. Because the model is linear in these overlapping averages, it reproduces the slow decay of autocorrelations (apparent "long memory") in realized volatility with a very parsimonious structure. Practically, the HAR-RV model provides a straightforward way to turn a high-frequency return history into multi-horizon volatility forecasts, which can then be used for risk management, position sizing, or as an input when comparing expected future realized volatility to implied volatility from options.

## Fitting the HAR-RV model

Once the realized variance series and the HAR components $RV_t^d, RV_t^w, RV_t^m$ are constructed, fitting HAR-RV reduces to a linear regression on realized (log) variance. In the log-HAR-RV specification, the model is
$$
\log RV_{t+1} = \beta_0
+ \beta_d \log RV_t^d
+ \beta_w \log RV_t^w
+ \beta_m \log RV_t^m
+ u_{t+1},
$$
where $u_{t+1}$ is an innovation term with mean zero. Defining the response and predictors as
$$
Y_t = \log RV_{t+1}, \quad
X_t = \big(1, \log RV_t^d, \log RV_t^w, \log RV_t^m\big),
$$
the parameter vector
$$
\boldsymbol{\beta} =
\begin{pmatrix}
\beta_0 \\
\beta_d \\
\beta_w \\
\beta_m
\end{pmatrix}
$$
is estimated by ordinary least squares:
$$
\hat{\boldsymbol{\beta}} =
\arg\min_{\boldsymbol{\beta}}
\sum_{t}
\big( Y_t - X_t \boldsymbol{\beta} \big)^2.
$$

This corresponds to running an OLS regression with $Y_t = \log RV_{t+1}$ as the dependent variable and $(\log RV_t^d, \log RV_t^w, \log RV_t^m)$ plus an intercept as regressors. Given the estimated coefficients and the most recent components $(\log RV_t^d, \log RV_t^w, \log RV_t^m)$, the one-step-ahead forecast is
$$
\widehat{\log RV}_{t+1} =
\hat{\beta}_0
+ \hat{\beta}_d \log RV_t^d
+ \hat{\beta}_w \log RV_t^w
+ \hat{\beta}_m \log RV_t^m,
$$
and the corresponding forecast of realized variance is obtained by exponentiation:
$$
\widehat{RV}_{t+1} = \exp\big(\widehat{\log RV}_{t+1}\big).
$$

For empirical work, it is standard to estimate $\hat{\boldsymbol{\beta}}$ on a training period and then generate out-of-sample forecasts on a separate test period to evaluate predictive accuracy and compare against simpler benchmarks (e.g., AR(1) on $\log RV_t$ or GARCH(1,1)).


In [2]:
import numpy as np
import statsmodels.api as sm
import yfinance as yf

ticker = "SPY"
data = yf.download(
    ticker,
    period="60d",      # within allowed range
    interval="5m",
    progress=False
).dropna()


  data = yf.download(


In [7]:
import yfinance as yf

ticker = "SPY"
data = yf.download(
    ticker,
    period="60d",
    interval="5m",
    progress=False
).dropna()

data["log_ret"] = np.log(data["Close"]).diff()
data = data.dropna()
data["date"] = data.index.date

rv_daily = (
    data.groupby("date")["log_ret"]
    .apply(lambda x: np.sum(x**2))
    .to_frame(name="RV")
)

rv_daily["RV_d"] = rv_daily["RV"]
rv_daily["RV_w"] = rv_daily["RV"].rolling(window=5).mean()
rv_daily["RV_m"] = rv_daily["RV"].rolling(window=22).mean()
rv_daily = rv_daily.dropna()

rv_daily["log_RV"]   = np.log(rv_daily["RV"])
rv_daily["log_RV_d"] = np.log(rv_daily["RV_d"])
rv_daily["log_RV_w"] = np.log(rv_daily["RV_w"])
rv_daily["log_RV_m"] = np.log(rv_daily["RV_m"])

rv_daily["log_RV_t_plus_1"] = rv_daily["log_RV"].shift(-1)
rv_model = rv_daily.dropna()

y = rv_model["log_RV_t_plus_1"]
X = rv_model[["log_RV_d", "log_RV_w", "log_RV_m"]]
X = sm.add_constant(X)

model = sm.OLS(y, X).fit()
print(model.summary())


  data = yf.download(


                            OLS Regression Results                            
Dep. Variable:        log_RV_t_plus_1   R-squared:                       0.387
Model:                            OLS   Adj. R-squared:                  0.333
Method:                 Least Squares   F-statistic:                     7.160
Date:                Tue, 20 Jan 2026   Prob (F-statistic):           0.000748
Time:                        01:37:07   Log-Likelihood:                -41.695
No. Observations:                  38   AIC:                             91.39
Df Residuals:                      34   BIC:                             97.94
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -3.5694      2.600     -1.373      0.1