### Setting
Consider $X$ to be a set of $p$ inter and intra-dependent time-series with $T$ time steps. We are looking for a linear graphical model, or a prediction rule $f: \mathbb{R}^p \rightarrow \mathbb{R}^p$ that maps .

Let us consider the decomposition
$$R\left(\hat{f}_{n}\right)-R^{*}=\underbrace{\left(R\left(\hat{f}_{n}\right)-\inf _{f \in \mathcal{F}} R(f)\right)}_{\text {estimation error }}+\underbrace{\left(\inf _{f \in \mathcal{F}} R(f)-R^{*}\right)}_{\text {approximation error }} .$$

In our setting $\mathcal{F}$ is the set of all weighted adjacency matrices that induce a directed acyclic graph. Clearly, since our data has been generated by such a matrix, we know that the approximation error is zero. However, what is the estimation error?

1. Restrict the class $\mathcal{F}$ are priori. For example, we can 
    - Restrict it to all the adjacency matrices with fewer than $k$ non-zero coefficients, where $k \in \mathbb{N}$. 
    - Another example is that the matrix can be at most $kp$-sparse, where $p$ is the number of variables and $k \in \mathbb{N}$. 
    - Lastly, we can also say that each variable can depend at most on $k$ variables, so that *each column* contains at most $k$ non-zero coefficients, where $k \in \mathbb{N}$.
2. Modify the empirical risk minimization criteria by penalizing "complicated" models. That is, solve $$\hat{f}_{n}=\underset{f \in \mathcal{F}}{\arg \min }\left\{\hat{R}_{n}(f)+C(f)\right\}$$ where $C( f )$ measures the complexity of $f$ somehow. This has the advantage that data is used to decide what is the adequate amount of complexity for a model. For example, we can
    - Restrict the number of non-zero edges: $C(f) = \lambda |W|_0$, where $\lambda \in \mathbb{R}_+$.
    - Restrict the absolute sum of coefficients: $C(f) = \lambda |W|_1$, where $\lambda \in \mathbb{R}_+$.
3. Cross Validation.

### Our Setting
We have a time series data $X$, consisting of $p$ inter and intra-dependent time series, all of equal and equidistant length $T \in \mathbb{N}$. We are looking for a graphical model that can predict the next timestep using the current timestep. So, using $X_t \in \mathbb{R}^p$, we want to predict $X_{t+1} \in \mathbb{R}^p$. Our graphical model is of the form $$\hat{X}_{t + 1} = X_t W,\qquad W \in \mathbb{R}^{p \times p}.$$ We are trying yo learn a *prediction rule* $f$ that maps the current timestep $X_t$ to the next timestep $X_{t+1}$. 

The true function $f^*$ is equal to $f^*(X_t) = X_t W^*$, but we need to find this $W$. Luckily, this $f^*(\cdot)$ is Lipschitz Constant.

Let us consider the squared loss function, that is, $$l(\hat{y}, y) = (\hat{y} - y).$$ Define the Bayes Risk as $$R^* = \underset{f\ \text{measureable}}{\inf} R(f),$$ where $$R(f) = \mathbb{E}\left[l(f(X), X)\right]$$ is the statistical risk of a prediction rule $f$.

Now, we know that the Bayes Classifier is equal to $$\begin{align*}f^* &= \mathbb{E}\left[Y|X\right] \\ &= XW^* \end{align*}$$ and the Bayes Risk for the squared loss function is equal to $$\begin{align*}R(f^*) &= (f^*(X) - Y)^2 \\ &= \text{Tr}(\Sigma).\end{align*}$$ The Excess Risk is equal to $$\begin{align*}R(f) - R^* &= \text{Tr}(\Sigma) + \left(W - W^*\right) \mathbb{E}\left[X^2\right] \left(W - W^*\right)^T.\end{align*}$$