# Model Diagnostic Checking
## VIF

Assume we are doing multivariate linear regression $y = X\beta +\epsilon$ where $X\in\mathbb R^{n\times (k+1)}$.

<br>

We have assumed that the data $X$ is of full column rank. However, when $X$ is not full rank, or has small singular values, it implies underlying relation between features. This is called the multicolinearity. We should detect this to avoid mistakes.


To test multicolinearity, we can extract feature $j\ (1\leqslant j\leqslant k)$, and fit a linear mdoel to column $x_{*j}$ with the rest of data $[x_{*0},x_{*1},\dotsc,x_{*(j-1)},x_{*(j+1)},\dotsc,x_{*n}]$. If the model has a high $R^2$, then we know that feature $x_{*j}$  might have colinearity with other features.

Explicitly, we can define VIF (variance inflation factor) to be
$${\rm VIF}_j = \frac{1}{1 - R_j^2} = \frac{1}{1 - \frac{{\rm SSR}_j}{{\rm SST}_j}}=\frac{{\rm SST}_j}{{\rm SSE}_j}$$
where $R_j^2$ is the $R^2$ of the regression on $x_{*j}$ with the remaining features.

<br>

When there is no multicolinearity, then $R_j = 0$ and ${\rm VIF}_j = 1$. When there is prominent multicolinearity, then $R_j^2\rightarrow 1$ and ${\rm VIF}_j\rightarrow +\infty$.


### Conditional Variance

**Theorem** Denote $X_j\in\mathbb R^n$ to be the $(j+1)$-th column of $X$ (do not forget that the first column is full of ones)
$${\rm VIF}_j = \frac{{\rm Var}X_j}{{\rm Var}(X_j|X_0,X_1,\dotsc,X_{j-1},X_{j+1},\dotsc,X_k)}$$


**Proof** On the one hand, ${\rm Var}  X_j = \frac{1}{n-1}\sum_{i=1}^n (x_{ij} - \bar x_{*j})^2=\frac{1}{n-1}{\rm SST}_j$. On the other, the denominator refers to the MSE of fitting $X_j$ linearly with the remaining $k-1$ features, which is $\frac{1}{n-1}{\rm SSE}_j$.

### Inverse Entry

**Theorem** Denote $X_j\in\mathbb R^n$ to be the $(j+1)$-th column of $X$ (do not forget that the first column is full of ones), then $\frac{1}{\rm SSE_j}$ is the $(j+1,j+1)$ entry of the matrix $(X^TX)^{-1}$.


**Proof** Without loss of generality we may assume $j+1$ is the last column of $X$. And we can parition $X\in\mathbb R^{n\times (k+1)}$ by $[U,v]$ where $U\in\mathbb R^{n\times k}$ and $v\in\mathbb R^n$. On the one hand, 
$${\rm SSE}_j= \Vert v - U(U^TU)^{-1}U^Tv\Vert^2= v^T(I - U(U^TU)^{-1}U^T)v. $$

On the other hand, $X^TX = \left[\begin{matrix} U^TU & U^Tv \\ v^TU & v^Tv\end{matrix}\right]$. Using the idea of Schur complement we learn 

$$X^TX = \left[\begin{matrix} I &  \\ v^TU(U^TU)^{-1} & 1  \end{matrix}\right]
\left[\begin{matrix}  U^TU &   \\   & v^T(I - U(U^TU)^{-1}U^T)v\end{matrix}\right]
\left[\begin{matrix} I &   (U^TU)^{-1} U^Tv \\ & 1  \end{matrix}\right].$$

Hence, 

$$(X^TX)^{-1}= \left[\begin{matrix} I &   -(U^TU)^{-1} U^Tv \\ & 1  \end{matrix}\right] 
\left[\begin{matrix}  (U^TU)^{-1} &  \\   & \left[v^T(I - U(U^TU)^{-1}U^T)v\right]^{-1}\end{matrix}\right]\left[\begin{matrix} I &  \\ -v^TU(U^TU)^{-1} & 1  \end{matrix}\right]
.$$

And it is clear that the bottom-right entry of $(X^TX)^{-1}$ is $ \left[v^T(I - U(U^TU)^{-1}U^T)v\right]^{-1}$, which is $1/{\rm SSE}_j$.

In [1]:
import numpy as np
X = np.random.randn(10,5)

# compute the conditional variance of the last column
SSE = ((X[:,-1] - X[:,:-1] @ np.linalg.solve(X[:,:-1].T @ X[:,:-1], X[:,:-1].T @ X[:,-1])) ** 2).sum()

# the two should be equal
print(1 / SSE, np.linalg.inv(X.T @ X)[-1,-1]) 

0.06239734207048804 0.06239734207048801


## Heteroskedasticity

Sometimes the noise $\epsilon$ has different variance, this is called heteroskedasticity.

## Time Series 

Sometimes the noise $\epsilon$ are correlated, e.g. time series.


### Durbin-Watson Test 

Since $\rho = \frac{{\rm Cov}(\epsilon_t,\epsilon_{t-1})}{\sqrt{{\rm Var}(\epsilon_t) {\rm Var}(\epsilon_{t-1})}}= \frac{\mathbb E(\epsilon_t\epsilon_{t-1})}{\sqrt{\mathbb E(\epsilon_t^2)\mathbb E(\epsilon_{t-1}^2)}}$, we can use the following to estimate the autocorrelation:
$$1 - \frac{\sum_{t=2}^n (\hat\epsilon_t - \hat\epsilon_{t-1})^2}{2\sum_{t=2}^n \hat\epsilon_t^2}
=\frac{\sum_{t=2}^n \hat\epsilon_t \hat\epsilon_{t-1}}{\sum_{t=2}^n\hat\epsilon_t^2}\approx \rho.$$