# 2 Multiple Linear Regression

Now we have multiple factors, and the linear regression has the form ($y_i,x_{ij}\in\mathbb R$)
$$y_i = \beta_0+\beta_1x_{i1}+\dotsc +\beta_k x_{ik}+\epsilon_i,$$
or in the matrix form, (with $x_i = [1,x_{i1},\dotsc,x_{ik}]^T\in\mathbb R^{ (k+1)}$)
$$y_i = x_i^T\beta+\epsilon_i.$$

Still we assume that the noise is independent with $\mathbb E(\epsilon_i )=0$ and ${\rm Var}(\epsilon_i)=\sigma^2$.

<br>

We can stack all $n$ observations by matrices,
$y = [y_1,\dotsc,y_n]\in\mathbb R^n$, $X = [x_1,\dotsc,x_{n}]^T\in\mathbb R^{n\times (k+1)}$ and $\epsilon=[\epsilon_1,\dotsc,\epsilon_n]\in\mathbb R^n$. As a consequence, $\mathbb E(Y) =X\beta$ and ${\rm Cov}(Y) = \sigma^2I_n$.

## Model

### Least Squares Estimator

$${\rm argmin}_{\hat \beta} \Vert y - X\hat \beta\Vert^2\quad\Leftrightarrow\quad X^TX\hat\beta = X^Ty $$

Proof: For arbitrary $b\in\mathbb R^{k+1}$,
$$\Vert y - Xb\Vert^2-\Vert y -X\hat\beta\Vert^2
=\Vert (y - X\hat\beta)+X(\hat\beta -b)\Vert^2 - \Vert y - X\hat \beta\Vert^2
=2(y-X\hat\beta)^TX(\hat \beta - b)+\Vert X(\hat\beta -b)\Vert^2.$$

In particular, if $X^TX\hat\beta = X^Ty$, we have $2(y-X\hat\beta)^TX=0$ and thus, 
$$\Vert y - Xb\Vert^2-\Vert y -X\hat\beta\Vert^2\geqslant 0.$$

Such $\hat \beta$ always exists, and one of the solutions is given by $\hat \beta = X^\dag y$ where $X^\dag$ is the pseudoinverse.

However, we shall further assume $X^TX$ is nonsingular and $\hat\beta =(X^TX)^{-1}X^Ty$.

In this case, 

$${\rm Cov}(\hat \beta) = {\rm Cov}((X^TX)^{-1}X^T(X\beta + \epsilon))
= {\rm Cov}((X^TX)^{-1}X^T\epsilon)=\sigma^2(X^TX)^{-1}.$$

Here we have used the fact that ${\rm Cov}(Au) = A{\rm Cov}(u)A^T$.



### Ham Matrix

Note that $\hat y = X\hat \beta= X(X^TX)^{-1}X^Ty$. We denote $H = X(X^TX)^{-1}X^T$ and call it the ham matrix. Properties:

1. ${\rm tr}(H) = {\rm tr}((X^TX)^{-1}X^TX) = k+1$.
2. $H$ is symmetric.
3. $H$ is idempotent ($H^2=H$).
4. $(I - H)X = 0$.


### Maximum Likelihood Estimator 

Under the assumption that $\epsilon_i\in N(0,\sigma^2)$ are independent samples from normal distribution. It is clear that the least squares estimator is exactly the maximum likelihood estimator.  To derive the MLE for $\sigma^2$, we have

$$\hat\sigma^2_{MLE} = {\rm argmax}_\sigma \left\{-\frac{1}{2\sigma^2}\Vert y - X\hat\beta \Vert^2 -\frac{n}{2}\log\sigma^2\right\}=\frac{1}{n} \Vert y - X\hat\beta \Vert^2=\frac{1}{n} \Vert y - Hy \Vert^2$$

Note that $(I-H)$ is symmetric and idempotent, we obtain
$\hat\sigma^2_{MLE} =\frac{1}{n} y^T(I - H)y$. 

The MLE for $\hat\sigma^2$ is biased. In fact, 
$$\hat\sigma^2_{MLE}=\frac{1}{n} \Vert (I - H)y\Vert^2
=\frac{1}{n} \Vert (I - H)(X\beta+\epsilon)\Vert^2=\frac{1}{n} \Vert (I - H)\epsilon \Vert^2.$$

Recall that $I - H$ being symmetric and idempotent implies that it has spectral decomposition $I - H=Q^T\Lambda Q$ with $\Lambda = \left[\begin{matrix}I_r & 0 \\ 0 & 0\end{matrix}\right]$ and $Q$ orthogonal. Here the rank $r$ is given by $r = {\rm tr}(I - H) = n - k - 1$. Thus, $(I - H)\epsilon $ is the sum of $n-k-1$ independent normal distribution $N(0,\sigma^2)$. And we conclude that 
$$\hat\sigma^2_{MLE} \sim \frac{1}{n}\chi_{n-k-1}^2\sigma^2.$$

Then, $\mathbb E(\hat\sigma^2_{MLE}) = \dfrac{n-k-1}{n}\sigma^2$. To fix the biasedness, we can use the unbiased estimator 
$$s^2 = \frac{n}{n-k-1}{\sigma^2_{MLE}}=\frac{y^T(I - H)y}{n-k-1} = \frac{\epsilon^T(I - H)\epsilon}{n-k-1}.$$

## Distribution

From above, we know that 
$$\hat\beta = (X^TX)^{-1}X^Ty=  \beta+ (X^TX)^{-1}X^T\epsilon\sim \mathcal N(\beta, (X^TX)^{-1}\sigma^2)$$
and 
$$s^2 = \Vert (I - H)\epsilon\Vert^2\sim \frac{1}{n-k-1}\chi_{n-k-1}^2\sigma^2.$$

Note that $(X^TX)^{-1}X^T\epsilon$ and $(I - H)\epsilon$ are uncorrelated multivariate normal distributions, which thus implies independence, $\hat\beta $ and $s^2$ are independent.