# 2 Multiple Linear Regression

Now we have multiple factors, and the linear regression has the form ($y_i,x_{ij}\in\mathbb R$)
$$y_i = \beta_0+\beta_1x_{i1}+\dotsc +\beta_k x_{ik}+\epsilon_i,$$
or in the matrix form, (with $x_i = [1,x_{i1},\dotsc,x_{ik}]^T\in\mathbb R^{ (k+1)}$)
$$y_i = x_i^T\beta+\epsilon_i.$$

Still we assume that the noise is independent with $\mathbb E(\epsilon_i )=0$ and ${\rm Var}(\epsilon_i)=\sigma^2$.

<br>

We can stack all $n$ observations by matrices,
$y = [y_1,\dotsc,y_n]\in\mathbb R^n$, $X = [x_1,\dotsc,x_{n}]^T\in\mathbb R^{n\times (k+1)}$ and $\epsilon=[\epsilon_1,\dotsc,\epsilon_n]\in\mathbb R^n$. As a consequence, $\mathbb E(Y) =X\beta$ and ${\rm Cov}(Y) = \sigma^2I_n$.

## Model

### Least Squares Estimator

$${\rm argmin}_{\hat \beta} \Vert y - X\hat \beta\Vert^2\quad\Leftrightarrow\quad X^TX\hat\beta = X^Ty $$

Proof: For arbitrary $b\in\mathbb R^{k+1}$,
$$\Vert y - Xb\Vert^2-\Vert y -X\hat\beta\Vert^2
=\Vert (y - X\hat\beta)+X(\hat\beta -b)\Vert^2 - \Vert y - X\hat \beta\Vert^2
=2(y-X\hat\beta)^TX(\hat \beta - b)+\Vert X(\hat\beta -b)\Vert^2.$$

In particular, if $X^TX\hat\beta = X^Ty$, we have $2(y-X\hat\beta)^TX=0$ and thus, 
$$\Vert y - Xb\Vert^2-\Vert y -X\hat\beta\Vert^2\geqslant 0.$$

Such $\hat \beta$ always exists, and one of the solutions is given by $\hat \beta = X^\dag y$ where $X^\dag$ is the pseudoinverse.

However, we shall further assume $X^TX$ is nonsingular and $\hat\beta =(X^TX)^{-1}X^Ty$.

In this case, 

$${\rm Cov}(\hat \beta) = {\rm Cov}((X^TX)^{-1}X^T(X\beta + \epsilon))
= {\rm Cov}((X^TX)^{-1}X^T\epsilon)=\sigma^2(X^TX)^{-1}.$$

Here we have used the fact that ${\rm Cov}(Au) = A{\rm Cov}(u)A^T$.



### Hat Matrix

Note that $\hat y = X\hat \beta= X(X^TX)^{-1}X^Ty$. We denote $H = X(X^TX)^{-1}X^T$ and call it the hat matrix. Properties:

1. ${\rm tr}(H) = {\rm tr}((X^TX)^{-1}X^TX) = k+1$.
2. $H$ is symmetric.
3. $H$ is idempotent ($H^2=H$) and $I - H$ is also idempotent.
4. $HX = X$.
5. $(I - H)X = 0$.
6. $\hat y = Hy$.


### Maximum Likelihood Estimator 

Under the assumption that $\epsilon_i\in N(0,\sigma^2)$ are independent samples from normal distribution. It is clear that the least squares estimator is exactly the maximum likelihood estimator.  To derive the MLE for $\sigma^2$, we have

$$\hat\sigma^2_{MLE} = {\rm argmax}_\sigma \left\{-\frac{1}{2\sigma^2}\Vert y - X\hat\beta \Vert^2 -\frac{n}{2}\log\sigma^2\right\}=\frac{1}{n} \Vert y - X\hat\beta \Vert^2=\frac{1}{n} \Vert y - Hy \Vert^2$$

Note that $(I-H)$ is symmetric and idempotent, we obtain
$\hat\sigma^2_{MLE} =\frac{1}{n} y^T(I - H)y$. 

The MLE for $\hat\sigma^2$ is biased. In fact, 
$$\hat\sigma^2_{MLE}=\frac{1}{n} \Vert (I - H)y\Vert^2
=\frac{1}{n} \Vert (I - H)(X\beta+\epsilon)\Vert^2=\frac{1}{n} \Vert (I - H)\epsilon \Vert^2.$$

Recall that $I - H$ being symmetric and idempotent implies that it has spectral decomposition $I - H=Q^T\Lambda Q$ with $\Lambda = \left[\begin{matrix}I_r & 0 \\ 0 & 0\end{matrix}\right]$ and $Q$ orthogonal. Here the rank $r$ is given by $r = {\rm tr}(I - H) = n - k - 1$. Thus, $(I - H)\epsilon $ is the sum of $n-k-1$ independent normal distribution $N(0,\sigma^2)$. And we conclude that 
$$\hat\sigma^2_{MLE} \sim \frac{1}{n}\chi_{n-k-1}^2\sigma^2.$$

Then, $\mathbb E(\hat\sigma^2_{MLE}) = \dfrac{n-k-1}{n}\sigma^2$. To fix the biasedness, we can use the unbiased estimator 
$$s^2 = \frac{n}{n-k-1}{\sigma^2_{MLE}}=\frac{y^T(I - H)y}{n-k-1} = \frac{\epsilon^T(I - H)\epsilon}{n-k-1}.$$

## Distribution

From above, we know that 
$$\hat\beta = (X^TX)^{-1}X^Ty=  \beta+ (X^TX)^{-1}X^T\epsilon\sim \mathcal N(\beta, (X^TX)^{-1}\sigma^2)$$
and 
$$s^2 = \Vert (I - H)\epsilon\Vert^2\sim \frac{1}{n-k-1}\chi_{n-k-1}^2\sigma^2.$$

Note that $(X^TX)^{-1}X^T\epsilon$ and $(I - H)\epsilon$ are uncorrelated multivariate normal distributions, which thus implies independence, $\hat\beta $ and $s^2$ are independent.

## Goodness of Fit

Denote $e = [1,\dotsc,1]^T\in\mathbb R^{n}$ and $n\bar y =  e^Ty$. Recall $H = X(X^TX)^{-1}X^T$ is the hat matrix and $\hat y = Hy$.

### Total Sum of Squares

The total sum of square is unrelated with the model. ${\rm SST} = \sum_{i=1}^n (y_i - \bar y )^2$.  Also, we can write it in quadratic form, 
$${\rm SST} = \Vert y - \frac 1n ee^Ty\Vert^2 = y^T(I - \frac 1n ee^T)y.$$

We shall note that $I - \frac 1nee^T$ is idempotent and ${\rm tr}(I - \frac 1n ee^T) = n-1$.



### Regression Sum of Squares

For regression sum of squares, it is 

$${\rm SSR} = \Vert \hat y - \frac 1n ee^Ty\Vert^2 = \Vert Hy - \frac 1n ee^Ty\Vert^2 = y^T(H - \frac 1n ee^T)^2y.$$

Recall that $HX = X$ and thus $He = e$ as $e$ is the first column of $X$, we can show that $H - \frac 1n ee^T$ is idempotent and therefore, 
$${\rm SSR} = y^T(H - \frac 1n ee^T)y.$$

And ${\rm tr}(H - \frac 1n ee^T) = k$.

### Residual Sum of Squares

For residual sum of squares, it is 
$${\rm SSE} = \Vert y - \hat y \Vert^2 = \Vert y - Hy\Vert^2 = y^T(I - H)y=(n-k-1)\sigma^2.$$
And $I - H$ is also idempotent with  ${\rm tr}(I - H) = n - k - 1$.

### Relations 

Still we have $${\rm SST} = {\rm SSE}+{\rm SSR}$$
and the orthogonality between ${\rm SSE}$ and ${\rm SSR}$: 
$(I - H)^T(H - \frac 1nee^T) = 0$.

Therefore, ${\rm SSR} = \Vert (H - \frac 1n ee^T)y\Vert^2$ and ${\rm SSE} = \Vert (I - H)y\Vert^2$ are independent.

### $R^2$

The coefficient of determination is given by $R^2 = 1 - \dfrac{\rm SSE}{\rm SST} = \dfrac{\rm SSR}{\rm SST}$. And we call $\rho = \sqrt {R^2}$ the multiple correlation coefficient.

Observe that 
$$\frac{{\rm Cov}(y,\hat y)^2}{{\rm Var}(y){\rm Var}(\hat y)}
=\frac{\left((y  - \frac 1nee^Ty)^T(Hy - \frac 1n ee^THy)\right)^2}{\Vert y - \frac 1n ee^Ty\Vert^2\Vert Hy - \frac 1n ee^THy\Vert^2}=
 \frac{\left(y^T(H - \frac 1nee^T)y\right)^2 }{\Vert y - \frac 1n ee^Ty\Vert^2y^T(H - \frac 1nee^T)y}=\frac{\rm SSR}{\rm SST} = R^2,$$
we conclude that $R^2$ characterizes the correlation between $y$ and $\hat y$.

### Adjusted $R^2$

$R^2$ is not larger the better. Because counting in more factors, even rubbish factors, will reduce ${\rm SSR}$, leading to the increase in $R^2$. We can fix the problem by introducing adjusted $R^2$, which is $R_a^2$ defined as below. 

$$R_a^2 = 1 - \frac{{\rm SSE}/(n - k - 1)}{{\rm SST} / (n - 1)}.$$

In this case, larger number of factors, $k$, will penalize the $R_a^2$.


### Extra Sum of Squares

As mentioned, introducing new factors, useful or not, can reduce ${\rm SSE}$ (or equivalently, increast ${\rm SSR}$). But useful ones often reduce it more. For current $X_1\in\mathbb R^{n\times(k_1+1)}$ and a new set of factors $X_2\in\mathbb R^{k_2}$, we can merge them to $X = [X_1,X_2]\in\mathbb R^{n\times (k_1+k_2+1)}$. 

If denote by ${\rm SSE}(X_1)$ and ${\rm SSE}([X_1,X_2])$ by the residual sum of squares of the model fitted on $X_1$ and $X=[X_1,X_2]$ respectively, we define the extra sum of squares by the decrease in ${\rm SSE}$:

$${\rm SSR}(X_2|X_1) = {\rm SSE}(X_1) - {\rm SSE}([X_1,X_2]),$$
or equivalently the increase in regression sum of squares, 

$${\rm SSR}(X_2|X_1) =  {\rm SSR}([X_1,X_2])-{\rm SSR}(X_1).$$

## Hypothesis Testing

It is important in practice to test whether a factor is indeed influential to the response.


### Test All Coefficients

When we want to test $H_0:\ \beta_1=\dotsc =\beta_k = 0$, except that $\beta_0$ can be nonzero, against its opposite, we can test 
$$F=\frac{\rm MSR}{\rm MSE} =\frac{{\rm SSR}/k}{{\rm SSE}/(n-k-1)}=\frac{\sigma^2\chi_k^2}{\sigma^2\chi_{n-k-1}^2}\sim F_{k,n-k-1}.$$

We reject $H_0$ when $F$ is large enough, since this implies that the regression fits much and the factors are not nonsense.


### Test Single Coefficient

When we want to test whether one of the factors contributes, say, $H_0:\beta_1 = 0$ against $H_1:\beta_1\neq 0$, we can recall that $\hat\beta_1 \sim N(\beta_1, \sigma^2e_1^T(X^TX)^{-1}e_1)$ where $e_1 = [1,0,\dotsc,0]^T\in\mathbb R^{k+1}$. And thus under the assumption $\beta_1 = 0$, we have
$$\frac{\hat\beta_1 }{\sqrt{e_1^T(X^TX)^{-1}e_1 s^2}}\sim \frac{\sigma^2 N(0,1)}{\sigma^2\sqrt{\frac{\chi_{n-k-1}^2}{n-k-1}}}
\sim t_{n-k-1}.
$$

We reject $H_0$ when $t$ is far from origin, i.e. $|t|$ being "abnormally" large indicates $\beta_1$ does not vanish.

### Test Linearity of Coefficients

More generally, if we want to test whether $C\beta = 0$ where $C\in\mathbb  R^{m\times (k+1)}$, we can use the generalized extra sum of squares: find the best, reduced, model under constraint $H_0:C\beta = 0$ and compare it with the full model. 

**Theorem** Let ${\rm SSE}_R$ and ${\rm SSE}_F$ be the SSE of the model with and without constraint $C\beta = 0$, then under assumption $H_0$ we have 
$${\rm SSE}_R - {\rm SSE}_F \sim \sigma^2 \chi_m^2.$$

Sometimes we approximate $\sigma^2$ by $\frac{1}{n-k-1}{\rm SSE}_F$ so that we test 
$$\frac{({\rm SSE}_R - {\rm SSE}_F) / m}{{\rm SSE}_F / (n - k - 1)} \sim F_{m,n-k-1}.$$

**Proof** First we derive the best reduced model ${\rm argmin}_\beta\left\{\Vert y - X\beta\Vert^2:\ C\beta = 0\right\}$. Lagrange dual shows that it is equivalent to solving for
$$\begin{aligned}&\sup_{\hat \mu} \inf_{\hat \beta} \left\{\Vert y - X\hat\beta\Vert^2+2 \hat\mu^T C\hat\beta \right\}
=\sup_{\hat \mu} \left\{ y^Ty - (\hat\mu^T C - y^TX)(X^TX)^{-1}( C^T\hat\mu -X^Ty)\right\}.
\end{aligned}$$
The extrema is reached when $\hat\mu = (C(X^TX)^{-1}C^T)^{-1}C(X^TX)^{-1}X^Ty$ and $\hat \beta_R = (X^TX)^{-1}(X^Ty-C^T\mu)$.

Thus, the SSE of the reduced model is given by (note that $(I - H)X = 0$)
$$\begin{aligned}{\rm SSE}_R&=\Vert y - X\hat \beta_R\Vert^2
=\Vert \left((I - X(X^TX)^{-1}X^T + X(X^TX)^{-1}C^T(C(X^TX)^{-1}C^T)^{-1}C(X^TX)^{-1}X^T\right)y\Vert^2
\\ &= y^T (I - H + 2(I - H)X(X^TX)^{-1}C^T(C(X^TX)^{-1}C^T)^{-1}C(X^TX)^{-1}X^T
\\ &\quad\quad\quad \quad\quad\quad \quad\quad\quad + X(X^TX)^{-1}C^T(C(X^TX)^{-1}C^T)^{-1}C(X^TX)^{-1}X^T )y
\\ &= y^T\left(I - H + X(X^TX)^{-1}C^T(C(X^TX)^{-1}C^T)^{-1}C(X^TX)^{-1}X^T \right)y
\end{aligned}
$$

Therefore, the extra sum of squares is given by
$$\begin{aligned}{\rm SSE}_R - {\rm SSE}_F &= y^T  X(X^TX)^{-1}C^T(C(X^TX)^{-1}C^T)^{-1}C(X^TX)^{-1}X^T y
\\ &= \Vert (C(X^TX)^{-1}C^T)^{-\frac 12}C(X^TX)^{-1}X^T (X\beta +\epsilon)\Vert^2
\\ &= \Vert (C(X^TX)^{-1}C^T)^{-\frac 12} (C\beta +C(X^TX)^{-1}X^T\epsilon)\Vert^2
\end{aligned}
$$

When assuming $H_0:\ C\beta = 0$, we can see that 
$${\rm SSE}_R - {\rm SSE}_F=\epsilon^T X(X^TX)^{-1}C^T(C(X^TX)^{-1}C^T)^{-1}C(X^TX)^{-1}X^T \epsilon=\epsilon^TW\epsilon.
$$

Note that $W$ is idempotent and ${\rm rank}(W) = {\rm tr}(W) =m $, by Cochran's theorem we learn ${\rm SSE}_R - {\rm SSE}_F\sim \sigma^2\chi_m^2$.

Lastly, ${\rm SSE}_R - {\rm SSE}_F$ and ${\rm SSE}_F$ are independent (which thus implies independence with $s^2$) since 
$$ (I - H)X(X^TX)^{-1}C^T(C(X^TX)^{-1}C^T)^{-1}C(X^TX)^{-1}X^T  = 0.$$

## Model Selection

In the covariates $X\in\mathbb R^{n\times (k+1)}$ where there are $k$ different factors, some of them might be redundant. If we $k$ is very large ($k=O(n)$) and have used all $k$ of them, then the model might be overfitting and have a large variance. On the other hand, if we $k$ is too small, then the model might have a large balance. To conclude, it is important to extract essential factors.

### All-Possible-Regression Selection

Enumerate all subset of the $k$ factors, and select the subset with best criterion. The criterion could be $R^2$, $C_p$

### $C_p$ Criterion

Let $p$ be the number of factors in the reduced model (the interception term is not counted). We define
$$C_p = \frac{{\rm SSE}_R}{{\rm SSE}_F / (n - k - 1)}- (n - 2(p+1))$$
where ${\rm SSE}_R$ and ${\rm SSE}_F$ are the residuals of the reduced and the full model respectively. 

**Theorem** Let $\hat y_R$ be the prediction with the reduced model and $H_p\in\mathbb R^{(p+1)\times (p+1)}$ be the corresponding hat matrix, then 
$$\mathbb E\left\{\frac{{\rm SSE}_R}{\sigma^2}- (n - 2(p+1))\right\} =   \frac{\mathbb E\Vert  \hat y_R-\mathbb E(y)\Vert^2 }{\sigma^2}$$

**Proof** 
$$\mathbb E\Vert  \hat y_R-\mathbb E(y)\Vert^2
=\mathbb E \Vert H_p(X\beta+\epsilon) - X\beta\Vert^2=\beta^TX^T(I -H_p)X\beta+\sigma^2(p+1)$$
Combining $\mathbb E({\rm SSE}_R)=\sigma^2(n-p-1)$ and $n-p-1-(n-2(p+1))=p+1$ yields the result.


### AIC Criterion

Let $L$ be the log-likelihood of the maximum likelihood estimator, then we define
$${\rm AIC}= -\frac 2n\log L + \frac {2p}{n}.$$
The smaller ${\rm AIC}$  the better.


**Theorem** When assuming $\epsilon\sim N(0,\sigma^2I_n)$, minimizing ${\rm AIC}$ is to maximize $n\log {\rm SSE}+2p$.

**Proof** Recall that $\hat\sigma_{MLE}^2= \frac 1n {\rm SSE}$,
$$\log L = -\frac {1}{2\sigma^2}  \Vert y - \hat y \Vert^2+n\log\frac{1}{\sqrt{2\pi\sigma^2}}
=-\frac n2-\frac n2\log {\rm SSE} +{\rm Const}.$$