# Module 20: GLM Estimation


We continue with the model: $$Y = X\beta+\epsilon$$ with $\epsilon\tilde{}N(0,I\sigma^2)$. The matrices X and Y are assumed to be known and the noise is assumed to be uncorrelated.

Our goal is to find the value of $\beta$ that minimizes $$(Y-X\beta)^T(Y-X\beta)$$

which is the sums of squared errors (SSE)

In the case where we have one explanatory variable, we are looking for a line that minimizes the distance between all the data points and the line (in the y direction). With two explanatory variables we look for a plane, and with p explanatory variables we look for a p-dimensional hyperplane.
<img src="onedim.png">

The least squares criterion is:$$Q = (Y - X\beta)'(Y-X\beta)$$

To minimize this, we can take the derivative with repsect to $\beta$ and set it equal to 0, which produces the normal equations:$$X'X\hat{\beta}=X'Y$$

Then the ordinary least squares (OLS) estimators are given by:$$\hat{\beta} = (X'X)^{-1}X'Y$$

### The Ordinary Least Squares Solution

The OLS has a few interesting properties. First of all, the expected value of beta hat is equal to beta, so it is an <b>unbiased estimator</b>. Also any other unbiased estimator of beta will have a larger variance than the OLS solution. It is the Best Linear Unbiased Estimator (BLUE).

This means that if $\epsilon$ is independent and identically distributed (i.i.d.), then the OLS estimate is optimal.$$\hat{\beta} = (X'X)^{-1}X'Y$$

However, if $Var(\epsilon) = V\sigma^2\neq I\sigma^2$ then the Generalized Least Squares (GLS) estimate is optimal.This means that we have to include the variance covariance matrix into our estimate of beta $$\hat{\beta} = (X'V^{-1}X)^{-1}X'V^{-1}Y$$

<b>Note</b>: that the OLS solution is a special case of the GLS solution because if you put the Identity matrix into the GLS for V, it becomes the OLS solution. So when $\epsilon$ is autocorrelated, we have to use GLS instead of OLS. 

So we use our <b>model</b>$$Y = X\beta + \epsilon$$

to get our <b>estimate</b>$$\hat{\beta}=(X'V^{-1}X)^{-1}X'V^{-1}Y$$

which we solve and use to find our <b>fitted values</b>$$\hat{Y}=X\hat{\beta}$$

from which we can calculate our <b>residuals</b>:$$r=Y-\hat{Y}$$
$$ = (I-X(X'V^{-1}X)^{-1}X'V^{-1})Y$$
$$ = RY$$

Even if we assume that $\epsilon$ is i.i.d., we still need to estimate the residual variance, $\sigma^2$. Our estimate is:$$\hat{\sigma}^2 = \frac{r^Tr}{tr(RV)}$$

which is the transpose of the residuals times the residuals, over the trace of the residual inducing matrix (R) times V. Recall that the trace of an n x n matrix is the sum of the elements on the main diagonal. 

In OLS:$$\hat{\sigma}^2 = \frac{r^Tr}{N - p}$$

where N is the length of the Y matrix and p is the number of columns in the design matrix.

<b>Note</b>:Estimating $V\neq I$ is more difficult and covered in the next module.

### A Geometric Interpretation of the GLM

Think of y as a vector in a space with one dimension per observation. Imagine a study with 3 subjects and 2 predictor variables, X1 and X2. Y exists in a 3 dimensional space, and the model exists in a 2D space. All possible combinations of X1 and X2 span a 2D plane which is a subspace of the 3D space. The plane is the subspace spanned by the model: <img src='3dspace.png'>

What are we doing when we fit the linear model?

We are minimizing the sum of squared errors (Q). In solving for beta we are <b>projecting</b> the data (y) onto the subspace of X. The formula for $\hat{\beta}$ is actually projection matrix:
<img src='projectionmatrix.png'>