# Linear Methods for Regression

### Linear Regression Models and Least Squares

RSS can be represented as:
$$RSS(\beta) = (y - X\beta)^T(y - X\beta)$$
$$\hat \beta = (X^TX)^{-1}X^Ty$$
$$\hat y = X\hat \beta$$

The hat matrix is given by:
$$H = X(X^TX)^{-1}X^T$$

and is called the hat matrix because it is what makes the response for $y$ a prediction, and thus giving the predicted $\hat y$. This computes the orthogonal projection between the subspace of the inputs and the real value $y$. This assumes $X$ is linearly independent and of full rank.

If there are variables that are considered redundant, then the projection of $y$ onto the subspace formed by the inputs $x$ can be represented in more than one way (more than one solution).

The predictors create a hyperplane, and the response creates an orthogonal projection onto this hyperplane in which the predictors span. This projection is $\hat y$ and represents the vector of least squares predictions. We minimize the $RSS(\beta)$ so that the residual vector $y = \hat y$ is orthogonal to the subspace derived from the input vectors $1, ..., p$.

The variance can be estimated by:
$$\sigma^2 = \frac{1}{N - p - 1}\ \sum_{i = 1}^N\ (y_i - \hat y_i)^2$$

To test if a coefficient $\beta_j = 0$, use a Z-score:
$$z_j = \frac{\hat \beta}{\hat \sigma \sqrt(v_j)}$$

where $v_j$ is the $j$th diagonal element of $(X^TX)^{-1}$. A large absolute value of $z_j$ will lead to a rejection of the null hypothesis. To test if a categorical variable with $k$ levels can be excluded from a model it will need to be proven that none of the levels are important. The F-statistic can be used to determine this:
$$F = \frac{(RSS_0 - RSS_1) / (p_1 - p_0)}{RSS_1 / (N - p_1 - 1)}$$

where $RSS_1$ is the least squares fit for the bigger model and $RSS_0$ is the least squares fit for the smaller, nested model. This means that $p_1 - p_0$ cannot be less than $0$.

The F-statistic measures the change in RSS per additional parameter in the bigger model.

###### Gauss-Markov Theroem
This states that trading a little bias can result in an even larger reduction in varience. 

###### Multiple Regression from Simple Univariate Regression
When the inputs are orthogonal, they have no effect on each other's parameter estimates in the model. This almost never occurs from observational data. 

###### Gram-Schmidt Process
This is used to get the coefficient estimates in multiple linear regression.  
https://www.khanacademy.org/math/linear-algebra/alternate-bases/orthonormal-basis/v/linear-algebra-gram-schmidt-process-example

### Subset Selection

Reasons why least squares estimates are not satisfactory:
- the least squares estimates often have low bias and large variance. Prediction accuracy can be improved by shrinking some coefficients to zero. 
- With a large number of predictors, it is better to have a smaller subset that exhibit the strongest effects. Sacrifice small details to understand the bigger picture. 

Forward-stepwise may be computationally more efficient, have lower variance than best subset selection, but perhaps have a higher bias.

### Shrinkage Methods

Subset selection methods can often exhibit high variance, but shrinkage methods do not suffer as much from high variability. 

###### Ridge Regression
$\lambda$ is a shrinkage parameter; the larger this value the more shrinkage as it brings the coefficients closer to zero. Many correlated variables can cause high variance. Correlation can be seen when one variable is wildly positive and then another is similarly wildly negative and thus cancelling each other out. One solution around this from occurring is by imposing a size constraint:
$$\hat \beta_{ridge} = argmin_{\beta}\ \sum_{i = 1}^N\ (y_i - \beta_0 - \sum_{j = 1}^p\ x_{ij}\beta_j)^2$$

subject to:
$$\sum_{j = 1}^p\ \beta_j^2 \le t$$

It is normal to normalize the ridge solutions as they are not equivariant under scaling of the inputs. It is also important to NOT penalize the intercept as doing so would make the proceudre depend too much on the origin chosen for $Y$. So subtract the mean from the actual non-intercept coefficients. In matrix notation this would be:
$$RSS(\lambda) = (y - X\beta)^T(y - X\beta) + \lambda \beta^T \beta$$

so the Ridge Regression solutions are seen to be:
$$\hat \beta = (X^TX + \lambda I)^{-1}X^Ty$$

where $I$ is the $p\ x\ p$ identity matrix.

Singular Value Decomposition of the centered input matrix $X$ that is $N\ x\ p$ can be defined as:
$$X = UDV^T$$

where $U$ is an $N x p$ matrix, $V$ a $p\ x\ p$ matrix - both orthogonal - with columns of $U$ spanning the column space of $X$ and the columns of $V$ are spanning the row space. $D$ is a $p\ x\ p$ diagonal matrix with the diagonal elements increasing from top-left ot bottom-right. Using the SVD, the least squares fitted vector can be represented as:
$$UU^Ty$$

With the equation written above, the ridge solution becomes:
$$X \ hat \beta^{ridge} = \sum_{j = 1}^p\ u_j\ \frac{d_j^2}{d_j^2 + \lambda}\ u_j^Ty$$

where $u_j$ are the columns of $U$. Also note that the division in the equation is less or eaul to $1$.

###### Orthonormal Basis
Imagine having a Matrix $B$ composed of vectors $v$ that are all of length 1 - length being the the linear combination of its own elements squared, or in other words, have been all normalized - they are unit vecotrs. Also imagine that all the vectors $v$ in $B$ are orthogonal to each other. This will mean the dot product of any vector $v$ with itself is $1$ and dotted with any other vector will be $0$. This is a basis because it spans a subspace.

So $B$ is a ortho-normal set. This set is linearly independent. So each vector has length $1$ and the linear combination of all vectors is zero - all are orthogonal to each other. 