# Least Squares

Linear Models and Least Squares.

Given inputs $x^T = (X_1,X_2,...,X_p)$, we predict the output Y via the model:

$\hat{Y} = \hat{\beta}_0 + \sum\limits_{j=1}^{p}X_j\hat{\beta}_j$

$\hat{\beta}_0$ is the intercept, also known as the _bias_ in machine learning. 

Often it is convenient to include the constant variable 1 in $X$, include $\hat{\beta}_0$ in the vector of coefficients $\hat{\beta}$, and then write the linear model in vector form as an inner product.


$\hat{Y} = X^{T}\hat{\beta}$,

$X^T$ denotes vector or matrix transpose ($X$ being a column vector). Here we are modeling a single output, so $\hat{Y}$ is scalar; in general $\hat{Y}$ can be a K-vector, in which vase $\beta$ would be a $p\times{} K$ matrix of coefficients. In the $(p+1)$-dimensional input-output space, $(X,\hat{Y})$ represents a hyperplane. If the constant is included in $X$, then the hyperplane includes the origin and is a subspace; if not, it is an affine set cutting the Y-axis at the point $(0,\hat{\beta}_0)$. From now on we assume that the intercept is included in $\hat{\beta}$.

Viewed as a function over the p-dimensional input space, $f(X) = X^T\beta{}$ is linear, and the gradient $f^\prime(X) = \beta$ is a vector in input space that points in the steepest uphill direction.

There are many different methods to fit a linear model to training data. The _least squares_ is the most popular. In this approach, we pick the coefficients $\beta$ to minimize the residual sum of squares

$RSS(\beta) = \sum\limits_{i=1}^{N}(y_i - x_i^T\beta)^2$

$RSS(\beta)$ is a quadratic function of the parameters, and hence its minimum always exists, but may not be unique. The solution is easiest to characterize in the matrix notation. We can write

$RSS(\beta) = (y - X\beta)^T(y-X\beta)$,

where $X$ is an N x p matrix with each row an input vector, and $y$ is an N-vector of the outputs in the training set. Defferentitating w.r.t. $\beta$ we get the _normal equations_

$X^T(y-X\beta) = 0$

If $X^TX$ is nonsingular, then the unique solution is given by

$\hat{\beta} = (X^TX)^-1X^Ty$,

and the fitted value at the ith input _$x_i$_ is $\hat{y}_i = \hat{y}(x_i) = x_i^T\hat{\beta}$. At an arbitrary input $x_o$ the prediction is $\hat{y}(x_0) = x_0^T\hat{\beta}$. The entire fitted surface is characterized by the p parameters $\hat{\beta}$. Intuitively, it seems that we do not need a very large data set to fit such a model.

# Nearest-Neighbor Methods

Nearest-neighbor methods use those observations in the training set $\tau$ closest in input space to x to form $\hat{Y}$. Specifically, the k-nearest neighbor fit for $\hat{Y}$ is as follows:

$\hat{Y}(x) = \frac{1}{k}\sum\limits_{x_i\in{}N_k(x)}y_i$