# Supervised Learning: An Introduction

## Least-squares

The assumption is that $f(X) = E(Y|X)$ is linear.

We assume that $ Y = E(Y|X) + \epsilon$

$$\hat{Y} = \hat{\beta_0} + \sum_{i=1}^n X_{i} \hat{\beta_j} = X^T \hat{\beta}$$

Where the $\beta_0$ term is the model *bias*. The gradient $f^\prime(X) = \beta$ is a vector in input space that points in the steepest uphill direction. To fit the model, a (simple) method is *least squares*. Here, we pick coefficients $\beta$ to minimize the residual sum of squares

$$RSS(\beta) = \sum_{i=1}^n (y_i - x_i^T \beta)^2$$

which shows a quadratic function of the parameters. Therefore, a minimum always exists but may not be unique. In matrix notation, 

$$RSS(\beta) = (y - X \beta)^T (y - X \beta)$$

where X is an $N \times p$ matrix with each row a sample, and y is an N-vector of the outputs in the training set. Differentiating w.r.t. $\beta$ we get the normal equations

$$X^T (y - X \beta) = 0$$

If $X^T X$ is nonsingular (i.e. invertible, $AB = BA = I$), then the unique solution is given by

$$\hat{\beta} = (X^T X)^{-1} X^T y$$

Therefore, a solution for the best $\beta$ can be found without iteration.

A "poor man's" classifier can use linear regression and predict $1(\hat{Y} > 0.5)$. Ideally, we would like to estimate $P(Y=1|X=x)$

## Nearest neighbors

For regression, calculates average values of the $k$ nearest neighbors. For classification, a majority vote is conducted.

$$\hat{y} = \frac{1}{k} \sum_{x_i \in N_k(x)} y_i$$

If large number of variables, it'll require a larger number $k$.  If kept same, then smaller number of neighbors will be included (**Curse of dimensionality**). Increased number of features, the definition of the neighborhood will also have to expand.  The bias increases. This is because as you add another feature, it'll inherently make the points be further apart.

Also, as you increase $k$, a smoother surface will be formed (i.e. reduced variance).

The best $k$ can be found empirically.

## Bias-variance tradeoff

For a fixed $x_0$, 

$$E [\hat{f}(x_0) - f(x_0)] ^2 = E[ \hat{f}(x_0) - E\hat{f}(x_0) + E\hat{f}(x_0) - \hat{f}(x_0)]^2$$

$$= E[\hat{f}(x) - E\hat{f}(x_0)]^2 + [E\hat{f}(x_0) - f(x_0)]^2 + 2 E\hat{f}(x_0) - f(x_0) E[\hat{f}(x_0) - E\hat{f}(x_0)]$$

We know that $E[\hat{f}(x_0) - E\hat{f}(x_0)] = 0$. Therefore, 

$$Var(\hat{f}(x_0)) + bias(\hat{f}(x_0)^2$$

There is no bias if $k=1$ in nearest neighbor analysis. Small $k$ is small bias but high variance. Large $k$ is the summation over $n$ so benefiting from Variance (because for sample variance, there is a $\frac{1}{n}$ term) will be low but bias will be high.

## Linear regression vs. kNN

Linear regression has high bias (linear assumption can be violated) but only needs to estimate p+1 parameters.

kNN uses $\frac{n}{k}$ parameters but is flexible and adaptive. It is small bias but large variance.


# Linear Algebra Review

Matrix transpose: $A_{ij}^T = A_{ji}$ and $(AB)^T = B^T A^T$

Matrix (dot) product: $C = AB$

Identity matrix $I$ has a diagonal of ones and the rest zero.

Matrix inversion: $A^{-1} A = A A^{-1} =  I_n$

$$Ax = b$$

$$A^{-1} A x = A^{-1} b$$

$$I_n x = A^{-1} b$$

Invertability. We cannot invert a matrix if 1) more rows than columns or 2) more columns than rows, or 3) redundant rows ("linear dependence", "low rank")

Norms L^p norm: 

L2 norm (p=2) is mos often used. It is a distance. 

Eigendecomposition: $A v = \lambda v$

If $\lambda$ is eigenvalue of matrix $A$, there exists an eigenvector $V$ such that 

$$A = V diag(\lambda) V^{-1}$$

We can find $\lambda$ by $\lambda = \frac{V^T A V}{V^T V}

Every real symmetric matrix has a real, orthogonal eigendecomposition $A = Q \Lambda Q^T$.

This will take two vectors on an $x_1, x_2$ space. When you multiply the matrix, on the direction of v_1, you scale it by $\lambda_1$. This stretches the space.


Trace: $Tr(A) = \sum_i A_{i,i}$

We can switch this around in any way.

$$Tr(ABC) = Tr(CAB) = Tr(BCA)$$

# Probability and Information Theory

A pdf must be contained s.t. $\all x \in x, p(x) \geq 0$. Additionally, $\sum_{x\in x} p(x) = 1$ or $\int p(x) dx = 1$.

Computing a marginal probability with the **sum rule**

$$p(x) = \int p(x,y) dy$$


Conditional probability: $P(y=y, x=x) = \frac{P(y=y, x=x)}{P(x=x)}$

Chain rule of probability: $P(x_1, ..., x_n) = P(x_1) \pi_{i=1}^n P(x_i | x_1, ..., x_{(i-1)})$

$P(x_1, x_2, x_3) = P(x_1) P(x_2, x_3 | x_1) = P(x_1) p(x_2) p(x_3 | x_1, x_2)$


Independence: $p(x=x, y=y) = p(x=x)p(y=y)$

Expectation: $E_{x\sim P} [f(x)] = \sum_x P(x) f(x)$

Variance and covariance: $E(Z)^2 = Var(Z) + (E Z)^2$ where $Z=f(x) - E f(x)$

$Cov(X,Y) = E(XY) - EX EY$



F distribution is chi-squared divided by chi-squared
