# 2 Statistical Learning

## 2.1 Statistical Learning

#### Why Estimate f

###### Prediction
$$E(Y - \hat Y) ^ 2 = [f(X) - \hat f(X)] ^ 2 + Var(\epsilon)$$

Where $E(Y - \hat Y) ^ 2$ represents the expected value and $Var(\epsilon)$ is the irreducible error. This produces an upper bound on the accuracy of prediction for $Y$.

###### Inference
Less focus on predicting $Y$ and more focus on how $Y$ changes as a function $f(X)$.

#### How do we Estimate f

###### Parametric Models
Two-step approach:
- Make an assumption about the functional form (shape) of $f$. An example could be: $f(X) = \beta_0 + \beta_1 X_1 + \beta_2 + X_2$
- After a model has been selected we need a procedure that uses the training data to fit or train the model. Using the above example this could involve estimating the coefficients $\beta_p$. This is essestially finding the coefficients that gives the best estimate for $Y$. A common metric is least squares.

Issue with the parametric approach is that the chosen function will not match the unknown form of $f$.

###### Non-parametric Models
Do not make an assumption about the functional form of $f$. THese models try to find $f$ that fits the data as close as possible without being too rough. The drawback is that non-parametric models do not reduce the the problem of $f$ to a small number of parameters, therefore a very large number of observations is required.

## 2.2 Assessing Model Accuracy

#### Measuring Quality of Fit

Mean-Squared Error (MSE): $MSE = \frac{1}{n}\ \sum_{i = 1}^n\ (y_i \hat f(x_i))^2$

#### Bias-Variance Trade-Off

The expected MSE can be decomposed into three quantities:
- Variance
- Squared Bias
- Variance of Error Terms

$$E(y_0 - \hat f(x_0)) ^ 2 = Var(\hat f(x)) + [Bias(\hat f(x_0)^2)] + Var(\epsilon)$$
- $E(y_0 - \hat f(x_0)) ^ 2$ is the expected test MSE and derived from the average test MSE obtained through repeated estimations for $f$ using a large number of training sets.
- Optimum method has the lowest of the three quantities (Expected MSE can never lie below the $Var(\epsilon)$).

###### Variance
Refers to the amount by which $\hat f$ would change if we estimated it using a different training data set. If high variance is present, then smal changes in the training data will result in high changes in $\hat f$. High variance is tied to models with higher flexibility.

###### Bias
Error introduced by approximating a real-life problem. Linear Regression, for example, assumes a linear fit, which will introduce bias.More flexible models result in less bias.

As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases, thus the expected test MSE declines (fixing underfitting). Eventually increasing flexibility has little impact on the bias but starts to significantly increase the variance, which casuses the test MSE to increase (overfitting).

#### The Classification Setting

###### Training Error Rate
Proportion of mistakes that are made if we apply our estimate $\hat f$ to the training observations:
$$\frac{1}{n}\ sum_{i = 1}^n\ I(y_i \neq \hat y|i)$$

Where $I(y_i \neq \hat y|i)$ is an indicator variable that equals 1 if the actual and predicted values are not equal and 0 if they are equal (classified accurately).

###### Test Error Rate
$$Ave(I(y_i \neq \hat y|i))$$

###### Bayes Classifier
Test Error Rate is minimized, on average, by a classifier that assigns each observation to the most likely class, given its predictor values. $Pr(Y = j | X = x)$ for which class the probability is largest is the label a vector of variables will receive. The Bayes Decision Boundary is what decides the label. For a two-class problem the decision boundary is 0.5. 

The Bayes Error Rate at $X = x_0$ will be $1 - E( max_j\ Pr(Y = j | X ) )$.

###### K-Nearest Neighbors
$$Pr(Y = j | X = x_0) = \frac{1}{K}\ sum_{i \in N_0}\ I(y_i = j)$$

KNN uses Bayes rule and classifies the test observation $x_0$ to the class with the largest probability. As $\frac{1}{K}$ increases, the moethod becomes more flexible, thus decreasing the the training error rate and potentially hurting the test error rate. 