# How to estimate $f$

**Parametric methods**
- Make an assumption about the functional form, or shape, of $f$, like a linear assumption.
- After a model has been selected, we use the training data to fit or train the model. For linear regression, the most common approach is (ordinary) least squares.

- **Advantages**
 - Reduces the problem of estimating $f$ down to estimaing a set of parameters.
 - Like linear models, it allows for simple and interpretable inference.
- **Disadvantage** 
 - We should give assumption about the $f(X)$. If the chosen model is too far from the true $f$, and prediction accuracy is our goal, then the estimate will be poor.

**Non-parametric methods**
- **Advantages**
 - Don't explicit assume a form for $f$. Thereby they can be flexible to fit a wider range of possible $f$.
- **Disadvantage** 
 - Require a large number of observations for an accurate estimate for $f$. 
 
The parametric approach will outperform the non-parametric approach if the parametric form that has been selected is close to the true form of $f$.

# Supervised vs unsupervised
- supervised model: for each observation of the predictor measurements, $x_1$ to $x_n$, there's an associated response measurement $y_1$ to $y_n$.
- Unsupervised: for every observation i to n, we observe a vector of measurements $x_1$ to $x_n$ but no associated response $y$. 

# The trade-off between prediction accuracy and model interpretability
<img src="images/1.png" width="500">

- If we are mainly interested inference, we can choose restrictive as they're more interepretable.
- If interpretability is not a concern, most flexible models are not necessarily the best because of potential overfit. We can start from a less flexible method as a baseline.

# MSE

In the regression setting, the most commonly-used measure is the $mean$ $squared$ $error$ (MSE)

\begin{align}
MSE&=\frac{1}{n}*sum_{i=1}^n(y_i-\hat{f}(x_i))^2 \\
\end{align}

AS model flexibility increases, training MSE will decrease, but the test MSE may not. When a given method yields a small training MSE but a large test MSE, we are said to be overfitting the data. This happens because our statistical learning procedure is working too hard to find patterns in the training data, and may be picking up some patterns that are just caused by random chance rather than by true properties of the unknown function $f$. 

But we almost always expect the training MSE to be smaller than the test MSE because most statistical learning methods either directly or indirectly seek to minimize the training MSE. Overfitting refers specifially to the case in which a less flexible model would have tielded a smaller test MSE.

# The Bias-Variance Trade-off

The expected test MSE, for a given vaue $x_0$, can aways be decomposed into the sum of three quantities: the $variance$ of $\hat{f}(x_0)$, the squared $bias$ of $\hat{f}(x_0)$ and the variance of the error terms $\epsilon$, which is the irreducible error. That is,

\begin{align}
E(y_0-\hat{f}(x_0))^2&= Var(hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon)\\ 
\end{align}

Therefore, in order to minimize the expected test MSE, we need to select a model that simultaneously achieves low variance and low bias.

- Variance

 - Variance refers to the amount by which the predicted value would change if we estimated it using a different training data set. It's the degree to which these predictions vary between model iterations.
 - Ideally, the estimate predicted value should not vary too much between training sets. If a method has high variance, then small changes in the training data can result in large changes in the predicted value. In general, more flexible statistical methods have higher variance.
 > Mathematically, variance is the squared difference between our long-term expectation for the model's performance, which is the averaged performance over many datasets D, ED[hD(x)], and what we expect in a representative run on a dataset D (hat y)
 
 \begin{align}
 Variance = E[(h-hat{y})^2]
 \end{align}



- Bias

 - Bias occurs due to the simplified assumption made by our models when solving complex real problems. It measures how far off in general these models' predictions are from the correct value.
 - Generally, more flexible methods result in less bias.
  > Mathematically, bias-squared is the squared difference between the true target variable y (or the best possible prediction for x, f(x)), and our “long-term” expectation for what the model will perform if we averaged over many datasets D, ED[hD(x)]
  
 \begin{align}
 Bias^2 = E[(f-h)^2]
 \end{align}
 
 
 
For more flexible methods, the variance will increase and the bias will decrease. The relative rate of change of these two qtities determines whether the test MSE increases or decreases. In general, the bias tends to decrease faster than the variance increase at the begining. But at some point, increasing the flexibility of the model begin to have little impact on the bias but significantly increase the variance.

<img src="images/2.png" width="500">
<center>So mostly, the test MSE plot will be like a U-shape.</center>