## 2 Statistical Learning
### 2.1 What Is Statistical Learning?
- $X$ is introduced as a symbol for the input variables, but it is also used for the whole data fed into a model (p. 15)
    - comment: so I see that $X$ is all the columns and when they talk about "input variables" they mean the numerical values that make them (those in the column) and not their meaning (or name as a string)

- synonyms for $X$: "features" "predictors", "independent variables", "variables"
- synonyms for $Y$: "target", "response", "dependent variable"

- we assume a relationship between $X$ and $Y$ which we express like this: $Y = f(X) + \epsilon$
    - $f$ is some fixed but unknown function
        - we want to estimate f based on the observed points
    - $\epsilon$  is a random error term, which is independent of $X$ and has mean zero
        - comment: $\epsilon$ having a mean of zero is an assumption in predictive modeling and calculations are based on it. In reality, I would think, we might miss an important feature that is not part of the data but influences $Y$; then, I believe, $\epsilon$ most probably doesn't really have mean zero.

#### 2.1.1 Why Estimate f ?
##### Prediction
- we can predict a probable output using $\hat{Y} = \hat{f}(X)$, since the error term averages to zero
    - $\hat{Y}$ is our prediction and $\hat{f}$ is our estimate for $f$
    
- "irreducible error" is the error term $\epsilon$
- "reducible error" lies in the choice of our estimate ($\hat{f}$) and its hyperparams

- "Why is the irreducible error larger than zero? The quantity $\epsilon$ may contain unmeasured variables that are useful in predicting Y: since we don’t measure them, $f$ cannot use them for its prediction." (p. 17/18)
    - comment: yes, but in those cases $\epsilon$'s mean might not be zero, I think (and if so: does this affect the estimate's reliability in some way?)

- Prediction Error Decomposition: $\begin{align} \mathbb{E}\left[(Y - \hat{Y})^2\right] &= \mathbb{E}\left[(f(X) + \epsilon - \hat{f}(X))^2\right] \\ &= \left[f(X) - \hat{f}(X)\right]^2 \, &\quad + \text{Var}(\epsilon) \, \end{align}$

    - $\mathbb{E}\left[(Y - \hat{Y})^2\right]$ is the expected squared difference (prediction error) between the true value $Y$ and the predicted value $\hat{Y}$
        - researched: [Expected value](https://en.wikipedia.org/wiki/Expected_value) is the (weighted) mean of a random variable; conventionally it uses squared brackets. We can think of it as a long-run average outcome.
        - comment: We CANNOT re-write this as $(Y - \hat{Y})^2 / n$ (assuming equal weights), because the expected value is a population-level concept. $(Y - \hat{Y})^2 / n$ in contrast, would be on the sample-level and is the same as mean squared error. Since the whole population mostly is not known, in practical applications people use the sample MSE as a proxy.
    
    - comment: The first line of the formula is basically substituting $Y = f(X) + \epsilon$ and $\hat{Y} = \hat{f}(X)$ into the first term on the expected squared prediction error.

    - comment: For stepping from the first line of the formula to the second line, we need to expand $(f(X) + \epsilon - \hat{f}(X))^2$ algebraically and then calculate the expectation of all its elements. We will then see that we can single out two components of the error.
        1. Algebraic expansion: $(f(X) + \epsilon - \hat{f}(X))^2$ gets to be:</br>
        $(f(X) - \hat{f}(X))^2 + 2(f(X) - \hat{f}(X))\epsilon + \epsilon^2$, </br>
        because $(a + b - c)^2$ can be expanded to $(a - c)^2 + 2(a - c)b + b^2$.</br>
        This is done by expanding totally (multiplying each term with each term) and then grouping together anew by recognising a perfect square and factoring. Now the error is broken into three parts.
        
        2. Now we take the expectation of all the three terms in $(f(X) - \hat{f}(X))^2 + 2(f(X) - \hat{f}(X))\epsilon + \epsilon^2$
            - $\mathbb{E}\left[(f(X) - \hat{f}(X))^2 \right] = (f(X) - \hat{f}(X))^2$ </br>
            Since the two terms are both fixed, its expectation is just itself. (When we take the expectation of a fixed quantity, it remains itself, because the expected value of a random variable is the weighted average of all possible values of it, based on their probabilities. But here, there is only one possible value (fixed one) and the probability is 1.)

            - $\mathbb{E}\left[2(f(X) - \hat{f}(X))\epsilon \right] = 0$ </br>
            the entire term vanishes, since the expectation of $\epsilon$ is zero 
            
            - $\mathbb{E}\left[\epsilon^2\right] = \text{Var}(\epsilon)$ </br>
            This is a basic property from statistics, which holds when $\mathbb{E}\left[\epsilon\right] = 0$ (meaning we assume that the error term $\epsilon$ has a mean of zero). The expectation of $\epsilon^2$ is the variance of the noise.
            - researched: [Variance](https://en.wikipedia.org/wiki/Variance#Definition) is the average squared deviation from the mean.

    - $\left[f(X) - \hat{f}(X)\right]^2$  represents the reducible error
        - comment: It has  the squared brackets (instead of the normal ones) even though the $\mathbb{E}$ (that always comes with squared brackets) has vanished, but those are unrelated.

    - $\text{Var}(\epsilon)$ represents the irreducible error 

    - comment: the formula shows how the average prediction error of an estimate can be decomposed into a reducible and an irreducible part

##### Inference

#### doubts and open questions
- "The input variables are typically denoted using the symbol $X$, with a subscript to distinguish them. So $X_1$ might be the TV budget, $X_2$ the radio budget, and $X_3$ the newspaper budget." (p. 15)
    - this contradicts the notation conventions from p. 9, where they write "At other times we will instead be interested
    in the columns of $X$, which we write as $\boldsymbol{x}_1$, $\boldsymbol{x}_2$, ..., $\boldsymbol{x}_p$. Each is a vector of length n."
    - how can I know or trust a meaning of $X_1$ or $\boldsymbol{x}_1$ in a formula if it is so ambiguous?