## 2 Statistical Learning
### 2.1 What Is Statistical Learning?
- $X$ is introduced as a symbol for the input variables, but it is also used for the whole data fed into a model (p. 15)
    - comment: so I see that $X$ is all the columns and when they talk about "input variables" they mean the numerical values that make them (those in the column) and not their meaning (or name as a string)

- synonyms for $X$: "features" "predictors", "independent variables", "variables"
- synonyms for $Y$: "target", "response", "dependent variable"

- we assume a relationship between $X$ and $Y$ which we express like this: $Y = f(X) + \epsilon$
    - $f$ is some fixed but unknown function
        - we want to estimate f based on the observed points
    - $\epsilon$  is a random error term, which is independent of $X$ and has mean zero
        - $\epsilon$ having a mean of zero is an assumption in predictive modeling and calculations are based on it. In reality, I would think, we might miss an important feature that is not part of the data but influences $Y$; then, I believe, $\epsilon$ most probably doesn't really have mean zero.

#### 2.1.1 Why Estimate f ?
##### Prediction
- we can predict a probable output using $\hat{Y} = \hat{f}(X)$, since the error term averages to zero
    - $\hat{Y}$ is our prediction and $\hat{f}$ is our estimate for $f$
    
- "irreducible error" is the error term $\epsilon$
- "reducible error" lies in the choice of our estimate ($\hat{f}$) and its hyperparams

- "Why is the irreducible error larger than zero? The quantity $\epsilon$ may contain unmeasured variables that are useful in predicting Y: since we don’t measure them, $f$ cannot use them for its prediction." (p. 17/18)
    - question: but in those cases $\epsilon$'s mean might not be zero, I think (and if so: does this affect the estimate's reliability in some way?)

- Prediction Error Decomposition: $\begin{align} \mathbb{E}\left[(Y - \hat{Y})^2\right] &= \mathbb{E}\left[(f(X) + \epsilon - \hat{f}(X))^2\right] \\ &= \left[f(X) - \hat{f}(X)\right]^2 \, &\quad + \text{Var}(\epsilon) \, \end{align}$

    - $\mathbb{E}\left[(Y - \hat{Y})^2\right]$ is the expected squared difference (prediction error) between the true value $Y$ and the predicted value $\hat{Y}$
        - [Expected value](https://en.wikipedia.org/wiki/Expected_value) is the (weighted) mean of a random variable; conventionally it uses squared brackets. We can think of it as a long-run average outcome.
        - We CANNOT re-write this as $(Y - \hat{Y})^2 / n$ (assuming equal weights), because the expected value is a population-level concept. $(Y - \hat{Y})^2 / n$ in contrast, would be on the sample-level and is the same as mean squared error. Since the whole population mostly is not known, in practical applications people use the sample MSE as a proxy.
    
    - The first line of the formula is basically substituting $Y = f(X) + \epsilon$ and $\hat{Y} = \hat{f}(X)$ into the first term on the expected squared prediction error.

    - For stepping from the first line of the formula to the second line, we need to expand $(f(X) + \epsilon - \hat{f}(X))^2$ algebraically and then calculate the expectation of all its elements. We will then see that we can single out two components of the error.
        1. Algebraic expansion: $(f(X) + \epsilon - \hat{f}(X))^2$ gets to be:</br>
        $(f(X) - \hat{f}(X))^2 + 2(f(X) - \hat{f}(X))\epsilon + \epsilon^2$, </br>
        because $(a + b - c)^2$ can be expanded to $(a - c)^2 + 2(a - c)b + b^2$.</br>
        This is done by expanding totally (multiplying each term with each term) and then grouping together anew by recognising a perfect square and factoring. Now the error is broken into three parts.
        
        2. Now we take the expectation of all the three terms in $(f(X) - \hat{f}(X))^2 + 2(f(X) - \hat{f}(X))\epsilon + \epsilon^2$
            - $\mathbb{E}\left[(f(X) - \hat{f}(X))^2 \right] = (f(X) - \hat{f}(X))^2$ </br>
            Since the two terms are both fixed, its expectation is just itself. (When we take the expectation of a fixed quantity, it remains itself, because the expected value of a random variable is the weighted average of all possible values of it, based on their probabilities. But here, there is only one possible value (fixed one) and the probability is 1.)

            - $\mathbb{E}\left[2(f(X) - \hat{f}(X))\epsilon \right] = 0$ </br>
            the middle term vanishes, since the expectation of $\epsilon$ is zero 
            
            - $\mathbb{E}\left[\epsilon^2\right] = \text{Var}(\epsilon)$ </br>
            This is a basic property from statistics, which holds when $\mathbb{E}\left[\epsilon\right] = 0$ (meaning we assume that the error term $\epsilon$ has a mean of zero). The expectation of $\epsilon^2$ is the variance of the noise.
            - [Variance](https://en.wikipedia.org/wiki/Variance#Definition) is the average squared deviation from its mean, for instance the variance of $X$ would be: $\text{Var}(X) = \mathbb{E}\left[(X - \mu)^2 \right]$, where $\mu$ is the population mean. Here we are talking about the variance of the random variable $\epsilon$, so if the mean of $\epsilon$ is assumed to be zero, we are left with $\mathbb{E}\left[\epsilon^2\right]$ as a value for the variance and this is what we find here in this formula. 

    - $\left[f(X) - \hat{f}(X)\right]^2$  represents the reducible error
        - It has the squared brackets (instead of the normal ones) even though the $\mathbb{E}$ (that always comes with squared brackets) has vanished, but those are unrelated.

    - $\text{Var}(\epsilon)$ represents the irreducible error 

    - the formula shows how the average prediction error of an estimate can be decomposed into a reducible and an irreducible part

##### Inference
- introspection of statistical relations
- questions like
    - Which predictors are associated with the response? 
    - What is the relationship between the response and each predictor?
    - Can the relationship between Y and each predictor be adequately sum marized using a linear equation, or is the relationship more complicated?
- in these scenarios, $\hat{f}$ cannot be treated like a black box

#### 2.1.2 How Do We Estimate f ?
- we want to find a function $\hat{f}$  such that $Y ≈ \hat{f}(X)$ for any observation $(X, Y)$

##### Parametric Methods
- reduces the problem of estimating f down to one of estimating a set of parameters
- two step approach:
    1. make an assumption about the functional form of $f$
        - simplifies the search
        - in case we make the assumption of a linear form, we now only need to estimate the coefficients and maybe the bias
    2. use procedure to fit and train the model
        - in case of linear model for instance using OLS (Ordinary Least Squares)

##### Non-Parametric Methods
- no explicit assumptions about the functional form of $f$
-  since they do not reduce the problem of estimating $f$ to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for $f$
- examples: thin-plate spine, tree based models, KNN

#### 2.1.3 The Trade-Off Between Prediction Accuracy and Model Interpretability
- we might chose a more restrictive method over a more flexible approach 
    - for interpretability
    - for inference
    - to avoid overfitting

#### 2.1.4 Supervised Versus Unsupervised Learning
- supervised: for each observation of the predictor measurement(s) $x_i$ , $i = 1, . . . , n$ there is an associated response
measurement $y_i$
- unsupervised: for every observation $i = 1, . . . , n$, we observe a vector of measurements $x_i$ but no associated response $y_i$
- semi-supervised: we have targets only available for a subset of the observations

#### 2.1.5 Regression Versus Classification Problems
- we select statistical learning methods on the basis of whether the response is quantitative or qualitative; not the data

### 2.2 Assessing Model Accuracy
#### 2.2.1 Measuring the Quality of Fit
- quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation
- Mean Squared Error: $\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{f}(x_i) \right)^2$
    - $\hat{f}(x_i)$ is the prediction that $\hat{f}$ gives for the $i$ th observation

    - often used in regression
- MSE is computed using the training data that was used to fit the model
- we are also interested in the accuracy of the predictions when we apply our method to previously unseen test data
    - the term "accuracy" here is used as a general term to describe the performance of a model
- average squared prediction on test data: $\text{Ave}\left((y_0 - \hat{f}(x_0))^2\right)$
- we want to choose the method that gives the lowest test MSE, as opposed to the lowest training MSE (otherwise we risk overfitting)
    - This happens because our statistical learning procedure is working too hard to find patterns in the training data, and
may be picking up some patterns that are just caused by random chance rather than by true properties of the unknown function $f$. 
- As the flexibility of a statistical learning method increases, the training MSE always decreases, while the test MSE follows a U-shape, initially decreasing and then increasing. This pattern is consistent across different datasets and methods.
- cross validation can be used to make sure model is kind of not only fit on the training data (but it kind of is)

#### 2.2.2 The Bias-Variance Trade-Off
- U-shape observed in the test MSE curves are the result of two competing properties of statistical learning methods: **bias** and **variance**

- expected test MSE decomposition: $\mathbb{E}\left[(y_0 - \hat{f}(x_0))^2\right] = \text{Var}(\hat{f}(x_0)) + \left[\text{Bias}(\hat{f}(x_0))\right]^2 + \text{Var}(\epsilon)$

    - the expected test MSE, for a given value $x_0$ , can always be decomposed into the sum of three fundamental quantities: </br>
    the variance of $\hat{f}(x_0)$, the squared bias of $\hat{f}(x_0)$ and the variance of the error term $\epsilon$
    
    - assumption: $x_0$ is a test sample (can consist of several samples)
    
    - $\text{Var}(\hat{f}(x_0))$ is how much the prediction of $\hat{f}(x_0)$ varies if we trained the model on different training sets

        - to estimate this, we calculate the standard deviation on the test scores after cross validation: </br>
        `cv_results[["test_score"]].std()`

            - (however in reality we would have to have completely new datasets, instead of re-sampling the first one)
    
    - $\text{Bias}(\hat{f}(x_0))$ is how far the model's average prediction at $x_0$ is from the true value of $\hat{f}(x_0)$

        - if we calculate the difference between the train and the test scores after cross validation, we can get an indication of the bias: </br>
        `float(cv_results[["train_score"]].mean().iloc[0]) - float(cv_results[[ "test_score"]].mean().iloc[0])`

            - (however this only captures generalization ability of the model, not the deviation between the model's expected prediction and the true underlying function; meaning: when the model has train scores as bad as the test scores, then we can hardly see the bias, because it's the wrong type of model)

- we need to select a statistical learning method that simultaneously achieves low variance and low bias

- [StatQuest video on Variance and Bias](https://www.youtube.com/watch?v=EuBBz3bI-aA): 
 
    - Bias how much the model's predictions deviate from the true underlying function across multiple training sets on average.
 
    - Variance is how much the predictions are different over several data sets.
 
    - We can think of bias as error due to incorrect model assumptions (underfitting), and variance as error due to sensitivity to training data (overfitting). The goal is to find a model that minimizes both.

- this is more a theoretical concept, but in order to compute we would need to train our model multiple times using different training sets: for variance, we would check how the predictions $\text{Var}(\hat{f}(x_0))$ for the same test point $x_0$ vary; for bias, we would average the predictions from our models and compare that to the true value using $\text{Bias}(\hat{f}(x_0)) = \mathbb{E}\left[\hat{f}(x_0) \right] -  f(x_0)$

- As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease.


#### 2.2.3 The Classification Setting
- instead of MSE (as in regression) we would use the training error rate, the proportion of mistakes that are made if we apply our estimate $\hat{f}$ to the training observations

- training error rate: $\frac{1}{n} \sum_{i=1}^{n} I(y_i \neq \hat{y}_i)$

    - $I(y_i \neq \hat{y}_i)$ is an indicator variable that equals 1 if $y_i \neq \hat{y}_i$ and 0 if $y_i = \hat{y}_i$

    - computes the fraction of <b>in</b>correct classifications

    - is the opposite of the accuracy score which measures the fraction of correct classifications
        - $Accuracy=1−Training Error Rate$

- test error rate: $\text{Ave}(I(y_0 \neq \hat{y}_0))$
    -  $\hat{y}_0$ is the predicted class label that results from applying the classifier to the test observation with predictor $x_0$

    -  $y_0$ and $\hat{y}_0$ are a general placeholder for any test sample, and the averaging happens over all such test points in the test set

    - the shorthand notation $Ave()$ is used here, because the number of the tested samples is undefined (and can very across different test sets)

##### The Bayes Classifier
- is some kind of a theoretical super-concept for all(?) classification tasks
- it delivers optimal classification results, because it is based on the assumption that we are somehow (magically) able to calculate the highest most likely class for any sample
- in practice, we can never use the Bayes Classifier because we don't know the exact conditional probabilities for every class and feature combination
- many classification methods are in use to approximate the Bayes Classifier
    - for instance Naive Bayes Classifier, assumes that all the features play into the target independently from each other

- Bayes Classifier minimizes the test error rate by assigning each observation its most likely class given by </br> 
$\Pr(Y = j | X = x_0)$

    - $Y$ is the class, $j$ represents a specific class label (not the total number of predictors/columns as in the prediction part above)

    - this is a [conditional probability](https://en.wikipedia.org/wiki/Conditional_probability) and it means: 

        - given the features $x_0$, we calculate the probability that the response $Y$ belongs to class $j$ or, expressed differently, the probability that $Y$ (the class label) is equal to class $j$, given that the predictor values are $X = x_0$​
        - (also this is a very simple form of a conditional probability, because the condition is always true and observable)
        - we use a conditional probability to express that our prediction depends on the features from the test sample (because we magically know how they play into the prediction)

    - this formula gives a conceptual definition of the Bayes Classifier

        - it is central for any classification task
        - for direct calculation of the probabilities, we would need a more precise model that does not (like Bayes Classifier) assume that we magically know the real likelihood of the targets to occur given the data
        - could be Naive Bayes (directly subordinate) or Logistic Regression or Decision Trees (which are only indirectly related, as they implicitly approximate the decision boundary that minimizes the classification error)

- "decision boundary": when we have only two predictors, $X_1$ and $X_2$, we can plot the data in two dimensions; if the data is generated, we will know the probability of a data point to belong to a certain class and can plot it using colors ("hue"); then the line where the probability is exactly 50% is called the "Bayes decision boundary" (p. 35, 36)

- the Bayes classifier produces the lowest possible test error rate, called the "Bayes error rate", which equals the irreducible error

- since the Bayes classifier will always choose the class for which the probabilty $\Pr(Y = j | X = x_0)$ is largest, the test error rate for a particular test sample will be: </br>
$1−max_j Pr(Y = j|X = x_0)$

    - we calculate the prediction $Pr(Y = j|X = x_0)$ for each possible class $j$ (these predictions are the true(!) probabilities for the sample’s membership in each class)
    - we pick the probability for the most likely class
    - we subtract it from 1 to receive the error

- more generally for all the test samples, the test error rate for the Bayes Classifier is: </br>
$1 - \mathbb{E} \left( \max_j \Pr(Y = j \mid X) \right)$

    - where the expectation $E$ averages the probability over all possible values of $X$
    - $X$ are the features from all the test samples
    - $max_j$ is the maximum conditional probability among all possible classes $j$, meaning: picking the probability for the most likely class
        - We use conditional probability here and not regular probability, because in classification, we want to predict the probability of each class given the observed predictors $X$. This requires conditional probabilities Pr⁡(Y=j∣X)Pr(Y=j∣X) because the likelihood of each class can change based on feature values. Regular probabilities Pr⁡(Y=j)Pr(Y=j) (unconditional) would ignore feature information and be less accurate for classification.
    - This is the lowest possible test error and the model cannot over fit, here is why: The Bayes Classifier assumes that we (magically) KNOW the real likelihoods of the classes given the features; here we don't rely on the training observations like we do in actual machine learning algorithms.

- Interpretation: the Bayes error rate measures how often even the best possible classifier, the Bayes classifier, will make an error because the classes overlap in the true population
    - overlapping classes mean that one feature (for instance sex) or also a combination of features is not able to cleanly divide the classes (for instance hight): no decision boundary (regardless of complexity) could not be able to divide the classes without error

##### Supplementary Insights: How would we use Bayes Classifier if we had all the necessary information?
- we would need both "prior" and "likelihood" and we could use the Bayes' theorem for classification:

- $\Pr(Y = j \mid X = x_0) = \frac{\Pr(X = x_0 \mid Y = j) \cdot \Pr(Y = j)}{\Pr(X = x_0)}$

    - Prior: $Pr⁡(Y=j)$: how likely class $j$ is overall in the population

    - Likelihood: $\Pr(X = x_0 \mid Y = j)$: probability of observing the specific feature values $X=x_0$​ given that we are in class $Y=j$

    - denominator $\Pr(X = x_0)$ normalizes the posterior probabilities across all classes

    - Posterior: $\Pr(Y = j \mid X = x_0)$: probability that the class is $Y=j$, given that we observed features $X=x_0$

- in real world scenarios and actually usable ML algorithms, the priors are estimated from the training data and the posteriors are made on the test data

##### Supplementary Insights: Difference between Probability and Likelihood:
- "Probability" is forward-looking, used to predict the chance of a future event based on a known model: "Given a model (e.g., a fair die), what is the probability of this outcome?"

- "Likelihood" is backward-looking, used to evaluate how well a model explains observed data: "Given this outcome, how likely is this model (e.g., fair die or biased die) to explain this outcome?"

- but it seems that often these two terms are used interchangeably

##### K-Nearest Neighbors
- is one of the models that can be used instead of the gold standard Bayes Classifier, estimating the hightest probability (instead of magically knowing it)

- we give it a test observation $x_0$ and a number of neighbors $K$, then KNN classifier first identifies the
$K$ points in the training data that are closest to $x_0$ , represented by $N_0$
- this way, it estimates the conditional probability for class $j$ as the fraction of points in $N_0$ whose response values equal $j$: </br>
$\Pr(Y = j \mid X = x_0) = \frac{1}{K} \sum_{i \in \text{N}_0} I(y_i = j)$

    - here for each class $j$ we sum up how often the closest surrounding data points to $x_0$ have the same class and then with $\frac{1}{K}$ we calculate the fraction of how often this class is present nearby compared to the other surrounding classes of the k-nearest neighbors

    - $K$ represents the number of neighbors we are considering (hyperparameter fed into model) and $N_0$ is the set of the $K$ nearest neighbors in the training data that are closest to the test observation $x_0$
        - notation $N_0$: $N$ stands for neighbors and the $0$ makes clear that these neighbors belong to the sample $x_0$

- we can draw a decision boundary for a KNN task if our data has only two features and we predict the most probably classes for all of the possible values for $X_1$ and $X_2$ (p. 38) 
    - on p. 37-38 ISL shows how close the decision boundary of KNN can be to Bayes Classifier

- the choice of $K$ has a drastic effect on the KNN classifier obtained and its predictions
    - $K=1$ is very flexible and corresponds to a classifier that has low bias but very high variance
    - $K=100$ produces a decision boundary that is close to linear; this corresponds to a low-variance but high-bias classifier

#### doubts and open questions
- "The input variables are typically denoted using the symbol $X$, with a subscript to distinguish them. So $X_1$ might be the TV budget, $X_2$ the radio budget, and $X_3$ the newspaper budget." (p. 15)
    - this contradicts the notation conventions from p. 9, where they write "At other times we will instead be interested
    in the columns of $X$, which we write as $\boldsymbol{x}_1$, $\boldsymbol{x}_2$, ..., $\boldsymbol{x}_p$. Each is a vector of length n."
    - how can I know or trust a meaning of $X_1$ or $\boldsymbol{x}_1$ in a formula if it is so ambiguous?