# Introduction to Statistical Learning
- [1. Statistical Learning](#1.-Statistical-Learning)
    * [1.1. Basics of Statistical Learning](#1.1.-Basics-of-Statistical-Learning)
        + [1.1.1. Why estimate $f$](#1.1.1.-Why-estimate-$f$)
        + [1.1.2. How do estimate $f$](#1.1.2.-How-do-estimate-$f$)
        + [1.1.3. Trade-Off Between Prediction Accuracy and Model Interpretability](#1.1.3.-Trade-Off-Between-Prediction-Accuracy-and-Model-Interpretability)
        + [1.1.4. Supervised vs Unsupervised Learning](#1.4.-Supervised-vs-Unsupervised-Learning)
        + [1.1.5. Regression vs Classification Problems](#1.5.-Regression-vs-Classification-Problems)
    * [1.2 Assessing Model Accuracy](#1.2-Assessing-Model-Accuracy)
        + [1.2.1. Measuring the Quality of Fit](#1.2.1.-Measuring-the-Quality-of-Fit)
        + [1.2.2 The Bias-Variance Trade-Off](#1.2.2-The-Bias-Variance-Trade-Off)
        + [1.2.3 The Classification Setting](#1.2.3-The-Classification-Setting)

# 1. Statistical Learning

## 1.1. Basics of Statistical Learning

### 1.1.1. Why estimate $f$

**Prediction**
- The accuracy of $\hat{Y}$ as a prediction of $Y$ depends on two quantities, the `reducible error` and the `irreducible error`.
- In general, $\hat{f}$ will not be a perfect estimate for $f$, and this inaccuracy will introduce some error.
    * The key is to estimate $f$ with the aim of minimizing the `reducible error` associated with the error term $\epsilon$
    
**Inference**
- Interested in understanding the relationship between X and Y:
    * What predictors are associated with the response?
    * What is the relationship between the response and each predictor?
    * Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?

### 1.1.2. How do estimate $f$

**Parametric Methods**
- Parametric methods involve a two-step model-based approach
    * (1) Making an assumption about the functional form or shape of $f$
    * (2) Use the training data to fit and train the model
- Reduces the problem of estimating $f$ down to one of estimating a set of parameters
    * Easier to estimate a set of parameters such as $\beta_{0}, \beta_{1},...,\beta_{p}$ than $f$
- Disadvantage:
    * Model we choose will usually not match the true unknown form of $f$, resulting in a poor estimate

**Non-parametric Methods**
- Do not make explicit assumptions about the functional form of $f$
- Seek an estimate of $f$ that gets as close to the datapoints without being too rough or wiggly
- Disadvantage:
    * As non-parametric approaches does not reduce the problem of estimating $f$ to be a small number of parameters, a very large number of observations is required to obtain an accurate estimate for $f$ 

### 1.1.3. Trade-Off Between Prediction Accuracy and Model Interpretability

**Trade-Off Between Flexibility and Interpretability**
![Model_flex_Inter.JPG](attachment:Model_flex_Inter.JPG)
- Generalized Additive Models are an extension of the linear model which allow for certain non-linear relationships

### 1.1.4. Supervised vs Unsupervised Learning

**Supervised Learning**
- For each observation $x_{i}$, there is an associated response measurement $y_{i}$
- Fit a model that relates the response to the predictors, with the aim of accurately predicting the response `prediction` or better understanding of the relationship between the response and predictors `inference`
- Methods include:
    * Linear regression, logistic regression, generalized additive models, boosting and support vector machines
    
**Unsupervised Learning**
- A vector of measurement of $x_{i}$ with no associated response $y_{i}$
- Understand the relationships between the variables or between the observations
- Methods include:
    * Cluster analysis:
        Ascertain on the basis of $x_{1}, x_{2},...,x_{n}$  fall into relatively distinct groups
        
**Semisupervised Learning**
- Use a statistical learning method that can incorporate the $m$ observations for which response measurements are available as well as the $n-m$ observations for which they are not

### 1.1.5. Regression vs Classification Problems

**Regression**
- Used for data with a `quantitative` response
- Methods:
    - Least squares regression, K-nearest neighours, boosting, random forest

**Classification**
- Used for data with a `qualitative` response
- Methods:
    - Logisitic regression, K-nearest neighours, boosting, random forest

## 1.2 Assessing Model Accuracy

### 1.2.1. Measuring the Quality of Fit

- Need to quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation
    * Mean squared error (MSE)
$$MSE = \frac{1}{n}\sum^{n}_{i=1}(y_{i}-\hat{f}(x_{i}))^{2}$$
where $\hat{f}(x_{i}$ is the prediction that $\hat{f}$ gives for the ith observation

**Training vs Test MSE**
- Interested in the accuracy of the predictions that we have not seen (test data)
    * Minimize the test MSE and not the train MSE as there is no guarantee that the lowest training MSE will give the lowest test MSE

**Overfitting vs Underfitting**
- Overfitting:
    * Low training MSE but high test MSE
        + The suposed patterns that the method found in the training data does not apply to the test data

### 1.2.2 The Bias-Variance Trade-Off

- To minimize the test error, we need to select a statistical learning method that simultaneously acheives `low variance` and `low bias`
    * **Variance** refers to the amount by which $\hat{f}$ would change if we estimated it using a difference training data set
    * **Bias** refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model
- When using more flexible models, the variance will increase and the bias will decrease
    * The relative rate of change of these two quantities determines whether the test MSE increases or decreases
        + Bias tends to initially decrease faster than variance increases, but at some point, increasing flexibility has little impact on the bias but significantly increase the variance.

### 1.2.3 The Classification Setting

Quantifying the accuracy of the estimate $f$ for `qualitative` response

$$\frac{1}{n}\sum^{n}_{i=1}I(y_{i}\neq\hat{y_{i}})$$

where $\hat{y_{i}}$ is the predicted class label for the ith observation using $\hat{f}$; $I(y_{i}\neq\hat{y_{i}})$ is an indicator variable that equals **1** if $y_{i}\neq\hat{y_{i}}$ and **0** if $y_{i}=\hat{y_{i}}$

**Bayes Classifier**
- The unattainable gold standard against which to compare other models
- A simple classifier that assigns each observation to the most likely class, given its predictor values
- Observations are assigned to groups based on the Bayes decision boundary
    * This method ensure the lowest possible test error rate
        + However, it is impossible to calculate the condition distribution of Y given X
        
**K-Nearest Neigbours**
- A classifier that identifies the K points in the training data that are closest to $x_{0}$, represented by $N_{0}$ and estimates the conditional probability for class j as the fraction of points in $N_{0}$ whose response values equal j

$$Pr(Y=j|X=x_{0}) =\frac{1}{K}\sum_{i\in N_{0}}I(y_{i}=j)$$

- **K** represents the number of neighbours that is used to determine the point
- Choice of K has a drastic effect on the KNN classifier obtained
    * As K increases, the method becomes more flexible and error rate declines
        + However, when overfitted and method becomes excessively flexible, the test error will increase