# Statistical Learning

## What is Statistical Learning
- Imagine you're hired to provide advice on how to improve sales of a product
- The **Advertising** data set contains sales of that product in different markets along with advertising budget of the product in 3 different media
    - TV
    - Radio
    - Newspaper
- It is impossible to directly increase sales but its possible indirectly
- In this setting sales is considered the _output variable_ and advertising budgets are _input variables_
- Input variables can also be named
    - Independent variables
    - Features
    - Variables
- Output variable can also be named
    - Response
    - Dependent Variable
<img src="./Figures/Chapter2/2.1.png" width="600" height="600">

- We can assume there is a relationship between output and input
- This can be written in the general form of:<br>    
$$Y = f(X) +\epsilon $$

- $f(X)$ is a fixed but unknown function of $X$
- $\epsilon$ is a random _error term_, independent of X with a mean of 0

<img src="./Figures/Chapter2/2.2.png" width="600" height="600">

- The plot suggest predicting income based on **years of education**
- The function $f(X)$ may involve more than 1 feature
<img src="./Figures/Chapter2/2.3.png" width="600" height="600">

- Statistical learning refers to set of approaches for estimating $f$

### Why Estimate $f$ ?
- There are 2 main reasons to estimate $f$
    - Prediction
    - Inference

#### Prediction
- A set of inputs are readily available but output cannot be obtained
- We can predict $Y$ as:
$$\hat Y = \hat f(X)$$

- $\hat f$ represents our estimate for $f$ 
- $\hat Y$ represents our prediction for $Y$
- $\hat f$ is considered a black box, in the sense that one is not concerned with the form of $\hat f$ as long as it got accurate prediction for $Y$
- Imagine $X$ are characteristics of a patients blood
- $Y$ in this case, is endocding of the patient's risk for a severe adverse reaction to a particular drug
- Predict $Y$ using $X$ to avoid giving the drug to high estimate of $Y$
- Accuracy of $\hat Y$ depends on 2 quantities:
    - Reducible error
    - Irreducible error
- $\hat f$ will never be perfect estimate of $f$ producing error
- This is a _reducible error_ because accuracy is improved by statistical learning technique
- If it became the perfect estimate of $$\hat Y=f(X)$$ our prediction would still have an error
- $Y$ is also a function of $\epsilon$
- We cannot reduce the error introduced by $\epsilon$ making it _irreducible_
- $\epsilon$ may contain unmeasured variables that are usful for predicting $Y$ but since it's not measured $f$ cannot use them
- In the scenario above, the risk of adverse reaction may vary for a given patient on a given
- Consider $\hat Y = \hat f(X)$ with $\hat f$ and $X$ are fixed
$$E(Y-\hat Y)^2 = E[ f(X)+ \epsilon -\hat f(X) ]^2$$
$$= [f(X) - \hat f(X)]^2+Var( \epsilon )$$
- The $E(Y-\hat Y)^2$ represnts the average or _expected value_, of the squared difference between predicted and actual value of $Y$
- The $Var(\epsilon)$ represents the variance associated with the error term $\epsilon$
- The goal is estimating $f$ with the aim of lowering reducible error
- The irreducible error will always be upper bound and unknown in practice

#### Inference

- We are often interested how $Y$ is affected as $X_p$ change
- We want to understand the relationship between $X$ and $Y$
- $\hat f$ cannot be treated a black box since we need exact form
- Usually interested in answering the questions:
    - Which predictors are associated with the response ?
    - What is the relationship between the response and each predictor ?
    - Can the relationship between $Y$ and each predictor be adequately summarized using a linear equation or is it more complicated?
- The **Advertising** data set might be interested in answering
 questions such as :
    - Which media contribute to sales?
    - Which media generate the biggest boost in sales?
    - How much increase in sales is associated with a given increase in a certain media?
- This is an example of the inference paradigm
- Some modeling could be conducted for prediction and inference
- Different methods for estimating $f$ depend on goal
- _Linear models_ allow simple and interpretable inference but not accurate predictions
- The reverse is true

### How Do We Estimate $f$ ?

- Explore linear and non-linear approaches for estimating $f$
- These methods generally share certain characteristics

<img src="./Figures/Chapter2/2.2.png" width="600" height="600">


- These observation are called the _training data_
- The goal is to apply a statistical learning method to the training data to estimate $f$
- Find function $\hat f$ that $Y \approx \hat f(X)$ for any observation $(X,Y)$
- Most statistical learning methods can be characterized as:
    - Parametric
    - Non-parametric


#### Parametric Methods

- Involves 2-step model-based approach

- Assume the functional form of $f$
    - E.g. $f$ is linear in $X$:
$$f(X) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + .. + \beta_p X_p$$

- Once the assumption is made, estimating $f$ is simplified
- Estimate _p_-dimensional function $f(X)$ became estimate _p_+1 coeffcients

- After the model has been selected, We then _fit_ or _train_ the model
- In the case of the linear model, we estimate the parameters $\beta_0, \beta_1, \beta_2,...., \beta_p$
- We want values for the parameters such that:
$$Y \approx \beta_0 + \beta_1 X_1 + \beta_2 X_2 + .. + \beta_p X_p$$

- The most common approach to fitting the model is _(Ordinary) least squares_

<img src="./Figures/Chapter2/2.4.png" width="600" height="600">

- This approach is referred to as parametric: Reducing the problem of estimating $f$ down to estimating a set of parameters
- The disadvantage of parametric approach is that the model will usually not match the true $f$
- If the model is far from $f$, our estimate will be poor
- We can fit a flexible model but the flexible model requires estimating a greater number of parameters
- These complex models can lead to _overfitting_ the data, which follow the errors or _noise_ too closely

<img src="./Figures/Chapter2/2.3.png" width="600" height="600">

- The parametric approach of this is 
$$Income \approx \beta_0 + \beta_1 * education + \beta_2 * seniority $$

<img src="./Figures/Chapter2/2.4.png" width="600" height="600">

- We assume a linear relationship between response and 2 predictors

#### Non-parametric Methods

- Unlike parametric methods, Non-parametrics methods do not make explicit assumptions about $f$
- Seek an estimate of $f$ that gets close to data point as possible
- Accurately fit a wider range of posssible shapes for $f$
- The disadvantage is that it requires a large number of observations is required to obtain accurate estimate for $f$

<img src="./Figures/Chapter2/2.3.png" width="600" height="600">

<img src="./Figures/Chapter2/2.5.png" width="600" height="600">

- A _thin-plate spline_ is used to estimate $f$ as close as possible to observed data being _smooth_
- To fit a thing-plate spline, the data analyst must select a level of smoothness

<img src="./Figures/Chapter2/2.5.png" width="600" height="600">

<img src="./Figures/Chapter2/2.6.png" width="600" height="600">

- The rough thin-plate spline is more variable then $f$, showing overfitting


### Trade-Off between Prediction Accuracy and Model Interpretability

- Different models have different flexibility

- Why would we choose to use a more restrictive method instead of a very flexible approach?
- If mainly interested in inference, restrictive models are more interpretable
- Flexible models can lead to complicated estimates of $f$

<img src="./Figures/Chapter2/2.7.png" width="600" height="600">

- When Inference is the goal, simple and inflexible methods is best
- Otherwise if interested in prediction, the interpretability of the predictive model is not of interest

### Supervised V. Unsupervised Learning

- Statistical learning problems fall into 2 categories
    - Supervised
    - Unsupervised
- Previous examples are in the supervised domain
- For each $x_i$ there is a response $y_i$
- Majority of problems will be supervised

- Unsupervised learning is more challenging
- For each $x_i$, this is no $y_i$
- Used to understand the relationships between the variables 
- One method is _cluster analysis_, with the goal whether the observations fall into distinct groups

<img src="./Figures/Chapter2/2.8.png" width="600" height="600">

- Clustering methods cannot be expected to assign to correct group
- We cannot easily plot the observation if more than 2 variables

- Analysis considered supervised or unsupervised is less clear-cut
- If you have $n$ observation and $m$ out of the the $n$ observation while $m < n$ have response measurement, and the remaing $n - m$ have no response, this is considered _semi-supervised learning_

### Regression V. Classification Problems

- Variables can be categorized as either quantitative or qualitative
- Quantitative problems are considered regression problems while qualitive response are referred as classification problems
- The distinction is not always clear in the case of using logistic regression used with qualitative response
- Whether the features are qualitative or quantitative are considered less important

## Assessing Model Accuracy

- It is necessary to introduce different statistical learning methods since there is no best method
- There will be important concepts that arise in selecting a statistical learning problem

### Measuring the Quality of Fit

- Evaluate the performance of statistical learning method on a given data set
- The most commonly-used measure is the _mean squared error(MSE)_

$$ MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat f(x_i))^2 $$

- MSE will be small if predicted close to true responses,otherwise it will be large
- MSE uses the training data to was used to fit the model making it _training MSE_
- We are more interested in the accuracy of the predictions on unseen data
- Because of this we want the lowest _test MSE_ therefore

$$Ave(y_0 - \hat f(x_0))^2 $$

- The average squared prediction error for the test observation
- Evaluate on the test observation 
- If no test observation are available, then use the train MSE since it is closely related to test MSE
- The problem is that lowest train MSE does not guarantee lowest test MSE

<img src="./Figures/Chapter2/2.9.png" width="600" height="600">


- The _degrees of freedom_ represents the number of smoothing splines
- A fundamental property of statistical learning is that a train MSE will have a monotone decrease while test MSE will have a U-shape
- When a model has small train MSE but large test MSE then we are _overfitting_
- The train MSE will always be lower than test MSE since statistical learning methods either directly or indirectly seek to minimize the training MSE

<img src="./Figures/Chapter2/2.10.png" width="600" height="600">

- Test MSE is more difficult because no test data is available
- A variety of approaches to estimate minimum point
- One method is _cross-validation_, a method for estimating test MSE using traing data


### Bias-Variance Trade-Off

- The U-shape observed in the test MSE curves turns to be the result of 2 competing properties of statistical learning methods
- The expected test MSE for a given value $x_0$ can be decomposed into the sum of 3 fundamental quantites
    - Variance of $\hat f(x_0)$
    - Squared Bias of $\hat f(x_0)$
    - Variance of the error terms $\epsilon$
$$ E(y_0 - \hat f(x_0))^2 = Var( \hat f(x_0)) + [Bias(\hat f(x_0))]^2 + Var(\epsilon)$$

- The notation $E(y_0 - \hat y (x_0))^2$ defines the _expected test MSE_ and refers to average test MSE if we estimated $f$ using large number of training sets and tested at $x_0$
- To minimize the expected test error

### The Classification Setting

#### Bayes Classifer
#### K-Nearest Neighbors
