<br>

# An Introduction to Statistical Learning
with Applications in `R`  
Second Edition  
Gareth James $\cdot$ Daniela Witten $\cdot$ Trevor Hastie $\cdot$ Robert Tibshirani  

Notes by Bonnie Cooper and working out the examples in `Python`  
for the course, DATA 622: Machine Learning and Big Data  


<img src="https://images-na.ssl-images-amazon.com/images/I/41pP5+SAv-L._SX330_BO1,204,203,200_.jpg" width="20%" style="margin-left:auto; margin-right:auto">

<br>

**statistical learning** - making sense of complex data sets  
**supervised statistical learning** - building a statistical model for predicting or estimating an output based on one of more inputs  
**unsupervised statistical learning** - there are inputs, but no supervising outputs; nevertheless, we can learn about relationships and structure in the data.  

The goal of this book: become informed users. for technical detail, work through the [ESL](https://web.stanford.edu/~hastie/Papers/ESLII.pdf)  
*While it is important to know what job is performed by each cog, it is not necessary to have the skills to construct the machine inside the box!*  

<br>

## Introduction

### Notation

* $n$ - number of distinct data points
* $p$ - number of feature variables
* $\mathbf{X}$ - an $n \times p$ matrix whose $(i,j)$th element is represented as $x_{ij}$
* $\mathbf{y}$ - the set of all $n$ observations in vector form
* $a \in \mathbb{R}$ - a scalar
* $a \in \mathbb{R}^k$ - a vector with length k
* $a \in \mathbb{R}^[k\times d]$ - a matix

<br>

## Chapter 2 Statistical Learning

$\mathbf{X}$ - the response or dependent variable  
$\mathbf{Y}$ - the predictor(s), independent variable(s) or feature(s)  

We assume that there is a relationship between $\mathbf{Y}$ and $\mathbf{X}$ such that:  
$$\mathbf{Y} = f(\mathbf{X}) + \epsilon $$

where:  
$f$ - some fixed function (undetermined)  
$\epsilon$ - a random error term  

Our goal: estimate $f$ based on the given observations  

### What is Statistical Learning?

#### Why Estimate $f$?

**Prediction** - predict $\mathbf{Y}$ using $\hat{ \mathbf{Y} } = \hat{f} (\mathbf{X})$  
The accuracy of $\hat{\mathbf{Y}}$ depends on two quantities: the reducible error and the irreducible error  
**reducible error** - error in our estimate of $f$  
**irreducible error** - error in $\mathbf{Y}$  
**expected value** - the squared difference between the actual and the estimated value of $\mathbf{Y}$  

$$\mathbf{E}(\mathbf{Y}- \hat{\mathbf{Y}})^2 = \mathbf{E}[f(\mathbf{X}+\epsilon -\hat{f}(\mathbf{X})]^2] = [f(\mathbf{X})-\hat{f}(\mathbf{X})]^2 + \mbox{Var}(\epsilon)$$


**Inference** - understand the association between $\mathbf{Y}$ and $\mathbf{X}_1,\dots,\mathbf{X}_p$  

#### How Do We Estimate $f$?

Our goal is to apply statistical learning to our training data to estimate the unknown function $f$. Basically, we want to find an estimate of $f$ which we will call $\hat{f}$ such the $\mathbf{Y}  \approx \hat{f}( \mathbf{X})$  

* **Paramteric approach** - involves a two-step model-based  
    * make an assumption about the form of the data (ex: assume linearity)
    * use a procudure to train and fit the assumed model
    * problem: the model we chose will usually not match the true form of $f$
* **Nonparametric methods** - make no assumptions about the form of $f$   
    * estimate $f$ by getting as close to data points
    * problem: since nonparametric methods do not reduce the problem of estimating $f$ to a small number of parameters, a comparatively larger number of observations are required in order to get an accurate estimate of $f$.
    
#### The Trade off Between Prediction Accuracy and Model Interpretability

*Why would we ever choose to use a more restrictive method instead of a very flexible approach?*   
Often, more restrictive models (e.g. linear regression) are more easily interpretable. However, in some settings we are only interested in the prediction, and the interpretability of the predictive model is simply not of interest; here we might expect that it will be best to use the more flexible model provided the model does not overfit.  

#### Supervised Vs Unsupervised Learning 

**Supervised Learning** - for each observation of the predictor measurement, there is an associated response measure. Goal: fit a model that relates the response to the predictors so that we may (1) accurately predict future responses from the observations and (2) better understand the relationship between the response and the predictors.  
**Unsupervised Learning** - for every observation, we observe a vector of features, but no associated response. This situation is called unsupervised, because we lack a response variable that can supervise our analysis.  

#### Regression vs Classification Problems

**quantitive variables** - take on numeric values  
**qualitative variables** - take on categorical values  
**regression problems** - the reponse is a quantitative variable
**classification problems** - the response is a qualitative variable  
most of the statistical learning methods discussed in this book can be applied regardless of the predictor variable type, provided that any qualitative predictors are properly coded before the analysis is performed.  

<br>

### Assessing Model Accuracy

*There is no free lunch in statistics* - no onemethod dominates all others over all possible data sets  

#### Measureing Quality of Fit

**mean squared error** -a measure of how well a models predictions acutally match the observed data. The MSE will be small if the predicted responses are very close to the truw responses and will be large if for some of the observations, the predicted and true responses differ substantially.   
$$\mathbf{MSE} = \frac{1}{n} \sum_{i=1}^{n}( y_i - \hat{f} (x_i))^2$$
We are interested in the accuracy of the predictions that we obtain when we apply out method to previously unseen test data. We want to choose the method which gives the lowest *test* MSE, as opposed to the lowest *traininng* MSE. In other words, we'd like to select a model that minimizes:  
$$\mathbf{Ave}(y_0 - \hat{f}(x_0))^2$$
where $(x_0,y_0)$ is a previously unseen test observation not ued to train the model.  

**overfitting** - when a given method yields a small training $\mathbf{MSE}$ but a large test $\mathbf{MSE}$. When a less flexible model would have given a lower test $\mathbf{MSE}$   

#### The Bias-Variance Trade-Off  

the Expected test $\mathbf{MSE}$, for a given value of $x_0$ can always be decomposed into the sum of three fundamental qualities: the variance of $\hat{f} (x_0)$, the square of the bias of $\hat{f} (x_0)$ and the variance of the error terms $\epsilon$:  
$$\mathbf{E}(y_0 - \hat{f}(x_0))^2) = \mathbf{Var}(\hat{f}(x_0)) + [\mathbf{Bias}(\hat{f}(x_0))]^2 + \mathbf{Var}(\epsilon)$$

What this tells is is, that in order to minimize the expected test error, we need to select a statistical learning method that simultaniously achieves **low variance** nd **low bias**. where:  
**variance** - the amount that $\hat{f}$ would change if we estimated it using a different training set. If a methods has high variance, then a small change to the data set would lead to a large change in $\hat{f}$  
**bias** - the error that is introduced by approximaating a real-life problem (choice of estimating $f$)  

In general, as we use more flexible methods, the variance will increase and the bias will decrease.  
**bias-variance trade-off** - the relationship between bias, variance, and test $\mathbf{MSE}$. The challenge lies in finding a method for which both the variance and the squared bias are low.   

#### The Classification Setting  

Quantifying accuracy of a classification problem: the training error rate  
$$\frac{1}{n}\sum_{i=1}^n \mathbf{I}(y_1 \neq \hat{y}_i)$$
**error rate**  - proportion of mistakes that are made if we apply our estimate $\hat{f}$ to the training observations  
**test error rate** $\mathbf{Ave}(\mathbf{I}(y_0 \neg \hat{y}_0))$ a good classifier is one for which the test error is smallest.  

##### The Bayes Classifier 

**Bayes Classifier** - assign each observation to the most likely class, given its predictor values.
$$\mathbf{Pr}( Y= j | X = x_0 )$$
**Bayesian Decision Boundary** boundary in feature space where the conditional probability of either events is equal. An observation that falls on one side of the boundary will be classified as one class whereas the other side the other.
A Bayes Classifier produces the lowest possible test error rate, the **Bayes Error Rate**. the Bayes error rate is analogous to the irreducible error.  
$$1 - \mathbf{E}(max_j \mathbf{Pr}( Y=j|X )$$

##### K-Nearest Neighbors

In theory, we would always like to predict qualitative responses using the Bayes classifier. But in the real world, we might not be able to accurately estimate, let alone know, the conditional distribution of Y given X, so computing the Bayes classifier might be unattainable. 
**K-Nearest Neighbors (KNN)** - given a positive integer $\mathbf{K}$ and a test observation $x_0$, the KNN classifier first identifies the K training data points that are clossest to $x_0$, represented by $\mathcal{N}_0$. It then estimates the conditional probability for class $j$ as the fraction of points in $\mathcal{N}_0$ whose values equal $j$:
$$\mathbf{Pr}( Y=j|X=x_0) = \frac{1}{K}\sum_{i \in \mathcal{N}_0} I(y_i=j)$$
KNN then classifies the test observation $x_0$ to the class with the largest probability.  
The choice of $K$ has a drastic effect on the KNN classifier. Very small $K$ is overly flexible and will overfit the data by finding patterns in the training data that don't exist in the test data (high variance, low bias). As $K$ grows, the method becomes less flexible and produces a decision boundary that approaches linear (low variance, high bias).