<br>

# An Introduction to Statistical Learning
with Applications in `R`  
Second Edition  
Gareth James $\cdot$ Daniela Witten $\cdot$ Trevor Hastie $\cdot$ Robert Tibshirani  

Notes by Bonnie Cooper and working out the examples in `Python`  
for the course, DATA 622: Machine Learning and Big Data  


<img src="https://images-na.ssl-images-amazon.com/images/I/41pP5+SAv-L._SX330_BO1,204,203,200_.jpg" width="20%" style="margin-left:auto; margin-right:auto">

<br>

**statistical learning** - making sense of complex data sets  
**supervised statistical learning** - building a statistical model for predicting or estimating an output based on one of more inputs  
**unsupervised statistical learning** - there are inputs, but no supervising outputs; nevertheless, we can learn about relationships and structure in the data.  

The goal of this book: become informed users. for technical detail, work through the [ESL](https://web.stanford.edu/~hastie/Papers/ESLII.pdf)  
*While it is important to know what job is performed by each cog, it is not necessary to have the skills to construct the machine inside the box!*  

<br>

## Introduction

### Notation

* $n$ - number of distinct data points
* $p$ - number of feature variables
* $\mathbf{X}$ - an $n \times p$ matrix whose $(i,j)$th element is represented as $x_{ij}$
* $\mathbf{y}$ - the set of all $n$ observations in vector form
* $a \in \mathbb{R}$ - a scalar
* $a \in \mathbb{R}^k$ - a vector with length k
* $a \in \mathbb{R}^[k\times d]$ - a matix

<br>

## Chapter 2 Statistical Learning

$\mathbf{X}$ - the response or dependent variable  
$\mathbf{Y}$ - the predictor(s), independent variable(s) or feature(s)  

We assume that there is a relationship between $\mathbf{Y}$ and $\mathbf{X}$ such that:  
$$\mathbf{Y} = f(\mathbf{X}) + \epsilon $$

where:  
$f$ - some fixed function (undetermined)  
$\epsilon$ - a random error term  

Our goal: estimate $f$ based on the given observations  

### What is Statistical Learning?

#### Why Estimate $f$?

**Prediction** - predict $\mathbf{Y}$ using $\hat{ \mathbf{Y} } = \hat{f} (\mathbf{X})$  
The accuracy of $\hat{\mathbf{Y}}$ depends on two quantities: the reducible error and the irreducible error  
**reducible error** - error in our estimate of $f$  
**irreducible error** - error in $\mathbf{Y}$  
**expected value** - the squared difference between the actual and the estimated value of $\mathbf{Y}$  

$$\mathbf{E}(\mathbf{Y}- \hat{\mathbf{Y}})^2 = \mathbf{E}[f(\mathbf{X}+\epsilon -\hat{f}(\mathbf{X})]^2] = [f(\mathbf{X})-\hat{f}(\mathbf{X})]^2 + \mbox{Var}(\epsilon)$$


**Inference** - understand the association between $\mathbf{Y}$ and $\mathbf{X}_1,\dots,\mathbf{X}_p$  

#### How Do We Estimate $f$?

Our goal is to apply statistical learning to our training data to estimate the unknown function $f$. Basically, we want to find an estimate of $f$ which we will call $\hat{f}$ such the $\mathbf{Y}  \approx \hat{f}( \mathbf{X})$  

* **Paramteric approach** - involves a two-step model-based  
    * make an assumption about the form of the data (ex: assume linearity)
    * use a procudure to train and fit the assumed model
    * problem: the model we chose will usually not match the true form of $f$
* **Nonparametric methods** - make no assumptions about the form of $f$   
    * estimate $f$ by getting as close to data points
    * problem: since nonparametric methods do not reduce the problem of estimating $f$ to a small number of parameters, a comparatively larger number of observations are required in order to get an accurate estimate of $f$.
    
#### The Trade off Between Prediction Accuracy and Model Interpretability

*Why would we ever choose to use a more restrictive method instead of a very flexible approach?*   
Often, more restrictive models (e.g. linear regression) are more easily interpretable. However, in some settings we are only interested in the prediction, and the interpretability of the predictive model is simply not of interest; here we might expect that it will be best to use the more flexible model provided the model does not overfit.  

#### Supervised Vs Unsupervised Learning 

**Supervised Learning** - for each observation of the predictor measurement, there is an associated response measure. Goal: fit a model that relates the response to the predictors so that we may (1) accurately predict future responses from the observations and (2) better understand the relationship between the response and the predictors.  
**Unsupervised Learning** - for every observation, we observe a vector of features, but no associated response. This situation is called unsupervised, because we lack a response variable that can supervise our analysis.  

#### Regression vs Classification Problems

**quantitive variables** - take on numeric values  
**qualitative variables** - take on categorical values  
**regression problems** - the reponse is a quantitative variable
**classification problems** - the response is a qualitative variable  
most of the statistical learning methods discussed in this book can be applied regardless of the predictor variable type, provided that any qualitative predictors are properly coded before the analysis is performed.  

<br>

### Assessing Model Accuracy

*There is no free lunch in statistics* - no onemethod dominates all others over all possible data sets  

#### Measureing Quality of Fit

**mean squared error** -a measure of how well a models predictions acutally match the observed data. The MSE will be small if the predicted responses are very close to the truw responses and will be large if for some of the observations, the predicted and true responses differ substantially.   
$$\mathbf{MSE} = \frac{1}{n} \sum_{i=1}^{n}( y_i - \hat{f} (x_i))^2$$
We are interested in the accuracy of the predictions that we obtain when we apply out method to previously unseen test data. We want to choose the method which gives the lowest *test* MSE, as opposed to the lowest *traininng* MSE. In other words, we'd like to select a model that minimizes:  
$$\mathbf{Ave}(y_0 - \hat{f}(x_0))^2$$
where $(x_0,y_0)$ is a previously unseen test observation not ued to train the model.  

**overfitting** - when a given method yields a small training $\mathbf{MSE}$ but a large test $\mathbf{MSE}$. When a less flexible model would have given a lower test $\mathbf{MSE}$   

#### The Bias-Variance Trade-Off  

the Expected test $\mathbf{MSE}$, for a given value of $x_0$ can always be decomposed into the sum of three fundamental qualities: the variance of $\hat{f} (x_0)$, the square of the bias of $\hat{f} (x_0)$ and the variance of the error terms $\epsilon$:  
$$\mathbf{E}(y_0 - \hat{f}(x_0))^2) = \mathbf{Var}(\hat{f}(x_0)) + [\mathbf{Bias}(\hat{f}(x_0))]^2 + \mathbf{Var}(\epsilon)$$

What this tells is is, that in order to minimize the expected test error, we need to select a statistical learning method that simultaniously achieves **low variance** nd **low bias**. where:  
**variance** - the amount that $\hat{f}$ would change if we estimated it using a different training set. If a methods has high variance, then a small change to the data set would lead to a large change in $\hat{f}$  
**bias** - the error that is introduced by approximaating a real-life problem (choice of estimating $f$)  

In general, as we use more flexible methods, the variance will increase and the bias will decrease.  
**bias-variance trade-off** - the relationship between bias, variance, and test $\mathbf{MSE}$. The challenge lies in finding a method for which both the variance and the squared bias are low.   

#### The Classification Setting  

Quantifying accuracy of a classification problem: the training error rate  
$$\frac{1}{n}\sum_{i=1}^n \mathbf{I}(y_1 \neq \hat{y}_i)$$
**error rate**  - proportion of mistakes that are made if we apply our estimate $\hat{f}$ to the training observations  
**test error rate** $\mathbf{Ave}(\mathbf{I}(y_0 \neg \hat{y}_0))$ a good classifier is one for which the test error is smallest.  

##### The Bayes Classifier 

**Bayes Classifier** - assign each observation to the most likely class, given its predictor values.
$$\mathbf{Pr}( Y= j | X = x_0 )$$
**Bayesian Decision Boundary** boundary in feature space where the conditional probability of either events is equal. An observation that falls on one side of the boundary will be classified as one class whereas the other side the other.
A Bayes Classifier produces the lowest possible test error rate, the **Bayes Error Rate**. the Bayes error rate is analogous to the irreducible error.  
$$1 - \mathbf{E}(max_j \mathbf{Pr}( Y=j|X )$$

##### K-Nearest Neighbors

In theory, we would always like to predict qualitative responses using the Bayes classifier. But in the real world, we might not be able to accurately estimate, let alone know, the conditional distribution of Y given X, so computing the Bayes classifier might be unattainable. 
**K-Nearest Neighbors (KNN)** - given a positive integer $\mathbf{K}$ and a test observation $x_0$, the KNN classifier first identifies the K training data points that are clossest to $x_0$, represented by $\mathcal{N}_0$. It then estimates the conditional probability for class $j$ as the fraction of points in $\mathcal{N}_0$ whose values equal $j$:
$$\mathbf{Pr}( Y=j|X=x_0) = \frac{1}{K}\sum_{i \in \mathcal{N}_0} I(y_i=j)$$
KNN then classifies the test observation $x_0$ to the class with the largest probability.  
The choice of $K$ has a drastic effect on the KNN classifier. Very small $K$ is overly flexible and will overfit the data by finding patterns in the training data that don't exist in the test data (high variance, low bias). As $K$ grows, the method becomes less flexible and produces a decision boundary that approaches linear (low variance, high bias).

<br>

## Chapter 3 Linear Regression

Key ideas underlying the linear regression model, as well as the least squares approach that is most commonly used to fit the model.  

<br>

### Simple Linear Regression  

**simple linear regression** - predicting a linear quantitative relationship between a response $Y$ and a single predictor variable $X$. We are regressing Y onto X.

$$Y \approx \beta_o + \beta_1X$$

where the coefficients $\beta_0$ and $\beta_1$ are unknown constants that represent the intercept and slope (respectively)  

$$\hat{y} = \hat(\beta_o )+ \hat{\beta_1}X$$

where $\hat{}$ denotes an estimated or predicted value  

#### Estimating the Coefficients  

We want to find  an intercept and slope such that the resulting line is as close as possible to the data points.  
Measuring closeness:  minimizing the least squares criterion - minimize the sum of the squares of the residuals.  

**Least Squares Coefficient Estimte**  

$$\hat{\beta_1} = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2}$$
$$\hat{\beta_0} = \bar{y}-\hat{\beta_1}\bar{x}$$

#### Assessing the Acurracy of the Coefficient Estimates

$$Y \approx \beta_o + \beta_1X + \epsilon$$

In real applications, we have access to a set of observations from which we cn compute the least squares line; however, the population regression line is unobserved  
If we estimate $\beta_0$ and $\beta_1$ on the basis of a particular data set, then our estimates won't be exaclt equal to the coefficients. But if we could average the estimates obtained over a huge number of data sets, then the average of these estimates would be spot on!  

The **standard error** of $\mu$ - tells us the average amount that the estimate $\hat{\mu}$ differs from the actual value of $\mu$. This value shrinks with $n$: the more values /observations we have, the smaller the standard error of $\hat{\mu}$  
$$\mbox{Var}(\hat{\mu}) = \mbox{SE}(\hat{\mu})^2 = \frac{\sigma^2}{n}$$
Usually, SE is unknown. Therefore, we estimate $\sigma$ from the data as the **residual standard error**: $\hat{SE}$  
Standard errors can be used to compute confidence intervals. for linear regression, the 95% confidence interval for the coefficients takes the form $$\hat{\beta_1} \pm 2 \cdot \mbox{SE}(\hat{\beta_1})$$ $$\hat{\beta_0} \pm 2 \cdot \mbox{SE}(\hat{\beta_0})$$  
Standard errors can also be used to perform *hypothesis tests*:  

* **The Null Hypothesis $H_0$**: there is no relationship between X and Y
* **The Alternative Hypothesis $H_1$**: there is some relationship between X and Y

we compute the *t-statistic*: $t = \frac{\hat{\beta_1}-0}{\mbox{SE}(\hat{\beta_1})}$  
then, we compute the **p-value**, or the probability of observing any number equal $|t|$ or larger in absolute value. A small p-value indicates that it is unlikely to observe such a substantial association between the predictor annd the response due to chance. If the p-value is below a criterion value, we can reject the null hypothesis and declare a relationship exists between X and Y.  

#### Assessing the Accuracy of the Model

The quality of the regression fit is typically assessed using the **residual standard error (RSE)** and the **$R^2$** statistic:

* the RSE is an estimate of the standard deviation of the error $\epsilon$. it is the average amount that the response will deviate from the true regression line. The RSE can be considered a measure of the lack of fit of the model to the data (large RSE values indicate a poor fit). 
* the $R^2$ Statistic summarizes the proportion of variability in Y that can be explained using X 
    - $R^2 = 1 - \frac{\mbox{RSS}}{\mbox{TSS}} = 1 - \frac{\mbox{variability left unexplained by the model}}{\mbox{total variance in the response Y}}$

<br>

### Multiple Linear Regression

extending the simple linear regression model so that it can accommodate mutiple predictors. We interpret a coefficient $\beta_j$ as the average effect on Y of a one unit increase in $X_j$, holding all other predictors fixed. 
$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p + \epsilon$$

<br>

### Estimating the Regression Coefficients

use the same least squares approach: choose coefficients to minimize the sum of the squared residuals

<br>

#### Some Important Questions

for MLR:  

* Is at least one of the predictors useful in predicting the response?
    - test the null hypothesis to see if all $\beta_{\geq 1} = 0$ (F-statistic $\approx$ 1)
    - vs the alternative hypothesis that at least one $\beta_j \neq 0$ (F-statistic $>$ 1)
* Do all predictors help to explain Y, or is only a subset of the predictors useful?
    - variable selection: Akaike information, bayesian information, adjuste $R^2$
        - consider all possible. might not be feasible
        - forward selection. start with the null model
        - backward selection. start with the full model
        - mixed selection
* How well does the model fit the data?
    - RSE - models with more parameters have a higher RSE if the decrease in RSS in small relative to the increase in p.
    - $R^2 = \mbox{Cor}(Y.\hat{Y})^2$
    - plotting the data can reveal problems with the model that are not visible from numerical statistics
* Given a set of predictor values, what response value should we predict and how accurate is the prediction?
    1. the inaccuracy in the coefficient estimates is related to the reducible error. We can compute a confidence interval to find how close $\hat{Y}$ will be to $f(X)$
    2. approximating linearity has model bias
    3. because of random error, we cannot perfectly predict even if we have the right model and coefficients
    
**confidence interval** - quantify the uncertainty surrounding the estimate of the average  
**prediction interval** - quantify the uncertainty surrounding the estimate of a particular observation. is typically substantially wider than the confidence interval.  

<br>

### Other Considerations in the Regression Model

#### Qualitative Predictors

**Predictors with Only Two Levels**  
Predictors with only two values can be incorporated into the regression model framework by recoding the variable as a binary *dummy variable* also called **one-hot encoding**. polarity of the coding is completely arbitrary and does not affect the regression outcome, however, it does affect the interpretation of model coefficients.  

**Predictors with More than Two Levels**  
Add more one-hot encoding. There will always be one fewer dummy variables than the number of levels. The level with no dummy variable is known as the baseline.  

#### Extensions of the Linear Model  

**additive** - association between predictor and response does not depend on the values of any other predictors  
**linear** - change inthe response Y associated with a one-unit change inX is constant regardless of the value of the predictor.  
the Linear Model can be extended with several methods that relax both of these assumptions.  

**Removing the Additive Assumption** - synergy or interaction effects. One way of extending the linear model to account of interaction effects is to include a parameter called an interaction term that is, for example, a product of the two variables (for a multiplicative effect) in addition to the main effect terms.  
**Hierarchy principle** - if we include an interaction term in a model, we should also include the main effects, even if the p-values associated with the coefficients are not significant.  
**Extending the Linearity Assumption** - include transformed versions of the predictors, e.g. polynomial regression.  

#### Potential Problems

1. Non-linearity of the response-predictor relationships
    - Residual plots are a useful graphic for assessing non-linearity
2. Correlation of error terms
    - tracking patterns in the residual as a function of the predictor (e.g. time). residuals form a relational pattern with neighbor observations or intervals or periodicity
3. Non-constant variance of error terms
    - e.g. **heteroscedasticity** - variance of residuals increases with value of the response
    - a cone-like pattern in the residuals
    - can counteract by transforming with a concave function
4. Outliers
    - make educated decisions about removing outliers
5. High-Leverage Points
    - points with an unusual value for a given x or in terms of the full set of predictors.
    - can find the 'leverage statistic'
6. Collinearity - when two or more predictor variables are closely related to one another. it can be difficult to separate out the individual effects of colinear responses. the power of a hypothesis test is diminished when colinearity is present in the data.
    - solution: either drop the offending features or combine them into a single measure.


<br>

## Chapter 4 Classification

**classification** - approaches for predicting qualitative responses  

### Why Not a Linear Model?

consider a binary qualitative response variable.  
It becomes problematic if we fit this data with a linear regressor.  For instance, some of the predictions might be out of the O-1 range making them hard to interpret as probabilities. Additionally, a linear regressor cannot accommodate more than two levels of a response variable.

<br>

### The Logistic Model

#### The Logistic Model

To yield sensible probability  predictions, we model $p(X)$ using the **logistic function** using **maximum likelihood**  

$$p(X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X} }$$

The logistic function will always produce an S-shaped curve, and so, regardless of the value of $Y$, we get a sensible prediction of $X$.  

We can also find that:  
$$\mbox{Odds} = \frac{p(X)}{1-p(X)} = e^{\beta_0 + \beta_1X}$$

Which leads to the **logit**, or log odds function:
$$\mbox{log}\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + \beta_1X$$

Interpretting the logistic model: increasing X by one unit changes the log odds by  $\beta_1$. $p(X)$ does not have a straight line relationship with X, and the rate of change of $p(X)$ depends on the value of X.  

#### Estimating the Regression Coefficients

**maximum likelihood** - seek estimates for the coefficients such that the predicted probability of default for each individual, corresponds as closely as possible to the individual's observed default status.

$$\mathcal{l}(\beta_0,\beta_1)= \prod_{i:y_1=1}p(x_i) \prod_{i':y_i'=0}(1-p(x_i'))$$

chose estimates of the coefficient to maximize the likelihood function.  

#### Multiple Logistic Regression

$$\mbox{log}\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + \beta_1X + \dots + \beta_xX_p$$

#### Multinomial Logistic Regression

classify a response variable that has more than two classes.  
first, select a class to serve as the baseline; e.g. the *k*th class:

$$p(Y=k|X=x) = \frac{e^{\beta_{k_0} + \beta_{k_1}x_1 + \dots + \beta_{k_p}x_p}}{1 + e^{\beta_{l0 }+ \beta_{l1}x_1 + \dots + \beta{l_p}x_p} }$$

$$\mbox{log}\left(\frac{\mbox{Pr}(Y=k|X=x)}{\mbox{Pr}(Y=K|X=x)}\right) = \beta_{k_0} + \beta_{k_1}x_1 + \dots + \beta_{k_p}x_p$$

the log odds between any pair of classes is linear in the features.  
interpretation of the coefficients in a multinomial logistic regression model must be done with care, since it is tied to the choice of baseline.  

Alternative: **softmax coding** - rather than selecting a baseline class, we treat all K classes symmetrically:  

$$p(Y=k|X=x) = \frac{e^{\beta_{k_0} + \beta_{k_1}x_1 + \dots + \beta_{k_p}x_p}}{\sum_{l=1}^K e^{\beta_{l0 }+ \beta_{l1}x_1 + \dots + \beta{l_p}x_p} }$$