# Study Notes for *The Elements of Statistical Learning*

### Edited by Emma Teng

## Chapter 2 Overview of Supervised Learning 
### 2.1 Introduction

1. Termonologies for X and Y:
    
    - X: inputs, predictors, independent variables, features;
    - Y: outputs, reponses, dependent variables.

### 2.2 Variable Types and Terminology
1. The distinction in output type has led to a naming convention for the prediction tasks:
    
    - regression: when we predict quantitative outputs;
    - classification: when we predict qualitative outputs.
    
    
2. Variable types include: quantitative, qualitative and ordered categorical.


3. The most useful and commonly used coding for *qualitative* variables are **dummy variables**. K-level qualitative variable is represented by a vector of K binary variables or bits, only one of which is “on” at a time.


4. Naming Convention: 
   - Inputs: X;
   - Quantitative ouputs: Y;
   - Qualitative outputs: G
   - Use uppercase letters such as X, Y or G when referring to the generic aspects of a variable. Observed values are written in lowercase; hence the *i*th observed value of X is written as $x_i$ (where $x_i$ is again a scalar or vector).
   - Matrices are represented by bold uppercase letters; for example, a set of N input p-vectors $x_i, i = 1, . . . ,N$ would be represented by the $N×p$ matrix X. 
   - The $i$th row of **X** is $x_i^T$, the vector transpose of $x_i$, because we assume all vectors are column vectors.

### 2.3 Two Simple Approaches to Prediction: Least Squares and Nearest Neighbors
#### 2.3.1 Linear Models and Least Square

1. The linear model is:
$$
\begin{align}
\hat{Y} & = \hat{\beta_0} + \sum_{j=1}^p X_j \hat{\beta_j} \\
& = X^T \hat{\beta}  \\
\end{align}
$$
where $X^T = (1, X_1, X_2, ..., X_p)$, $X$ is a $N\times p$ matrix.


2. Using Least Squares for finding unknown coefficients and the goal is to minimize the residual sum of squares: 
$$
\begin{align}
RSS(\beta) & = \sum_{i = 1}^N (y_i - x_i^T\beta)^2 \\
&=(\boldsymbol{y} - X\beta)^T(\boldsymbol{y}-X\beta)
\end{align}
$$

where $X$ is an $N\times p$ matrix with each row an input vector, and $\boldsymbol{y}$ is an $N$-vector of the ouputs in the training set.

$RSS(\beta)$ is a quadratic function of the parameters, and hence its minimum always exists, but may not be unique.

Differentiating w.r.t. $\beta$ we get the normal equations:
$$
\begin{align}
X^T(\boldsymbol{y} - X\beta) = 0
\end{align}
$$

If $X^TX$ is nonsingular, then the unique solution is given by
$$
\begin{align}
\hat{\beta} = (X^T X)^{-1} X^T \boldsymbol{y}
\end{align}
$$

3. Using linear model for classification:
    
    First code all classes as a binary variable, and then fit by linear regression. The decision boundary can be $x^T \hat{\beta} = 0.5$.


4. Two possible scenarios regarding to Linear Regression Classification performance:

    <span style="color:blue">Scenario 1:</span> The training data in each class were generated from bivariate Gaussian distributions with uncorrelated components and different means.
    
    Linear decision boundary is almost optimal.
    
    <span style="color:blue">Scenario 2:</span> The training data in each class came from a mixture of 10 low-variance Gaussian distributions, with individual means themselves distributed as Gaussian.
    
    Linear decision boundary is unlikely to be optimal.

#### 2.3.2 Nearest-Neighbor Methods

1. For classification, majority vote in the neighborhood will determine the class. For example, if $\hat{Y} > 0.5$, then assign class to class 1.


2. For 1-nearest-neighbor classification, each point has an associated tile bounding the region ofr which it is the closest input point.


3. For k-nearest-neighbor fits, the error on the training data should be approximately an increasing function of k, and will always be 0 for $k=1$. (OVERFITTING)


4. The effective number of parameters of k-nearest neighbors is $N/k$ and is generally bigger than least-square parameter $p$, and decreases with increasing $k$.
    
    For example, if the neighborhoods were nonoverlapping, there would be $N/k$ neighborhoods (the number of classes) and we would fit one parameter (mean) in each neighborhood.
    

5. We shouldn't use sum-of-squared errors on the raining set as a criterion for kicking $k$, since we would always pick $k=1$ (COMPLETELY OVERFITTING).

6. k-nearset neighbor can find a non-linear decision boundary, which is good for the Gaussian mixture model in <span style="color:blue">Scenario 2</span>.

#### 2.3.3 From Least Square to Nearest Neighbors
1. Least Squares Classification rely heavily on the assumption that a linear decision boundary is appropriate and has low variance but high bias.


2. k-nearest neighbor has no assumptions about the underlying data and has high variance but low bias (wiggly and unstable).


3. A large subset of the most popular techniques in use today are variants of these two simple procedures. For example:
    
    - Kernel methods use weights that decrease smoothly to zero with distance from the target point, rather than the effective 0 or 1 weights used by k-nearest neighbors.
    - In high-dimensional spaces the distance kernels are modified to emphasize some variable more than others, like PCA.
    - Local regression fits linear models by locally weighted least squares, rather than fitting constants locally.
    - Linear models fit to a basis expansion of the original inputs allow arbitrarily complex models, like Splines.
    - Projection pursuit and neural network models consist of sums of non-linearly transformed linear models.
    

### 2.4 Statistical Decision Theory

1. If we use squared error loss as loss function $L(Y, f(X)) = (Y-f(X))^2$, the solution is
$$f(x) = E(Y|X=x) $$

    Both k-nearest neighbors and least squares end up approximating conditional expectations by averages.

    - k-nearest neighbor: $\hat{f}(x) = Ave(y_i|x_i \in N_k(x))$.
    - least squares: $\beta = [E(XX^T)] ^{-1}E(XY)$
    
    But they differ dramatically in terms of model assumptions:
    
    - Least squares assumes $f(x)$ is well approximated by a globally linear function.
    - k-nearest neighbors assumes $f(x)$ is well approximated by a locally constant function.
    

2. The dummy-variable regression procedure, followed by classification to the largest fitted value, is another way of representing the Bayes classifier.



### 2.5 Local Methods in High Dimensions

1. mean squared error(bias-variance decomposition):

\begin{align}
MSE(x) & = E[f(x) - \hat{y} ] \\
& = Var(\hat{y}) + Bias^2(\hat{y})  \\
\end{align}

2. Relationship between MSE and SSE:
$MSE = \frac{1}{N} SSE = \frac{1}{N} \sum (f_i -y_i)^2$


3. If the relationship between Y and X is linear, then Least squares estimates are BLUE(Best Linear Unbiased Estimator), meaning it is the estimator with smallest variance and no biase.

### 2.6 Statistical Models, Supervised Learning and Function Approximation
#### 2.6.1 A statistical Model for the Joint Distribution Pr(X, Y)

1. The additive error model:
$$Y = f(X) + \epsilon $$
where the random error $\epsilon$ has $E(\epsilon) = 0$ and is independent of X. 

    Generally there will be other unmeasured variables that also contribute to Y , including measurement error. The additive model assumes that we can capture all these departures from a deterministic relationship via the error $\epsilon$.
    
    
2. The assumption for the additive error model that the errors are i.i.d (independent and identically distributed) is not strictly necessary. Simple modifications can be made to avoid the independence assumption.


3. Additive error models are typically not used for qualitative ouputs $G$; in this case the target function $p(X)$ is the conditional density $Pr(G|X)$, and this is modeled directly.