# Introduction

+ Inputs are often called the *predictors*, also *independent variables*. And in the pattern recognition the term *features* is preferred.

+ Outputs are called the *response*, or the *dependent variables*.

# Variable Types and Terminology

Distinction in output ype has led to a naming convection for prediction tasks:

+ *regression*  when we predict quantitative outputs

+ *classification* when we predict qualitative outputs

A third type is *ordered categorical*, such as small, medium and large, where there is an ordering between the values, but no metric notion is appropriate(the difference between medium and small need not be the
same as that between large and medium).

# Two Simple Approaches to Prediction: Least Squares and Nearest Neighbors

The linear model makes huge assumptions about structure and yields stable
but possibly inaccurate predictions. The method of k-nearest neighbors
makes very mild structural assumptions: its predictions are often accurate
but can be unstable.

## Linear Models and Least Squares

Given a vector of inputs $X ^ { T } = \left( X _ { 1 } , X _ { 2 } , \ldots , X _ { p } \right)$, we predict the output $Y$ via the model

$$
\hat { Y } = \hat { \beta } _ { 0 } + \sum _ { j = 1 } ^ { p } X _ { j } \hat { \beta } _ { j }
$$

The term $\hat{\beta}_0$ is the **intercept**, also known as the *bias* in machine learning.

How do we fit the linear model to a set of training data? There are
many different methods, but by far the most popular is the method of
least squares. In this approach, we pick the coefficients $\beta$ to minimize the
residual sum of squares.

$$
\operatorname { RSS } ( \beta ) = ( \mathbf { y } - \mathbf { X } \beta ) ^ { T } ( \mathbf { y } - \mathbf { X } \beta )
$$

if $\mathbf { X } ^ { T } \mathbf { X }$ is nonsingular, then the unique solution is given by

$$
\hat { \beta } = \left( \mathbf { X } ^ { T } \mathbf { X } \right) ^ { - 1 } \mathbf { X } ^ { T } \mathbf { y }
$$

## Nearest-Neighbor Methods

Nearest-neighbor methods use those observations in the training set $T$ closest in input space to x to form $\hat{Y}$ . Specifically, the k-nearest neighbor fit
for $\hat{Y}$ is defined as follows:

$$
\hat { Y } ( x ) = \frac { 1 } { k } \sum _ { x _ { i } \in N _ { k } ( x ) } y _ { i }
$$

where $N_k(x)$ is the neighborhood of x defined by the k closest points $x_i$ in the training sample.Closeness implies a metric, which for the moment we assume is Euclidean distance. So, in words, we find the k observations with $x_i$ closest to x in input space, and average their response.

## From Least Squares to Nearest Neighbors

The linear decision boundary from least squares is very smooth, and apparently stable to fit. It does appear to rely heavily on the assumption that a linear decision boundary is appropriate. In language we will develop
shortly, it has **low variance and potentially high bias**.

On the other hand, the k-nearest-neighbor procedures do not appear to
rely on any stringent assumptions about the underlying data, and can adapt
to any situation. However, any particular subregion of the decision boundary depends on a handful of input points and their particular positions, and is thus wiggly and unstable—**high variance and low bias**.

# Statistical Decision Theory

Let $X \in \mathbb { R } ^ { p }$ denote a real valued random input vector, and $Y \in \mathbb { R } ^ { p }$ a real valued random output variable, with joint distribution $Pr(X,Y)$. This theory requires a **Loss Function** $L(Y,f(X0)$ for penalizing errors in prediction, and by far the most common and convenient is *squared error loss*: $L ( Y , f ( X ) ) = ( Y - f ( X ) ) ^ { 2 }$. This leads us to a criterion for chossing $f$:

$$
\begin{aligned} \operatorname { EPE } ( f ) & = \mathrm { E } ( Y - f ( X ) ) ^ { 2 } \\ & = \int [ y - f ( x ) ] ^ { 2 } \operatorname { Pr } ( d x , d y ) \end{aligned}
$$
the expected(squared) prediction error. By conditioning on X, we can write EPE as

$$
\operatorname { EPE } ( f ) = \mathrm { E } _ { X } \mathrm { E } _ { Y | X } \left( [ Y - f ( X ) ] ^ { 2 } | X \right)
$$

and we see that it suffices to minimize EPE pointwise:

$$
f ( x ) = \operatorname { argmin } _ { c } \mathrm { E } _ { Y | X } \left( [ Y - c ] ^ { 2 } | X = x \right)
$$

The solution is 

$$
f ( x ) = \mathrm { E } ( Y | X = x )
$$

the conditional expectation, also known as the regression function. Thus
the best prediction of Y at any point X = x is the conditional mean, when
best is measured by average squared error.

We conclude that both k-nearest neighbors and least squares end up approximating
conditional expectations by averages. But they differ dramatically in terms
of model assumptions:

+ Least squares assume $f(x)$ is well approximated by a globally linear
function.

+ k-nearest neighbors assumes $f(x)$ is well approximated by a locally
constant function.

Many of the more modern techniques described in this book are model
based, although far more flexible than the rigid linear model. For example,
additive models assume that

$$
f ( X ) = \sum _ { i = 1 } ^ { p } f _ { j } \left( X _ { j } \right)
$$

This retains the additivity of the linear model, but each coordinate function
$f_j$ is arbitrary.It turns out that the optimal estimate for the additive model
uses techniques such as k-nearest neighbors to approximate univariate conditional expectations simultaneously for each of the coordinate functions.
Thus the problems of estimating a conditional expectation in high dimensions are swept away in this case by imposing some (often unrealistic) model
assumptions, in this case additivity.

If we replace the $L_2$ loss function with the $L_1$:$\mathrm { E } | Y - f ( X ) |$, and the solution is the conditional median

$$
\hat { f } ( x ) = \operatorname { median } ( Y | X = x )
$$

which is a different measure of location, and its estimates are more robust
than those for the conditional mean. $L_1$ criteria have **discontinuities in
their derivatives**, which have hindered their widespread use.

Next,What do we do when the output is a categorical variable G? The same
paradigm works here, except we need a different loss function for penalizing
prediction errors. An estimate $\hat{G}$ will assume values in $\mathrm{g}$, the set of possible classes. Our loss function can be represented by a $K \times K$ matrix $\mathrm{L}$, where $K = card(\mathrm{g})$. $L$ will be zero on the diagonal and nonnegative elsewhere,where $L(k, l)$ is the price paid for classifying an observation belonging to class $g_k$ as $g_l$.Most often we use the zero–one loss function, where all
misclassifications are charged a single unit. The expected prediction error
is

$$
\mathrm { EPE } = \mathrm { E } [ L ( G , \hat { G } ( X ) ) ]
$$

Again

$$
\mathrm { EPE } = \mathrm { E } _ { X } \sum _ { k = 1 } ^ { K } L \left[ \mathcal { G } _ { k } , \hat { G } ( X ) \right] \operatorname { Pr } \left( \mathcal { G } _ { k } | X \right)
$$

and again it suffices to minimize EPE pointwise:

$$
\hat { G } ( x ) = \operatorname { argmin } _ { g \in \mathcal { G } } \sum _ { k = 1 } ^ { K } L \left( \mathcal { G } _ { k } , g \right) \operatorname { Pr } \left( \mathcal { G } _ { k } | X = x \right)
$$

with the 0-1 loss function this simplifies to

$$
\hat { G } ( x ) = \operatorname { argmin } _ { g \in \mathcal { G } } [ 1 - \operatorname { Pr } ( g | X = x ) ]
$$

or simply

$$
\hat { G } ( x ) = \mathcal { G } _ { k } \text { if } \operatorname { Pr } \left( \mathcal { G } _ { k } | X = x \right) = \max _ { g \in \mathcal { G } } \operatorname { Pr } ( g | X = x )
$$

This reasonable solution is known as the *Bayes classifier*, and says that
we classify to the most probable class, using the conditional (discrete) distribution $Pr(G|X)$.

Again we see that the k-nearest neighbor classifier directly approximates
this solution—a majority vote in a nearest neighborhood amounts to exactly this, except that conditional probability at a point is relaxed to conditional probability within a neighborhood of a point, and probabilities areestimated by training-sample proportions.

# Local Methods in High Dimensions

It would seem that with a reasonably large set of training data, we could always approximate the theoretically optimal conditional expectation by k-nearest-neighbor averaging, since we should
be able to find a fairly large neighborhood of observations close to any x
and average them. This approach and our intuition breaks down in high
dimensions, and the phenomenon is commonly referred to as the **curse
of dimensionality**.

# Statistical Models, Supervised Learning and Function Approximation

Our goal is to find a useful approximation $\hat{f(x)}$ to the function $f(x)$ , but the previous models can have two disadvantages:

+ if the dimension of the input space is high, the nearest neighbors need
not be close to the target point, and can result in large errors;

+ if special structure is known to exist, this can be used to reduce both
the bias and the variance of the estimates.

## A Statistical Model for the Joint Distribution Pr(X, Y )

Suppose in fact that our data arose from a statistical model

$$
Y = f ( X ) + \varepsilon
$$

where the random error $\varepsilon$ has $E(\varepsilon) = 0$ and is independent of X. The additive error model is a useful approximation to the truth. For
most systems the input–output pairs (X, Y ) will not have a deterministic
relationship Y = f(X). Generally there will be other unmeasured variables
that also contribute to Y , including measurement error. The additive model
assumes that we can capture all these departures from a deterministic relationship via the error $\varepsilon$.

## Supervised Learning

Supervised learning attempts to learn f by
example through a teacher. One observes the system under study, both
the inputs and outputs, and assembles a training set of observations $T =
(x_i, y_i), i = 1, . . . , N.$ The observed input values to the system xi are also
fed into an artificial system, known as a learning algorithm (usually a computer program), which also produces outputs $\hat{f(x_i)}$ in response to the inputs. The learning algorithm has the property that it can modify its input/output relationship $\hat{f}$ in response to differences $y_i − \hat{f(x_i)}$ between the
original and generated outputs. This process is known as learning by example. Upon completion of the learning process the hope is that the artificial
and real outputs will be close enough to be useful for all sets of inputs likely
to be encountered in practice.

## Function Approximation

The approach taken in applied mathematics and statistics has been from the perspective of function approximation and estimation. Here the data pairs ${x_i, y_i}$ are viewed as points in a $(p + 1)$-dimensional Euclidean space. The function f(x) has domain equal to the p-dimensional input subspace, and is related to the data via a model.

The goal is to obtain a useful approximation to f(x) for all x in some region of $\mathbb { R } ^ { p }$, given the representations in $T$ .Although somewhat less glamorous than the learning paradigm, treating
supervised learning as a problem in function approximation encourages the
geometrical concepts of Euclidean spaces and mathematical concepts of
probabilistic inference to be applied to the problem. 

Many of the approximations we will encounter have associated a set of
parameters $\theta$ that can be modified to suit the data at hand.  For example, the linear model $f ( x ) = x ^ { T } \beta$ has $\theta = \beta$. Another class od useful approximators can be expressed as *linear basis expansions*

$$
f _ { \theta } ( x ) = \sum _ { k = 1 } ^ { K } h _ { k } ( x ) \theta _ { k }
$$

where the $h_k$ are a suitable set of functions or transformations of the input
vector x. Traditional examples are polynomial and trigonometric expansions, where for example $h_k$ can be $x_1^2$, $x_1x_2^2$,$\cos(x_1)$ and so on. We
also encounter nonlinear expansions, such as the sigmoid transformation
common to neural network models

$$
h _ { k } ( x ) = \frac { 1 } { 1 + \exp \left( - x ^ { T } \beta _ { k } \right) }
$$

We can use least squares to estimate the parameters $\theta$ in $f(\theta)$ as we did
for the linear model, by minimizing the residual sum-of-squares

$$
\operatorname { RSS } ( \theta ) = \sum _ { i = 1 } ^ { N } \left( y _ { i } - f _ { \theta } \left( x _ { i } \right) \right) ^ { 2 }
$$

For the linear model we get a simple closed form solution to the minimization problem. This is also true for the basis function methods, if the
basis functions themselves do not have any hidden parameters. Otherwise
the solution requires either iterative methods or numerical optimization.

While least squares is generally very convenient, it is not the only criterion used and in some cases would not make much sense. A more general principle for estimation is **maximum likelihood estimation**.

Suppose we have a random sample $y _ { i } , i = 1 , \dots , N$ from a density $\operatorname { Pr } _ { \theta } ( y )$ indexed by some parameters $\theta$. The log-probability of the observed sample is

$$
L ( \theta ) = \sum _ { i = 1 } ^ { N } \log \operatorname { Pr } _ { \theta } \left( y _ { i } \right)
$$

**The principle of maximum likelihood assumes that the most reasonable
values for** $\theta$ **are those for which the probability of the observed sample is
largest.**

Least squares for the additive error model $Y = f _ { \theta } ( X ) + \varepsilon$, with $\varepsilon \sim N \left( 0 , \sigma ^ { 2 } \right)$, is equivalent to maximum likelihood using the conditional
likelihood

$$
\operatorname { Pr } ( Y | X , \theta ) = N \left( f _ { \theta } ( X ) , \sigma ^ { 2 } \right)
$$

So although the additional assumption of normality seems more restrictive,
the results are the same. The log-likelihood of the data is

$$
L ( \theta ) = - \frac { N } { 2 } \log ( 2 \pi ) - N \log \sigma - \frac { 1 } { 2 \sigma ^ { 2 } } \sum _ { i = 1 } ^ { N } \left( y _ { i } - f _ { \theta } \left( x _ { i } \right) \right) ^ { 2 }
$$

A more interesting example is the multinomial likelihood for the regression function $\operatorname { Pr } ( G | X )$ for a qualitative output G. Suppose we have a model $\operatorname { Pr } \left( G = \mathcal { G } _ { k } | X = x \right) = p _ { k , \theta } ( x ) , k = 1 , \ldots , K$ for the conditional probability of each class given X, indexed by the parameter vector $\theta$. Then the log-likelihood (also referred to as the **cross-entropy**) is

$$
L ( \theta ) = \sum _ { i = 1 } ^ { N } \log p _ { g _ { i } , \theta } \left( x _ { i } \right)
$$

and when maximized it delivers values of $\theta$ that best conform with the data in this likelihood sense.

# Structured Regression Models

## Difficulty of the Problem

Any method that attempts to produce locally varying functions in small isotropic neighborhoods will run
into problems in high dimensions—again the curse of dimensionality. And
conversely, all methods that overcome the dimensionality problems have an
associated—and often implicit or adaptive—metric for measuring neighborhoods, which basically does not allow the neighborhood to be simultaneously small in all directions.

# Classes of Restricted Estimators

The variety of nonparametric regression techniques or learning methods fall
into a number of different classes depending on the nature of the restrictions
imposed. These classes are not distinct, and indeed some methods fall in
several classes.

Each of the classes has associated with it one
or more parameters, sometimes appropriately called smoothing parameters,
that control the effective size of the local neighborhood. Here we describe
three broad classes.

## Roughness Penalty and Bayesian Methods

To be added.

## Kernel Methods and Local Regression

To be added.

## Basis Functions and Dictionary Methods

To be added.

# Model Selection and the Bias–Variance Tradeoff
Suppose the data arise from a model $Y = f ( X ) + \varepsilon$, with $\mathrm { E } ( \varepsilon ) = 0$ and 
$\operatorname { Var } ( \varepsilon ) = \sigma ^ { 2 }$. For simplicity here we assume that the values of xi in the sample are fixed in advance (nonrandom). The expected prediction error
at $x_0$, also known as test or generalization error, can be decomposed:

$$
\begin{aligned} \operatorname { EPE } _ { k } \left( x _ { 0 } \right) & = \mathrm { E } \left[ \left( Y - \hat { f } _ { k } \left( x _ { 0 } \right) \right) ^ { 2 } | X = x _ { 0 } \right] \\ & = \sigma ^ { 2 } + \left[ \operatorname { Bias } ^ { 2 } \left( \hat { f } _ { k } \left( x _ { 0 } \right) \right) + \operatorname { Var } _ { \mathcal { T } } \left( \hat { f } _ { k } \left( x _ { 0 } \right) \right) \right] \\ & = \sigma ^ { 2 } + \left[ f \left( x _ { 0 } \right) - \frac { 1 } { k } \sum _ { \ell = 1 } ^ { k } f \left( x _ { ( \ell ) } \right) \right] ^ { 2 } + \frac { \sigma ^ { 2 } } { k } \end{aligned}
$$


There are three terms in this expression. The first term $\sigma^2$ is the irreducible error—the variance of the new test target—and is beyond our control, even if we know the true $f(x0)$.

The second and third terms are under our control, and make up the
mean squared error of $\hat{f_k(x_0)}$ in estimating $f(x_0)$, which is broken down into a bias component and a variance component.The bias term is the
squared difference between the true mean $f(x_0)$ and the expected value of
the estimate $\left[ \mathrm { E } _ { \mathcal { T } } \left( f _ { k } \left( x _ { 0 } \right) \right) - f \left( x _ { 0 } \right) \right] ^ { 2 }$,where the expectation averages the
randomness in the training data. This term will most likely increase with
k, if the true function is reasonably smooth. For small k the few closest
neighbors will have values $f \left( x _ { ( \ell ) } \right)$ close to $f(x_0)$. so their average should be close to $f(x_0)$. As k grows, the neighbors are further away, and then
anything can happen.

The variance term is simply the variance of an average here, and decreases as the inverse of k. So as k varies, there is a **bias–variance tradeoff**.

More generally, as the model complexity of our procedure is increased, the
variance tends to increase and the squared bias tends to decrease.

Typically we would like to choose our model complexity to trade bias
off with variance in such a way as to minimize the test error. An obvious
estimate of test error is the *training error* $\frac { 1 } { N } \sum _ { i } \left( y _ { i } - \hat { y } _ { i } \right) ^ { 2 }$. Unfortunately
training error is not a good estimate of test error, as it does not properly
account for model complexity.

![](https://i.loli.net/2019/02/26/5c7546f5f1a78.png)

The figure shows the typical behavior of the test and training error, as
model complexity is varied. The training error tends to decrease whenever
we increase the model complexity, that is, whenever we fit the data harder.
However with too much fitting, the model adapts itself too closely to the
training data, and will not generalize well (i.e., have large test error). In
that case the predictions $\hat{f(x_0)}$ will have large variance, as reflected in the
last term of expression equation. In contrast, if the model is not complex
enough, it will underfit and may have large bias, again resulting in poor
generalization. 


 