$$ Sales \approx f(TV,Radio,Newspaper)

## Notation

- **Sales** ---- $Y$
- **TV** ---- $X_1$
- **Radio** ---- $X_2$
- **Newspaper** ---- $X_3$


We can show $input$ $vector$ as

$$X = \begin{pmatrix}
X_1 \\
X_2 \\
X_3
\end{pmatrix}$$

Now we can write our model as 

$$Y = f(X) + \epsilon$$

where $\epsilon$ captures measurement errors and other discrepancies.

## What is $f(X)$ good for?
- With a good $f$ we can make predictions of $Y$ at new points $X = x$.
- We can understand which components of $X = (X_1,X_2,...,X_p)$ are important in explaining $Y$, and which are irrelevant. e.g. **Seniority** and **Years of Education** have a big impact on Income, but **Marital Status** typically does not.
- Depending on the complexity of $f$, we may be able to understand how each component $X_j$ of $X$ affects $Y$.

Is there an ideal $f(X)$? In particular, what is a good value for $f(X)$ at any selected value of $X$, say $X=4$? There can be many $Y$ values at $X=4$. A good value is

$$f(4) = E(Y|X=4)$$

$E(Y|X=4)$ means $expected$ $value$ (average) of $Y$ given $X=4$.

This ideal $f(x)=E(Y|X=x)$ is called $regression$ $function$ 

## The regression function $f(x)$
- Is also defined for vector $X$; e.g.
    $$f(x)=f(x_1,x_2,x_3) = E(Y|X_1=x_1,X_2=x_2,X_3=x_3)$$

- Is the $ideal$ or $optimal$ predictor of Y with regard to mean-squared prediction error: $f(x)=E(Y|X=x)$ is the function that minimizes $E[(Y-g(X))^2 | X=x]$ over all functions $g$ at all points $X=x$

- $\epsilon = Y - f(x)$ is the $irreducible$ error $-$ i.e. even if we knew $f(x)$, we would still make errors in prediction, since at each $X=x$ there is typically a distribution of possible $Y$ values.

- For any estimate $\hat{f}(x)$ of $f(x)$, we have
    $$E[(Y-\hat{f}(x))^2|X=x] = [f(x)-\hat{f}(x)] + \text{Var}(\epsilon)$$

## How to estimate $f$
- Typically we have few if any data points with $X = 4$ exactly.
- So we cannot compute $E(Y|X=x)!$
- Relax the definition and let
$$\hat{f}(x) = \text{Ave}(Y|X \in \mathcal{N}(x))$$
where $\mathcal{N}(x)$ is some $neighborhood$ of $x$

- Nearest neighbor averaging can be pretty good for small $p$ $-$ i.e. $p \le 4$ and large-ish $N$
- We will discuss smoother versions, such as kernel and spline smoothing later in the course.
- Nearest neighbor methods can be $lousy$ when $p$ is large. Reason: the $curse$ $of$ $dimensionality$. Nearest neighbors tend to be far away in high dimensions.

    - We need to get a reasonable fraction of the $N$ values of $y_i$ to average to bring the variance down -- e.g. 10%.
    - A 10% neighborhood in high dimensions need no longer be local, so we lose the spirit of estimating $E(Y|X=x)$ by local averaging.

## Parametric and structured models
The $linear$ model is an important example of a parametric model:
$$ f_L(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \text{. . . }\beta_pX_p.$$

- A linear model is specified in terms of $p+1$ parameters $\beta_0,\beta_1\text{,....,}\beta_p.$

- Although it is $almost$ $never$ $correct$, a linear model often serves as a good and interpretable approximation to the unknown true function $f(X).$

A linear model $\hat{f}_L(X) = \hat{\beta}_0 + \hat{\beta}_1X$ gives a reasonable fit here. {linear scatter}

A quadratic model $\hat{f}_Q(X) = \hat{\beta}_0 + \hat{\beta}_1X + \hat{\beta}_2X^2$ fits slightly better. {quadratic scatter}

Simulated example. Red points are simulated values for **income** from the model
$$income = f(education,seniority) + \epsilon$$
$f$ is the blue surface.

Linear regression model fit to the simulated data.

$$\hat{f}_L(education,seniority) = \hat{\beta}_0 + \hat{\beta}_1*education + \hat{\beta}_2*seniority

More flexible regression model $\hat{f}_S(education,seniority)$ fit to more simulated data. Here we use a technique called a $thin-plate$ $spline$ to fit a flexible surface. We control the roughness of the fit (chapter 7).

Even more flexible spline regression model
$\hat{f}_S(education,seniority)$ fit to the simulated data. Here the fitted model maked no errors on the training data!, Also known as $overfitting$.

## Some trade-offs
- Prediction accuracy versus interpretability.
    $-$ Linear models are easy to interpret; thin-plate splines are not.
- Good fit versus over-fit or under-fit.
    $-$ How do we know when  the fit is just right?
- Parsimony versus black-box.
    $-$ We often prefer a simpler model involving fewer variables over a black-box predictor involving them all.

models flexibility vs interpretability chart

## Assessing Model Accuracy

Suppose we fit a model $\hat{f}(x)$ to some training data $\text{Tr} = \{x_i,y_i\}_{1}^N,$ and we wish to see how well it performs.
- We could compute the average squared prediction error over $\text{Tr}:$
$$\text{MSE}_{\text{Tr}} = \text{Ave}_{i \in \text{Tr}}[y_i - \hat{f}(x_i)]^2$$
This may be biased toward more overfit models.
- Instead we should, if possible, compute it using fresh $test$ data $\text{Te} = \{x_i,y_i\}_{1}^M:$
$$\text{MSE}_{\text{Te}} = \text{Ave}_{i \in \text{Te}}[y_i - \hat{f}(x_i)]^2$$

Charts; Black curve is truth. Red curve on right is $\text{MSE}_\text{Te},$ grey curve is $\text{MSE}_\text{Tr}.$ Orange, blue and green curves/squares correspond to fits of different flexibility.

Charts; Here the truth is smoother, so the smoother fit and linear fit and linear model do really well.

Charts; Here the truth is wiggly and the noise is low, so the more flexible fits do the best.

## Bias-Variance Trade-off

Suppose we have to fit a model $\hat{f}(x)$ to some training data $\text{Tr},$ and let $(x_0,y_0)$ be a test observation drawn from the population. If the true model is $Y = f(X) + \epsilon$ $(\text{with } f(x) = E(Y|X=x))$, then

$$ E(y_0 - \hat{f}(x_0))^2 = \text{Var}(\hat{f}(x_0)) + [\text{Bias}(\hat{f}(x_0))]^2 + \text{Var}(\epsilon). $$

The expectation averages over the variability of $y_0$ as well as the variability in $\text{Tr}$. Note that $\text{Bias}(\hat{f}(x_0)) = E[\hat{f}{(x_0)}] - f(x_0)$.

Typically as the $flexibility$ of $\hat{f}$ increases, its variance increases, and its bias decreases. So choosing the flexibility based on average test error amounts to a $bias-variance$ $trade-off$.



Chart; Bias-variance trade-off for the three examples

## Classification Problems

Here the response variable $Y$ is $qualitative$ $-$ e.g. email is on of $C = (spam,ham)$  (**ham**=good email), digit class is one of $C = \{0,1,. . . ,9\}$. Our goals are to:
- Build a classifier $C(X)$ that assigns a class label from $C$ to a future unlabeled observation $X$.
- Assess the uncertainty in each classification
- Understand the roles of the different predictors among $X = (X_1, X_2,\text{. . . ,}X_p)$.

Is there an ideal $C(X)$? Suppose the $K$ elements in $C$ are numbered $1,2\text{, . . . ,}K$. Let
$$ p_k(x) = \text{Pr}(Y=k|X=x), k=1,2,\text{. . . ,}K.$$

These are the $conditional$ $class$ $probabilities$ at $x$; e.g. see little barplot at $x=5.$ Then the $Bayes$ $optimal$ classifier at $x$ is
$$C(x) = j\text{ if } p_j(x)=\text{max}\{{p_1(x),p_2(x)\text{, . . . ,}p_k(x)}\}$$

Nearest-neighbor averaging can be used as before.
Also breaks down as dimension grows. However, the impact on $\hat{C}(x)$ is less than on $\hat{p}_k(x)\text{, }k=1\text{, . . . ,}K.$

## Classification: some details
- Typically we measure the performance of $\hat{C}(x)$ using the misclassification error rate:
$$\text{Err}_{\text{Te}} = \text{Ave}_{i \in \text{Te}}I[y_i \ne \hat{C}(x_i)]$$

- The Bayes classifier (using the true $p_k(x)$) has smallest error (in the population).

- Support-vector machines build structured models for $C(x)$.
- We will also build structured models for representing the $p_k(x)$. e.g. Logistic regression, generalized additive models.

Example: K-nearest neighbors in two dimensions
K = 1(varies much); K=10(just right); K = 100(linear);

train, test error vs 1/K chart