# 2.1 What Is Statistical Learning?

A given response Y can be described by its predictors Xn, such that  Y = f(X) + e. (with e being the error)
In essence, statistical learning refers to a set of approaches for estimating
f.

Why do we want to estimate f?

### a) Prediction

a set of inputs X are readily available, but the output Y cannot be easily obtained.
The accuracy of ˆ Y as a prediction for Y depends on two quantities, which we will call the reducible error and the irreducible error. In general, f(hat) will not be a perfect estimate for f, and this inaccuracy will introduce some error. This error is reducible because we can potentially improve the accuracy of f(hat) by using the most appropriate statistical learning technique to estimate f.
However, variability associated with e also affects the accuracy of our predictions. This is known
as the irreducible error, because no matter how well we estimate f, we cannot reduce the error introduced by e.
It is important to keep in mind that the irreducible error will always provide an upper bound on the accuracy of
our prediction for Y . This bound is almost always unknown in practice.

### b) Inference

Here our goal is not necessarily to make predictions for Y . We instead want to understand the relationship between X and Y , or more specifically, to understand how
Y changes as a function of X1, . . .,Xp. We might have questions like:
Which predictors are associated with the response?
What is the relationship between the response and each predictor?
Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?

Depending on whether our ultimate goal is prediction, inference, or a combination of the two, different methods for estimating f may be appropriate.
For example, linear models allow for relatively simple and interpretable inference, but may not yield as accurate predictions as some other
approaches. In contrast, some of the highly non-linear approaches that we discuss in the later chapters of this book can potentially provide quite accurate
predictions for Y , but this comes at the expense of a less interpretable model for which inference is more challenging.

### How Do We Estimate f?

Broadly speaking, most statistical learning methods for this task can be characterized as either parametric or non-parametric.

### Parametric Methods

Parametric methods involve a two-step model-based approach.
1) First, we make an assumption about the functional form, or shape, of f. For example, one very simple assumption is that f is linear in X:
f(X) = β0 + β1X1 + β2X2 + . . . + βpXp.
2) After a model has been selected, we need a procedure that uses the training data to fit or train the model. In the case of the linear model, we need to estimate the parameters β0, β1, . . . , βp.
The most common approach to fitting this model  is referred to as (ordinary) least squares, yet there are further possible ways.

### Non-parametric Methods

Non-parametric methods do not make explicit assumptions about the functional form of f. Instead they seek an estimate of f that gets as close to the data points as possible without being too rough or wiggly.
Any parametric
approach brings with it the possibility that the functional form used to estimate f is very different from the true f, in which case the resulting model will not fit the data well. In contrast, non-parametric approaches completely avoid this danger, since essentially no assumption about the form of f is made. But non-parametric approaches do suffer from a major
disadvantage: since they do not reduce the problem of estimating f to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for f.

### The Trade-Off Between Prediction Accuracy and Model Interpretability

![image.png](attachment:image.png)

For Inference, we are mostly interested in interpretable models, which, in genral, are more restrictive and hence, less acuurate. For Predictions, when we are less interested in interpretation, we would prefer flexible methods YET! they have high potential for overfitting and might therefore be less accurate.

### Supervised Versus Unsupervised Learning

Supervised: we observe a response associated with the features
Unsupervised: there is no response variable, we are interested in grouping the data, i.e. clustering.
Many problems fall naturally into the supervised or unsupervised learningparadigms. However, sometimes the question of whether an analysis should be considered supervised or unsupervised is less clear-cut. For instance, suppose that we have a set of n observations. For m of the observations, where m < n, we have both predictor measurements and a response
measurement. For the remaining n − m observations, we have predictor measurements but no response measurement. Such a scenario can arise if the predictors can be measured relatively cheaply but the corresponding responses are much more expensive to collect. We refer to this setting as a semi-supervised learning problem. There are methods that incorporate both populations of samples, which are not covered by this book.

### Regression Versus Classification Problems

We tend to refer to problems with a quantitative response as regression problems, while those involving a qualitative response are often referred to as classification problems. We tend to select statistical learning methods on the basis of whether the response is quantitative or qualitative; i.e. we might use linear regression when quantitative and logistic regression when qualitative. However, whether the predictors are qualitative or quantitative is generally considered
less important. Most of the statistical learning methods discussed in this book can be applied regardless of the predictor variable type, provided that any qualitative predictors are properly coded before the analysis is performed.

## Assessing Model Accuracy

Measuring the Quality of Fit
In regression, this is most commonly achieved by the mean squared error MSE:

![image.png](attachment:image.png)

![image.png](attachment:image.png)

The Variance-bias tradeoff: 

As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease. The relative rate of change of these
two quantities determines whether the test MSE increases or decreases. As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases. Consequently, the expected test MSE declines. However, at some point increasing flexibility has little impact on the bias but starts to significantly increase the variance. When this happens the test MSE increases.
The tradeoff refers to the fact that it is easy to obtain a method with extremely low bias but high variance (for instance, by drawing a curve that passes through every single training observation) or a method with very low variance but high bias (by fitting a horizontal line to the data). The challenge lies in finding a method for which both the variance and the squared bias are low

For classifications, the accuracy can be estimated via the error rate

![image.png](attachment:image.png)

Also in classification, we have to deal with overfitting problems:

![image.png](attachment:image.png)

In both the regression and classification settings, choosing the correct level of flexibility is critical to the success of any statistical learning method. The bias-variance tradeoff, and the resulting U-shape in the test error, can make this a difficult task.