# Chap 2: Start Learning

## Prediction
In a prediction / regression problem, the inputs (denoted by `X`) are called as `predictors`, `independent variables`, `features` and the predicted variable is called as `response`, `dependent variable` and is denoted by `Y`.

The relationship betwen input and predicted is represented as

$$
Y = f(X) + \epsilon
$$

where $f$ is some fixed, unknown function that is to be determined. $\epsilon$ is **random error** term that is independent of `X` and has **zero mean**.

In reality, $f$ may depend on more than 1 input variable $X$, for instance 2. In this case, $f$ is a `2D` surface that is fit. In general, the process of **estimating** $f$ is **statistical learning**.

### Reducible and Irreducible errors
Since $f$ and $Y$ cannot be **calculated**, the best we can get is to **estimate** them. Thus, the estimates are called $\hat f$ and $\hat Y$

$$
\hat Y = \hat f(X)
$$

The accuracy of $\hat Y$ depends on **reducible** and **irreducible** errors. The error in prediction of $\hat f$ is **reduible** and can be improved wth more data and better models. However, $\hat Y$ is also a function of $\epsilon$ which is **irreducible**. Thus, the best our predictions can get is 

$$
\hat Y = f(X)
$$

Focus of Statistical learning is to estimating $f$ as $\hat f$ with least **reducible** error. However, the accuracy of $\hat Y$ will always be controlled by **irreducible** and **unknown** error $\epsilon$.

In **prediction problems**, $\hat f$ can be treated as **blackbox** as we are only interested in predicting $Y$.

#### Inference
We are interested in understanding how each of the different $X_{1}... X_{p}$ affect the dependent variable $Y$, hence the name **inference**. Here, $\hat f$ **cannot be treated as blackbox** and we need to know its exact form. Some questions that are sought to be answered through inference:
 - which predictor variables are associated with the response?
 - what is the relationship b/w response and each predictor?
 - is the relationship linear or is more complicated?

## Parametric and Unparametric methods for Estimating f
The observations for X and Y can be written as ${(x_{1}, y_{1}),(x_{2}, y_{2}),...,(x_{n}, y_{n})}$ where each *x* has many predictor variables that can be written as $x_{i} = (x_{i1},x_{i2},..,x_{ip})^{T}$. The goal is to find $\hat f$ such that $Y \approx \hat f (X)$

### Parametric methods for estimating f
Parametric methods take a **model based** approach (**deterministic**).
We make an assumption about the functional form of *f* (whether it is linear, non linear, higher order, logistic etc). For instance, if we assume that *f* is linear, then 

$$
Y \approx f(X) = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + .. + \beta_{p}X_{p}
$$
we only need to find $p+1$ coefficients. Through training or fitting (using methods like *ordinary least squares*, we can estimate the coefficients.

**Notes**
 - parametric methods are an approximation of the true functional form of *f*.
 - simpler (lower order, less flexible) models may lead to poorer estimates of *f*
 - more flexible (higher order, complex) models may lead to **overfitting**.
 - Since the model is trained on a subset of values, it might be very different from true nature of *f*. Hence the model developed is only valid for the range of data it was trained on.

### Non parametric methods for estimating f
Non parametric methods **avoid assuming the functional form of f**. However, these methods require **a very large** number of observations since they do not try to reduce the phenomenon to a model.