# Statistical Learning
---

## Contents
- [Linear Regression](#linreg)
- [Simple Linear Regression](#simlinreg)
- [Estimating the coefficients](#estcoef)

<a name="linreg"></a>
## Linear Regression

Consider the **Advertising** dataset. It displays *sales* of a particular product
as a function of advertising budgets for *TV*, *radio* and *newspaper* media.
Here are a few important questions that we might seek to address:
- Is there a relationship between advertising budget and sales?
- How strong is the relationship between advertising budget and sales?
- Which media contribute to sales?
- How accurately can we estimate the effect of each medium on sales?
- How accurately can we predict future sales?
- Is the relationship linear?
- Is there synergy/interaction among the advertising media?

<a name="simlinreg"></a>
## Simple Linear Regression

It is a very straightforward approach for predicting a quantitative response Y on the basis of a single predictor variable $X$. It assumes that there is approximately a linear relationship between $X$ and $Y$. Mathematically, we can write this linear relationship as
$$Y \approx \beta_{0} + \beta_{1}X$$
where $\beta_{0}$ is the intercept and $\beta_{1}$ is the slope term in linear model.
They are known as model *coefficients* or *parameters*.
Suppose, *sales* regresses onto *TV*, it can be written as
$$sales \approx \beta_{0} + \beta_{1} \times TV$$
Once we have our training data to produce estimates $\hat{\beta_{0}}$ and $\hat{\beta_{1}}$ for the model coefficients,
we predict the future sales on the basis of particular value of TV advertising by computing
$$\hat{y} = \hat{\beta_{0}} + \hat{\beta_{1}}x$$
where $\hat{y}$ is the prediction of $Y$ on the basis of $X = x$.
We use $\hat{}$ to denote the estimated value for an unknown parameter or coefficient, or to denote the predicted value of the response.

<a name="estcoef"></a>
### Estimating the coefficients

In practice, $\beta_{0}$ and $\beta_{1}$ are unknown.
Our goal is to obtain coefficient estimates $\hat{\beta_{0}}$ and $\hat{\beta_{1}}$ such that the linear model
$y_{i} \approx \hat{\beta_{0}} + \hat{\beta_{1}}x_{i}$ for $i = 1, \cdots, n$ fits the available data well.
In other words, we want to find an intercept $\hat{\beta_{0}}$ and slope $\hat{\beta_{1}}$ such that the resulting line is as close as possible to the $n$ data points.

Let $\hat{y_{i}} = \hat{\beta_{0}} + \hat{\beta_{1}}x_{i}$ be the prediction for $Y$ based on the $i^{th}$ value of $X$.
Then $e_{i} = y_{i} - \hat{y_{i}}$ represents the $i^{th}$ residual. Thus the *residual sum of squares (RSS)* is
$$RSS = {e_{1}}^{2} + {e_{2}}^{2} + \cdots + {e_{n}}^{2}$$
This is equivalent to
$$RSS = \left(y_{1} - \hat{\beta_{0}} - \hat{\beta_{1}}x_{1}\right)^{2} +
        \left(y_{2} - \hat{\beta_{0}} - \hat{\beta_{1}}x_{2}\right)^{2} +
        \cdots +
        \left(y_{n} - \hat{\beta_{0}} - \hat{\beta_{1}}x_{n}\right)^{2}$$
The least squares approach chooses $\hat{\beta_{0}}$ and $\hat{\beta_{1}}$ to minimize the RSS which are given by
$$\hat{\beta_{1}} = \frac{\sum_{i = 1}^{n}\left(x_{i} - \bar{x}\right)\left(y_{i} - \bar{y}\right)}
                        {\sum_{i = 1}^{n}\left(x_{i} - \bar{x}\right)^{2}}$$

$$\hat{\beta_{0}} = \bar{y} - \hat{\beta_{1}}\bar{x}$$
where $$\bar{y} = \frac{1}{n}\sum_{i = 1}^{n}y_{i}$$ and $$\bar{x} = \frac{1}{n}\sum_{i = 1}^{n}x_{i}$$ are the sample means.
This defines the *least squares coefficient estimates* for simple linear regression.