## 3.1 Simple Linear Regression

Simple linear regression lives up to its name: it is a very straightforward approach for predicting a quantitative response Y on the basis of a single predictor variable X.

It assumes that there is an approximately linear relationship between X and Y. Mathematically, we can write this linear relationship as:

$ Y \approx \beta_0 + \beta_1X $

For example, if we regress sales onto TV by fitting the model:

$ sales \approx \beta_0 + \beta_1 \times TV $

Once we have used our training data to produce estimates $\hat{\beta_0}$ and $\hat{\beta_1}$ for the model coefficients, we can predict future sales on the basis of a particular value of TV advertising by computing:

$\hat{y} = \hat{\beta_0} + \hat{\beta_1}x$

Where $\hat{y}$ indicates a prediction of Y on the basis of X = x. Here we use a hat symbole to denote the estimated value for an unknown parameter or coefficient, or to denote the predicted value of the response

## 3.1.1 Estimating the Coefficients

In practice, $\hat{\beta_0}$ and $\hat{\beta_1}x$ are unknown. So before we can use the equations above to make predictions, we must use the data to estimate the coefficients. Let:

$ (x_1, y_1), (x_2, y_2), ..., (x_n, y_n)$  

represent n observation pairs, each of which consists of a measurement of X and a measurement of Y. In the `Advertising` example, this data set consists of the TV advertising budget and product sales in n = 200 different markets. 

Our goal is to obtain coefficient estimates $\hat{\beta_0}$ and $\hat{\beta_1}x$ such that the linear model fits the available data well - that is, so that $y_i \approx \hat{\beta_0} + \hat{\beta_1}x_i$ for $i = 1, ..., n$

In other words, we want to find an intercept $\hat{\beta_0}$ and a slope $\hat{\beta_1}$ such that the resulting line is as close as possible to the n = 200 data points.

There are a number of ways of measuring _closeness_. However, by far the most common approach involves minimising the least squares criterion and we will take that approach in this chapter.

![title](input_figures/Fig 3-1.png)

For the Advertising data, the least squares fit for the regression of sales onto TV is shown

If we let:  
$ \hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i$  
Then:  
$ e_i = y_i - \hat{y_i} $  
represents the $i^th$ residual.  

We define the residual sum of squares (RSS) as

$RSS = e^2_1 + e^2_2 + ... + e^2_n $

Or equivalently as:

$ RSS = (y_1 - \hat{\beta_0} - \hat{\beta_1}x_1)^2 + (y_2 - \hat{\beta_0} - \hat{\beta_1}x_2)^2 ... + (y_n - \hat{\beta_0} - \hat{\beta_1}x_n)^2 $

The least squares approach chooses \hat{\beta_0} and \hat{\beta_1} to minimize the RSS. Using some calculus, one can show that the minimisers are:

$ \hat{\beta_1} = \dfrac{\Sigma^n_{i=1} (x_i - \bar{x})(y_i - \bar{y})}{\Sigma^n_{i=1} (x_i - \bar{x})^2} $  

$ \hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x} $  

The figure above displays the simple linear regression fit to the Advertising data, where:  
$ \hat{\beta_0} = 7.03 $  
$ \hat{\beta_1} = 0.0475 $  

In other words, an additional $1000 spent on TV advertising is associated with selling approximately 47.5 additional units of this product.

## 3.1.2. Assessing the Accuracy of the Coefficient Estimates

If we assume that the true relationship between X and Y takes the form $Y = f(x) + \epsilon$ for some unknown function f, where $\epsilon$ is a mean-zero random error term.  
  
If f is to be approximated by a linear function then we can write this relationship as:

$ Y = \beta_0 + \beta_1X + \epsilon$

Because we rarely know the actual relationship, we will only ever be estimating the values above as long as we have a sample of the population.

To work out how how accurate the sample mean $\hat{\mu}$ as an estimate of ${\mu}$ we will use the _standard error_ of $\hat{\mu}$, written as $SE(\hat{\mu})$. We have the well known formula:  

$Var(\hat{\mu}) = SE(\hat{\mu})^2 = \dfrac{\sigma^2}{n}$

where $\sigma$ is the standard deviation of each of the realisations $y_i$ of Y.

Roughly speaking, the standard error tells us the average amount that this estimate $\hat{\mu}$ differs from the actual value of $\mu$

In a similar vein, we can wonder how close $\hat{\beta_0}$ and $\hat{\beta_1}$ are to the true values $\beta_0$ and $\beta_1$. To compute the standard errors associated with $\beta_0$ and $\beta_1$ we use the following formulae:

$ SE(\hat{\beta_0})^2 = \sigma^2\Big[\dfrac{1}{n} + \dfrac{\bar{x}^2}{\Sigma^n_{i=1}(x_i - \bar{x})^2}\Big] $  

$ SE(\hat{\beta_1})^2 = \dfrac{\sigma^2}{\Sigma^n_{i=1}(x_i - \hat{x})^2} $  

In general $\sigma^2$ is not known but it can be estimated from the data. The estimate of $\sigma$ is known as the residual standard error, and is given by the formula:

$ RSE = \sqrt{RSS/(n-2)} $

For Linear regression, the 95% confidence interval for $\beta_1$ approximately takes the form:

$ \hat{\beta_1} \pm 2 \cdot SE(\hat{\beta_1}) $

That is, there is approximately a 95% chance that the interfal:

$\Big[ \hat{\beta_1} - 2 \cdot SE(\hat{\beta_1}), \hat{\beta_1} + 2 \cdot SE(\hat{\beta_1})\Big] $

will contain the true value of $\beta_1$

This is true of $\hat{\beta_0}$ also

In the case of advertising data, the 95% confidence interval for $\hat{\beta_0}$ is [6.130, 7.935] and the 95% confidence interval for $\hat{\beta_1}$ is [0.042, 0.053]. Therefore we can conclude that in the absence of any advertising, sales wil, on average, fall somewhere between 6,130 and 7,940 units. We can also conclude that for each \$1000 increase in television advertising, there will be an average increase in sales between 42 and 53 units.

### Using Standard Errors to perform hypothesis tests

The most common hypothesis test involves testing the null hypothesis of:

$H_0$: There is no relationship between X and Y  

versus the alternative hypothesis

$H_a$: Thre is some relationship between X and Y.

Mathematically, this corresponds to testing:

$H_0: \beta_1 = 0 $  
$H_a: \beta_1 \neq 0 $  

We need to consider both the scale and accuracy of $\beta_1$, a small $\beta_1$ may still be significant.

In practice we compute a t-statistic, given by:

$ t = \dfrac{\hat{\beta_1} - 0}{SE(\beta_1)}$

Which measues the number of standard deviations that $\beta_1$ is from zero.

The t distribution is similar to the normal distribution and it is a simple matter to compute the probability of observing any value equal to |t| or larger, assuming $\beta_1 = 0$. We call this probability the p-value.

A small p value indicates it is unlikely to observe such a substantial association between the predictor and the response due to chance. Therefore we can infer that there is an association between the predictor and the response.

When n = 30, a p-value of 5% or 1% corresponds to t-statistics of around 2 and 2.75 respectively

## Assessing the Accuracy of the Model

Once we have rejected the null hypothesis in favour of the alternative hypothesis, it is natural to want to quantify the extent to which the model fits the data.

The quality of a linear regression fit is typically assessed using two related quantities: the residual standard eror (RSE) and the $R^2$ statistic.

