# Introduction to Statistical Learning - Chapter 2

- [2. Linear Regression](#2.-Linear-Regression)
    * [2.1 Simple Linear Regression](#2.1-Simple-Linear-Regression)
        + [2.1.1 Estimating the Coefficients](#2.1.1-Estimating-the-Coefficients)
        + [2.1.2 Assessing the Accuracy of the Coefficient Estimates](#2.1.2-Assessing-the-Accuracy-of-the-Coefficient-Estimates)
        + [2.1.3 Assessing the Accuracy of the Model](#2.1.3-Assessing-the-Accuracy-of-the-Model)
    * [2.2 Multiple Linear Regression](#2.2-Multiple-Linear-Regression)
        + [2.2.1 Estimating the Regression Coefficients](#2.2.1-Estimating-the-Regression-Coefficients)

# 2. Linear Regression

## 2.1 Simple Linear Regression

- Very straightforward approach for predicting a quantitative response $Y$

$$ Y \approx \beta_{0} + \beta_{1}X$$
where $\beta_{0}$ and $\beta_{1}$ represent the `intercept` and `slope` respectively

### 2.1.1 Estimating the Coefficients

- For linear model, the goal is to obtain coefficient estimates $\beta_{0}$ and $\beta_{1}$ where the resulting line is as close to the observations X
    * Most common approach is the `least squares criterion`
- Least squares approach chooses $\beta_{0}$ and $\beta_{1} to minimize the residual sum of squares `RSS`

$$RSS = e^{2}_{1} + e^{2}_{2}+...+ e^{2}_{n}$$
where $e_{i} = y_{i} - \hat{y_{i}}$ represent the ith residual, which is the difference between the ith observed response value and the predicted value from the linear model

### 2.1.2 Assessing the Accuracy of the Coefficient Estimates

If $f$ is approximated by a linear function, the true relationship would be:

$$ Y = \beta_{0} + \beta_{1}X + \epsilon $$
where $\epsilon$ is the mean-zero random error term, which captures the variation in **Y**

However, the true relationship is generally not known for real data and the least squares line can be computed instead:

$$ Y = \beta_{0} + \beta_{1}X $$

- As such, the actual population mean $\mu$ and variance $\sigma^{2}$ is unknown
    * A reasonable estimate would be to calculate the sample mean $(\bar{y})$ and standard error **(SE)**

$$ \bar{y} = \frac{1}{n}\sum^{n}_{i=1}y_{i} $$
$$ Var(\hat{\mu}) = SE(\hat{\mu})^{2} = \frac{\sigma^{2}}{n} $$ 
$$ SE = \frac{\sigma}{\sqrt{n}}$$

- Standard errors can be used to perform hypothesis testing on the coefficients
$$ H_{0} : \beta_{1} = 0 $$
$$ H_{1} : \beta_{1} \neq 0 $$
    * If the $SE(\hat{\beta_{1}})$ is large, then $\hat{\beta_{1}}$ must be large in order to not reject the null hypothesis $H_{0}$
        
        + The `t-statistic` measures the number of standard deviations that $\hat{\beta_{1}}$ is away from 0
        
$$ t = \frac{\hat{\beta_{1}}-0}{SE(\hat{\beta_{1}})} $$
where the t-distribution will have n-2 degrees of freedom if there is no relationship between X and Y

### 2.1.3 Assessing the Accuracy of the Model

- After rejecting the null hypothesis in favour of the alternative hypothesis, there is a need to quantify the extent to which the model fits the data. For linear regression, the residual standard error $RSE$ and the $R^{2}$ are typically assessed

**Residual Standard Error (RSE)**
- The RSE is an estimate of the standard deviation of $\epsilon$ or the average amount the response will deviate from the true regression line
- RSE is also considered a measure of the lack of fit of the model to the data
$$ RSE = \sqrt{\frac{1}{n-2}RSS} = \sqrt{\frac{1}{n-2}\sum^{n}_{i=1}(y_{i}-\hat{y_{i}})^{2}} $$

**$R^{2}$ Statistic**
- $R^{2}$ provides an alternative measure of fit
    * It measures the proportion of variance explained with value between 0 and 1 and is independent of the scale of Y

$$ R^{2} = \frac{TSS-RSS}{TSS} = 1 - \frac{RSS}{TSS} $$

where $TSS = \sum(y_{i} - \bar{y_{i}})^{2}$ is the total sum of squares and $RSS =\sum(y_{i}-\hat{y_{i}})^{2}$ is the residual sum of squares 
- TSS measures the total variance in the response Y, before the regression is performed, while RSS measures the amount of variability that is left unexplained after performing the regression

## 2.2 Multiple Linear Regression

- An extension of the linear regression model that directly accommodate multiple predictors
- Each predictor will have a separate slope coefficient in a single model

$$ Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} +...+ \beta_{n}X_{n} + \epsilon $$

### 2.2.1 Estimating the Regression Coefficients

$$ RSS = \sum^{n}_{i=1}(y_{i}-\hat{y_{i}})^{2} = \sum^{n}_{i=1}(y_{i}-\hat{\beta_{0}}-\hat{\beta_{1}}x_{i1}-\hat{\beta_{2}}x_{i2}-...-\hat{\beta_{p}}x_{ip})^{2} $$

- In a multiple linear regression, more variables can be included so as to get a better understanding of the main predictor that is responsible for the change in the response.
    * The key is to minimize the sum of squared residuals

**Determining the relationship between Response and Predictors**
- Check if all the regression coefficients are zero

$$ H_{0} : \beta_{1} = \beta_{2} = ... = \beta_{n} = 0 $$
$$ H_{1} : \text{at least one } \beta_{j} \text{ is non-zero} $$

where this hypothesis is performed by computing the `F-statistic`

$$ F= \frac{(TSS-RSS)/p}{RSS/(n-p-1)} $$

where $TSS = \sum(y_{i} - \bar{y_{i}})^{2}$ and $RSS =\sum(y_{i}-\hat{y_{i}})^{2}$. 
- If there is no relationship between the reponse and predictors, the F-statistic $\approx$ = **1**, otherwise, F-statistic **> 1**
- For each individual predictor, a t-statistic and a p-value will be reported which reports the partial effect of add that variable to the model
- When there are many predictors, there will be a chance that predictors that are not associated with the response could present false associations
    * The F-statistic corrects for this problem and state that there is a only a 5% chance regardless of the number of predictors.
        + However, the F-statistic is only useful when p is relatively small (n>p)

### 2.2.2 Deciding on Important Variables