# Introduction to Statistical Learning - Chapter 2

- [2. Linear Regression](#2.-Linear-Regression)
    * [2.1 Simple Linear Regression](#2.1-Simple-Linear-Regression)
        + [2.1.1 Estimating the Coefficients](#2.1.1-Estimating-the-Coefficients)
        + [2.1.2 Assessing the Accuracy of the Coefficient Estimates](#2.1.2-Assessing-the-Accuracy-of-the-Coefficient-Estimates)
        + [2.1.3 Assessing the Accuracy of the Model](#2.1.3-Assessing-the-Accuracy-of-the-Model)
    * [2.2. Multiple Linear Regression](#2.2.-Multiple-Linear-Regression)
        + [2.2.1 Estimating the Regression Coefficients](#2.2.1-Estimating-the-Regression-Coefficients)
        + [2.2.2 Important Considerations for MLR](#2.2.2-Important-Considerations-for-MLR)
    * [2.3. Other Considerations in the Regression Model](#2.3.-Other-Considerations-in-the-Regression-Model)
        + [2.3.1. Qualitative Predictors](#2.3.1.-Qualitative-Predictors)
        + [2.3.2. Extension of the Linear Model](#2.3.2.-Extension-of-the-Linear-Model)

# 2. Linear Regression

## 2.1 Simple Linear Regression

- Very straightforward approach for predicting a quantitative response $Y$

$$ Y \approx \beta_{0} + \beta_{1}X$$
where $\beta_{0}$ and $\beta_{1}$ represent the `intercept` and `slope` respectively

### 2.1.1 Estimating the Coefficients

- For linear model, the goal is to obtain coefficient estimates $\beta_{0}$ and $\beta_{1}$ where the resulting line is as close to the observations X
    * Most common approach is the `least squares criterion`
- Least squares approach chooses $\beta_{0}$ and $\beta_{1}$ to minimize the residual sum of squares `RSS`

$$RSS = e^{2}_{1} + e^{2}_{2}+...+ e^{2}_{n}$$
where $e_{i} = y_{i} - \hat{y_{i}}$ represent the ith residual, which is the difference between the ith observed response value and the predicted value from the linear model

### 2.1.2 Assessing the Accuracy of the Coefficient Estimates

If $f$ is approximated by a linear function, the true relationship would be:

$$ Y = \beta_{0} + \beta_{1}X + \epsilon $$
where $\epsilon$ is the mean-zero random error term, which captures the variation in **Y**

However, the true relationship is generally not known for real data and the least squares line can be computed instead:

$$ Y = \beta_{0} + \beta_{1}X $$

- As such, the actual population mean $\mu$ and variance $\sigma^{2}$ is unknown
    * A reasonable estimate would be to calculate the sample mean $(\bar{y})$ and standard error **(SE)**

$$ \bar{y} = \frac{1}{n}\sum^{n}_{i=1}y_{i} $$
$$ Var(\hat{\mu}) = SE(\hat{\mu})^{2} = \frac{\sigma^{2}}{n} $$ 
$$ SE = \frac{\sigma}{\sqrt{n}}$$

- Standard errors can be used to perform hypothesis testing on the coefficients
$$ H_{0} : \beta_{1} = 0 $$
$$ H_{1} : \beta_{1} \neq 0 $$
    * If the $SE(\hat{\beta_{1}})$ is large, then $\hat{\beta_{1}}$ must be large in order to not reject the null hypothesis $H_{0}$
        
        + The `t-statistic` measures the number of standard deviations that $\hat{\beta_{1}}$ is away from 0
        
$$ t = \frac{\hat{\beta_{1}}-0}{SE(\hat{\beta_{1}})} $$
where the t-distribution will have n-2 degrees of freedom if there is no relationship between X and Y

### 2.1.3 Assessing the Accuracy of the Model

- After rejecting the null hypothesis in favour of the alternative hypothesis, there is a need to quantify the extent to which the model fits the data. For linear regression, the residual standard error $RSE$ and the $R^{2}$ are typically assessed

**Residual Standard Error (RSE)**
- The RSE is an estimate of the standard deviation of $\epsilon$ or the average amount the response will deviate from the true regression line
- RSE is also considered a measure of the lack of fit of the model to the data
$$ RSE = \sqrt{\frac{1}{n-2}RSS} = \sqrt{\frac{1}{n-2}\sum^{n}_{i=1}(y_{i}-\hat{y_{i}})^{2}} $$

**$R^{2}$ Statistic**
- $R^{2}$ provides an alternative measure of fit
    * It measures the proportion of variance explained with value between 0 and 1 and is independent of the scale of Y

$$ R^{2} = \frac{TSS-RSS}{TSS} = 1 - \frac{RSS}{TSS} $$

where $TSS = \sum(y_{i} - \bar{y_{i}})^{2}$ is the total sum of squares and $RSS =\sum(y_{i}-\hat{y_{i}})^{2}$ is the residual sum of squares 
- TSS measures the total variance in the response Y, before the regression is performed, while RSS measures the amount of variability that is left unexplained after performing the regression

## 2.2. Multiple Linear Regression

- An extension of the linear regression model that directly accommodate multiple predictors
- Each predictor will have a separate slope coefficient in a single model

$$ Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} +...+ \beta_{n}X_{n} + \epsilon $$

### 2.2.1 Estimating the Regression Coefficients

$$ RSS = \sum^{n}_{i=1}(y_{i}-\hat{y_{i}})^{2} = \sum^{n}_{i=1}(y_{i}-\hat{\beta_{0}}-\hat{\beta_{1}}x_{i1}-\hat{\beta_{2}}x_{i2}-...-\hat{\beta_{p}}x_{ip})^{2} $$

- In a multiple linear regression, more variables can be included so as to get a better understanding of the main predictor that is responsible for the change in the response.
    * The key is to minimize the sum of squared residuals

### 2.2.2 Important Considerations for MLR

**Determining the relationship between Response and Predictors**
- Check if all the regression coefficients are zero

$$ H_{0} : \beta_{1} = \beta_{2} = ... = \beta_{n} = 0 $$
$$ H_{1} : \text{at least one } \beta_{j} \text{ is non-zero} $$

where this hypothesis is performed by computing the `F-statistic`

$$ F= \frac{(TSS-RSS)/p}{RSS/(n-p-1)} $$

where $TSS = \sum(y_{i} - \bar{y_{i}})^{2}$ and $RSS =\sum(y_{i}-\hat{y_{i}})^{2}$. 
- If there is no relationship between the reponse and predictors, the F-statistic $\approx$ = **1**, otherwise, F-statistic **> 1**
- For each individual predictor, a t-statistic and a p-value will be reported which reports the partial effect of add that variable to the model
- When there are many predictors, there will be a chance that predictors that are not associated with the response could present false associations
    * The F-statistic corrects for this problem and state that there is a only a 5% chance regardless of the number of predictors.
        + However, the F-statistic is only useful when p is relatively small (n>p)
        
**Deciding on Important Variables**
- Once we have concluded that the alternative hypothesis (at least one predictor is related to the response) is true, we need to find all the predictors that are associated with the response
    * Perform variable selection by trying different models containing different subset of predictors
        + Model selection can be perform using various statistics: ***Mallow's $C_{p}$, Akaike information criterion (AIC), Bayesian information criterion (BIC) and adjusted $R^{2}$***
- Methods for variable selection include:
    * Forward selection
        + (1) Begin with a null model
        + (2) Fit $p$ simple linear regressions and add to the null model the variable that results in the lowest RSS
        + (3) Repeat recursively for the next few variables
    * Backward selection
        + (1) Start with all variables in the model
        + (2) Remove the variable with the largest p-value (least statistically significant)
        + (3) Fit the new (p-1) variables and repeat recursively till remaining variables have a p-value below some threshold
    * Mixed selection
        + Combination of forward and backward selection
        + (1) Start with no variables in the model
        + (2) Add the variable that provides the best fit as with forward selection
        + (3) At any point the p-value for one of the variables in the model rises above a certain threshold, remove the variable from the model
        + (4) Repeat recursively until all variables in the model have a sufficiently low p-value
- Backward selection cannot be used if p > n, while forawrd selection can always be used

**Model Fit**
Common numerical measures of model fit are **RSE** and **$R^{2}$**
- $R^{2}$
    * Is the square of the coorelation between response and the fitted linear model
    * $R^{2}$ will always increase when more variables are added to the model
        + Weakly associated variables will also increase the $R^{2}$ but should not be included
        + Variables that greatly improve the $R^{2}$ should be included
- $RSE$
    * Relative squared error (RSE) can be used to compare between models
        + A lower RSE  indicates that the added variable is important in the prediction of the response
    * Models with variables can have higher RSE if the decrease in RSS is small relative to the increase in $p$

$$ RSE = \sqrt{\frac{1}{n-p-1}RSS} $$

**Predictions**
- Uncertainty in coefficient estimates
    * Inaccuracy in the coefficient estimates is related to the reducible error
        + Can compute the confidence interval in order to determine how close $\hat{Y}$ will be to $f(X)$
- Model bias
    * Additional source of potentially reducible error
        + When we use linear model, we are assuming that the linear approximation is the best
        + Can try different types of model
- Irreducible error $\epsilon$
    * Prediction intervals are always wider than confidence intervals because they incorporate both the error in the eastimate of $f(X)$ and the uncertainty as to how much an individual point will different from the population regression plane $\epsilon$

## 2.3. Other Considerations in the Regression Model

### 2.3.1. Qualitative Predictors

**Predictors with only two levels**
- To incorporate the qualitative predictor with 2 levels, we can create an indicator or dummy variable that takes on two possible numerical values
    * Binary variables: **1** and **0**, e.g.
    * $y_{i} = \beta_{0} + \beta_{1}x_{i1} + \epsilon_{i} $
        + Male: **0**: $\beta_{0} + \epsilon_{i}$ 
        + Female **1**: $\beta_{0} + \beta_{1} + \epsilon_{i}$ 
        
**Qualitative predictors with more than two levels**
- A single dummy variable cannot represent all possible values
    * Can create additional dummy variables, e.g.
    * $y_{i} = \beta_{0} + \beta_{1}x_{i1} + \beta_{2}x_{i2} + \epsilon_{i} $
        + Asian: $\beta_{0} + \beta_{1} + \epsilon_{i}$
        + Caucasian: $\beta_{0} + \beta_{2} + \epsilon_{i}$
        + African American: $\beta_{0} + \epsilon_{i}$
- There will always be one fewer dummy variable than the number of levels

### 2.3.2. Extension of the Linear Model

- Two of the most important assumptions state that relationship between predictors and response are `additive` and `linear`
    * **Additive** means the effect of changes in a predictor $X_{j}$ on the response $Y$ is independent of the values of the other predictors
    * **Linear** states that the change in the response $Y$ due to a one-unit change in $X_{j}$ is constant, regardless of the values of $X_{j}$
    
**Removing the Additive Assumption**
- Predictors might not necessarily be indepedent and but have a synergistic effect **(interaction effect)** on the response

$$ Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{3}X_{1}X_{2} + \epsilon $$

where A one unit increase in $X_{1}$ is associated with an increase of $\beta_{1} + \beta_{3}X_{2}$
- The hierachical principle states that if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant
- In the absence of an interaction term, the predictors will be parallel lines to the data, meaning the increase in one predictor has not effect on the other.

**Non-linear Relationships**
- Polynomial regression can be used to directly extend the linear model to accomodate non-linear relationships
- Assuming the model has a quadratic shape, we can extend the linear model via the following equation:

$$ Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{1}^{2} + \epsilon $$

### 2.3.3. Potential Problems of Linear Regression

**Non-linearity of the Data**
- Linear regression model assumes a straight-line relationship between predictors and the response
    * If the true relationship is far from linear, the prediction accuracy of the model can be significantly reduced
        + Plotting residual versus predicted values of $\hat{y^{i}}$ to identify non-linearity

**Correlation of Error Terms**
- Important assumption of linear regression model is that error terms, $\epsilon_{1}, \epsilon_{2}, ..., \epsilon_{n}$ are uncorrelated
    * Correlation between the error terms will lead to an underestimate of the true standard errors
        + Confidence intervals and prediction intervals will be narrower than they should be
        + Erroneously conclude that a parameter is statistically significant
        
**Non-constant Variance of Error Terms**
- Important assumption of linear regression model is that the error terms have a constant variance
    * Often not the case and resulting in `heteroscedasticity`
        + Possible solution is to transform the response $Y$ using a concave function such as $logY$ or $\sqrt{Y}$
        
**Outliers**
- A point for which $y_{i}$ is far from the value predicted by the model
    * Outliers can potentially be influential points that cause poor fitting of the model

**High Leverage Points**
- Observation with high leverage have an unusual value of $x_{i}$ and tend to have a sizable impact on the estimated regression line
- In multiple linear regression with many predictors, it is possible that an observation is well within the range of each individual predictor's values but that is unusual in terms of the **full set of predictors**
- To quantify an observation's leverage, we can compute the leverage statistic:

$$ h_{i} = \frac{1}{n} + \frac{(x_{i}-\bar{x})^{2}}{\sum^{n}_{i'=1}(x_{i'}-\bar{x})^{2}} $$

where a large value indicates an observation with high leverage
- The leverage statistic $h_{i}$ is always between $\frac{1}{n}$ and **1**, and the average leverage for all the observations is always equal to $\frac{p+1}{n}$

**Collinearity**
- Refers to the situation in which two or more predictor variables are closely related to one another
    * Presence of collinearity can pose problems in the regression context as it can be difficult to separate out the individual effects of collinear variables on the response