# 03 Linear Regression

## 3.1 Simple Linear Regression

###### Population Regression Line
This is the best linear approximation to true relationship between $X$ and $Y$.
$$Y \approx \beta_0 + \beta_1 X + \epsilon$$

#### Estimating the Coefficients

- $\beta_0$ is the intercept
- $\beta_1$ is the slope

Residual $\epsilon_i = (y_i - \hat y_i)^2$
$$RSS = \sum_{i = 1}^n\ (y_i - \hat y_i)^2$$

or the sum of $\epsilon$

And the minimizers (least squares coefficients) are:
- $\hat \beta_1 = \frac{\sum_{i = 1}^n\ (x_i - \bar x_i)(y_i - \bar y_i)}{\sum_{i = 1}^n\ (x_i - \bar x)^2}$
- $\hat \beta_0 = \bar y - \hat \beta_1 \bar x$

While it is not feasible to estimate the true population mean given a sample set, we can achieve a close approximation of the population mean from a collection of sample means. A particular set of observations may have a sample mean $\bar \mu$ that underestimates the population $\mu$. Another set of observations may have a sample mean $\bar \mu$ overestimate the population mean $\mu$. But with enough sample means, we can obtain a very close approximation to the population mean.

This approach can also work for estimating the coeffiecient values $\beta$.

#### Estimating Accuracy of the Coefficient Estimates

To estimate closeness of $\bar \mu$ to $\mu$, a value called Standard Error is calculated:
$$Var(\bar \mu) = SE(\mu)^2 = \frac{\sigma^2}{n}$$

where $\sigma^2$ is the standard deviation for each set of observations. This tells us the average amount that the estimate $\bar \mu$ differs from the actual value. Notice that the equation shrinks with $n$, that is in the presence of more observations the standard error decreases. To get the standard errors for the coefficients we use:
- $SE(\hat \beta_0)^2 = \sigma^2\ [\frac{1}{n} + \frac{\bar x^2}{\sum_{i = 1}^n\ (x_i - \bar x)^2}]$
- $SE(\hat \beta_1)^2 = \frac{\sigma^2}{\sum_{i = 1}^n\ (x_i - \bar x_i)^2}$

Notice that the $SE(\hat \beta_1)$ is smaller when the $x_i$ are more spread out. We have more leverage to estimate a slope when this is the case.

The estimate of $\sigma$ is known as the residual standard error:
$$RSE = \sqrt(\frac{RSS}{(n - 2)})$$

Confidence Intervals, 95% confidence interval is defined as a range of values such that with 95% probability the range will contain the true unknown value of the parameter. This range is defined in terms of lower and upper limits computed from the sample data. 
- $\hat \beta_1 \pm SE(\hat \beta_1)$
- $\hat \beta_0 \pm SE(\hat \beta_0)$

which means there is a 95% chance that the interval above will contain the true value of $\beta_1$.

###### Hypothesis Testing
- $H_0: \beta_1 = 0$ (no relationship)
- $H_a: \beta_1 \ne 0$ (has relationship)

If the $SE(\hat \beta_1)$ is small, then small values of $\hat \beta_1$ may provide stong evidence that $\hat \beta_1 \ne 0$ and that there is a relationship between $X$ and $Y$. If large, then the coefficient must be large in absolute value in order for us to reject the null hypothesis. 

T-Statistics
$$t = \frac{\hat \beta_1 - 0}{SE(\hat \beta_1)}$$

which measures number of standard deviations that $\hat \beta_1$ is away from 0. If there is no relationship between $X$ and $Y$, then we expect a t-distribution with $n - 2$ degrees of freedom. The t-distribution has a bell shape and for values of $n$ greater than around 30 it is quite similar to a normal distribution. 

###### P-Value
A small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance (the association is meaningful), in the absence of any real association between the predictor and the response. Therefore if we see a small p-value, we can infer that there is an association between the predictor and the response. This is why we would reject the null hypothesis.

#### Assessing the Accuracy of a Model

###### Residual Standard Error (RSE)
$$RSE = \sqrt(\frac{1}{n - 2}\ \sum_{i = 1}^n\ (y_i - \hat y_i)^2$$

$$R^2 = 1 - \frac{RSS}{TSS}$$
- $RSS = \sum_{i = 1}^n\ (y_i - \hat y_i)^2$
- $TSS = \sum(y_i - \bar y)^2$

TSS measures the total variance in the response $Y$, which is the amount of variability inherit in the response before the regression is performed. RSS measures the amount of variability in the response left unexplained after performing regression. Hence TSS - RSS measures the amount of variability in the response that is explained by performing the regression, and $R^2$ measures the proportion of variability in $Y$ that can be explained by using $X$. An $R^2$ statistic close to one means a good majority of the variability in the response is explained by the regression. 

###### Correlation
$$Corr(X, Y) = \frac{\sum_{i = 1}^n\ (x_i - \bar x_i)(y_i - \bar y_i)}{\sqrt \sum_{i = 1}^n\ (x_i - \bar x)^2 \sqrt \sum_{i = 1}^n\ (y_i - \bar y)^2}$$

and represents a linear relationship between $X$ and $Y$. In simple linear regression, $R^2 = r^2$ where $r^2$ represents the correlation between $X$ and $Y$. $R^2$ is the square of the correlation of the response and the variable.

## 3.2 Multiple Linear Regression

$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3$$

where $\beta_j$ is the $j$th predictor and $\beta_j$ is the $j$th predictor's association wth $Y$. View $\beta_j$ as the averega effect on $Y$ for a one unit increase in $X_j$, holding all other predictors fixed. 

#### Questions

###### One: Relationship Between Response and Predictors?
First step in Multpile Linear Regression is to compute the F-Statistic and examine the associated p-value.

- $H_0: \beta_1 = \beta_p = 0$
- $H_a:$ at least one coefficient does not equal 0

Compute the F-Statistic:
$$F = \frac{(TSS - RSS) / p}{RSS / (n - p - 1)}$$

- $E{RSS / (n - p -1)} = \sigma^2$
- $E{(TSS - RSS) / p} = \sigma^2$

such that if there is no relationship between the response and the predictors, one would expect the F-Statistic to take on a value close to 1. Otherwise if $H_a$ is true, then $E{(TSS - RSS/p} > \sigma^2$ so we expect $F$ to be greater than 1. When the value of $n$ is large, an F-Statistic that is just a little larger than 1 might still provide evidence against the null. In contrast, a larger F-Statistic is needed to reject the null if $n$ is small. When the null is true and the errors $\epsilon_1$ have a normal distribution, the F-Statistic follows an F-Distribution. 

Sometimes we want to estimate if a particular subset of the coefficients are zero:
$$H_0: \beta_{p - q + 1} = \beta_{p - q + n} = 0$$

With this we fit a secnd model that uses all variables except the last $q$. We then compute the RSS for that subset as, for example, $RSS_0$ so that the F-Statistic for that subset is:
$$F = \frac{(RSS_0 - RSS) / q}{RSS / (n - p - 1)}$$

This evaluates the partial effect of adding that variable to the model. 

###### Two: Deciding on Important Variables
Once we have confirmed that at least one variable is related to the response, the next step is to determine which predictor that is. 

It is not viable to test all substes as there are $2^p$ models to consider. With this said, there are three viable approaches:
- Forward Selection: begin with null model of only an intercept. Then fit $p$ linear models and add to the null the variable that results in the lowest RSS. Repeat.
- Backward Selection: start with all variables in the model, and remove the variable with the largest p-value (least statistically significant). We stop when a threshold has been reached. (Cannot be used if $p > n$).
- Mixed Selection: this is a combination of forward and backward where two thresholds have been met (lower and upper bound for the p-value). You start with a null model and add variables like forward selection. The difference is that you remove a variable if the addition of a variable causes a previously added variable to go below a threshold. 

###### Three: Model Fit
In multiple linear regression, the $R^2$ is $Corr(Y, \bar Y)^2$. One property of the fitted linear model is that it maximizes this correlation among all possible linear models.  One question to ask is how does RSE increase when RSS must decrease? Well, models with more variables can have a higher RSE if the decrease in RSS is small relative to the increase in $p$. This is based on the equation:
$$RSE = \sqrt \frac{1}{n - p - 1} RSS$$

###### Four: Predictions
While we can get close to the true prediction for $Y$. there will always be a confidence interval to account for the error term $\epsilon$.

## 3.3 Other Considerations in the Regression Model

#### Qualitative Predictors

Create a dummy variable to represent a variable that takes on several values. Given $x_i$ and it can take on male or female:
- $x_i = 1$ if the $i$th variable is female
- $x_i = 0$ if the $i$th variable is male

This results in two regression models:
- $y_i = \beta_0 + \beta_1 + \epsilon_i$ if the $i$th person is female
- $y_i = \beta_0 + \epsilon_i$ if the $i$th person is male

With this $\beta_0$ is the result of the response given the default value for $\beta_1$, and $\beta_0 + \beta_1$ is the response value given the non-default value (assuming the values have only two possible categories). This also represents the difference between the categories. Of course if the p-value for the inclusion of the additional predictor is high, then we can conclude there is no statistical evidence of a difference in average credit card balance between the genders. 

We can also code the dummy variable to be 1 and -1 rather than 1 and 0. Doing this makes it not a matter of including or excluding a variable, but a matter of the addition of a dummy variable negatively affecting the response. Or, how the inclusion or exclusion of a dummy variable causes the response to be around the average at the intercept. 

###### Qualitative Predictors with More than Two Levels
Rather than a dummy variable being 0 or 1, with multiple possible values for the dummy variable we will have $x$ be represented as $x_{i1}, x_{i2}, x_{i3}$, etc. and each of these variations of the one variable will be 1 or 0. This creates a One Hot Encoding for a vairable in which a variable can turn on - like a switch - one of multiple values so that on of the possible values gets 1 while all the others get 0. This makes it so that a particular variation is included while all the others are excluded. An example would be a variable representing ethnicity. So:
- $y_i = \beta_0 + \beta_1 + \epsilon_i$ if the $i$th is Asian
- $y_i = \beta_0 + \beta_2 + \epsilon_i$ if $i$th is Caucasion
- $y_i = \beta_0 + \epsilon_i$ if the $i$th person is African American

Here $\beta_0$ is the response for African Americans, $\beta_1$ as the difference between Asian and African American categories, and $\beta_2$ can be interpreted as the difference in the average balance between Caucasion and African American. There will always be one fewer dummy variables than levels because there will always be one dummy variable value that serves as the default level. 

#### Extensions of the Linear Model

Two important relations between the predictors and the response are:
- Additive: predictor effects on $Y$ are independent of the other predictors
- Linear: change in $Y$ due to a one unit change in $x_j$ is constant regardless of the value of $x_j$.

###### Removing Additive Assumptions
Add in interaction terms:
$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \epsilon$$
becomes
$$Y = \beta_0 + (\beta_1 + \beta_3 X_2)X_1 + \beta_2 X_2 + \epsilon$$

where $(\beta_1 + \beta_3 X_2)X_1$ becomes $\tilde \beta_1 X_1$ and changes with $X_2$. Changing the value of $X_2$ will change the impact of $X_1$ on $Y$.

Dummy variables allow for different lines on the same plot. In other words, the average will change based on the presence or absence of a predictor. Interactive terms allow for a change in the slope. With the interaction term we have:
- $Y_i = \beta_0 + \beta_1 X_i +$:
    - $\beta_2 + \beta_3 X_i$ if male
    - $0$ if female
    
Which then becomes:
- $(\beta_0 + \beta_2) + (\beta_1 + \beta_3) X_i$ if male
- $\beta_0 + \beta_1 X_i$ if female

where $(\beta_0 + \beta_2)$ is the difference in intercepts (dummy variables) and $(\beta_1 + \beta_3)$ is the difference in slopes (interaction terms - how the presence of one variable affects another).

###### Non-Linear Relationships
Adding polynomial terms can help a model that is not strictly linear. To know when to stop adding the $n$th degree term use a p-value to evaluate the the addition of the next $n$ degree version of a predictor. If not significant then stop; if significant then include the variable and check the next $n$th degree version of that variable. 

#### Potential Problems

The following problems can occur when fitting a linear regression:

###### Non-linearity of the response-predictor relationships
If a residual plot indicates that there is a non-linear associations in the data, then use non-linear transformations of predictors like $log(X)$, $\sqrt X$, or $X^2$.

###### Correlation of error terms 
If there is correlation between the error terms $\epsilon$, then the estimated standard errors will tend to underestimate the true standard errors. This results in narrower confidence and prediction intervals than they should be. Correlated error terms lead to unwarrented confidence in our models. Positively correlated variables lead to tracking residuals (adjacent residuals may have similar values).

###### Non-constant variance of error terms:
Heteroscedasticity can be seen when the linear plot of data is funnel-shaped in the residual plot. To fix this issue, transform the response $Y$ with a concave functions like the log() or the square-root of the response. This leads to a greater amount of shrinkage for larger responses. 

###### Outliers: 
To identify the outliers it is beneficial to plot the studentized residuals calculated by dividing each residual $e_i$ by its estimated standard error. Observations whose outliers are greater than 3 in absolute value are possible outliers. It is common to remove outliers, but study should be made to assure that it is not due to some other deficiency like a missing predictor value. 

###### High-leverage points:
High Leverage observations tend to have a sizeable impact on the estimated regression line; sometimes greater than outliers. To quantify an observation's leverage, we compute the leverage statistic. A large value of this statistic indicates an observation with high leverage. For simple linear regression:
$$h_i = \frac{1}{n} + \frac{(x_i - \bar x)^2}{\sum_{i' = 1}^n\ (x_{i'} - \bar x)^2}$$

with this equation $h_i$ increases with the distance between $x_i$ and $\bar x$. In regard to multiple linear regression, if an observation has a leverage statistic that greatly exceeds $\frac{p + 1}{n}$, then we suspect the corresponding point has high leverage. 

###### Collinearity:
If two variables are interact in a collinear fashion, it can be troublesome to identify how the collinear predictors interact with the response (which is responsible). A simple way to detect collinearity is to look at the correlation matrix of the predictors. A large value in this table that is large in absolute value flags the two variables as collinear. While this flags collinearity, but not mutlicollinearity: correlation between three variables. 

Rather than a correlation matrix, a better method to detect collinearity is the Variance Inflation Factor (VIF). The VIF is the ratio of the variance of $\bar \beta_j$ when fitting the full model divided by the variance of $\bar \beta_j$ if fit on its own. The smallest value for VIF is 1, which represents the complete absence for collinearity. As a rule of thumb, a VIF greater than 5 or 10 indicates a problematic amount of collinearity. 

The VIF can be computed as:
$$VIF(\hat \beta_j) = \frac{1}{1 - R^2_{X_j | X_{-j}}}$$

where $R^2_{X_j | X_{-j}}$ is the $R^2$ from a regression of $X_j$ onto all of the other predictors. If that value is close to one, then the collinearity is present and so the VIF will be large. Two ways to deal with collinearity is:
- Drop problematic variables from the regression. 
- Combine the collinear variables together into one predictor. An example is taking the average of the two predictors to create a new predictor. 