# Regression and Correlation Methods

# 11.2 General Concepts

Population regression line is given as $y = \alpha + \beta x + e$
- $\beta$ represents the change in $y$ for a 1 unit increase in $x$, where $y$ represents the average value for the "dependent" variable for the population. 
- e ~ N(0, $\sigma^2$)

Sample regression line is given as $\hat{y} = a + bx$
- $\hat{y}$ is the **predicted (or average)** value of y for a given value of x 
- $a$ and $b$ are estimates of $\alpha$ and $\beta$, respectively. 

# 11.3 Fitting Regression Lines - Ordinary Least Squares

Strengths & Weaknesses:

**Assumptions:**
- Appropriate whenever the average residual for each given value of x is 0.
    - Normality is not strictly required 
    - However, the normality assumption is necessary to perform hypothesis tests concerning regression parameters 
    
Drawbacks:

What if assumptions are violated:

# 11.4 Inferences About Parameters from Regression Lines

Criteria for regression lines that fit the data well vs those that do not.

### F Test for Simple Linear Regression

Test for a non-zero relationship between x and y (that some of the variance in y is captured by x).

**Hypothesis**:
- H0: $\beta = 0$
- Ha: $\beta \neq 0$<br>

**Test Statistic**:<br>
- $F$ = Reg MS /Res MS, follows an $F$ distribution with 1 and n - 2 $df$ respectively. 

**Mechanics**:
1. Python
2. R<br>
    `model <- lm(y ~ x)`<br>
    `anova(model)`

$R^2$ is the proportion of the variance of y that is explained by x. 
   - if $R^2$ = 1, then if x is known, y can be predicted exactly without error. 
   - if $R^2$ = 0, then x gives no information as to what y is, and it's as if x provides no relevant information regarding y. 
   - For small $n$, a better measure of % variation of y explained by x is given by the adjusted $R^2$

### t Test for Simple Linear Regression

Gives the same p-value as an $F$ test, but also provides interval estimates for $\beta$.

**Hypothesis**:
- H0: $\beta = 0$
- Ha: $\beta \neq 0$<br>

**Test**:
- two-sided t-test, assuming you're Ha is that $\beta \neq 0$<br>

**Mechanics**:
1. Python
2. R<br>
    `model <- lm(y ~ x)`<br>
    `summary(model)`

# 11.5 Interval Estimation for Linear Regression

### Interval Estimates for Regression Parameters
Standard errors and confidence intervals for slope and intercept of regression lines. 

*See section 11.4 t-tests for generating standard errors and constructing interval estimates for regression parameters via computer*

### Interval Estimates for Predictions Made from Regression Lines

How accurate is the esimate $\hat{y} = a + bx$?<br>
*it depends on whether we are making predictions for one specific observation with value x (one specific boy with a given height), or for the mean value of all observations with value of x (all boys of a given height).*

So, we can calculate a standard error and a confidence interval for:
- the ***observed*** values of y for a given x
    - The distribution of observed $y$ values for the subset of individuals with independent variable $x$ is normal with mean $=\hat{y}=a+b x$ and standard error given by
$se_{1}(\hat{y})=\sqrt{s_{y \cdot x}^{2}\left[1+\frac{1}{n}+\frac{(x-\bar{x})^{2}}{L_{x x}}\right]}$
    - $100 x (1 - \alpha)$% of the observed values will fall within the ***prediction interval*** given by: $\hat{y} \pm t_{n-2,1-\alpha / 2} s e_{1}(\hat{y})$
- the ***average*** value of y for a given x
    - The best estimate of the average value of $y$ for a given $x$ is $\hat{y}=a+b x .$ Its standard error, denoted by $se_{2}(\hat{y}),$ is given by $se_{2}(\hat{y})=\sqrt{s_{y \cdot x}^{2}\left[\frac{1}{n}+\frac{(x-\bar{x})^{2}}{L_{x x}}\right]}$
    - a two sided $100 x (1 - \alpha)$% confidence interval for the ***average value of $y$*** is given by: $\hat{y} \pm t_{n-2,1-\alpha / 2} s e_{2}(\hat{y})$

**Mechanics**:
- R<br>
    `model <- lm(y~x)`<br>
    `newdata <- dataframe(x = c(x1, x2, ..., xk))`<br>
    `predict(model, newdata, interval = "prediction")`: for the prediction interval for a single observation<br>
    `predict(model, newdata, interval = "confidence", se.fit = TRUE)`: for the confidence interval for the average prediction $\hat{y}$ for a given value of $x$<br>


# 11.6 Assessing the Goodness of Fit of Regression Lines

What assumptions are made in fitting a linear regression, and what are some possible situations that would make these assumptions not viable, and what can be done when you have a situation where the assumptions are not met?

## Assumptions of linear regression 

*An alternative way to describe all four assumptions is that the errors, , are independent normal random variables with mean zero and constant variance, $\sigma^2$.*

1. **Linearity**:
    - for any given value of x, the corresponding value of y has an average value a + Bx, which is a linear function of x.<br><br>
2. **Equal-Variance** & **Normality** of residuals:
    - for any given value of x, the corresponding value of y is normally distributed about a + Bx with the same variance for any x. normality assumption is important in small samples. <br><br>
3. **Independence of error terms**:
    - There is no special pattern in the residuals that would indicate hidden variables that are affecting the observations.
    - for any two data points, the error terms $e_1$, $e_2$ are independent of each other. It's important that the samples be independent for accurate intepretation of p-values.<br><br>
    
4. **No Bad outliers** (from "Statistics with R")

## Assessing the Validity of the Linear Regression Assumptions

1. **Assessing the linearity assumption:**: can be checked with a scatter plot<br><br>
2. **Assessing the equal variance and normality assumption:** Since the variance of the residuals about an observation $x$ changes depending on how far away $x$ is from $\bar{x}$, **Studentized residuals** are used to check for normality and constant variance of the residuals, and are normalized such that their variance can be one-to-one compared across the entire spectrum of X ( or Y) values.

## Adjustments to satisfy the assumptions of linear regression

2. **adjusting input to satisfy equal variance & normality of residuals assumption**:
    - One commonly used strategy that can be employed if unequal residual variances are present is to transform the dependent variable (y) to a different scale. This type of transformation is called a **variance stabilizing transformation**. 
        - The **square-root transformation** is useful when the residual variance is proportional to the average value of y (e.g., if the average value goes up by a factor of 2, then the residual variance goes up by a factor of 2 also). 
        - The **ln transformation** is useful when the residual variance is proportional to the square of the average values (e.g., if the average value goes up by a factor of 2, then the residual variance goes up by a factor of 4).
    - *If you do employ a transformation, make sure that the linearity assumption still holds.*

**Standard Deviation of Residuals About the Fitted Regression line**

Let $\left(x_{i}, y_{i}\right)$ be a sample point used in estimating the regression line, $y=\alpha+\beta x$.
If $y=a+b x$ is the estimated regression line, and
$\hat{e}_{i}=$ residual for the point $\left(x_{i}, y_{i}\right)$ about the estimated regression line, then
$$
\begin{array}{l}
\hat{e}_{i}=y_{i}-\left(a+b x_{i}\right) \text { and } \\
\qquad s d\left(\hat{e}_{i}\right)=\sqrt{\hat{\sigma}^{2}\left[1-\frac{1}{n}-\frac{\left(x_{i}-\bar{x}\right)^{2}}{L_{x x}}\right]}
\end{array}
$$
The **Studentized residual** corresponding to the point $\left(x_{i}, y_{i}\right)$ is $\hat{e}_{i} / s d\left(\hat{e}_{i}\right)$

# 11.7 The Correlation Coefficient

(from "Statistics with R, Navarro): "*Pearson correlations are basically the same thing as linear regressions with only a single predictor added to the model.*"

**covariance:** un-normalized metric representing the relationship between two variables X and Y. 

**correlation coefficient** ($\rho$) Scaled/Normalized metric representing the relationship between X and Y. Ranges between -1 and 1. 

**sample (Pearson) correlation coefficient** ($r$) Sample estimate of the population correlation coefficient $\rho$ 
= 
$$\frac{\left(\text{sample covariance between x and y}\right)}{\text{(sample std. dev. of x)(sample std. dev. of y)}}$$



**Assumptions:**
- $x$ and $y$ are normally distributed. 

**Hypothesis Test**:
- H0: correlation between two variables = 0
- Ha: correlation between two variables != 0

**Mechanics**:
- R<br>
    `cor.test( x = parenthood$dan.sleep, y = parenthood$dan.grump )`

### Relationship between Sample Regression Coefficient (b) and Sample Correlation Coefficient (r)

The relationship between the sample regression coefficient $(b)$ and the sample correlation coefficient $(r)$:

$b=\frac{r s_{y}}{s_{x}}$

**When should the regression coefficient be used, and when should the correlation coefficient be used?**
- $b$ is for prediction. 
- $r$ is for quantifying the linear relationship.

# 11.8 Statistical Inference for Correlation Coefficients

### One-Sample t Test for Correlation Coefficient

**mechanics**:
- R<br>
    `cor.test(x, y)`

### One sample z Test for a Correlation Coefficient

*The problem with using the t test formation in Equation 11.20 is that the sample correlation coefficient r has a skewed distribution for nonzero ρ that cannot be easily approximated by a normal distribution.*

**hypotheses**:<br>
H0: ρ = ρ0<br>
Ha: ρ ≠ ρ0<br>

### Interval Estimation for Correlation Coefficient

### Sample Size Estimation for Correlation Coefficients

### Two-Sample Test for Correlations

# 11.9 Multiple Regression

population relationship: $$y=\alpha+\beta_{1} x_{1}+\beta_{2} x_{2}+e$$

The βj, j = 1, 2,..., k are referred to as partial-regression coefficients. βj represents the **average increase in y per unit increase in xj, with all other variables held constant** (or stated another way, after adjusting for all other variables in the model), and **is estimated by the parameter bj**.

sample estimate: $$y=a+b_{1} x_{1}+b_{2} x_{2}+e$$

The relationship between y and a specific independent variable xl is characterized as follows:
\begin{aligned}
&y \text { is normally distributed with expected value }=\alpha_{\ell}+\beta_{\ell} x_{\ell} \text { and variance } \sigma^{2} \text { where }\\
&\alpha_{\ell}=\alpha+\beta_{1} x_{1}+\cdots+\beta_{\ell-1} x_{\ell-1}+\beta_{\ell+1} x_{\ell+1}+\cdots+\beta_{k} x_{k}
\end{aligned}


**Assumptions**:
1. **All of the assumptions from single-variable linear regression** PLUS
1. **Uncorrelated Predictors** - no collinearity. 

A **partial-residual plot** is a good way to check the validity of the assumptions. The **partial-residual plot** reflects the relationship between y and xl after each variable is adjusted for all other predictors in the multiple-regression model, which is a primary goal of performing a multiple-regression analysis.

It can be shown that if the multiple-regression model in Equation 11.29 holds, then the residuals in step 1 should be linearly related to the residuals in step 2 with slope = βl (i.e., the partial-regression coefficient pertaining to xl in the multiple-regression model in Equation 11.29) and constant residual variance σ2. A separate partial-residual plot can be constructed relating y to each predictor x1,..., xk.

*If there are strong relationships among the independent variables in a multiple-regression model, then the partial-regression coefficients may differ considerably from the simple linear-regression coefficients obtained from considering each independent variable separately.*

**Mechanics**:
1. R<br>
    `lm(y ~ x1 + x2 + x3, data = data.frame)`

## Comparing regression coefficients in multiple linear regression

**If you want to compare multiple regression coefficients to each other directly, you must standardize the coefficients.**

The **standardized regression coefficient** 
$$b_s = b × (s_x /s_y)$$

It represents the estimated average increase in y (expressed in standard deviation units of y) per stan- dard deviation increase in x, after adjusting for all other variables in the model.

## Hypothesis testing

Test the overall hypothesis that age and birthweight when considered together are significant predictors of blood pressure. How can this be done?

$F$ Test:<br>
H0: β1 = β2 = . . . = βk = 0<br>
Ha: at least one of β1,..., βk ≠ 0<br>

The significant p-value for this test could be attributed to either variable. We would like to perform significance tests to identify the independent contributions of each variable. How can this be done?

In general, if we have k independent variables, then to assess the specific effect of the lth independent variable (xl), on y after controlling for the effects of all other variables:<br>
H0: βl = 0, all other βj ≠ 0<br>
H1: all βj ≠ 0<br>

**confounding variables**:<br>
It is possible that an independent variable (x1) will seem to have an important effect on a dependent variable (y) when considered by itself but will not be signifi- cant after adjusting for another independent variable (x2). This usually occurs when x1 and x2 are strongly related to each other and when x2 is also related to y. We refer to x2 as a confounder of the relationship between y and x1. 

**collinear variables**:<br>
In some instances, two strongly related variables are entered into the same multiple-regression model and, after controlling for the effect of the other variable, neither variable is significant. Such variables are referred to as collinear. It is best to avoid using highly collinear variables in the same multiple-regression model because their simultaneous presence can make it impossible to identify the specific effects of each variable.

## Goodness of Fit / Model Evaluation

**Externally Studentized residual**: *i*th data point not used in the estimation of regression parameters. Used in the case of outliers to remove their effect from the regression. 

**Internally Studentized residual**: because the ith data point was used in estimating the regression parameters.

**R<sup>2</sup>** (i.e.  the Coefficient of Determination): it is the proportion of the variance in the outcome variable that can be accounted for by the predictor. 
- 1- R<sup>2</sup> quantifies the amount of variance left over in the independent variable after the model is taken into account
    - R<sup>2</sup> quantifies the reduction in variance after the model is taken into account.  

**Adjusted R<sup>2</sup>**: Adding more predictors to the model will always increase the R<sup>2</sup>, so the adjusted R<sup>2</sup> only increases when a new predictor actually increases the performance of the model. 
- $\text{Adjusted } R^{2}=1-\left(\frac{n-1}{n-p}\right)\left(1-R^{2}\right)$
- This is useful to use when comparing different multiple linear regression models, as the number of terms in the model won't inflate the adjusted R<sup>2</sup> like it will for the un-adjusted R<sup>2</sup>. However, the adjusted R<sup>2</sup> does not have the same interpretation as the un-adjusted R<sup>2</sup> (proportion of the variance explained in the dependent variable by the predictors).
- Also useful for determining which coefficients are additive to the model, since the adjusted R<sup>2</sup> won't increase unless a predictor actually increases the explaine variance in the independent variable. 

**Partial R<sup>2</sup>** (i.e.  the Coefficient of Partial Determination): The proportion of the variance in model B that is not explained by model A, i.e. how much more variance can be accounted for by including the additional predictors in model B (with respect to model A).
\begin{aligned}
R_{y, B \mid A}^{2} &=\frac{\operatorname{SSR}(B \mid A)}{\operatorname{SSE}(A)} \\
&=\frac{\operatorname{SSE}(A)-\operatorname{SSE}(A, B)}{\operatorname{SSE}(A)}
\end{aligned}