# Multiple Linear Regression II

When we perform multiple linear regression, we usually are interested in answering a few important questions.

## 1. Is atleast one of the predictor $X_1, X_2, \dots, X_p$ helpful in predicting the response $y$?



In simple linear regression, we check if there’s a relationship between the response and predictor by testing if the coefficient $\beta_1$ is zero. For multiple regression with $p$ predictors, we test if all coefficients are zero ($\beta_1 = \beta_2 = \cdots = \beta_p = 0$). This is done using an __F-statistic__. It compares the explained variance (due to the model) to the unexplained variance (due to residuals):

$$
F = \frac{(TSS - RSS) / p}{RSS / (n - p - 1)}
$$

Where $n$ is the number of observations, $p$ is the number of predictors $TSS$ is the total sum of squares and $RSS$ is the residual sum of squares.


A large F-statistic indicates that at least one predictor is related to the response variable. But what counts as "large" depends on the number of observations $n$ and predictors $p$.

- When $n$ is large: Even an F-statistic slightly greater than 1 might be significant.
- When $n$ is small: A larger F-statistic is needed to be significant.

The F-statistic follows an F-distribution, and its p-value tells us if the predictors are collectively significant.

### Example: Multiple Linear Regression with Sales and Advertising Data

Let’s consider advertising data again. Assume we performed a multiple linear regression analysis and computed the F-statistic to test if at least one of the predictors is related to sales.


| Coefficient | Estimate | Std. Error | p-Value |
|-------------|----------|------------|---------|
| Intercept    | 2.939    | 0.3119     | <0.0001 |
| TV           | 0.046    | 0.0014     | <0.0001 |
| Radio        | 0.189    | 0.0086     | <0.0001 |
| Newspaper    | -0.001   | 0.0059     | 0.8599  |

| Quantity                 | Value  |
|--------------------------|--------|
| F-statistic              | 570    |
| Residual standard error  | 1.69   |
| R²                       | 0.897  |


__Interpretation__

- F-statistic= 570: Since this is far larger than 1, it provides compelling evidence against the null hypothesis $H_0$. This suggests that at least one of the predictors is significantly related to sales.
- p-values: If very small (close to 0), it provides strong evidence against the null hypothesis, indicating that the predictors collectively have a significant relationship with the response.


## 2. Feature Selection



Sometimes, we want to test if a subset of predictors is zero. Suppose we want to test if a particular subset of $q$ coefficients are zero:

$$
H_0: \beta_{p-q+1} = \beta_{p-q+2} = \cdots = \beta_p = 0
$$

We then fit a reduced model excluding these predictors and calculate the F-statistic to assess whether omitting these predictors significantly worsens the model fit.

This helps with deciding on which predictors are more important. In fact, the most direct approach is to called the __all subsets or best subsets regression__: compute the least square fit for all subsets and choose the one that balances the training error with model size.

However, if $p$ is large, the number of all substes could be huge $2^p$. For example, if $p=40$, the number of subsets is over a bilion. We discuss two approaches to address this issue:




### Forward Selection

- Begin with the null model— a model that contains an intercept but no predictors.

- Fit $p$ simple linear regressions and add to the null model the variable that results in the lowest RSS.

- Add to that model the variable that results in the lowest RSS for the new two-variable model.

- Continue until some stopping rule is satisfied. For example, when all remaining variables have a p-value above a threshhold.  

### Backward Selection

- Start with all variables in the model.

- Remove the variable with the largest p-value—that is, the variable that is the least statistically significant.

-  The new $(p − 1)-variable model$ is fit, and the variable with the largest p-value is removed.

- This procedure continues until a stopping rule is reached. For instance, we may stop when all remaining variables have a p-value below some threshold.

## 3. Model Fit:



Two common numerical measures of model fit are:
- **Residual Standard Error (RSE)**
- **$ R^2$**: The fraction of variance explained by the model.


- In simple regression, $ R^2 $ is the square of the correlation between the response and the predictor.
- In multiple regression, $ R^2 $ is the square of the correlation between the response and the predicted values. In fact it is equal to ⁉

$$
Cor(Y, \hat{Y}\ )^2
$$

A high $ R^2 $ value (close to 1) indicates that the model explains a large portion of the variance in the response variable.

**Example**:
- **Full Model**: For the Advertising data, regressing sales on TV, radio, and newspaper gives an $ R^2 $ of 0.8972.
- **Reduced Model**: Using only TV and radio gives an $ R^2 $ of 0.89719.

Including newspaper barely increases $ R^2 $, suggesting it doesn't significantly improve the model. This is evident from the non-significant p-value for newspaper advertising.

**Adding Predictors**:
- $ R^2 $ always increases when adding more predictors, even if they're weakly associated with the response.
- A small increase in $ R^2 $ indicates that the new predictor doesn't add much value.

**Comparisons**:
- **TV Only**: $ R^2 = 0.61 $
- **TV and Radio**: $ R^2 = 0.89719 $

Adding radio significantly improves $ R^2 $, showing that radio is an important predictor.

**Residual Standard Error (RSE)**:
- **TV Only**: $ RSE = 3.26 $
- **TV and Radio**: $ RSE = 1.681 $
- **TV, Radio, and Newspaper**: $ RSE = 1.686 $

Including newspaper doesn't reduce RSE, reinforcing that it's not a useful predictor.

**Graphical Analysis**:
- Plotting data can reveal problems not visible in numerical statistics.
- A 3D plot of TV and radio vs. sales shows non-linear patterns, suggesting interaction effects between TV and radio.

**Conclusion**:
- TV and radio are better predictors of sales than newspaper.
- The model should focus on TV and radio spending to predict sales accurately.

## 4. Predictions:



### Simplified Explanation of Predictions in Multiple Regression

**Predictions**:

Once we have fit the multiple regression model, predicting the response $Y$ based on the values of the predictors $X_1, X_2, \ldots, X_p$ is straightforward using the equation:

$$ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \cdots + \hat{\beta}_p X_p $$

However, there are three types of uncertainty in these predictions:

1. **Inaccuracy in Coefficient Estimates**:
   - The estimated coefficients $\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_p$ are not the true population coefficients $\beta_0, \beta_1, \ldots, \beta_p$.
   - This inaccuracy is related to the **reducible error**.
   - We can compute a **confidence interval** to determine how close $\hat{Y}$ is to the true regression function $f(X)$.

2. **Model Bias**:
   - The linear model we assume for $f(X)$ is an approximation of reality, which introduces **model bias**.
   - We estimate the best linear approximation to the true function, but this explanation ignores the discrepancy and assumes the linear model is correct.

3. **Random Error ($\epsilon$)**:
   - Even if we knew the true coefficients, predictions cannot be perfect due to random error ($\epsilon$) in the model.
   - This is the **irreducible error**.
   - **Prediction intervals** help quantify this uncertainty and are always wider than confidence intervals because they account for both reducible and irreducible errors.

**Example**:

For the Advertising data:

- **Confidence Interval**:
  - Quantifies uncertainty about the average sales over many cities.
  - Given $100,000 spent on TV and $20,000 on radio, the 95% confidence interval for the average sales is $[10,985, 11,528]$.
  - This means 95% of such intervals will contain the true average value of $f(X)$.

- **Prediction Interval**:
  - Quantifies uncertainty about sales for a particular city.
  - Given $100,000 spent on TV and $20,000 on radio in a specific city, the 95% prediction interval for sales is $[7,930, 14,580]$.
  - This means 95% of such intervals will contain the true sales value for that city.

**Key Points**:

- Both intervals are centered at 11,256, but the prediction interval is wider due to increased uncertainty about individual sales compared to average sales.
- Confidence intervals are narrower because they only account for the reducible error, while prediction intervals account for both reducible and irreducible errors.