# Performance Metrics and feature selection in Multiple Linear Regression



In this notebook, we'll explore key performance metrics used to evaluate the fit of multiple linear regression models. Although you may be familiar with these concepts from simple linear regression, we'll now extend them to the multiple regression setting, where we have more than one predictor variable.


## 1. **F-statistic**

The F-statistic tests whether **at least one predictor** is significantly related to the response variable in multiple linear regression.

While simple linear regression tests the significance of a single predictor, in multiple regression, we check if **all predictors combined** explain a significant portion of the variation in the response.

### Formula:
$$
F = \frac{(TSS - RSS) / p}{RSS / (n - p - 1)}
$$
Where:
- $n$ = number of observations
- $p$ = number of predictors
- $TSS$ = total sum of squares
- $RSS$ = residual sum of squares

The **p-value** of the F-statistic shows whether the predictors, as a group, are significant:
- **Small p-value** → Predictors are significant as a group.
- **Large p-value** → Predictors are not significant.


### Key Points:
- A **large F-statistic** suggests at least one predictor is significant.
- For larger datasets, even a small F-statistic can be significant. In smaller datasets, a larger F-statistic is needed to indicate significance.
  

### Usage:

Although, it's mainly used to test the overall significance of the model, we can use the F-statistic (and its p-value) to check whether adding features significantly improves the model. If one model has a much larger F-statistic, it suggests that those features explain a significant portion of the variance.



__Example 1:__

Let’s consider advertising data again. After performing a multiple linear regression analysis and computed the F-statistic to test if at least one of the predictors is related to sales.


| Coefficient | Estimate | Std. Error | p-Value |
|-------------|----------|------------|---------|
| Intercept    | 2.939    | 0.3119     | <0.0001 |
| TV           | 0.046    | 0.0014     | <0.0001 |
| Radio        | 0.189    | 0.0086     | <0.0001 |
| Newspaper    | -0.001   | 0.0059     | 0.8599  |

| Quantity                 | Value  |
|--------------------------|--------|
| F-statistic              | 570    |
| Residual standard error  | 1.69   |
| R²                       | 0.897  |


__Interpretation__

- F-statistic= 570: Since this is far larger than 1, it provides compelling evidence against the null hypothesis $H_0$. This suggests that at least one of the predictors is significantly related to sales. Which one do you think, and why?

- p-values: If very small (close to 0), it provides strong evidence against the null hypothesis, indicating that the predictors collectively have a significant relationship with the response.


## 2. **Adjusted $R^2$ (R-squared)**

In linear regression, $R^2$ quantifies how well **all your features combined** explain the variability in the response variable.In multiple regression $ R^2 $ is the square of the correlation between the response and the predicted values:

$$
Cor(Y, \hat{Y}\ )^2
$$

- A high $ R^2 $ value (close to 1) indicates that the model explains a large portion of the variance in the response variable.

- $R^2 = 1$ means that your predictors explain 100% of the variance in the response variable (a perfect fit).
- $R^2 = 0$ means that the predictors explain none of the variance, implying the model has no predictive power.


Note that **$R^2$ always increases** when you add more predictors, even if those predictors don’t actually improve the model. For that reason, we often also look at **Adjusted $R^2$**, which penalizes models that add too many predictors without improving fit.


### Formula:
$$
\text{Adjusted } R^2 = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - p - 1} \right)
$$
Where:
- $n$ = number of observations
- $p$ = number of predictors
- $R^2$ = regular R² value

### Key Points:
- **Adjusted R²** is lower than R² when irrelevant predictors are added.
- It is useful for comparing models with different numbers of predictors.
- A higher Adjusted R² indicates a model with better performance, accounting for model complexity.


### Usage:

This metric is ideal for comparing models with different numbers of predictors. Unlike regular $R^2$, it accounts for model complexity, so a higher Adjusted $R^2$ indicates a model that explains more variability while penalizing unnecessary predictors. You can compare the Adjusted $R^2$ values of both models, and the model with the higher Adjusted $R^2$ is likely better at balancing fit and complexity.

**Example 2**:
- **Full Model**: For the Advertising data, regressing sales on TV, radio, and newspaper gives an $ R^2 $ of 0.8972.
- **Reduced Model**: Using only TV and radio gives an $ R^2 $ of 0.89719.

Including newspaper barely increases $ R^2 $, suggesting it doesn't significantly improve the model. This is evident from the non-significant p-value for newspaper advertising.

**Adding Predictors**:
- $ R^2 $ always increases when adding more predictors, even if they're weakly associated with the response.
- A small increase in $ R^2 $ indicates that the new predictor doesn't add much value.

**Comparisons**:
- **TV Only**: $ R^2 = 0.61 $
- **TV and Radio**: $ R^2 = 0.89719 $

Adding radio significantly improves $ R^2 $, showing that radio is an important predictor.

**Residual Standard Error (RSE)**:
- **TV Only**: $ RSE = 3.26 $
- **TV and Radio**: $ RSE = 1.681 $
- **TV, Radio, and Newspaper**: $ RSE = 1.686 $

Including newspaper doesn't reduce RSE, reinforcing that it's not a useful predictor.

**Conclusion**:
- TV and radio are better predictors of sales than newspaper.
- The model should focus on TV and radio spending to predict sales.

### Exercise: Analyzing the Impact of Advertising Budgets on Sales

**Objective**: You will perform regression analysis and interpret the impact of different advertising budgets (TV, Newspaper, and Radio) on Sales. __Your task is to understand how different types of advertising contribute to overall sales and to interpret the results of the regression analysis.__




Use the provided dataset to answer the following questions.

#### Questions:

1. **Interpreting R-Squared**:
   - What is the **R-squared** value from your regression analysis?
   - What does this value tell you about how well the advertising budgets (TV, Newspaper, Radio) explain the variability in Sales? Is it a strong or weak fit?

2. **Significance of TV Budget**:
   - What is the coefficient for the **TV budget** in your regression output?
   - Is this coefficient statistically significant? Use the **p-value** to support your answer.
   - Interpret the coefficient: If the TV budget increases by 1 unit, what is the expected change in Sales?

3. **Impact of Newspaper Budget**:
   - What is the coefficient for the **Newspaper budget**?
   - Is the coefficient statistically significant? What does the p-value indicate?
   - How would you interpret the effect of the Newspaper budget on Sales? Does increasing Newspaper spending have a positive or negative effect?

4. **Impact of Radio Budget**:
   - What is the coefficient for the **Radio budget**?
   - Is the effect of Radio on Sales statistically significant? What does the p-value suggest?
   - Does increasing Radio advertising lead to higher Sales based on this data?


5. **Interpretaion of intercept**
  - how do you read and interpret the intercept?

6. **Model Summary**:
   - Based on your analysis, which advertising medium (TV, Newspaper, or Radio) has the **strongest influence** on Sales? Which has the weakest?
   - Given these results, if you were managing an advertising budget, where would you recommend focusing more resources to maximize Sales? Explain your reasoning.


7.  How does the **adjusted R-squared** differ from the **R-squared** value? Why is this adjustment important in the context of multiple regression with multiple independent variables?

8. **Further Analysis**:
   - Can you think of any external factors not included in this dataset that could also influence Sales?
   - How could you improve this model to better predict Sales?


#### Instructions:

1. Start by exploring and preprocessing your dataset.


(__Note:__ That means, you should briefly tell: what does each row/ each colmun represent? Give an overview of your dataset. Any missing information? how to handel them?  how to identify your features and target variable.)


2. Perform a **multiple linear regression**, print the regression summary and interpret them.


(__Note:__ that means, you should identify your features and target variable, implement the regression hyperplane, and print the regression summary (coefficients, R-squared values, p-values, etc.))
