# Performance Metrics: R Squared and Adjusted R Squared

## Commands

* `R² = 1 - (SS_residual / SS_total)`
* `R² = 1 - sum((y_i - y_hat_i)^2) / sum((y_i - y_avg)^2)`
* `R²_adjusted = 1 - (1 - R²) * ((N - 1) / (N - P - 1))`

## Summary

* **Performance metrics** such as **R Squared ($R^2$)** and **Adjusted R Squared** are essential for determining the quality of a linear regression model.
* **R Squared** calculates accuracy by comparing the sum of squared residuals (errors) against the sum of squared totals (variance from the mean).
* A major limitation of **R Squared** is that it tends to increase whenever new features are added, even if those features are **uncorrelated** or irrelevant to the output.
* **Adjusted R Squared** solves this problem by penalizing the addition of independent features that do not correlate with the target variable, decreasing the score if the new feature is not useful.

---

## R Squared ($R^2$)

**R Squared** is a statistical metric used to measure how close the data are to the fitted regression line. It is defined by the following formula:

$$
R^2 = 1 - \frac{SS_{residual}}{SS_{total}}
$$

### Understanding the Components

1. **Sum of Squares Residual ($SS_{residual}$)**:
   * This represents the sum of the squared differences between the **true output** ($y_i$) and the **predicted output** ($\hat{y}_i$).
   * Visually, this is the distance between the actual data points and the **best fit line**.
   * Formula:
     $$
     \sum (y_i - \hat{y}_i)^2
     $$

2. **Sum of Squares Total ($SS_{total}$)**:
   * This represents the sum of the squared differences between the **true output** ($y_i$) and the **average of all true outputs** ($\bar{y_i}$).
   * Visually, this compares the data points to a simple average line (mean) rather than the best fit line.
   * Formula:
     $$
     \sum (y_i - \bar{y_i})^2
     $$

### Interpretation

* Since the **best fit line** minimizes error, $SS_{residual}$ is typically smaller than $SS_{total}$.
* Dividing a smaller number by a larger number gives a small value; subtracting this from 1 yields a value close to 1.
* **Accuracy**: An $R^2$ value of 0.70 implies **70% accuracy**, while 0.90 implies **90% accuracy**.
* The closer the value is to 1, the **more accurate** the model is considered to be.

## The Problem with R Squared

While **R Squared** is a useful metric, it has a significant flaw when dealing with **multiple linear regression**.

### Scenario: Adding Correlated Features

* A feature like **Size of House** is positively correlated with **Price**, and a model might yield an $R^2$ of **75%**.
* Adding another relevant feature, such as **Number of Bedrooms**, typically increases the $R^2$ (e.g., to **80%**) because it provides more predictive power.

### Scenario: Adding Uncorrelated Features

* Adding an irrelevant feature, such as **Gender**, which has **no correlation** with house price, will often still increase or maintain the $R^2$ value (e.g., **87%**).
* **Mathematical Reason**: The formula structure ensures that adding variables usually reduces residual error slightly, even if the variable is noise.
* **Conclusion**: $R^2$ is unreliable for feature selection because it does not penalize useless variables.

## Adjusted R Squared

**Adjusted R Squared** is an improved metric that accounts for the number of predictors in the model. It penalizes the model for adding features that do not improve performance meaningfully.

### Formula

$$
R^2_{adjusted} = 1 - (1 - R^2)\frac{N - 1}{N - P - 1}
$$

### Variables

* **$R^2$**: The R Squared value of the model
* **$N$**: Number of data points
* **$P$**: Number of independent features (predictors)

### Behavior Comparison

1. **Adding an Irrelevant Feature**:
   * Increasing $P$ decreases the denominator $(N - P - 1)$.
   * If $R^2$ does not increase significantly, Adjusted $R^2$ **decreases**.

2. **Adding a Relevant Feature**:
   * A strong increase in $R^2$ outweighs the penalty for increasing $P$.
   * Adjusted $R^2$ **increases**, correctly reflecting improved model quality.

This mechanism ensures that **Adjusted R Squared** only increases when a new feature truly improves the model beyond chance.
