# $R^2$ and Adjusted $R^2$

## Prerequisities


- Performance metrics of Linear Regression.

## Learning Objectives

After reading this notebook, students should be able to:

- Understand $R^2$ and implement it on real world dataset.

- Understand the pitfalls of $R^2$ and motivation for adjusted $R^2$ denoted by $\bar{R^2}$.

- Compare and contrast the results from $R^2$ and $\bar{R^2}$.

##  $R^2$ or Coefficient of Determination

We have already gone through the several performance metrics in unit 0 contents such as Residual Sum of Squares(RSS), Mean Squared Error(MSE), Root Mean Squared Error(RMSE) and Mean Absolute Error(MAE). **But all these metrics have one common fault: a good value of these metrics**. What is a good value for RSS?. There is no definitive answer for this. Similar is the case with the other metrics stated above. This is due to the fact that these metrics are not limited to a particular range of values. So metrics like RSS, MSE, RMSE and MAE are not quite enought to evaluate the performance of a model. So there is a need for a better metric using which you can determine how good a model is performing. This is where $R^2$ comes into play.


The $R^{2}$ value is used to determine how differences in one variable can be explained by differences in a second variable. For instance, a person's height can be useful for explaining their weight. It is used to determine how strong of a linear relationship exists and is always in the range of 0.0 to 1.0. Note that the $R^{2}$ value is not an error measure; larger $R^{2}$ values are generally better.

Let's be more specific in terms of linear regression. $R^{2}$ is the proportion of variance in the dependent variable, $y$ that is explainable by the input independent variables, $x$'s. We calculate $R^2$ in the following way:

Suppose we have $n$ observed values for the output variable, $\mathbf{y}$ as ($y_1, y_2, \dots, y_n$). And the predicted values for these observed values are, $\hat{\mathbf{y}}$ as ($\hat{y_1}, \hat{y_2}, \dots, \hat{y_n}$). There is an error or residual which is denoted by $\mathbf{\epsilon}$ as ($\epsilon_1, \epsilon_2, \dots, \epsilon_n$).


The mean value for the observations of output variable is:

$$\bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i$$

Now, we will see the variability of the dataset using the _sum of squares_ formulae as:

- Variance offered by the total data can be presented as:

$$
 SS_{tot} = \sum_{i=1}^{n} (y_i-\bar{y})^2......(1)$$

- Variance offered by fitted regression line can be presented as:

$$ SS_{reg} = \sum_{i=1}^{n} (\hat{y_i}-\bar{y})^2......(2)$$

- Residual Sum of Squares or Sum of Squared Errors can be presented as:

$$SS_{res} = \sum_{i=1}^{n} ({y_i}-\hat{y_i})^2.......(3)$$


Now, coefficient of determination, $R^2$ is calculated by:

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}......(4)$$


If you take a look at eq. $(4)$ you can see that $R^2$ is actually comparing two models: a regression model and a naive model that always predicts average value of $y$. In eq. $4,\ \ SS_{res}$ is the sum of squared residuals which measures the squared distance between the predictions of the regression model to their actual values. Similarly, $SS_{tot}$ is the total sum of squares or the variance which measures the squared distance between the actual values and their average. This is equivalent to the sum of squared residual of a model that always predicts the average.

If the regression model’s predictions are better than the average predictions, that is, $SS_{res} < SS_{tot}$, then $R^2 \rightarrow 1$. And If the regression model’s predictions are worse than the average predictions, that is, $SS_{res} > SS_{tot}$, then $R^2 \rightarrow 0$.






From equation ($1$) , ($2$) and ($3$) we can write:

$$SS_{tot} = SS_{reg} + SS_{res}$$

So, we can also write $R^2$ as:

$$R^2 = \frac{SS_{reg}}{SS_{tot}}$$

But, we will use equation ($\text{4}$) as the standard formulae for $R^2$. From this formulae, it is clear that if all the predictions for output variables are equal to the observations, $SS_{res}$ is $0$ and $R^2$ is 1. This is the best case that we can ever experience.

Similarly, if all the predictions for output variable are equal to the mean of the observations ($\bar{y}$) then $SS_{tot}$ is equal to $SS_{res}$ and ultimately $R^2$ =$0$. Negative values for $R^2$ is also possible if the regression line fits the data worse.

Now, we will see $R^2$ score on the real world dataset and will interpret the results.

### Implementation

Scikit-Learn's [metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) package has several metrics for both [classification](https://scikit-learn.org/stable/modules/classes.html#classification-metrics) and [regression](https://scikit-learn.org/stable/modules/classes.html#regression-metrics).The [`r2_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score) function implements the $R^{2}$ value.

### Imports

In [None]:
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, median_absolute_error, r2_score

from IPython.display import display, HTML

### Advertising Dataset

A popular introductory statistics book, [An Introduction to Statistical Learning](https://www.statlearning.com), provides several [datasets](https://www.statlearning.com/resources-second-edition) on their website. We are going to be using the advertising dataset for our next example. This dataset can be downloaded from the following address:
  * https://www.statlearning.com/s/Advertising.csv

In [None]:
data_path = "https://www.statlearning.com/s/Advertising.csv"

# Read the CSV data from the link
data_df = pd.read_csv(data_path,index_col=0)

# Print out first 5 samples from the DataFrame
data_df.head()

Unnamed: 0,TV,radio,newspaper,sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [None]:
results = {

    "R Squared":list()
}
linear_regression = LinearRegression()
y_true = data_df[["sales"]]

############
# TV
############
linear_regression.fit(data_df[["TV"]], data_df[["sales"]])

y_pred = linear_regression.predict( data_df[["TV"]] )
results["R Squared"].append( r2_score(y_true, y_pred) )

############
# RADIO
############
linear_regression.fit(data_df[["radio"]], data_df[["sales"]])

y_pred = linear_regression.predict( data_df[["radio"]] )
results["R Squared"].append( r2_score(y_true, y_pred) )

############
# newspaper
############
linear_regression.fit(data_df[["newspaper"]], data_df[["sales"]])

y_pred = linear_regression.predict( data_df[["newspaper"]] )
results["R Squared"].append( r2_score(y_true, y_pred) )

############
# TV, radio
############
linear_regression.fit(data_df[["TV", "radio"]], data_df[["sales"]])

y_pred = linear_regression.predict( data_df[["TV", "radio"]] )
results["R Squared"].append( r2_score(y_true, y_pred) )

############
# TV, radio, newspaper
############
linear_regression.fit(data_df[["TV", "radio", "newspaper"]], data_df[["sales"]])

y_pred = linear_regression.predict( data_df[["TV", "radio", "newspaper"]] )
results["R Squared"].append( r2_score(y_true, y_pred) )

### Interpreting Results


In [None]:
index = ["TV", "radio", "newspaper", "TV + radio", "TV + radio + newspaper"]
r2_df = pd.DataFrame(results, index=index).transpose()
display(r2_df)

Unnamed: 0,TV,radio,newspaper,TV + radio,TV + radio + newspaper
R Squared,0.611875,0.332032,0.05212,0.897194,0.897211



Among three simple linear regression models, the $R^{2}$ value shows that the TV model is probably the best. For the multiple linear regression model that uses the TV and radio features seems to perform better than the three simple linear regression models. Observe that the model that uses the TV, radio, and newspaper features seems to be the best as it has the highest $R^2$ score. This is most likely due to overfitting  which might have been caused by considering the newspaper feature that seemed to be independent of our target variable sales, and so this model should be ignored.

From the above results, we can see that $R^2$ or coefficient of determination increases with the increase in number of features. The model with highest number of features will have the largest $R^2$ score and this score monotonically increases with the addition of more features. But, adding more features to linear regression model is always not good. There can be high chances of ovefitting and also it becomes computationally uneasy. So, selecting a model with highest $R^2$ is always not a good idea. We need a better measure to evaluate the performance. For that, we have Adjusted $R^2$.  

## Adjusted $R^2$ or  Adjusted Coefficient of Determination

The explanation of adjusted $R^2$ is almost similar to $R^2$, with one difference being that is, it penalises itself for adding a feature. Adjusted $R^2$ is denoted by $\bar{R^2}$, which is pronounced as R bar squared. It takes into account the number of explanatory variables denoted by $d$ relative to the number of observations or data points, $n$. Adjusted $R^2$ can be calculated as:

$$\bar{R^2} = 1-\frac{SS_{res} / {(n-d-1)}}{SS_{tot} / {(n-1)}}$$

Or, we can also write $\bar{R^2}$ in the form of $R^2$ as:

$$\bar{R^2} = 1- (1-R^2)\frac{n-1}{n-d-1}$$


Here, ($n-1$) is the degrees of freedom that encounters the population variance of dependent output variable whereas ($n-d-1$) is the degrees of freedom that encounters the population variance or error. Contrast to $R^2$, adjusted $R^2$ or $\hat{R^2}$ increases if and only if the increase in $R^2$ (including the new independent input variable) is more than one would expect to see by chance.

Now, we will see $\bar{R^2}$ score on the same Advertising dataset and interpret the results.

## Implementation

Like earlier, we will also implement adjusted $R^2$ on the same Advertising dataset.

In [None]:
results = {

    "Adjusted R Squared":list()
}

## Calculating Adjusted r2 from r2 score
def adjusted_r2(r2, n, d):
  adj_r2 = 1-((1-r2)*(n-1)/(n-d-1))
  results["Adjusted R Squared"].append(adj_r2)


In [None]:

############
# TV
############
adjusted_r2(r2_df.iloc[0, :]['TV'], 200, 1)


############
# radio
############
adjusted_r2(r2_df.iloc[0, :]['radio'], 200, 1)

############
# newspaper
############
adjusted_r2(r2_df.iloc[0, :]['newspaper'], 200, 1)


############
# TV, radio
############
adjusted_r2(r2_df.iloc[0, :]['TV + radio'], 200, 2)


############
# TV, radio, newspaper
############
adjusted_r2(r2_df.iloc[0, :]['TV + radio + newspaper'], 200, 3)



## Interpreting Results

In [None]:
index = ["TV", "radio", "newspaper", "TV + radio", "TV + radio + newspaper"]
adjusstedr2_df = pd.DataFrame(results, index=index).transpose()
display(adjusstedr2_df)

Unnamed: 0,TV,radio,newspaper,TV + radio,TV + radio + newspaper
Adjusted R Squared,0.609915,0.328659,0.047333,0.896151,0.895637


Here are the adjusted $R^2$ for each of the input sets. For simple linear regression, _TV_ as input has the highest adjusted $R^2$, so it is to be chosen if we want simple linear regression. Similarly, we see that with the inputs as _TV_ and _radio_, the adjusted $R^2$ increases to high value. Increasing one independent variable has increased adjusted $R^2$ too, so this feature adds a lot to fit the data points. Similarly, when we increase another independent input variable, _newspaper_, the adjusted $R^2$ decreases by a little. Unlike $R^2$, adjusted $R^2$ compensates for the addition of variables and only increases if the new predictor enhances the model above what would be obtained by probability. Hence, if we have to select the best model, then it would be the one with inputs as _TV_ and _radio_, which has the highest adjusted $R^2$.

## Key Takeaways

* While using metrics  like RSS, MSE, RMSE and MAE, there is no definite value that can be considered a good value.

* R-squared provides a measure of how well a model is performing. Closer the R-squared to 1,  better the model’s performance.


* Contrary to R-squared, adjusted R-squared takes into account the number of features.

* Adjusted R-squared increases only if the increase in R-squared (including new independent input variable) is more than one would expect to see by chance.

