<a href="https://colab.research.google.com/github/Shuaib11-Github/Important-concepts/blob/main/R2_and_Adjusted_R2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## R2 and Adjusted R2

In [1]:
#Import necessary packages and libraries
import numpy as np  
import pandas as pd
from sklearn.linear_model import LinearRegression
#Create input data as a dictionary
input_dict = {
    "PersonId": [1,2,3,4,5,6,7,8,9,10],
    "Father's height": [136.5,149.8,174.07,168.05,185.8,170.45,180.75,148.15,154.46,158.11],
    "Mother's height" : [126.5,143.8,167.07,165.05,182.8,160.45,170.75,140.25,148.46,147.11],
    "Weight" : [50,60,79,85,60,65,75,55,62,67],
    "Person's Height":  [116.5,139.8,184.07,198.05,145.8,160.45,180.75,128.15,144.46,156.11]
}
#Convert dictionary into a pandas dataframe
data = pd.DataFrame(input_dict)
data

Unnamed: 0,PersonId,Father's height,Mother's height,Weight,Person's Height
0,1,136.5,126.5,50,116.5
1,2,149.8,143.8,60,139.8
2,3,174.07,167.07,79,184.07
3,4,168.05,165.05,85,198.05
4,5,185.8,182.8,60,145.8
5,6,170.45,160.45,65,160.45
6,7,180.75,170.75,75,180.75
7,8,148.15,140.25,55,128.15
8,9,154.46,148.46,62,144.46
9,10,158.11,147.11,67,156.11


In [2]:
#Split the data into train data and test data
X_train = data.head(7)
X_test = data.tail(3)
#Remove UniqueId and target variable
del X_train["PersonId"]
del X_train["Person's Height"]
#Remove UniqueId and target variable
del X_test["PersonId"]
del X_test["Person's Height"]
y_train = data.head(7)
y_test = data.tail(3)
#Remove UniqueId and predictor variables
del y_train["PersonId"]
del y_train["Father's height"]
del y_train["Mother's height"]
del y_train["Weight"]
#Remove UniqueId and predictor variables
del y_test["PersonId"]
del y_test["Father's height"]
del y_test["Mother's height"]
del y_test["Weight"]
#Perform linear regression using sklearn library
regressor = LinearRegression()
regressor.fit(X_train,y_train)        
predictions = regressor.predict(X_test)
#sklearn's inbuilt method for computing the RSquared of the model
rsquared = regressor.score(X_test, y_test)
#Predictions of testdata
print(predictions)

[[129.60466076]
 [145.16054727]
 [159.51764397]]


In [3]:
#R Squared of the model
print(rsquared)

0.9639573144505528


Here, R Squared = 0.963. As per the characteristics of this metric, this looks like a very good value.


But is it enough to confirm the confidence regarding the predictive ability of this model?
No.


Let’s check Adjusted R Squared also for this model-

In [4]:
#Adjusted RSquared of the model
n=len(data) #number of records
p=len(data.columns)-2 #number of features .i.e. columns excluding uniqueId and target variable
adjr= 1-(1-rsquared)*(n-1)/(n-p-1)
print(adjr)

0.9459359716758291


Oops !!! It is less than R squared. Moreover, there is a drop of around 2% in the confidence from R Squared (0.963) to Adjusted R Squared (0.945).

- Why there was a drop in Adjusted R Squared?
- What is the real intuitive meaning conveyed by this difference?
- How it will reflect in real-time use cases?
- Does R Squared always belong to a value between 0 and 1 or are there any exceptional cases that we often miss out?

Let’s know the answers…

Limitations of R Squared
R Squared = 1 - (SSR/SST)

Here, SST stands for Sum of Squared Total which is an indication of nothing but “how much do the predicted points get varies from the mean of the target variable”. Mean is nothing but a regression line here.
SST = Sum (Square (Each data point — Mean of the target variable))

Mathematically,![image.png](attachment:image.png)

where,

n = Number of observations.

y = Observed value of the target variable.

y̅ = Mean value of the target variable.

For example,

If we want to build a regression model to predict the height of a person with weight as the independent variable then a possible prediction without much effort is to calculate the mean height of all persons belonging to our sample and consider it as the prediction. The red line in the following diagram shows the mean value of the height of all the persons belongs to our sample.

![image-2.png](attachment:image-2.png)

Now come to SSR,

SSR stands for Sum of Squared Residuals. This residual is calculated from the model which we built from our mathematical approach (Linear regression, Bayesian regression, Polynomial regression, or any other approach). If we use a sophisticated approach rather than using a naive approach like mean then our accuracy will increase.
SSR = Sum (Square (Each data point — Each corresponding data point in the regression line))

Mathematically,![image.png](attachment:image.png)

where,

n = Number of observations

y = Observed value of the target variable

ŷ = Predicted value of the target variable

![image-2.png](attachment:image-2.png)

In the above diagram, let’s consider that the blue line indicates the predictions from a sophisticated model with a high-level mathematical analysis. We can see that it has a higher accuracy than the red line.
Now come to the formula,

R Squared = 1- (SSR/SST)

Here,

SST will be a large number because it is a very poor model (red line).

SSR will be a small number because it is the best model we developed after much mathematical analysis (blue line).

So, SSR/SST will be a very small number (It will become very small whenever SSR decreases).

So, 1- (SSR/SST) will be a large number.

So we can infer that whenever R Squared goes higher, it means the model is too good.

This is a generic case but this cannot be applied in many cases where multiple independent variables are present. In the example, we had only one independent variable and one target variable but in the real case, we will have 100’s of independent variables for a single dependent variable. The actual problem is that, out of 100’s of independent variables-

Some variables will have a very high correlation with the target variable.

Some variables will have a very small correlation with the target variable.

Also, some independent variables will not correlate at all.

If there is no correlation then what happens is that — **“ Our model will automatically try to establish a relationship with dependent and independent variables and proceed with mathematical calculations assuming that the researcher has already eliminated the unwanted independent variables.”**

For example,

For predicting the height of a person, we will have the following independent variables

- Weight ( High correlation )

- Phone number( No correlation )

- Location ( Low correlation )

- Age ( High correlation )

- Gender ( Low correlation )

Here, only weight and age are enough to build an accurate model but the model will assume that the phone number will also influence the height and represent it in a multidimensional space. When a regression plane is built through these 5 independent variables, **it’s gradient, intercept, cost and residual will automatically adjust to increase the accuracy.** When the accuracy gets increases artificially, obviously R squared will also increase.


In such scenarios, the regression plane will touch all the edges of the original data points in the multidimensional space. It will make the SSR a very small number and that will eventually make the R Squared as a very high number but when test data is introduced, such models will fail miserably.


That is the reason why a high R Squared value does not guarantee an accurate model.

### Importance of Adjusted R Squared

- For overcoming the challenge mentioned above, we have an additional metric called Adjusted R Squared.

- Adjusted R Squared= 1 — [ ( (1 — R Squared) * (n-1) ) / (n-p-1) ]

where,

p = number of independent variables.

n = number of records in the data set.

For a simple representation, we can rewrite the above formula like this-

- Adjusted R Squared= 1 — (A * B)

where,

A = 1 — R Squared

B = (n-1) / (n-p-1)

From the above formula, we can impulsively consider the following inferences-

When the number of predictor variables increases, it will decrease the whole value of B.

When the value of R Squared increases, it will decrease the whole value of A.

Hence technically, it penalizes the value of both A and B if either R Squared is high or the number of predictor variables is high.

If we multiply both A and B then it will be a much smaller number.

If we subtract the product of A and B from 1 then it will be a value definitively less than 1 unless the value of p = 1.

Not only the difference between R Squared and Adjusted R squared but also the value of Adjusted R Squared itself can be considered as a goodness of fit metric replacing the limitations of R Squared for evaluating the envisage consistency of the model.

In summary, whenever the number of independent variables gets increases, it will penalize the formula so that the total value will come down. It is least affected by the increase of independent variables. Hence, Adjusted R Squared will more accurately indicate the performance of the model than the R Squared.

### Let's see an example of r2 and adj_r2 for analysis and their values with numerical values for better understanding

In [5]:
# (1 - r2) * ((n - 1)/ (n - k - 1)), where n = 100, k = 2, r2 = 0.9
0.1 * 1.0206185567010309

0.10206185567010309

In [6]:
# adj_r2 = 1 - (1 - r2) * ((n - 1)/ (n - k - 1)), where n = 100, k = 2, r2 = 0.9
adj_r2_2f = 1 - 0.10206185567010309
adj_r2_2f

0.8979381443298969

### After including one more feature which is irrelevant feature we can see the r2 and adj_r2 values and check its importance

In [7]:
# (1 - r2) * ((n - 1)/ (n - k - 1)), where n = 100, k = 3, r2 = 0.9001
0.09989999999999999 * 1.03125

0.10302187499999998

In [8]:
# adj_r2 = 1 - (1 - r2) * ((n - 1)/ (n - k - 1)), where n = 100, k = 3, r2 = 0.9001
adj_r2_3f = 1 - 0.10302187499999998
adj_r2_3f

0.896978125

### Can R Squared be negative?
Yes. It can be also a negative value in some rare scenarios.

Since, R squared = 1 — ( SSR / SST )

It is calculated on an assumption that the average line of the target which is a perpendicular line of the y-axis is the worst fit a model can have at a maximum riskiest case. SST is the squared difference between this average line and original data points. Similarly, SSR is the squared difference between the predicted data points (by the model plane) and original data points.

SSR/SST gives a ratio which indicates, **“How SSR is worst with respect to SST ? ”.** If your model can somewhat build a plane which is a comparatively good than the worst, then in 99% cases SSR< SST. It eventually makes R squared as positive if you substitute it in the equation.

But what if SSR >SST? This means that your regression plane is worse than the mean line (SST). In this case, R squared will be negative. But it happens only at 1% of cases or smaller.