<a href="https://colab.research.google.com/github/Saikat4991/R_Squared-and-Adjusted-R_Squared/blob/main/R_Sq%26AdjR_Sq.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

R-squared ($R^2$) and Adjusted R-squared are two statistical measures used to evaluate the performance of regression models, especially to quantify how well the independent variables explain the variability of the dependent variable.

When evaluating the performance of a regression model, both $R^2$ and adjusted $R^2$ are useful metrics, but they serve slightly different purposes and their use depends on the context of model and analysis.

### $R^2$ (Coefficient of Determination)
- **Definition**: $R^2$ measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It is a number between 0 and 1, where 1 indicates perfect prediction and 0 means that the model performs no better than a model that simply predicts the mean of the dependent variable, regardless of the input features.
- **Usefulness**: $R^2$ is very intuitive and widely used to gauge how well a model captures the variance of the data. It's especially useful for simple linear regression models.
- **Limitation**: Its main limitation is that it can give a misleadingly high value for models with many predictors or when using models that are too complex for the data. This is because adding more predictors to a model can inflate the $R^2$ value, even if those predictors do not improve the model's true predictive capability.

### Adjusted $R^2$
- **Definition**: Adjusted $R^2$ adjusts the $R^2$ statistic based on the number of predictors in the model. Unlike $R^2$, it can decrease if a predictor improves the model less than what would be expected by chance. It provides a more accurate measure of the goodness of fit, especially for multiple regression models.
- **Usefulness**: It is particularly useful when comparing models with a different number of independent variables. By penalizing the addition of irrelevant predictors, adjusted $R^2$ helps to mitigate the problem of overfitting and provides a more honest assessment of the model's explanatory power.
- **Consideration**: While adjusted $R^2$ is a better indicator for the comparative analysis of models with different numbers of predictors, it is still not a definitive measure of model quality. Other model evaluation metrics and techniques, such as cross-validation, should also be considered.

### Which is Best?
- **Model Comparison**: For simple linear regression (with a single predictor), $R^2$ and adjusted $R^2$ will provide similar insights. However, for multiple regression models (with more than one predictor), adjusted $R^2$ is generally a better choice as it accounts for the model complexity.
- **General Guideline**: There is no absolute "best" between $R^2$ and adjusted $R^2$; the choice depends on your specific needs. If you are evaluating models with varying numbers of predictors or concerned about overfitting, adjusted $R^2$ is more appropriate. If you are working with a simple model or just need a quick, intuitive measure of fit, $R^2$ may suffice.

In summary, adjusted $R^2$ is often preferred in more complex models and model comparison scenarios, as it provides a more nuanced view of model performance by accounting for the number of predictors used.

## **Comparison of R-squared $(R^2$) and Adjusted R-squared to clearly outline their differences:**

| Feature | R-squared $(R^2)$ | Adjusted R-squared |
|---------|---------------------|--------------------|
| **Definition** | Measures the proportion of variance in the dependent variable that is predictable from the independent variables. | Adjusts $R^2$ for the number of predictors in the model, providing a more accurate measure when multiple predictors are used. |
| **Calculation** | $R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$ | Adjusted $R^2$ = $1 - (1-R^2)\frac{n-1}{n-p-1}$ |
| **Interpretation** | Values range from 0 to 1. A higher value indicates that the model explains a greater proportion of variance. | Similar to $R^2$, but can decrease if unnecessary predictors are added, thus penalizing overfitting. |
| **Sensitivity to Predictors** | Increases or stays the same with the addition of predictors, which can lead to overfitting. | Decreases if the addition of predictors does not significantly improve the model, providing a penalty for unnecessary complexity. |
| **Use in Model Selection** | Can be misleading when comparing models with a different number of predictors due to its sensitivity to model complexity. | More reliable for comparing models with different numbers of predictors, as it accounts for model complexity. |
| **Advantage** | Simple and widely understood measure of model fit. | Provides a more accurate assessment of model performance by adjusting for model complexity. |
| **Limitation** | Can encourage overfitting by rewarding models for including more variables, regardless of their relevance. | Requires a larger sample size to be effective, as the adjustment can be overly penalizing in small samples. |


In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn.datasets import fetch_california_housing

In [3]:
calHoushing = fetch_california_housing(as_frame = True)
data = pd.DataFrame(calHoushing.data, columns = calHoushing.feature_names)
data['Price($100K)'] = calHoushing.target
data

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Price($100K)
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [4]:
data1 = data.copy(deep = True)
data1

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Price($100K)
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [6]:
# user defined r squared
def r_squared(y_true, y_pred):
  y = list(y_true)
  y_hat = list(y_pred)

  y_mean = sum(y)/len(y)

  ss_res = sum((yi - yi_hat)**2 for yi, yi_hat in zip(y,y_hat))

  ss_tot = sum((yi - y_mean)**2 for yi in y)

  r2 = 1 - (ss_res)/(ss_tot)

  return r2




In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [8]:
# Making prediction model
def accuracyOfModel(data):
  X = data.drop(columns = ['Price($100K)'])
  y = data['Price($100K)']

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state= 42)

  lr = LinearRegression()
  lr.fit(X_train, y_train)

  y_pred = lr.predict(X_test)

  ## Build in r2 score
  r2 = r2_score(y_test, y_pred)

  ## our user defined r-squared score
  r2_user_defined = r_squared(y_test, y_pred)

  # finding adjusted r2 = 1- {(1-r2)(n-1)}/{n-p-1},
  # p = no. of independent features, n = no. of data points in testset

  n = X_test.shape[0]
  p = X_test.shape[1]

  adj_r2 = 1 - (1-r2)*(n-1)/(n-p-1)

  return r2, r2_user_defined, adj_r2






In [9]:
# for actual dataset: data1
r2, r2_user_defined, adj_r2 = accuracyOfModel(data1)
print(f"r2 score is {r2}, user defined r2 score is {r2_user_defined} and adjusted r2 score is {adj_r2}")

r2 score is 0.595770232606166, user defined r2 score is 0.5957702326061689 and adjusted r2 score is 0.5952472117200023
