# Problem with R¬≤ Score

The **R¬≤ score** measures how well a regression model explains the variability of the target variable.

$$
R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}
$$

However, one major problem with R¬≤ is that **it always increases (or stays the same) when new independent variables are added to the model**, even if those variables have **no real relationship** with the target.

This happens because R¬≤ only looks at how much variation in the dependent variable is explained, without checking whether the new features genuinely improve the model.  
As a result, adding **irrelevant or random columns** can make the R¬≤ value appear higher, giving a **false impression of a better model**.

In short:

> **More features ‚â† Better model**, but R¬≤ may misleadingly suggest so.


In [21]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np

In [40]:
# Demonstration of R¬≤ Score Problem
data = pd.read_csv('Experience-Salary.csv').sample(30)
df = data
X = data['exp(in months)'].values
Y = data['salary(in thousands)'].values
X_trans = X.reshape(-1, 1)


In [47]:
# In this example, we have a dataset containing Experience (in years) and Salary.  
model1 = LinearRegression()
model1.fit(X_trans, Y)
print('DataSet', data.head())
print(" \n R2-Score:-", r2_score(Y, model1.predict(X_trans)))
# Normally, we expect that as experience increases, 
# salary also increases ‚Äî so R¬≤ should capture that relationship.

DataSet      exp(in months)  salary(in thousands)
679       36.437997             31.147866
15        36.925318             32.533003
716       20.678132             16.999354
90        37.682982             34.457312
883       16.586150             25.682085
 
 R2-Score:- 0.6736043894230678


In [49]:
# Now, we intentionally add an irrelevant column like 
# Day (values such as 'Mon', 'Tue', 'Wed', etc.) 
# or Gender like male and female, color , animql ect.
# # These column has no logical relation to salary.
days =  pd.Series(['sun','mon', 'tue', 'wen', 'thu', 'fri', 'sat'])
gender = pd.Series(['male', 'female'])
df.loc[:, 'days'] = [days[i % 7] for i in range(len(df))]
df.loc[:, 'gender'] = [gender[i % 2] for i in range(len(df))]
df['color'] = np.random.choice(['Red', 'Green', 'Blue', 'Yellow'], len(df))
df['animal'] = np.random.choice(['Dog', 'Cat', 'Cow', 'Goat'], len(df))
df = pd.get_dummies(df ,columns=['days', 'gender', 'color', 'animal'],drop_first=True, dtype=int)
df

Unnamed: 0,exp(in months),salary(in thousands),days_mon,days_sat,days_sun,days_thu,days_tue,days_wen,gender_male,color_Green,...,days_thu.1,days_tue.1,days_wen.1,gender_male.1,color_Green.1,color_Red,color_Yellow,animal_Cow,animal_Dog,animal_Goat
679,36.437997,31.147866,0,0,1,0,0,0,1,0,...,0,0,0,1,0,1,0,0,0,0
15,36.925318,32.533003,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,1
716,20.678132,16.999354,0,0,0,0,1,0,1,0,...,0,1,0,1,1,0,0,0,1,0
90,37.682982,34.457312,0,0,0,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,1
883,16.58615,25.682085,0,0,0,1,0,0,1,1,...,1,0,0,1,0,0,1,1,0,0
709,14.433476,10.566924,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
872,18.081597,21.247311,0,1,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,1
665,13.217402,15.836764,0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
941,21.756378,24.141823,1,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,1,0
116,30.548236,35.972568,0,0,0,0,1,0,0,0,...,0,1,0,0,1,0,0,0,1,0


In [None]:
# We then train a regression model and calculate the R¬≤ score again.
y = df['salary(in thousands)'].values
x = df.drop(columns='salary(in thousands)').values
model2 = LinearRegression()
model2.fit(x, y)
r2_score(y, model2.predict(x))
## üß† Expected Outcome
# Even though the new Day , Gender, color and animal column is completely useless, the R¬≤ score may increase or stay the same
# instead of decreasing.This happens because R¬≤ never penalizes a model for adding more independent variables   
# it only checks whether the model fits the data better numerically, even by chance.
# Hence, R¬≤ can give a false impression that the model improved.

0.9780476822469013

# Adjusted R¬≤ ‚Äî Fixing the Problem with R¬≤

The **Adjusted R¬≤** score is an improved version of R¬≤ that penalizes the model for adding **irrelevant or unnecessary features**.

While R¬≤ is calculated as:

$$
R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}
$$

The **Adjusted R¬≤** is given by:

$$
R^2_{adj} = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - k - 1} \right)
$$

Where:  
- \( n \) = number of data points  
- \( k \) = number of independent variables (features)

---

## üß† Why It‚Äôs Better

- When you **add a new feature**, Adjusted R¬≤ checks if it **actually improves the model**.  
- If the new feature **does not** help, Adjusted R¬≤ will **decrease** ‚Äî unlike regular R¬≤, which always stays the same or increases.  
- Thus, it provides a **fairer and more realistic** measure of model performance.

---

### üîç Intuitive Understanding

> R¬≤ says: ‚ÄúMore variables? Great! I‚Äôm happier now!‚Äù üòÉ  
>  
> Adjusted R¬≤ says: ‚ÄúWait ‚Äî did that new variable really help?  
> If not, I‚Äôll reduce your score.‚Äù üòè

---

### ‚úÖ Key Takeaway

- R¬≤ can **mislead** by increasing with irrelevant features.  
- **Adjusted R¬≤ fixes this** by adding a penalty term that depends on the number of predictors.  
- A **higher Adjusted R¬≤** truly means a **better, more efficient model**.


In [51]:
def Adjusted_R2(R2, num_columns, num_rows):
    num = (1 - R2) * (num_rows - 1)
    deno = (num_rows - num_columns - 1)
    return 1 - (num / deno)
Adjusted_R2(r2_score(y, model2.predict(x)), x.shape[1], x.shape[0])

0.6816913925800689

In [54]:
pd.DataFrame({
    "R2-Score":[r2_score(y, model2.predict(x))],
    "Adjusted R2-Score":[Adjusted_R2(r2_score(y, model2.predict(x)), x.shape[1], x.shape[0])]})

Unnamed: 0,R2-Score,Adjusted R2-Score
0,0.978048,0.681691
