<a href="https://colab.research.google.com/github/Santanukolkata/Data_Science/blob/master/Classification_Metrics/R2_and_Adjusted_R2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [12]:
# import the ML algorithm
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np
import statsmodels.api as sm

  import pandas.util.testing as tm


In [3]:
location="https://raw.githubusercontent.com/Santanukolkata/Data_Science/master/Classification_Metrics/sales.csv"

In [4]:
df = pd.read_csv(location, delimiter=r"\s+")

In [5]:
df.head()

Unnamed: 0,AverageNumberofTickets,NumberofEmployees,ValueofContract,Industry
0,1,51,25750,Retail
1,9,68,25000,Services
2,20,67,40000,Services
3,1,124,35000,Retail
4,8,124,25000,Manufacturing


In [6]:
X, y = df[['NumberofEmployees','ValueofContract']], df.AverageNumberofTickets

In [7]:
model = LinearRegression()
model.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [10]:
# compute with formulas from the theory
yhat        = model.predict(X)
SS_Residual = sum((y-yhat)**2)

SS_Total  = sum((y-np.mean(y))**2)
r_squared = 1 - (float(SS_Residual))/SS_Total

adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)
print (r_squared, adjusted_r_squared)


0.8776433713234001 0.8632484738320354


In [11]:
# adjusted-r-square directly from documentation
print (model.score(X, y), 1 - (1-model.score(X, y))*(len(y)-1)/(len(y)-X.shape[1]-1))


0.8776433713234001 0.8632484738320354


In [13]:
# compute with statsmodels, by adding intercept manually

X1 = sm.add_constant(X)
result = sm.OLS(y, X1).fit()

print (result.rsquared, result.rsquared_adj)

0.8776433713234001 0.8632484738320354


In [14]:
# compute with statsmodels, another way, using formula
import statsmodels.formula.api as sm1
result = sm1.ols(formula="AverageNumberofTickets ~ NumberofEmployees + ValueofContract", data=df).fit()

print (result.rsquared, result.rsquared_adj)


0.8776433713234001 0.8632484738320354


#### Issues with R-squared
R-squared can be artificially made high. That is we can increase the value of R-squared by simply adding more and more independent variables to our model.

In other words R-squared never decreases upon adding more independent variables.

__why does this happen?__

$$ R^2 = 1 - \frac{SSE}{SST} $$

- R-squared will be maximum when SSE/SST will be minimum.

- In order for SSE/SST to be minimum SSE should be minimum.

- Now SSE will decrease as we add more explanatory variables to our model. 
- This is because as we add more explanatory variables to our regression model ,
- our regression model will fit the data points better and hence sum of squared error will reduce.
- Hence R-squared will increase even when the variable is not significant to our model.

#### Adjusted R-squared
Adjusted R-squared simply penalizes the model for adding more useless variables.

In [15]:
y    = np.array([21,   21,    22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2])
yhat = np.array([21.5, 21.14, 26.1, 20.2, 17.5, 19.7, 14.9, 22.5, 25.1, 18])

In [16]:
R2 = 1 - sum((y - yhat)**2)/sum((y - np.mean(y))**2)
R2

0.6410828151089257

In [17]:
# Let's assume you have three independent variables in this case.
n = 10
p = 3
adjR2 = 1 - (1 - R2) * ((n - 1)/(n-p-1))
print(adjR2)

0.46162422266338865
