# Multiple Linear Regression with sklearn - Exercise Solution

You are given a real estate dataset.

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'.

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data.

Apart from that, please:

Display the intercept and coefficient(s)
Find the R-squared and Adjusted R-squared
Compare the R-squared and the Adjusted R-squared
Compare the R-squared of this regression and the simple linear regression where only 'size' was used
Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
Create a summary table with your findings
In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

Good luck!

# Import the relevant libraries

In [1]:
import  pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

In [2]:
x=pd.read_csv("real_estate_price_size_year.csv")

In [5]:
x.head(5)

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009


In [6]:
x.tail(2)

Unnamed: 0,price,size,year
98,225145.248,648.29,2015
99,274922.856,705.29,2006


In [7]:
x.describe()

Unnamed: 0,price,size,year
count,100.0,100.0,100.0
mean,292289.47016,853.0242,2012.6
std,77051.727525,297.941951,4.729021
min,154282.128,479.75,2006.0
25%,234280.148,643.33,2009.0
50%,280590.716,696.405,2015.0
75%,335723.696,1029.3225,2018.0
max,500681.128,1842.51,2018.0


In [8]:
x.isna().sum()

price    0
size     0
year     0
dtype: int64

# Create the regression 
#declare the dependent and the independent variables

In [9]:
X=x.drop("price",axis=1)

In [10]:
Y=x[["price"]]

In [13]:
lm=LinearRegression()
lm.fit(X,Y)


LinearRegression()

In [18]:
for i,j in enumerate(X.columns):
    print("{}== {} ".format(j, lm.coef_[0][i]))

size== 227.7008540074765 
year== 2916.7853268380386 


In [20]:
lm.intercept_[0]

-5772267.017463281

In [22]:
lm.score(X,Y)

0.7764803683276793

In [25]:
def adj_r2(x,y):
    r2 = lm.score(x,y)
    n = x.shape[0]
    p = x.shape[1]
    adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
    return adjusted_r2
adj_r2(X,Y)

0.77187171612825

In [26]:
x=sm.add_constant(X)
result=sm.OLS(Y,x).fit()
result.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.776
Model:,OLS,Adj. R-squared:,0.772
Method:,Least Squares,F-statistic:,168.5
Date:,"Tue, 17 Aug 2021",Prob (F-statistic):,2.7700000000000004e-32
Time:,21:22:57,Log-Likelihood:,-1191.7
No. Observations:,100,AIC:,2389.0
Df Residuals:,97,BIC:,2397.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-5.772e+06,1.58e+06,-3.647,0.000,-8.91e+06,-2.63e+06
size,227.7009,12.474,18.254,0.000,202.943,252.458
year,2916.7853,785.896,3.711,0.000,1357.000,4476.571

0,1,2,3
Omnibus:,10.083,Durbin-Watson:,2.25
Prob(Omnibus):,0.006,Jarque-Bera (JB):,3.678
Skew:,0.095,Prob(JB):,0.159
Kurtosis:,2.08,Cond. No.,941000.0


In [30]:
values=pd.DataFrame([[1200,2009]])
lm.predict(values)

array([[360795.72896331]])

In [31]:
from sklearn.feature_selection import f_regression

In [34]:
pvalues=f_regression(X,Y)[1].round(3)

  return f(*args, **kwargs)


In [43]:
summary=pd.DataFrame(data=X.columns.values,columns=["Features"])
summary["Coefficient"]=lm.coef_[0]
summary["p_values"]=pvalues

In [44]:
summary

Unnamed: 0,Features,Coefficient,p_values
0,size,227.700854,0.0
1,year,2916.785327,0.357


#  Therefore  eliminate the year feature for better model 