# Multiple Linear Regression with sklearn - Exercise Solution

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. 

Apart from that, please:
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
-  Create a summary table with your findings

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

Good luck!

## Import the relevant libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression


## Load the data

In [3]:
df = pd.read_csv('real_estate_price_size_year.csv', sep=',')

In [4]:
df.head()

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009


In [5]:
df.describe()

Unnamed: 0,price,size,year
count,100.0,100.0,100.0
mean,292289.47016,853.0242,2012.6
std,77051.727525,297.941951,4.729021
min,154282.128,479.75,2006.0
25%,234280.148,643.33,2009.0
50%,280590.716,696.405,2015.0
75%,335723.696,1029.3225,2018.0
max,500681.128,1842.51,2018.0


## Create the regression

### Declare the dependent and the independent variables

In [6]:
x = df[['size','year']]
y = df.price

### Regression

In [7]:
reg = LinearRegression()
reg.fit(x,y)

LinearRegression()

### Find the intercept

In [8]:
reg.intercept_

-5772267.017463281

### Find the coefficients

In [9]:
reg.coef_

array([ 227.70085401, 2916.78532684])

### Calculate the R-squared

In [10]:
r2 = reg.score(x,y)
r2

0.7764803683276791

### Calculate the Adjusted R-squared

In [11]:
x.shape

(100, 2)

In [12]:
n = x.shape[0]
p = x.shape[1]

In [13]:
r2_adj = 1- (1-r2)*(n-1)/(n-p-1)
r2_adj

0.7718717161282498

### Compare the R-squared and the Adjusted R-squared

It seems the r2 is only slightly larger than the r2_adj, implying that we were not penalized a lot for the inclusion of 2 independent variables.

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

In [14]:
import statsmodels.api as sm
x1 = df['size']
y1 = df.price

In [15]:
x1 = sm.add_constant(x1)
results = sm.OLS(y1,x1).fit()
results.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.745
Model:,OLS,Adj. R-squared:,0.742
Method:,Least Squares,F-statistic:,285.9
Date:,"Sat, 11 Nov 2023",Prob (F-statistic):,8.13e-31
Time:,07:56:50,Log-Likelihood:,-1198.3
No. Observations:,100,AIC:,2401.0
Df Residuals:,98,BIC:,2406.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.019e+05,1.19e+04,8.550,0.000,7.83e+04,1.26e+05
size,223.1787,13.199,16.909,0.000,196.986,249.371

0,1,2,3
Omnibus:,6.262,Durbin-Watson:,2.267
Prob(Omnibus):,0.044,Jarque-Bera (JB):,2.938
Skew:,0.117,Prob(JB):,0.23
Kurtosis:,2.194,Cond. No.,2750.0


The variable of 'year' doesn't add to much value to the muti-linear regression model. Compare to the simple regression model without 'year' variable.

### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [16]:
np.array([750,2009])


array([ 750, 2009])

In [17]:
reg.predict([[750,2009]])



array([258330.34465995])

### Calculate the univariate p-values of the variables

In [18]:
from sklearn.feature_selection import f_regression

In [19]:
f_regression(x,y)

(array([285.92105192,   0.85525799]), array([8.12763222e-31, 3.57340758e-01]))

In [20]:
p_values = f_regression(x,y)[1]
p_values

array([8.12763222e-31, 3.57340758e-01])

In [21]:
p_values.round(3)

array([0.   , 0.357])

### Create a summary table with your findings

In [23]:
reg_summary = pd.DataFrame(data = x.columns.values, columns = ['Feature'])
reg_summary

Unnamed: 0,Feature
0,size
1,year


Answer...

In [24]:
reg_summary['coef'] = reg.coef_
reg_summary['p_value'] = p_values.round(3)
reg_summary

Unnamed: 0,Feature,coef,p_value
0,size,227.700854,0.0
1,year,2916.785327,0.357
