# Multiple Linear Regression with sklearn - Exercise Solution

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. 

Apart from that, please:
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
-  Create a summary table with your findings

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

Good luck!

## Import the relevant libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

sns.set_theme()

## Load the data

In [2]:
df = pd.read_csv("data/raw/real_estate_price_size_year.csv")
df.describe()

Unnamed: 0,price,size,year
count,100.0,100.0,100.0
mean,292289.47016,853.0242,2012.6
std,77051.727525,297.941951,4.729021
min,154282.128,479.75,2006.0
25%,234280.148,643.33,2009.0
50%,280590.716,696.405,2015.0
75%,335723.696,1029.3225,2018.0
max,500681.128,1842.51,2018.0


## Create the regression

### Declare the dependent and the independent variables

In [3]:
y = df.iloc[:, 0]
X = df.iloc[:, 1:]

### Regression

In [5]:
reg = LinearRegression()
reg.fit(X=X, y=y)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


### Find the intercept

In [6]:
reg.intercept_

np.float64(-5772267.017463277)

### Find the coefficients

In [7]:
reg.coef_

array([ 227.70085401, 2916.78532684])

### Calculate the R-squared

In [21]:
r_squared = reg.score(X, y)
r_squared

0.7764803683276793

### Calculate the Adjusted R-squared

$$R^2_{adj.} = 1 - (1-R^2)* \frac{n-1}{n-p-1}$$

In [24]:
r_adjusted = 1 - (1 - r_squared) * (X.shape[0] - 1)/ (X.shape[0] - X.shape[1] - 1)
r_adjusted

0.77187171612825

### Compare the R-squared and the Adjusted R-squared

R2 : 0.7764803683276793 vs ADJ R2 : 0.77187171612825

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

ADJ R2 : 0.77187171612825 VS Adj. R-squared:	0.772

### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [36]:
df["prediction"] = reg.predict(X)
print(df.head(10))
reg.predict(pd.DataFrame(columns=["size", "year"], data=[[750, 2009]]))

        price     size  year     prediction
0  234314.144   643.09  2015  251487.558319
1  228581.528   656.22  2009  236976.558571
2  281626.336   487.29  2018  224762.121245
3  401255.608  1504.75  2015  447688.276183
4  458674.256  1275.46  2009  377978.035407
5  245050.280   575.19  2006  209775.602390
6  265129.064   570.89  2015  235047.556660
7  175716.480   620.82  2006  220165.592359
8  331101.344   682.26  2018  269156.956751
9  218630.608   694.52  2009  245697.501280


array([258330.34465995])

### Calculate the univariate p-values of the variables

### Create a summary table with your findings

Answer...