# Multiple Linear Regression with sklearn - Exercise Solution

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. 

Apart from that, please:
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
-  Create a summary table with your findings

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

Good luck!

## Import the relevant libraries

In [9]:
# For these lessons we will need NumPy, pandas, matplotlib and seaborn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# and of course the actual regression (machine learning) module
from sklearn import linear_model
from sklearn.linear_model import LinearRegression

## Load the data

In [3]:
data = pd.read_csv('real_estate_price_size_year (1).csv')

In [4]:
data.describe

<bound method NDFrame.describe of          price     size  year
0   234314.144   643.09  2015
1   228581.528   656.22  2009
2   281626.336   487.29  2018
3   401255.608  1504.75  2015
4   458674.256  1275.46  2009
..         ...      ...   ...
95  252460.400   549.80  2009
96  310522.592  1037.44  2009
97  383635.568  1504.75  2006
98  225145.248   648.29  2015
99  274922.856   705.29  2006

[100 rows x 3 columns]>

## Create the regression

### Declare the dependent and the independent variables

In [7]:
x = data[['size','year']]
y = data['price']

### Regression

In [10]:
reg = LinearRegression()
reg.fit(x,y)

### Find the intercept

In [11]:
reg.intercept_

-5772267.017463279

### Find the coefficients

In [12]:
reg.coef_

array([ 227.70085401, 2916.78532684])

### Calculate the R-squared

In [13]:
reg.score(x,y)

0.7764803683276795

### Calculate the Adjusted R-squared

In [17]:
r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]

adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adjusted_r2

0.7718717161282502

<function __main__.calcr2adj(x, y)>

### Compare the R-squared and the Adjusted R-squared

Answer... 0.776 < 0.771 so the adjusted R-squared is smaller

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

Answer... 

### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [20]:
reg.predict([[750,2009]])



array([258330.34465995])

### Calculate the univariate p-values of the variables

In [21]:
from sklearn.feature_selection import f_regression

In [24]:
f_regression(x,y)

(array([285.92105192,   0.85525799]), array([8.12763222e-31, 3.57340758e-01]))

In [29]:
p_values = f_regression(x,y)[1]
p_values

array([8.12763222e-31, 3.57340758e-01])

In [30]:
p_values.round(3)

array([0.   , 0.357])

### Create a summary table with your findings

In [33]:
reg_summary=pd.DataFrame(data = x.columns.values, columns = ['Features'])
reg_summary['Coefficients']= reg.coef_
reg_summary['p-values']= p_values.round(3)
reg_summary

Unnamed: 0,Features,Coefficients,p-values
0,size,227.700854,0.0
1,year,2916.785327,0.357


Answer...