# Feature scaling with sklearn - Exercise

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. This exercise is very similar to a previous one. This time, however, **please standardize the data**.

Apart from that, please:
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
-  Create a summary table with your findings

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

Good luck!

## Import the relevant libraries

In [1]:
import pandas as pd 
import numpy as np 
from sklearn.linear_model import LinearRegression as lr
from sklearn.feature_selection import f_regression as fs
from sklearn.preprocessing import StandardScaler as ss

## Load the data

In [2]:
data = pd.read_csv('real_estate_price_size_year.csv')

In [3]:
data.head()
data

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009
...,...,...,...
95,252460.400,549.80,2009
96,310522.592,1037.44,2009
97,383635.568,1504.75,2006
98,225145.248,648.29,2015


## Create the regression

### Declare the dependent and the independent variables

In [4]:
y = data['price']
x = data[['size', 'year']]

### Scale the inputs

In [5]:
scaler = ss()
scaler.fit(x)
x_scalled = scaler.transform(x)

### Regression

In [6]:
reg = lr()
reg.fit(x_scalled, y)

LinearRegression()

### Find the intercept

In [7]:
reg.intercept_

292289.4701599997

### Find the coefficients

In [8]:
reg.coef_

array([67501.57614152, 13724.39708231])

### Calculate the R-squared

In [9]:
reg.score(x_scalled, y)

0.7764803683276793

### Calculate the Adjusted R-squared

In [10]:
r = reg.score(x_scalled,y)
n = x_scalled.shape[0]
p = x_scalled.shape[1]

In [11]:
adj_r = 1 - (1 - r) *((n - 1)/(n-p-1))
adj_r

0.77187171612825

### Compare the R-squared and the Adjusted R-squared

Answer...

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

Answer...

### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [12]:
new_data = pd.DataFrame(data = [750], columns = ['Area'])
new_data['Year'] = 2009
new_data

Unnamed: 0,Area,Year
0,750,2009


In [13]:
scaler.fit(x)
scaled_data = scaler.transform(new_data)
reg.predict(scaled_data)

array([258330.34465995])

### Calculate the univariate p-values of the variables

In [14]:
p_val = fs(x_scalled,y)[1].round(3)
p_val

array([0.   , 0.357])

### Create a summary table with your findings

In [15]:
findings = pd.DataFrame(data = ['size', 'year'], columns = ['Features'])
findings['Coefficient'] = reg.coef_
findings['P Values'] = p_val
findings

Unnamed: 0,Features,Coefficient,P Values
0,size,67501.576142,0.0
1,year,13724.397082,0.357


Answer...