# Feature scaling with sklearn - Exercise

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. This exercise is very similar to a previous one. This time, however, **please standardize the data**.

Apart from that, please:
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
-  Create a summary table with your findings

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

Good luck!

## Import the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import seaborn as sns
sns.set()

## Load the data

In [2]:
data = pd.read_csv('real_estate_price_size_year.csv')
data.head()

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009


In [3]:
data.describe()

Unnamed: 0,price,size,year
count,100.0,100.0,100.0
mean,292289.47016,853.0242,2012.6
std,77051.727525,297.941951,4.729021
min,154282.128,479.75,2006.0
25%,234280.148,643.33,2009.0
50%,280590.716,696.405,2015.0
75%,335723.696,1029.3225,2018.0
max,500681.128,1842.51,2018.0


## Create the regression

### Declare the dependent and the independent variables

In [4]:
x = data[['size', 'year']]
y = data['price']

### Scale the inputs

In [9]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x, y)
std_x = scaler.transform(x)

### Regression

In [10]:
lm = LinearRegression()
lm.fit(std_x, y)

### Find the intercept

In [11]:
lm.intercept_

292289.4701599997

### Find the coefficients

In [12]:
lm.coef_

array([67501.57614152, 13724.39708231])

### Calculate the R-squared

In [14]:
r2 = lm.score(std_x, y)
r2

0.7764803683276793

### Calculate the Adjusted R-squared

In [15]:
n = x.shape[0]
p = x.shape[1]
adjr2 = 1-(1-r2)*((n-1)/(n-p-1))
adjr2

0.77187171612825

### Compare the R-squared and the Adjusted R-squared

Answer...

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

Answer...

### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [47]:
# new_data = np.array([[750, 2009]])
# print(new_data.shape)
new_data = pd.DataFrame(data=[[750, 2018], [750, 1990]], columns=['size', 'year'])
print(new_data.head())
std_new_data = scaler.transform(new_data)
lm.predict(std_new_data).round(3)

   size  year
0   750  2018
1   750  1990


array([284581.413, 202911.423])

### Calculate the univariate p-values of the variables

In [33]:
from sklearn.feature_selection import f_regression

In [35]:
fst = f_regression(x, y)
fst

(array([285.92105192,   0.85525799]), array([8.12763222e-31, 3.57340758e-01]))

In [38]:
pvals = fst[1]
pvals.round(3)

array([0.   , 0.357])

### change standardized values back into a dataframe

In [68]:
formula_data = std_x.copy()
formula_data = np.hstack((y.values.reshape(-1,1), formula_data))
formula_df = pd.DataFrame(data=formula_data, columns=['price', 'size', 'year'])
formula_df

Unnamed: 0,price,size,year
0,234314.144,-0.708164,0.510061
1,228581.528,-0.663873,-0.765092
2,281626.336,-1.233719,1.147638
3,401255.608,2.198445,0.510061
4,458674.256,1.424989,-0.765092
...,...,...,...
95,252460.400,-1.022856,-0.765092
96,310522.592,0.622084,-0.765092
97,383635.568,2.198445,-1.402669
98,225145.248,-0.690623,0.510061


### use _patsy_ to perform formula based linear regression

In [76]:
import patsy as pt

new_y, new_x = pt.dmatrices('price ~ size+year + size:year', data=formula_df)

In [71]:
new_lm = LinearRegression()
new_lm.fit(new_y, new_x)

In [77]:
new_lm.coef_

array([[ 0.00000000e+00],
       [ 1.12564676e-05],
       [ 1.21324555e-06],
       [-3.09588796e-06]])

array([ 1.        , -3.29014695, -0.3546189 ,  0.80721734])