# Feature scaling with sklearn - Exercise Solution

Given a real estate dataset.

The data is located in the file: 'real_estate_price_size_year.csv'. 

Multiple linear regression with **standardized the data**.
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared in Linear Regression with Multiple Linear Regression 
-  Using the model make a prediction about an apartment with size 800 sq.ft. from 2009
-  Find the univariate p-values of the two variables. What can you say about them?
-  Create a summary table with your findings

The dependent variable is 'price', while the independent variables are 'size' and 'year'.


## Import the relevant libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.linear_model import LinearRegression

## Load the data

In [None]:
data = pd.read_csv('real_estate_price_size_year.csv')
data.head()

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009


In [None]:
data.describe()

Unnamed: 0,price,size,year
count,100.0,100.0,100.0
mean,292289.47016,853.0242,2012.6
std,77051.727525,297.941951,4.729021
min,154282.128,479.75,2006.0
25%,234280.148,643.33,2009.0
50%,280590.716,696.405,2015.0
75%,335723.696,1029.3225,2018.0
max,500681.128,1842.51,2018.0


## Create the regression

### Declare the dependent and the independent variables

In [None]:
x = data[['size','year']]
y = data['price']

### Scale the inputs

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(x)
x_scaled = scaler.transform(x)

### Regression

In [None]:
reg = LinearRegression()
reg.fit(x_scaled,y)

LinearRegression()

### Find the intercept

In [None]:
reg.intercept_

292289.4701599997

### Find the coefficients

In [None]:
reg.coef_

array([67501.57614152, 13724.39708231])

### Calculate the R-squared

In [None]:
reg.score(x_scaled,y)

0.7764803683276793

### Calculate the Adjusted R-squared

In [None]:
# Let's use the handy function we created
def adj_r2(x,y):
    r2 = reg.score(x,y)
    n = x.shape[0]
    p = x.shape[1]
    adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
    return adjusted_r2

In [None]:
adj_r2(x_scaled,y)

0.77187171612825

### Compare the R-squared and the Adjusted R-squared

It seems the the R-squared is only slightly larger than the Adjusted R-squared, implying that there was no much penalty for the inclusion of 2 independent variables. 

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

Comparing the Adjusted R-squared with the R-squared of the simple linear regression (when only 'size' was used), it was realized that 'Year' is not bringing too much value to the result.

### Making predictions

Find the predicted price of an apartment that has a size of 800 sq.ft. from 2009.

In [None]:
new_data = [[800,2009]]
new_data_scaled = scaler.transform(new_data)



In [None]:
reg.predict(new_data_scaled)

array([269715.38736032])

### Calculate the univariate p-values of the variables

In [None]:
from sklearn.feature_selection import f_regression

In [None]:
f_regression(x_scaled,y)

(array([285.92105192,   0.85525799]), array([8.12763222e-31, 3.57340758e-01]))

In [None]:
p_values = f_regression(x,y)[1]
p_values

array([8.12763222e-31, 3.57340758e-01])

In [None]:
p_values.round(3)

array([0.   , 0.357])

### Create a summary table with your findings

In [None]:
reg_summary = pd.DataFrame(data = x.columns.values, columns=['Features'])
reg_summary ['Coefficients'] = reg.coef_
reg_summary ['p-values'] = p_values.round(3)
reg_summary

Unnamed: 0,Features,Coefficients,p-values
0,size,67501.576142,0.0
1,year,13724.397082,0.357


It seems that 'Year' is not event significant, therefore we should remove it from the model.

Note that this dataset is extremely clean and probably artificially created, therefore standardization does not really bring any value to it.