# Feature scaling with sklearn - Exercise

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. This exercise is very similar to a previous one. This time, however, **please standardize the data**.

Apart from that, please:
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
-  Create a summary table with your findings

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

Good luck!

## Import the relevant libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import f_regression


## Load the data

In [2]:
data=pd.read_csv("real_estate_price_size_year.csv")
data.head()

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009


## Create the regression

### Declare the dependent and the independent variables

In [3]:
x=data[["size","year"]]
y=data["price"]

### Scale the inputs

In [4]:
scaler=StandardScaler()
scaler.fit(x)
x_scaled=scaler.transform(x)

### Regression

In [5]:
reg=LinearRegression()
reg.fit(x_scaled,y)

LinearRegression()

### Find the intercept

In [6]:
reg.intercept_

292289.4701599997

### Find the coefficients

In [7]:
reg.coef_

array([67501.57614152, 13724.39708231])

### Calculate the R-squared

In [9]:
reg.score(x_scaled,y)

0.7764803683276793

### Calculate the Adjusted R-squared

In [15]:
r2=reg.score(x_scaled,y)
n=x_scaled.shape[0]
p=x_scaled.shape[1]
r2_squared=1-(1-r2)*(n-1)/(n-p-1)


In [16]:
r2_squared

0.77187171612825

### Compare the R-squared and the Adjusted R-squared

In [17]:

  r2-r2_squared

0.004608652199429297

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

Answer...

### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [27]:
new_data=pd.DataFrame([[750,2009]],columns=["size","year"])
new_data

Unnamed: 0,size,year
0,750,2009


In [28]:
reg.predict(new_data)



array([78490785.31465863])

In [29]:
new_data_scaled=scaler.transform(new_data)

In [30]:
reg.predict(new_data_scaled)

array([258330.34465995])

### Calculate the univariate p-values of the variables

In [22]:
f_regression(x_scaled,y)

(array([285.92105192,   0.85525799]), array([8.12763222e-31, 3.57340758e-01]))

In [32]:
p_values=f_regression(x_scaled,y)[1]
p_values.round(3)

array([0.   , 0.357])

### Create a summary table with your findings

Answer...

In [63]:
columns=["value","bias","coefficient","p_value"]
index=["size","year"]
val1=[750,reg.intercept_.round(2),reg.coef_[0].round(2),p_values.round(3)[0]]
val2=[2009,reg.intercept_.round(2),reg.coef_[1].round(2),p_values.round(3)[1]]
summary_table=pd.DataFrame(data=[val1,val2],columns=columns,index=index)
print(summary_table)
print()
print(f"The pridected price for the given size and year is {reg.predict(new_data_scaled).round(2)} Dollars")
print("Year is not a significant variable due to it's p_value")


      value       bias  coefficient  p_value
size    750  292289.47     67501.58    0.000
year   2009  292289.47     13724.40    0.357

The pridected price for the given size and year is [258330.34] Dollars
Year is not a significant variable due to it's p_value
