# Feature scaling with sklearn - Exercise

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. This exercise is very similar to a previous one. This time, however, **please standardize the data**.

Apart from that, please:
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
-  Create a summary table with your findings

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

Good luck!

## Import the relevant libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

## Load the data

In [2]:
df=pd.read_csv('datasets/real_estate_price_size_year.csv')

In [3]:
df.sample(5)

Unnamed: 0,price,size,year
34,285223.176,857.54,2018
85,376253.808,1009.25,2018
20,268125.08,620.71,2015
11,494778.992,1842.51,2009
40,310045.712,1021.95,2015


## Create the regression

### Declare the dependent and the independent variables

In [4]:
x=df[['size','year']]
y=df['price']

### Scale the inputs

In [5]:
scaler=StandardScaler()
scaler.fit(x)
x_scaled=scaler.transform(x)

In [7]:
x_scaled[0:5]

array([[-0.70816415,  0.51006137],
       [-0.66387316, -0.76509206],
       [-1.23371919,  1.14763808],
       [ 2.19844528,  0.51006137],
       [ 1.42498884, -0.76509206]])

### Regression

In [8]:
reg=LinearRegression()
reg=reg.fit(x_scaled,y)

In [9]:
reg

### Find the intercept

In [10]:
reg.intercept_

292289.4701599997

### Find the coefficients

In [11]:
reg.coef_

array([67501.57614152, 13724.39708231])

### Calculate the R-squared

In [12]:
reg.score(x_scaled,y)

0.7764803683276793

### Calculate the Adjusted R-squared

In [13]:
def Adjusted_RSquared(x,y,reg):
    n=x.shape[0]
    p=x.shape[1]
    r2=reg.score(x,y)
    suffix=(n-1)/(n-p-1)
    preffix=(1-r2)
    
    return 1-preffix*suffix

In [14]:
print(Adjusted_RSquared(x_scaled,y,reg))

0.77187171612825


### Compare the R-squared and the Adjusted R-squared

Answer...

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

Answer...

### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [33]:
new_dataframe=pd.DataFrame([[750,2009]],columns=['size','year'])
new_dataframe

Unnamed: 0,size,year
0,750,2009


In [34]:
new_dataframe['Predicted_Price']=reg.predict(scaler.transform(new_dataframe))

In [35]:
new_dataframe

Unnamed: 0,size,year,Predicted_Price
0,750,2009,258330.34466


### Calculate the univariate p-values of the variables

In [36]:
from sklearn.feature_selection import f_regression

In [38]:
f_reg=f_regression(x_scaled,y)
f_reg

(array([285.92105192,   0.85525799]), array([8.12763222e-31, 3.57340758e-01]))

In [40]:
p_values=f_reg[1]

In [41]:
x_features=['size','year']
for i in range(2):
    print(f'{x_features[i]} : {p_values.round(4)[i]}')

size : 0.0
year : 0.3573


### Create a summary table with your findings

In [42]:
reg_summary = pd.DataFrame(data = x.columns.values, columns=['Features'])
reg_summary ['Coefficients'] = reg.coef_
reg_summary ['p-values'] = p_values.round(3)
reg_summary

Unnamed: 0,Features,Coefficients,p-values
0,size,67501.576142,0.0
1,year,13724.397082,0.357


Answer...