# Supervised learning- regression

A real estate agent currently only has Single-Family housed in his portfolio. He wants to expand his business to apartments, but he doesn't have enough experience to give reliable appraisels. Getting the necessary experience would take a lot of time and he doesn't have any colleagues to fall back on. He knows we are following a machine learning course and has a brilliant idea. He give us a data-set with a lot of information on real estate, including the known selling price (tx_price).  He asks us to build a real-estate pricing model for apartmens.

We already cleaned this dataset in the first class (the data hasn't been standardized yet). Perform a simple linear regression and polynomial regression to predict the price for apartments.

## 0. Loading packages and dataset

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns



In [4]:
df = pd.read_csv('real_estate_cleaned.csv')

The dataset is already cleaned, but not standardized yet. We will take a quick look at the data to get to know the dataset.

# 1. Take a look at the data
1. Look at the dimensions (number of features and observations)
2. Look at the first 5 rows
3. Look at the different features and there data types
    + What do you notice with regard to the datatypes of the one-hot encoded features?
    + Fix this




1. Dimensions

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 803 entries, 0 to 802
Data columns (total 39 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   tx_price                            803 non-null    float64
 1   beds                                803 non-null    int64  
 2   baths                               803 non-null    int64  
 3   sqft                                803 non-null    float64
 4   year_built                          803 non-null    int64  
 5   lot_size                            803 non-null    float64
 6   restaurants                         803 non-null    float64
 7   groceries                           803 non-null    float64
 8   nightlife                           803 non-null    float64
 9   cafes                               803 non-null    float64
 10  shopping                            803 non-null    float64
 11  arts_entertainment                  803 non-n

2. First rows

In [6]:
df.head(5)

Unnamed: 0,tx_price,beds,baths,sqft,year_built,lot_size,restaurants,groceries,nightlife,cafes,...,exterior_walls_Siding (Alum/Vinyl),exterior_walls_Wood,roof_Asphalt,roof_Composition,roof_Gravel/rock,roof_Missing,roof_Other,roof_Shake shingle,basement_1.0,basement_Missing
0,12.597611,1,1,6.371612,2013,8.388054,4.682131,2.302585,3.433987,2.995732,...,0,1,0,0,0,1,0,0,0,1
1,12.28535,1,1,6.418365,1965,8.388054,4.663439,2.772589,1.94591,2.639057,...,0,0,0,1,0,0,0,0,1,0
2,12.542191,1,1,6.423247,1963,8.388054,5.214936,2.639057,3.465736,3.433987,...,0,1,0,0,0,1,0,0,0,1
3,12.847666,1,1,6.428105,2000,10.420554,5.293305,2.302585,3.663562,3.258097,...,0,1,0,0,0,1,0,0,0,1
4,12.736704,1,1,6.453625,1992,8.388054,5.010635,2.079442,3.135494,3.044522,...,0,0,0,0,0,1,0,0,0,1


3. Data types

In [None]:
...

You should have noticed that they are not in the correct data type. Convert them to uint8, using '.astype(np.uint8)'

In [None]:
df.iloc[:,25:39] = df.iloc[:,25:39]. ...(np.uint8)

In [None]:
df.info()

# 2 Train/test-split and standardisation

1. Shuffle your data

2. Make a Train/test
    - Use random state=123 whenever needed
    - Use a test size of 20%
    
3. Standardize both datasets  
    + Make sure you only standardise the numerical features


1. Shuffle

In [None]:
from random import Random
df_shuffle = df. ...(frac=1, random_state=123)


2. Train/test-split

In [None]:
# Import the function
from sklearn.model_selection import train_test_split

# Split of feaures and outcomes
X = df_shuffle.drop(columns='...')
y = df_shuffle['...']

# Perform train/test-split
X_train, X_test, y_train, y_test = ...(X, y, test_size=..., random_state=...)


3. Standardize

In [None]:
from sklearn.preprocessing import StandardScaler

num_feat = X_train.select_dtypes(include=['int64', 'float64']).columns

scaler = StandardScaler()
scaler. ...(X_train[num_feat])

X_train_stan = X_train.copy()
X_test_stan = X_test.copy()

X_train_stan[num_feat] = scaler. ...(X_train[...])
X_test_stan[...] = ...

# 3. Linear regression
1. Train a linear regression model, using the standardized data.
2. Test the trained model on the train and the test set.
    + Predict the price of the appartments
    + Calculate the coefficient of determination
    + Calculate the Mean Absolute Error (MAE)
    + Calculate the Mean Square Error (MSE)
    + Would you say this model is overfitted, underfitted or neither?



1. Train

In [None]:
from sklearn.linear_model import LinearRegression
reg= LinearRegression()
reg.fit(..., ...)

2. Evaluating the model
    + make predictions

In [None]:
predictions_train = reg.predict(...)
predictions_test = ...

   + Coefficient of determination

In [None]:
print(reg.score(..., ...))
print(...)

In [None]:
#alternative code
from sklearn.metrics import r2_score
print(...(y_train, predictions_train))
print(...)


+ MAE


In [None]:
from sklearn.metrics import mean_absolute_error
...
...

+ MSE

In [None]:
from sklearn.metrics import ...

...
...

# 2. Polynomial regression
## 2.1 Quadratic model

We will do a quadratic polynomial regression to see if we can improve the reliability of the model.
1. Design polynomial features with degree 2
    + Don't forget to also transform the test data
    + Check the number of features of the new datasets
2. Fit a linear regression to the polynomial features
    + Use the 'fit_intercept=False'-argument
3. Evaluate the train and test-set, using R^2
    +  Would you say this model is overfitted, underfitted or neither?



1. Design the features

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = ...(degree=...)
X_train_poly = ...
X_test_poly = ...


In [None]:
#check the number of features
...

2. Fit the linear regression

In [None]:
# Define the model
reg_quad = LinearRegression(fit_intercept=...)

#Fit the model
...

In [None]:
# Evaluate
...
...

## 2.2 Higher order polynomial model

1. Do a cross-validation to find the optimal order for the polynomial.
    + Use a pipeline that entails two steps: engineering the polynomial features and fitting the regression
    + Let the degree of the polynomial range from 1 to 4.
    + Ask python to print out the R^2 for each degree
    + What do you expect to happen? Will increasing the degree of the polynomial solve the overfitting or just make it worse?
    
2. Make a plot of the cross-validation results
     + Did your expectation come true?
    + Overfitted or underfitted?

1. Cross-validation

In [None]:
from sklearn.pipeline import Pipeline

from sklearn.model_selection import cross_val_score

avg_scores = [None] * 5

for i in np.arange(1,6):

    reg_poly = Pipeline(... ,
                        ...)

    scores = cross_val_score(...)

    avg_scores[i-1] = scores.mean()

    print("Order "+str(i)+": avg R^2 = "+str( avg_scores[i-1]))

2. plot

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))
plt.scatter(np.arange(1,6), avg_scores, c='b', label='data')
plt.axis('tight')
plt.title("Cross-validation polynomials")
ax.set_xlabel("Order");
ax.set_ylabel("CV R^2");
plt.tight_layout()
plt.show()