#  Neural Networks: Regression on House Pricing Dataset
We consider a reduced version of a dataset containing house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

https://www.kaggle.com/harlfoxem/housesalesprediction

For each house we know 18 house features (e.g., number of bedrooms, number of bathrooms, etc.) plus its price, that is what we would like to predict.

## Insert your ID number ("numero di matricola") below

In [1]:
#put here your ``numero di matricola''
numero_di_matricola = 1 # COMPLETE

In [2]:
#import all packages needed
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Load the data, remove data samples/points with missing values (NaN) and take a look at them.

In [3]:
#load the data
df = pd.read_csv('kc_house_data.csv', sep = ',')

#remove the data samples with missing values (NaN)
df = df.dropna() 

df.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0
mean,4645240000.0,535435.8,3.381163,2.071903,2070.027813,15250.54,1.434893,0.009798,0.244311,3.459229,7.615676,1761.252212,308.775601,1967.489254,94.668774,98077.125158,47.557868,-122.212337,1982.544564,13176.302465
std,2854203000.0,380900.4,0.895472,0.768212,920.251879,42544.57,0.507792,0.098513,0.776298,0.682592,1.166324,815.934864,458.977904,28.095275,424.439427,54.172937,0.140789,0.139577,686.25667,25413.180755
min,1000102.0,75000.0,0.0,0.0,380.0,649.0,1.0,0.0,0.0,1.0,3.0,380.0,0.0,1900.0,0.0,98001.0,47.1775,-122.514,620.0,660.0
25%,2199775000.0,315000.0,3.0,1.5,1430.0,5453.75,1.0,0.0,0.0,3.0,7.0,1190.0,0.0,1950.0,0.0,98032.0,47.459575,-122.32425,1480.0,5429.5
50%,4027701000.0,445000.0,3.0,2.0,1910.0,8000.0,1.0,0.0,0.0,3.0,7.0,1545.0,0.0,1969.0,0.0,98059.0,47.5725,-122.226,1830.0,7873.0
75%,7358175000.0,640250.0,4.0,2.5,2500.0,11222.5,2.0,0.0,0.0,4.0,8.0,2150.0,600.0,1990.0,0.0,98117.0,47.68025,-122.124,2360.0,10408.25
max,9839301000.0,5350000.0,8.0,6.0,8010.0,1651359.0,3.5,1.0,4.0,5.0,12.0,6720.0,2620.0,2015.0,2015.0,98199.0,47.7776,-121.315,5790.0,425581.0


Extract input and output data. We want to predict the price by using features other than id as input.

In [4]:
Data = df.values
# m = number of input samples
m = Data.shape[0]
print("Amount of data:",m)
Y = Data[:m,2]
X = Data[:m,3:]

Amount of data: 3164


## Data Pre-Processing

We split the data into 3 parts: one will be used for training and choosing the parameters, one for choosing among different models, and one for testing. The part for training and choosing the parameters will consist of $2/3$ of all samples, the one for choosing among different models will consist of $1/6$ of all samples, while the other part consists of the remaining $1/6$-th of all samples.

In [5]:
# Split data into train (2/3 of samples), validation (1/6 of samples), and test data (the rest)
m_train = int(2./3.*m)
m_val = int((m-m_train)/2.)
m_test = m - m_train - m_val
print("Amount of data for training and deciding parameters:",m_train)
print("Amount of data for validation (choosing among different models):",m_val)
print("Amount of data for test:",m_test)
from sklearn.model_selection import train_test_split

Xtrain_and_val, Xtest, Ytrain_and_val, Ytest = train_test_split(X, Y, test_size=m_test/m, random_state=numero_di_matricola)
Xtrain, Xval, Ytrain, Yval = train_test_split(Xtrain_and_val, Ytrain_and_val, test_size=m_val/(m_train+m_val), random_state=numero_di_matricola)

Amount of data for training and deciding parameters: 2109
Amount of data for validation (choosing among different models): 527
Amount of data for test: 528


Let's standardize the data.

In [6]:
# Data pre-processing
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(Xtrain)
Xtrain_scaled = scaler.transform(Xtrain)
Xtrain_and_val_scaled = scaler.transform(Xtrain_and_val)
Xval_scaled = scaler.transform(Xval)
Xtest_scaled = scaler.transform(Xtest)

## Neural Networks
Let's start by learning a simple neural network with 1 hidden node.
Note: we are going to use the input parameter solver='lbfgs' and random_state=numero_di_matricola to fix the random seed (so results are reproducible).

In [8]:
#let's load the MLPRegressor

from sklearn.neural_network import MLPRegressor #COMPLETE

#let's define the model
mlp = MLPRegressor(hidden_layer_sizes=(1, ), solver="lbfgs", random_state = numero_di_matricola) #COMPLETE

#let's learn the model on training data
mlp.fit(Xtrain_scaled, Ytrain)#COMPLETE

#let's print the error (1 - R^2) on training data
print("Training error: ", 1. - mlp.score(Xtrain_scaled, Ytrain)) #COMPLETE

#let's print the error (1 - R^2) on validation data
print("Validation error: ", 1. - mlp.score(Xval_scaled, Yval)) #COMPLETE

#let's print the coefficients of the model for the input nodes (but not the bias)
print(mlp.coefs_) #COMPLETE

#let's print the coefficient for the bias (i.e., the bias)
print(mlp.intercepts_) #COMPLETE

Training error:  0.2639480781015997
Validation error:  0.3040462946282121
[array([[-214.2706902 ],
       [ 268.83014126],
       [ 523.15531199],
       [ -60.57620702],
       [   4.17629812],
       [ 709.84738204],
       [ 293.97851894],
       [ 136.39751309],
       [ 814.60963176],
       [ 492.64335933],
       [ 163.33472905],
       [-581.27391639],
       [  38.01155997],
       [-203.23453672],
       [ 599.2678294 ],
       [-141.80411624],
       [ 146.94340034],
       [ -26.95580774]]), array([[141.32340859]])]
[array([3789.16189829]), array([-37.51218273])]


## Neural Networks vs Linear Models

Let's learn a linear model on the other same data and compare the results with the simple NN above.

In [12]:
from sklearn import linear_model #COMPLETE

LR = linear_model.LinearRegression()

LR.fit(Xtrain_scaled, Ytrain)#COMPLETE

#let's print the error (1 - R^2) on training data
print("Training error: ", 1. - LR.score(Xtrain_scaled, Ytrain)) #COMPLETE

#let's print the error (1 - R^2) on validation data
print("Validation error: ", 1. - LR.score(Xval_scaled, Yval)) #COMPLETE

#let's print the coefficients of the model for the input nodes (but not the bias)
print(LR.coef_) #COMPLETE

#let's print the coefficient for the bias (i.e., the bias)
print(LR.intercept_) #COMPLETE

Training error:  0.2653594216072852
Validation error:  0.3115400506517969
[-31303.71909156  35848.45081517  74506.78099995  -8012.41104949
    671.23713588 100205.53195594  41671.19028923  19507.84532115
 111331.50566184  69959.22677526  23468.73219785 -78236.93092911
   6535.34729956 -28197.21476235  83701.76486765 -21647.26671149
  22056.22833416  -2002.69401407]
536831.9203413766


Is there a way to make a NN network learn a linear model?

Let's first check what is the loss used by MLPRegressor...

In [15]:
#let's write the code to learn a linear model with NN: how? 

#let's define the model
mlp_lr = MLPRegressor(hidden_layer_sizes=(1, ), solver="lbfgs", random_state = numero_di_matricola, activation = "identity") #COMPLETE

#let's learn the model on training data
mlp_lr.fit(Xtrain_scaled, Ytrain)#COMPLETE

#let's print the error (1 - R^2) on training data
print("Training error: ", 1. - mlp_lr.score(Xtrain_scaled, Ytrain)) #COMPLETE

#let's print the error (1 - R^2) on validation data
print("Validation error: ", 1. - mlp_lr.score(Xval_scaled, Yval)) #COMPLETE

#let's print the coefficients of the model for the input nodes (but not the bias)
print(mlp_lr.coefs_) #COMPLETE

#let's print the coefficient for the bias (i.e., the bias)
print(mlp_lr.intercepts_) #COMPLETE

Training error:  0.26535942166590454
Validation error:  0.3115390658284922
[array([[  51.55070235],
       [ -59.02846704],
       [-122.96939596],
       [  13.19466306],
       [  -1.10694448],
       [-165.016906  ],
       [ -68.62491747],
       [ -32.12596414],
       [-183.34020234],
       [-114.96558392],
       [ -38.51551396],
       [ 128.83877577],
       [ -10.76237595],
       [  46.43749972],
       [-137.83874569],
       [  35.64844107],
       [ -36.32214078],
       [   3.29740834]]), array([[-607.24280969]])]
[array([-883.44724722]), array([365.29153421])]


Note that there is an $\ell_2$ regularization term in MLPRegressor. What about making it smaller?

In [16]:
#COMPLETE

#let's define the model
mlp_lr_noreg = MLPRegressor(hidden_layer_sizes=(1, ), solver="lbfgs", random_state = numero_di_matricola, activation = "identity", alpha=1e-20) #COMPLETE

#let's learn the model on training data
mlp_lr_noreg.fit(Xtrain_scaled, Ytrain)#COMPLETE

#let's print the error (1 - R^2) on training data
print("Training error: ", 1. - mlp_lr_noreg.score(Xtrain_scaled, Ytrain)) #COMPLETE

#let's print the error (1 - R^2) on validation data
print("Validation error: ", 1. - mlp_lr_noreg.score(Xval_scaled, Yval)) #COMPLETE

#let's print the coefficients of the model for the input nodes (but not the bias)
print(mlp_lr_noreg.coefs_) #COMPLETE

#let's print the coefficient for the bias (i.e., the bias)
print(mlp_lr_noreg.intercepts_) #COMPLETE

Training error:  0.26535942166590454
Validation error:  0.31153906582851487
[array([[  51.55070235],
       [ -59.02846704],
       [-122.96939596],
       [  13.19466306],
       [  -1.10694448],
       [-165.016906  ],
       [ -68.62491747],
       [ -32.12596414],
       [-183.34020234],
       [-114.96558392],
       [ -38.51551396],
       [ 128.83877577],
       [ -10.76237595],
       [  46.43749972],
       [-137.83874569],
       [  35.64844107],
       [ -36.32214078],
       [   3.29740834]]), array([[-607.24280969]])]
[array([-883.44724722]), array([365.29153421])]


## More Complex NNs

Let's try more complex NN, for example increasing the number of nodes in the only hidden layer, or increasing the number of hidden layers.

Let's build a NN with 2 nodes in the only hidden layer

In [17]:
#let's build a NN with 2 nodes in the only hidden layer

from sklearn.neural_network import MLPRegressor #COMPLETE

#let's define the model
mlp_1h2n = MLPRegressor(hidden_layer_sizes=(2, ), solver="lbfgs", random_state = numero_di_matricola) #COMPLETE

#let's learn the model on training data
mlp_1h2n.fit(Xtrain_scaled, Ytrain)#COMPLETE

#let's print the error (1 - R^2) on training data
print("Training error: ", 1. - mlp_1h2n.score(Xtrain_scaled, Ytrain)) #COMPLETE

#let's print the error (1 - R^2) on validation data
print("Validation error: ", 1. - mlp_1h2n.score(Xval_scaled, Yval)) #COMPLETE

#let's print the coefficients of the model for the input nodes (but not the bias)
print(mlp_1h2n.coefs_) #COMPLETE

#let's print the coefficient for the bias (i.e., the bias)
print(mlp_1h2n.intercepts_) #COMPLETE

Training error:  0.18062204762767708
Validation error:  0.20740387174470365
[array([[  91.02730275,  -33.27495621],
       [ 120.426258  ,   39.34029925],
       [  85.92033098,   72.9854881 ],
       [-271.76978265,   28.29864088],
       [ -30.7062934 ,   17.82631683],
       [ 197.57765784,   25.88717597],
       [  34.91748275,   37.55666698],
       [  96.9807779 ,   25.68593065],
       [ 312.8005049 ,  132.75156317],
       [  85.17191276,   68.79246682],
       [  19.29957673,   23.32901307],
       [-217.58250538,  -81.10854351],
       [  -3.45286622,   20.17072215],
       [-300.9651875 ,  -26.39908198],
       [ 305.19643214,  144.62963693],
       [-463.25823136,  -16.43003545],
       [ 193.67218628,   53.02206533],
       [-241.36794669,  -11.18297964]]), array([[615.21480933],
       [548.87640907]])]
[array([-1049.80655366,   897.57644076]), array([725.96713241])]


Let's build a NN with 5 nodes in the only hidden layer

In [22]:
#let's build a NN with 5 nodes in the only hidden layer

from sklearn.neural_network import MLPRegressor #COMPLETE

#let's define the model
mlp_1h5n = MLPRegressor(hidden_layer_sizes=(5, ), solver="lbfgs", random_state = numero_di_matricola) #COMPLETE

#let's learn the model on training data
mlp_1h5n.fit(Xtrain_scaled, Ytrain)#COMPLETE

#let's print the error (1 - R^2) on training data
print("Training error: ", 1. - mlp_1h5n.score(Xtrain_scaled, Ytrain)) #COMPLETE

#let's print the error (1 - R^2) on validation data
print("Validation error: ", 1. - mlp_1h5n.score(Xval_scaled, Yval)) #COMPLETE

#let's print the coefficients of the model for the input nodes (but not the bias)
print(mlp_1h5n.coefs_) #COMPLETE

#let's print the coefficient for the bias (i.e., the bias)
print(mlp_1h5n.intercepts_) #COMPLETE

Training error:  0.16319548399504324
Validation error:  0.2188084488823152
[array([[ -212.42052182,   178.56878224,   -13.93417767,   178.16704452,
          -31.9253716 ],
       [  297.38251127,   417.20249223,   163.34881072,   155.2919727 ,
            1.60478425],
       [ -415.59091332,   350.22957967,   286.09538753,  -317.74580448,
          292.84391406],
       [ -170.44645464,   -25.1820697 ,  -761.12709977,   117.47032116,
           77.06653586],
       [  920.42730448,  -987.67682931,   348.2484192 ,   -45.05053924,
         -135.20538027],
       [ -391.11436992,   742.22169513,   536.3526787 ,  -526.54088295,
          -81.9788458 ],
       [ -581.59916475,   706.32604182,   -16.51241429,  -467.61467351,
          202.91782666],
       [  104.34600555,   798.43989716,   184.71685778,   -28.91483413,
           21.98868105],
       [ -387.67229429,   700.39784078,   555.40802976,    85.22505566,
          437.05095339],
       [ -523.15224841,   584.84780028,   313.81717

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Note that with a smaller number of iterations we had a larger error on training set but a smaller error on validation data -> "early stopping is a form of regularization"

Let's build a NN with 10 nodes in the only hidden layer

In [23]:
#let's build a NN with 10 nodes in the only hidden layer

from sklearn.neural_network import MLPRegressor #COMPLETE

#let's define the model
mlp_1h10n = MLPRegressor(hidden_layer_sizes=(10, ), solver="lbfgs", random_state = numero_di_matricola) #COMPLETE

#let's learn the model on training data
mlp_1h10n.fit(Xtrain_scaled, Ytrain)#COMPLETE

#let's print the error (1 - R^2) on training data
print("Training error: ", 1. - mlp_1h10n.score(Xtrain_scaled, Ytrain)) #COMPLETE

#let's print the error (1 - R^2) on validation data
print("Validation error: ", 1. - mlp_1h10n.score(Xval_scaled, Yval)) #COMPLETE

#let's print the coefficients of the model for the input nodes (but not the bias)
print(mlp_1h10n.coefs_) #COMPLETE

#let's print the coefficient for the bias (i.e., the bias)
print(mlp_1h10n.intercepts_) #COMPLETE

Training error:  0.12166041459011512
Validation error:  0.30474906458367634
[array([[  142.90157349,    -5.64252683,   -78.24670384,    81.80264724,
           54.4690754 ,   -67.57084438,    30.49260416,    -7.62352087,
          -49.2408123 ,   194.26623298],
       [   19.96658921,   -70.14013944,   126.59977883,    81.77260576,
         -126.33147599,   426.93649595,   308.06387783,   -17.12276879,
          105.32066135,  -173.93899674],
       [ -137.75067599,    61.86268552,   252.34475712,    32.68670536,
         -181.87255964,   184.01698488,  -200.59881485,    -9.95255862,
           92.27566777,  -178.51658159],
       [  201.84001755,    71.9096388 ,    74.31277841,  -256.86885231,
          -86.24887685,   -16.90169854,  -157.19191873,  -158.9461769 ,
          -46.48850682,  -545.78031232],
       [ -557.62996402,   610.55530946,  -202.15336765,    74.06301243,
          204.20692671,   -11.14157963,  -177.38321874,   182.73816545,
          196.76444621,  -654.77006564]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Let's build a NN with 100 nodes in the only hidden layer. Note that this is the default!

In [24]:
#let's build a NN with 100 nodes in the only hidden layer

from sklearn.neural_network import MLPRegressor #COMPLETE

#let's define the model
mlp_1h100n = MLPRegressor(hidden_layer_sizes=(100, ), solver="lbfgs", random_state = numero_di_matricola) #COMPLETE

#let's learn the model on training data
mlp_1h100n.fit(Xtrain_scaled, Ytrain)#COMPLETE

#let's print the error (1 - R^2) on training data
print("Training error: ", 1. - mlp_1h100n.score(Xtrain_scaled, Ytrain)) #COMPLETE

#let's print the error (1 - R^2) on validation data
print("Validation error: ", 1. - mlp_1h100n.score(Xval_scaled, Yval)) #COMPLETE

#let's print the coefficients of the model for the input nodes (but not the bias)
print(mlp_1h100n.coefs_) #COMPLETE

#let's print the coefficient for the bias (i.e., the bias)
print(mlp_1h100n.intercepts_) #COMPLETE

Training error:  0.03138645935685236
Validation error:  0.38673398065108233
[array([[  88.01564147,  -20.88757248,   74.32197474, ..., -121.56426507,
         168.69298174,  -87.42104112],
       [   9.53756696,   67.87362961,   86.99794504, ...,   36.22614267,
         -22.49776064,  -15.50747908],
       [ -76.74482921,  -75.15169921,  100.97982527, ...,   31.46635555,
         238.65599168,   45.00100989],
       ...,
       [  50.33743981,  -61.71113935,   69.47203088, ..., -379.91516011,
          62.53446728,  -35.46205508],
       [ -49.60164363,   83.15128489,  152.20787873, ..., -110.89090786,
         -73.5251784 ,  -35.58799736],
       [   8.47989996, -170.30551729,   43.98564458, ...,   86.62785586,
          -2.34972701,   -3.53734511]]), array([[ 1.17116030e+02],
       [ 3.95089139e+01],
       [-2.32557360e+01],
       [-4.51931511e+01],
       [ 2.09076048e+02],
       [ 5.97541549e+01],
       [ 5.93123578e+01],
       [ 2.79226561e+02],
       [-2.56311971e+02],
   

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Let's try 2 layers, 1 node each

In [None]:
#let's build a NN with 2 hidden layers, 1 node each

#COMPLETE

Let's try 2 layers, 2 nodes each

In [25]:
#let's build a NN with 2 layers, 2 nodes each

from sklearn.neural_network import MLPRegressor #COMPLETE

#let's define the model
mlp_2h2n = MLPRegressor(hidden_layer_sizes=(2, 2, ), solver="lbfgs", random_state = numero_di_matricola) #COMPLETE

#let's learn the model on training data
mlp_2h2n.fit(Xtrain_scaled, Ytrain)#COMPLETE

#let's print the error (1 - R^2) on training data
print("Training error: ", 1. - mlp_2h2n.score(Xtrain_scaled, Ytrain)) #COMPLETE

#let's print the error (1 - R^2) on validation data
print("Validation error: ", 1. - mlp_2h2n.score(Xval_scaled, Yval)) #COMPLETE

#let's print the coefficients of the model for the input nodes (but not the bias)
print(mlp_2h2n.coefs_) #COMPLETE

#let's print the coefficient for the bias (i.e., the bias)
print(mlp_2h2n.intercepts_) #COMPLETE

Training error:  0.21279156721086012
Validation error:  0.26992402416860495
[array([[ -4.22577476,   2.63027224],
       [  9.99342087,  -7.80910368],
       [ 21.93698998, -27.03330468],
       [ -8.24945212,  13.1650583 ],
       [ -5.28469992,  17.17721391],
       [ 21.98142398, -17.36619318],
       [  9.62471103, -11.78924743],
       [  6.02574135,  -3.58730912],
       [ 38.40219038, -41.71203757],
       [ 21.63601478, -27.37195716],
       [  5.42173258,  -4.0912218 ],
       [-28.8798438 ,  37.2404525 ],
       [  0.31230183,   3.83151851],
       [-15.55835082,  21.00288097],
       [ 25.40006534,  -9.93940831],
       [-18.41527883,  23.81319944],
       [  5.82974355,   7.42355739],
       [ -4.20219205,   8.94717134]]), array([[ -2.22792878,  65.44419175],
       [-10.63576921,  26.55855252]]), array([[ 1.13997534],
       [72.03579216]])]
[array([82.77659745, 70.78953727]), array([ -10.55097674, -484.62292654]), array([147.77520455])]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Let's try 2 layers, 10 nodes each

In [None]:
#let's build a NN with 2 layers, 10 nodes each

#COMPLETE

Let's try 2 layers, 100 nodes each

In [None]:
#let's build a NN with 2 layers, 100 nodes each

#COMPLETE

So it seems that 1 layer (and default number of iterations) works best for this dataset. Let's try 5-fold cross-validation with number of nodes in the hidden layer between 1 and 20.
Note that we use train and validation data together, since we are doing cross-validation.

In [None]:
from sklearn.model_selection import GridSearchCV

#COMPLETE

Now let's check what is the best parameter, and compare the best NNs with the linear model (learned on train and validation) on test data.

In [None]:
#let's print the best model according to grid search
#COMPLETE

#let's print the error 1-R^2 for the best model
#COMPLETE

Let compare the error of the best NN on train and validation and on test data.

In [None]:
#COMPLETE

Now let's learn the linear model on train and validation, and get error (1-R^2) on train and validation and on test data.

In [None]:
#COMPLETE

Note: MLPRegressor has several other parameters!