# Data Loading and Preprocessing

We consider the same notebook used in the labs, containing house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

https://www.kaggle.com/harlfoxem/housesalesprediction

For each house we know 18 house features (e.g., number of bedrooms, number of bathrooms, etc.) plus its price, that is what we would like to predict.

## TO DO: Insert your ID number ("numero di matricola") below

In [32]:
#put here your ``numero di matricola''
numero_di_matricola = 2016515

Load the required packages

In [33]:
#import all packages needed
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Read the data, remove data samples/points with missing values (NaN), and print some statistics.

In [34]:
#load the data
df = pd.read_csv('kc_house_data.csv', sep = ',')

#remove the data samples with missing values (NaN)
df = df.dropna() 

df.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0
mean,4645240000.0,535435.8,3.381163,2.071903,2070.027813,15250.54,1.434893,0.009798,0.244311,3.459229,7.615676,1761.252212,308.775601,1967.489254,94.668774,98077.125158,47.557868,-122.212337,1982.544564,13176.302465
std,2854203000.0,380900.4,0.895472,0.768212,920.251879,42544.57,0.507792,0.098513,0.776298,0.682592,1.166324,815.934864,458.977904,28.095275,424.439427,54.172937,0.140789,0.139577,686.25667,25413.180755
min,1000102.0,75000.0,0.0,0.0,380.0,649.0,1.0,0.0,0.0,1.0,3.0,380.0,0.0,1900.0,0.0,98001.0,47.1775,-122.514,620.0,660.0
25%,2199775000.0,315000.0,3.0,1.5,1430.0,5453.75,1.0,0.0,0.0,3.0,7.0,1190.0,0.0,1950.0,0.0,98032.0,47.459575,-122.32425,1480.0,5429.5
50%,4027701000.0,445000.0,3.0,2.0,1910.0,8000.0,1.0,0.0,0.0,3.0,7.0,1545.0,0.0,1969.0,0.0,98059.0,47.5725,-122.226,1830.0,7873.0
75%,7358175000.0,640250.0,4.0,2.5,2500.0,11222.5,2.0,0.0,0.0,4.0,8.0,2150.0,600.0,1990.0,0.0,98117.0,47.68025,-122.124,2360.0,10408.25
max,9839301000.0,5350000.0,8.0,6.0,8010.0,1651359.0,3.5,1.0,4.0,5.0,12.0,6720.0,2620.0,2015.0,2015.0,98199.0,47.7776,-121.315,5790.0,425581.0


Get the feature matrix and the vector of target values. We want to predict the price by using features other than id as input.

In [35]:
Data = df.values
# m = number of input samples
m = Data.shape[0]
print("Amount of data:",m)
Y = Data[:m,2]
X = Data[:m,3:]

feature_names = df.columns[3:]

Amount of data: 3164


We split the $m$ samples of the data into 3 parts: one will be used for training and choosing the parameters, one for choosing among different models, and one for testing. The part for training and choosing the parameters will consist of $m_{train}=2/3 m$ samples, the one for choosing among different models will consist of $m_{val}= (m - m_{train})/2$ samples, while the other part consists of $m_{test}=m - m_{train} - m_{val}$ samples.

In [36]:
# Split data into train (2/3 of samples), validation (1/6 of samples), and test data (the rest)
m_train = int(2./3.*m)
m_val = int((m-m_train)/2.)
m_test = m - m_train - m_val
print("Amount of data for training and deciding parameters:",m_train)
print("Amount of data for validation (choosing among different models):",m_val)
print("Amount of data for test:",m_test)
from sklearn.model_selection import train_test_split

#Xtrain_and_val, Ytrain_and_val is the part of data for training and validation
#Xtest, Ytest is the part of data for testing
Xtrain_and_val, Xtest, Ytrain_and_val, Ytest = train_test_split(X, Y, test_size=m_test/m, random_state=numero_di_matricola)

#if you need to consider a specific training and validation split, use
#Xtrain, Ytrain for training and Xval, Yval for validation
Xtrain, Xval, Ytrain, Yval = train_test_split(Xtrain_and_val, Ytrain_and_val, test_size=m_val/(m_train+m_val), random_state=numero_di_matricola)

Amount of data for training and deciding parameters: 2109
Amount of data for validation (choosing among different models): 527
Amount of data for test: 528


Let's scale the data.

In [37]:
# Data pre-processing
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(Xtrain)
Xtrain_scaled = scaler.transform(Xtrain)
Xtrain_and_val_scaled = scaler.transform(Xtrain_and_val)
Xval_scaled = scaler.transform(Xval)
Xtest_scaled = scaler.transform(Xtest)

# Neural Networks

Let's learn the best neural network with 1 hidden layer and between 1 and 9 hidden nodes, choosing the best number of hidden nodes with cross-validation.

In [38]:
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV

mlp_cv = MLPRegressor()
param_grid = {'hidden_layer_sizes': [i for i in range(1,10)],
              'activation': ['relu'],
              'solver': ['lbfgs'], 
              'random_state': [numero_di_matricola],
              'max_iter': [10000]
             }
mlp_GS = GridSearchCV(mlp_cv, param_grid=param_grid, 
                   cv=5,verbose=True)
mlp_GS.fit(Xtrain_and_val_scaled, Ytrain_and_val)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:   41.1s finished


GridSearchCV(cv=5, estimator=MLPRegressor(),
             param_grid={'activation': ['relu'],
                         'hidden_layer_sizes': [1, 2, 3, 4, 5, 6, 7, 8, 9],
                         'max_iter': [10000], 'random_state': [2016515],
                         'solver': ['lbfgs']},
             verbose=True)

Now let's check what is the best parameter, and compare the best NNs with the linear model (learned on train and validation) on test data.

In [39]:
#let's print the best model according to grid search
print("Best model: ",mlp_GS.best_estimator_)
#let's print the error 1-R^2 for the best model
print("Error (1-R^2) of best model: ",1. - mlp_GS.best_score_)

Best model:  MLPRegressor(hidden_layer_sizes=7, max_iter=10000, random_state=2016515,
             solver='lbfgs')
Error (1-R^2) of best model:  0.1787776603202762


Let's learn the best NN using all of training and validation, and then compare the error of the best NN on train and validation and on test data.

In [40]:
best_mlp = MLPRegressor(hidden_layer_sizes=(6,), activation='relu', solver='lbfgs', random_state = numero_di_matricola,max_iter=10000)
best_mlp.fit(Xtrain_and_val_scaled,Ytrain_and_val)

print("Error best model on train and validation: ",1. - best_mlp.score(Xtrain_and_val_scaled,Ytrain_and_val))
print("Error best model on test data: ",1. - best_mlp.score(Xtest_scaled,Ytest))

Error best model on train and validation:  0.1290153025459786
Error best model on test data:  0.2758398041468335


# Linear Regression

Now let's learn the linear model on train and validation, and get error (1-R^2) on train and validation and on test data.

In [41]:
from sklearn import linear_model
#LR the linear regression model
LR = linear_model.LinearRegression()

#fit the model on training data
LR.fit(Xtrain_and_val_scaled, Ytrain_and_val)

print("1 - coefficient of determination on training data:"+str(1 - LR.score(Xtrain_and_val_scaled,Ytrain_and_val)))
print("1 - coefficient of determination on test data:"+str(1 - LR.score(Xtest_scaled,Ytest)))

1 - coefficient of determination on training data:0.27731047267675213
1 - coefficient of determination on test data:0.32671507627694774


# k-Nearest Neighbours

You will now explore the k-Nearest Neighbours (kNN) method for regression. In order to do this, you will need to use load the scikit-learn package *neighbors.KNeighborsRegressor* 

k-Nearest Neighbours for regression works as follows: the predicted value $h(\textbf{x})$ for an instance $\textbf{x}$ is obtained by first finding the $\ell$ instances *in the training set* that are clostest to $\textbf{x}$; the predicted value $h(\textbf{x})$ is then the mean of the targets of such $\ell$ instances. $\ell$ is a parameter of the method. The targets of the $\ell$ instances used for prediction can be weighted by the (inverse of) their distance to $\textbf{x}$.

## TO DO: load the package for kNN regression, learn the model with default parameters using the training and validation scaled data, and print the error (1-R^2) on the data used to train the model and on the test data.

In [42]:
from sklearn.neighbors import KNeighborsRegressor

n = KNeighborsRegressor(n_neighbors=5, weights='uniform', algorithm='auto')
n.fit(Xtrain_and_val_scaled, Ytrain_and_val)

print("Error on training and validation data:"+str(1 - n.score(Xtrain_and_val_scaled,Ytrain_and_val)))
print("Error on test data:"+str(1 - n.score(Xtest_scaled,Ytest)))

Error on training and validation data:0.1619317909989555
Error on test data:0.2513359212537556


## TO DO: repeat the point (including the printing instructions) above using the kNN version where points are weighted by the inverse of their distance 

In [43]:
from sklearn.neighbors import KNeighborsRegressor

n = KNeighborsRegressor(n_neighbors=5, weights='distance', algorithm='auto')
n.fit(Xtrain_and_val_scaled, Ytrain_and_val)

print("Error on training and validation data:"+str(1 - n.score(Xtrain_and_val_scaled,Ytrain_and_val)))
print("Error on test data:"+str(1 - n.score(Xtest_scaled,Ytest)))

Error on training and validation data:0.0006293883379925314
Error on test data:0.2513827909308828


## TO DO: use cross validation to choose the best number of neighbours between 2 and 20)

In [44]:
knn = KNeighborsRegressor()
param_grid = {'n_neighbors': [i for i in range(2,20)]}
clf = GridSearchCV(knn, param_grid, cv=5)
clf.fit(Xtrain_and_val_scaled, Ytrain_and_val)
print("Best model: ",clf.cv_results_)

Best model:  {'mean_fit_time': array([0.0095746 , 0.00998812, 0.01018114, 0.00937338, 0.00917454,
       0.0097733 , 0.00977955, 0.00938034, 0.0087594 , 0.008571  ,
       0.00818291, 0.0083694 , 0.00797696, 0.00837035, 0.00818095,
       0.00837531, 0.0083704 , 0.00837502]), 'std_fit_time': array([1.01177774e-03, 6.22296998e-04, 9.84372140e-04, 7.97470087e-04,
       7.43958889e-04, 7.46365657e-04, 7.41215327e-04, 5.06447509e-04,
       7.41772895e-04, 4.83814513e-04, 4.10367111e-04, 4.80809484e-04,
       2.87450159e-06, 4.82612855e-04, 4.10513997e-04, 4.88115996e-04,
       4.84910521e-04, 4.90391481e-04]), 'mean_score_time': array([0.03909483, 0.04147968, 0.04148059, 0.04358721, 0.04248033,
       0.04288616, 0.0440762 , 0.04387684, 0.04090233, 0.04009895,
       0.04089112, 0.04009314, 0.04169035, 0.04148903, 0.04268756,
       0.04188852, 0.04268551, 0.04248672]), 'std_score_time': array([0.00291588, 0.00120343, 0.00101321, 0.00366982, 0.00285481,
       0.00166904, 0.00342386, 0

## TO DO: print the best model according to cross validation above, and print the score of the best model 

In [45]:
#let's print the best model according to grid search
print("Best model: ", clf.best_estimator_)
#let's print the error 1-R^2 for the best model
print("Score of best model: ", clf.best_score_)

Best model:  KNeighborsRegressor(n_neighbors=6)
Score of best model:  0.7589786583414776


## TO DO: learn the best model on all of the training and validation scaled data, and print the error on training and validation scaled data, and on test scaled data

In [46]:
from sklearn.neighbors import KNeighborsRegressor

n = KNeighborsRegressor(n_neighbors=7)
n.fit(Xtrain_and_val_scaled, Ytrain_and_val)

print("Error best model on train and validation: ", 1 - n.score(Xtrain_and_val_scaled,Ytrain_and_val))
print("Error best model on test data: ", 1 - n.score(Xtest_scaled,Ytest)) 

Error best model on train and validation:  0.17314070666865855
Error best model on test data:  0.2458093661707873


## TO DO: compare the error on test data of the best kNN model with the error on test data of linear regression and of NNs. Describe what you observe and give a potential explanation.
## [USE MAX 10 LINES]

# Clustering and "Local" Linear Models

You are now going to explore the use of clustering to identify groups of *similar* instances, and then learning models that are specific to each group.

Once you have clustered the data, and then learned a model for each cluster, the prediction for a new instance is obtained by using the model of the cluster that is the closest to the instance, where the distance of a cluster to the instance is defined as the distance of the *center* of the cluster to the instance.

**Note**: in this part you are not explicitely told which part of the data to use, deciding which one is the correct one is part of the homework!

## TO DO: use k-means in sklearn to learn a cluster with 5 clusters.

In [53]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5,random_state=numero_di_matricola)
kmeans.fit(Xtrain_and_val_scaled, Ytrain_and_val)
Clust_train_val = kmeans.fit_predict(Xtrain_and_val_scaled)
print(Clust_train_val)

[0 4 4 ... 2 2 2]


## TO DO: for each cluster, learn a linear model using the elements of the cluster. For each model, print the error on the data used to learn it.

In [73]:
c={}
c_names=np.unique(Clust_train_val)
for label in c_names:
    c[label]=[]
    for i in range(Xtrain_and_val_scaled.shape[0]):
        if label == Clust_train_val[i]:
            c[label].append(i)
            
models=[]
for i in c_names:
    models.append(linear_model.LinearRegression().fit(Xtrain_and_val_scaled[c[i],:],Ytrain_and_val[c[i]]))
    print("error:"+str(1-models[i].score(Xtrain_and_val_scaled[c[i],:],Ytrain_and_val[c[i]])))

error:0.33248409821626024
error:0.05237794589214195
error:0.3299433227478712
error:0.08376675201732664
error:0.3434995018025111


## TO DO: *compute* the error (1 - R^2) on the data not used to learn the models.
For each instance not used to learn the model, the prediction is done by:
- finding the cluster C whose center is the closest to the instance
- use the model learned for cluster C to make the prediction

In [49]:
def find_center(instance,centers,num_centers):
    dist=np.zeros(num_centers)
    for i in range(num_centers):
        dist[i]=np.linalg.norm(instance-centers[i])
    return int(np.argmin(dist))

centers=kmeans.cluster_centers_
Ypred=np.zeros(m_test)
for row in range(m_test):
    index = find_center(Xtest_scaled[row,:],centers,centers.shape[0])
    Ypred[row]= models[index].predict(Xtest_scaled[row,:].reshape(1,-1))

## TO DO: *print* the error (1-R^2) on the data not used to learn the models

In [50]:
from sklearn.metrics import r2_score
print("error on test data:",str(1-r2_score(Ytest,Ypred)))

error on test data: 0.2963820743395511


## TO DO: compare the error of the model "clustering + linear models" and of the linear model (see the beginning of the HW). Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]

## TO DO: compare the error of the model "clustering + linear models" and of kNN. Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]

# Clustering and "Local" NNs

Repeat the same as above, but using neural networks instead of linear models.

**Note**: note that we are not telling you which parameters to use for NNs. You have to decide how to select the parameters.

## TO DO: clearly explain how you decided to set the parameters, motivating the choice of your strategy.

## TO DO: repeat the analysis in part "Clustering and "Local" Linear Models" using NNs instead of linear models.

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5,random_state=numero_di_matricola)
kmeans.fit(Xtrain_and_val_scaled, Ytrain_and_val)
Clust_train_val = kmeans.fit_predict(Xtrain_and_val_scaled)
print(Clust_train_val)

c={}
c_names=np.unique(Clust_train_val)
for label in c_names:
    c[label]=[]
    for i in range(Xtrain_and_val_scaled.shape[0]):
        if label == Clust_train_val[i]:
            c[label].append(i)
            

models=[]
mlp_cv = MLPRegressor()
param_grid = {'hidden_layer_sizes': [i for i in range(1,10)],
              'activation': ['relu'],
              'solver': ['lbfgs'], 
              'random_state': [numero_di_matricola],
              'max_iter': [10000]
             }
mlp_GS = GridSearchCV(mlp_cv, param_grid=param_grid, 
                   cv=5,verbose=True)
for i in c_names:
    models.append(mlp_GS.fit(Xtrain_and_val_scaled[c[i],:],Ytrain_and_val[c[i]]))
    
def find_center(instance,centers,num_centers):
    dist=np.zeros(num_centers)
    for i in range(num_centers):
        dist[i]=np.linalg.norm(instance-centers[i])
    return int(np.argmin(dist))

centers=kmeans.cluster_centers_
Ypred=np.zeros(m_test)
for row in range(m_test):
    index = find_center(Xtest_scaled[row,:],centers,centers.shape[0])
    Ypred[row]= models[index].predict(Xtest_scaled[row,:].reshape(1,-1))
    
from sklearn.metrics import r2_score
print("error on test data:",str(1-r2_score(Ytest,Ypred)))

[0 4 4 ... 2 2 2]
Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:   12.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:   10.6s finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


## TO DO: compare the error of the model "clustering + NNs" and of NNs (see the beginning of the HW). Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]

## TO DO: compare the error of the model "clustering + NNs" and of kNN. Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]

## TO DO: compare the error of the model "clustering + NNs" and of "clustering + Linear Models". Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]