# Data Loading and Preprocessing

We consider the same notebook used in the labs, containing house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

https://www.kaggle.com/harlfoxem/housesalesprediction

For each house we know 18 house features (e.g., number of bedrooms, number of bathrooms, etc.) plus its price, that is what we would like to predict.

## TO DO: Insert your ID number ("numero di matricola") below

In [None]:
#put here your ``numero di matricola''
numero_di_matricola = 2019157

Load the required packages

In [None]:
#import all packages needed
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Read the data, remove data samples/points with missing values (NaN), and print some statistics.

In [None]:
#load the data
df = pd.read_csv('kc_house_data.csv', sep = ',')

#remove the data samples with missing values (NaN)
df = df.dropna() 

df.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0
mean,4645240000.0,535435.8,3.381163,2.071903,2070.027813,15250.54,1.434893,0.009798,0.244311,3.459229,7.615676,1761.252212,308.775601,1967.489254,94.668774,98077.125158,47.557868,-122.212337,1982.544564,13176.302465
std,2854203000.0,380900.4,0.895472,0.768212,920.251879,42544.57,0.507792,0.098513,0.776298,0.682592,1.166324,815.934864,458.977904,28.095275,424.439427,54.172937,0.140789,0.139577,686.25667,25413.180755
min,1000102.0,75000.0,0.0,0.0,380.0,649.0,1.0,0.0,0.0,1.0,3.0,380.0,0.0,1900.0,0.0,98001.0,47.1775,-122.514,620.0,660.0
25%,2199775000.0,315000.0,3.0,1.5,1430.0,5453.75,1.0,0.0,0.0,3.0,7.0,1190.0,0.0,1950.0,0.0,98032.0,47.459575,-122.32425,1480.0,5429.5
50%,4027701000.0,445000.0,3.0,2.0,1910.0,8000.0,1.0,0.0,0.0,3.0,7.0,1545.0,0.0,1969.0,0.0,98059.0,47.5725,-122.226,1830.0,7873.0
75%,7358175000.0,640250.0,4.0,2.5,2500.0,11222.5,2.0,0.0,0.0,4.0,8.0,2150.0,600.0,1990.0,0.0,98117.0,47.68025,-122.124,2360.0,10408.25
max,9839301000.0,5350000.0,8.0,6.0,8010.0,1651359.0,3.5,1.0,4.0,5.0,12.0,6720.0,2620.0,2015.0,2015.0,98199.0,47.7776,-121.315,5790.0,425581.0


Get the feature matrix and the vector of target values. We want to predict the price by using features other than id as input.

In [None]:
Data = df.values
# m = number of input samples
m = Data.shape[0]
print("Amount of data:",m)
Y = Data[:m,2]
X = Data[:m,3:]

feature_names = df.columns[3:]

Amount of data: 3164


We split the $m$ samples of the data into 3 parts: one will be used for training and choosing the parameters, one for choosing among different models, and one for testing. The part for training and choosing the parameters will consist of $m_{train}=2/3 m$ samples, the one for choosing among different models will consist of $m_{val}= (m - m_{train})/2$ samples, while the other part consists of $m_{test}=m - m_{train} - m_{val}$ samples.

In [None]:
# Split data into train (2/3 of samples), validation (1/6 of samples), and test data (the rest)
m_train = int(2./3.*m)
m_val = int((m-m_train)/2.)
m_test = m - m_train - m_val
print("Amount of data for training and deciding parameters:",m_train)
print("Amount of data for validation (choosing among different models):",m_val)
print("Amount of data for test:",m_test)
from sklearn.model_selection import train_test_split

#Xtrain_and_val, Ytrain_and_val is the part of data for training and validation
#Xtest, Ytest is the part of data for testing
Xtrain_and_val, Xtest, Ytrain_and_val, Ytest = train_test_split(X, Y, test_size=m_test/m, random_state=numero_di_matricola)

#if you need to consider a specific training and validation split, use
#Xtrain, Ytrain for training and Xval, Yval for validation
Xtrain, Xval, Ytrain, Yval = train_test_split(Xtrain_and_val, Ytrain_and_val, test_size=m_val/(m_train+m_val), random_state=numero_di_matricola)

Amount of data for training and deciding parameters: 2109
Amount of data for validation (choosing among different models): 527
Amount of data for test: 528


Let's scale the data.

In [None]:
# Data pre-processing
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(Xtrain)
Xtrain_scaled = scaler.transform(Xtrain)
Xtrain_and_val_scaled = scaler.transform(Xtrain_and_val)
Xval_scaled = scaler.transform(Xval)
Xtest_scaled = scaler.transform(Xtest)



# Neural Networks

Let's learn the best neural network with 1 hidden layer and between 1 and 9 hidden nodes, choosing the best number of hidden nodes with cross-validation.

In [None]:
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV

mlp_cv = MLPRegressor(max_iter=2000)
param_grid = {'hidden_layer_sizes': [i for i in range(1,10)],
              'activation': ['relu'],
              'solver': ['lbfgs'], 
              'random_state': [numero_di_matricola],
             }
mlp_GS = GridSearchCV( mlp_cv, param_grid=param_grid, cv=5, verbose=True)
mlp_GS.fit(Xtrain_and_val_scaled, Ytrain_and_val)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:   32.3s finished


GridSearchCV(cv=5, estimator=MLPRegressor(max_iter=2000),
             param_grid={'activation': ['relu'],
                         'hidden_layer_sizes': [1, 2, 3, 4, 5, 6, 7, 8, 9],
                         'random_state': [2019157], 'solver': ['lbfgs']},
             verbose=True)

Now let's check what is the best parameter, and compare the best NNs with the linear model (learned on train and validation) on test data.

In [None]:
#let's print the best model according to grid search
print("Best model: ",mlp_GS.best_estimator_)
#let's print the error 1-R^2 for the best model
print("Error (1-R^2) of best model: ",1. - mlp_GS.best_score_)

Best model:  MLPRegressor(hidden_layer_sizes=5, max_iter=2000, random_state=2019157,
             solver='lbfgs')
Error (1-R^2) of best model:  0.1953286353963405


Let's learn the best NN using all of training and validation, and then compare the error of the best NN on train and validation and on test data.

In [None]:
#best_mlp = MLPRegressor(max_iter=2000, hidden_layer_sizes=(5,), activation='relu', solver='lbfgs', random_state = numero_di_matricola)
best_mlp = mlp_GS.best_estimator_
best_mlp.fit(Xtrain_and_val_scaled,Ytrain_and_val)

print("Error best model on train and validation: ",1. - best_mlp.score(Xtrain_and_val_scaled,Ytrain_and_val))
print("Error best model on test data: ",1. - best_mlp.score(Xtest_scaled,Ytest))

Error best model on train and validation:  0.14124494388261788
Error best model on test data:  0.18499202712189133


# Linear Regression

Now let's learn the linear model on train and validation, and get error (1-R^2) on train and validation and on test data.

In [None]:
from sklearn import linear_model
#LR the linear regression model
LR = linear_model.LinearRegression()

#fit the model on training data
LR.fit(Xtrain_and_val_scaled, Ytrain_and_val)

print("1 - coefficient of determination on training data:"+str(1 - LR.score(Xtrain_and_val_scaled,Ytrain_and_val)))
print("1 - coefficient of determination on test data:"+str(1 - LR.score(Xtest_scaled,Ytest)))

1 - coefficient of determination on training data:0.27013072490104273
1 - coefficient of determination on test data:0.326693824565925


# k-Nearest Neighbours

You will now explore the k-Nearest Neighbours (kNN) method for regression. In order to do this, you will need to use load the scikit-learn package *neighbors.KNeighborsRegressor* 

k-Nearest Neighbours for regression works as follows: the predicted value $h(\textbf{x})$ for an instance $\textbf{x}$ is obtained by first finding the $\ell$ instances *in the training set* that are clostest to $\textbf{x}$; the predicted value $h(\textbf{x})$ is then the mean of the targets of such $\ell$ instances. $\ell$ is a parameter of the method. The targets of the $\ell$ instances used for prediction can be weighted by the (inverse of) their distance to $\textbf{x}$.

## TO DO: load the package for kNN regression, learn the model with default parameters using the training and validation scaled data, and print the error (1-R^2) on the data used to train the model and on the test data.

In [None]:
#import package
from sklearn.neighbors import KNeighborsRegressor
neigh = KNeighborsRegressor()
#TO DO: learn model
#fit the model on training and validation scaled data
neigh.fit(Xtrain_and_val_scaled, Ytrain_and_val)


print("Error on train and validation: ",1. - neigh.score(Xtrain_and_val_scaled,Ytrain_and_val))
print("Error on test data: ",1. - neigh.score(Xtest_scaled,Ytest))

Error on train and validation:  0.1397947148596227
Error on test data:  0.31539170780529113


## TO DO: repeat the point (including the printing instructions) above using the kNN version where points are weighted by the inverse of their distance 

In [None]:
neigh_dist = KNeighborsRegressor(weights='distance')

#fit the model on training and validation scaled data
neigh_dist.fit(Xtrain_and_val_scaled, Ytrain_and_val)

print("Error on train and validation: ",1. - neigh_dist.score(Xtrain_and_val_scaled,Ytrain_and_val))
print("Error on test data: ",1. - neigh_dist.score(Xtest_scaled,Ytest))

Error on train and validation:  0.00036709356405661975
Error on test data:  0.31311736735448314


## TO DO: use cross validation to choose the best number of neighbours between 2 and 20)

In [None]:
neigh_cv = KNeighborsRegressor()
param_grid = {'n_neighbors': [i for i in range(2,20)],
              'weights':['uniform','distance']
             }
neigh_GS = GridSearchCV( neigh_cv, param_grid=param_grid, cv=5, verbose=True)
neigh_GS.fit(Xtrain_and_val_scaled, Ytrain_and_val)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed:    9.2s finished


GridSearchCV(cv=5, estimator=KNeighborsRegressor(),
             param_grid={'n_neighbors': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
                                         14, 15, 16, 17, 18, 19],
                         'weights': ['uniform', 'distance']},
             verbose=True)

## TO DO: print the best model according to cross validation above, and print the score of the best model 

In [None]:
#let's print the best model according to grid search
print("Best model: ",neigh_GS.best_estimator_)
#let's print the score for the best model
print("Score of best model: ", neigh_GS.best_score_)

Best model:  KNeighborsRegressor(n_neighbors=6, weights='distance')
Score of best model:  0.7917623409789576


## TO DO: learn the best model on all of the training and validation scaled data, and print the error on training and validation scaled data, and on test scaled data

In [None]:
best_neigh = KNeighborsRegressor(n_neighbors=6, weights='distance')
best_neigh.fit(Xtrain_and_val_scaled,Ytrain_and_val)

print("Error best model on train and validation: ",1. - best_neigh.score(Xtrain_and_val_scaled,Ytrain_and_val))
print("Error best model on test data: ",1. - best_neigh.score(Xtest_scaled,Ytest))

Error best model on train and validation:  0.00036709356405661975
Error best model on test data:  0.3216866455997077


## TO DO: compare the error on test data of the best kNN model with the error on test data of linear regression and of NNs. Describe what you observe and give a potential explanation.
## [USE MAX 10 LINES]

Error best model on test data

kNN: 0.3216866455997077

Linear Regression:0.326693824565925

NN: 0.18499202712189133

The best model of NN performs better than kNN and Linear Regression. 
The Neural network outperforms linear regression probably beacuse let us to define non linear functions.
In this case looking at the big difference of R^2 between NN and Linear Model we can say that we deal with non linearities.
Comparing kNN (a non-parametric method) to NN and Linear Model (parametric methods) we can see that
kNN overfits on training and validation set and can't generalize in a good way on test set.
This could be also a problem caused by a small number of samples on Training and Validation data or their particular distribution.

# Clustering and "Local" Linear Models

You are now going to explore the use of clustering to identify groups of *similar* instances, and then learning models that are specific to each group.

Once you have clustered the data, and then learned a model for each cluster, the prediction for a new instance is obtained by using the model of the cluster that is the closest to the instance, where the distance of a cluster to the instance is defined as the distance of the *center* of the cluster to the instance.

**Note**: in this part you are not explicitely told which part of the data to use, deciding which one is the correct one is part of the homework!

## TO DO: use k-means in sklearn to learn a cluster with 5 clusters.

In [None]:
#load the required packages
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.metrics import r2_score

kmeans = KMeans(n_clusters=5, random_state=numero_di_matricola).fit(Xtrain_and_val_scaled)
Ytrain_and_val_predicted=kmeans.predict(Xtrain_and_val_scaled)
Ytest_predicted=kmeans.predict(Xtest_scaled)

#I print the number of samples
label,counter=np.unique(Ytrain_and_val_predicted, return_counts = True)
print("TRAIN AND VAL:\nlabel: ",label,"\ncounter: ",counter)
label,counter=np.unique(Ytest_predicted, return_counts = True)
print("TEST:\nlabel: ",label,"\ncounter: ",counter)

TRAIN AND VAL:
label:  [0 1 2 3 4] 
counter:  [270 863 735  36 732]
TEST:
label:  [0 1 2 3 4] 
counter:  [ 52 172 144  11 149]


## TO DO: for each cluster, learn a linear model using the elements of the cluster. For each model, print the error on the data used to learn it.

In [None]:
#LRcluster array of the linear regression models
LRcluster=[]
for i in range(0,5):
    #let's construct the training and validation set of cluster i
    X_train_cluster=Xtrain_and_val_scaled[np.where(Ytrain_and_val_predicted==i)]
    Y_train_cluster=Ytrain_and_val[np.where(Ytrain_and_val_predicted==i)]
    #LRi linear regression model of cluster i
    LRi = linear_model.LinearRegression()
    #fit the model on training data of cluster i
    LRi.fit(X_train_cluster, Y_train_cluster)
    print("Cluster :",i, "  Error on train and validation: ",1. - LRi.score(X_train_cluster,Y_train_cluster))
    LRcluster.append(LRi)


Cluster : 0   Error on train and validation:  0.23703594108805937
Cluster : 1   Error on train and validation:  0.35052480765457983
Cluster : 2   Error on train and validation:  0.34723044510453116
Cluster : 3   Error on train and validation:  0.03503035380115194
Cluster : 4   Error on train and validation:  0.3431290290135154


## TO DO: *compute* the error (1 - R^2) on the data not used to learn the models.
For each instance not used to learn the model, the prediction is done by:
- finding the cluster C whose center is the closest to the instance
- use the model learned for cluster C to make the prediction

In [None]:
#Y_predicted_LR contains the prediction of the model on test data
Y_predicted_LR=np.array([])
#Y_test_LR will be Ytest reordered to let me calculate R^2
Y_test_LR=np.array([])

for i in range(0,5):
    #let's construct the test set of cluster i
    X_test_cluster=Xtest_scaled[np.where(Ytest_predicted==i)]
    Y_test_cluster=Ytest[np.where(Ytest_predicted==i)]
    #predictions of the model on test data of cluster i
    Y_predicted_LR=np.append(Y_predicted_LR,LRcluster[i].predict(X_test_cluster))
    Y_test_LR=np.append(Y_test_LR,Y_test_cluster)

#calculate R^2 using the function r2_score
measure_test=r2_score(Y_test_LR, Y_predicted_LR)
#calculate R^2 esplicity
#measure_test = 1-np.linalg.norm(Y_test_LR-Y_predicted_LR)**2/np.linalg.norm(Y_test_LR-Y_test_LR.mean())**2

## TO DO: *print* the error (1-R^2) on the data not used to learn the models

In [None]:
print("Error on Test Data (1-R^2):", 1-measure_test)

Error on Test Data (1-R^2): 0.28057902375638644


## TO DO: compare the error of the model "clustering + linear models" and of the linear model (see the beginning of the HW). Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]

Values of 1-R² for models:

CLUSTERING + LINEAR MODELS:   0.2805790237563843


BEST LINEAR MODEL:    0.326693824565925

"clustering + linear modles" performs better than linear model.
A possible explenation is that using clustering we can "groupped similar samples" in such a way that
the distribution of the samples that belong to a specific cluster is a "more linear" function compared to the one that a linear model try to learn from the entire training and validation data.

## TO DO: compare the error of the model "clustering + linear models" and of kNN. Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]

CLUSTERING + LINEAR MODELS: 

-Test Error (1-R^2):    0.2805790237563843

BEST kNN:   

-Test Error (1-R^2):    0.3216866455997077

As we can see "Cluster+linear models" performs better than kNN.

kNN seems to overfit on training set and can't generalize well on test set.

# Clustering and "Local" NNs

Repeat the same as above, but using neural networks instead of linear models.

**Note**: note that we are not telling you which parameters to use for NNs. You have to decide how to select the parameters.

## TO DO: clearly explain how you decided to set the parameters, motivating the choice of your strategy.

I decide to use cross validation to find the best NN architecture for each cluster.
As before I tried different sizes for the hidden layer.
I decide to not use more than one hidden layer to simplify the running of the code.

## TO DO: repeat the analysis in part "Clustering and "Local" Linear Models" using NNs instead of linear models.

In [None]:
#I divide the code in 3 pieces to simplify the running and checking during changes on the code
NNcluster=[]
for i in range(0,5):
    #let's construct the training and validation set of cluster i
    X_train_val_cluster=Xtrain_and_val_scaled[np.where(Ytrain_and_val_predicted==i)]
    Y_train_val_cluster=Ytrain_and_val[np.where(Ytrain_and_val_predicted==i)]
    print('Cluster :',i, '   Samples: ',X_train_val_cluster.shape[0])
    mlp_cv_c = MLPRegressor(max_iter=3000)
    param_grid = {'hidden_layer_sizes':[i for i in range(1,10)],
                'activation': ['relu'],
              'solver': ['lbfgs'], 
              'random_state': [numero_di_matricola],
             }
    mlp_GS_c = GridSearchCV( mlp_cv_c, param_grid=param_grid, cv=5)
    mlp_GS_c.fit(X_train_val_cluster, Y_train_val_cluster)
    #let's print the best model according to grid search
    print("Best model: ",mlp_GS_c.best_estimator_)
    NNcluster.append(mlp_GS_c.best_estimator_)
    #let's fit the best model on training and validation set of cluster i
    NNcluster[i].fit(X_train_val_cluster, Y_train_val_cluster)


Cluster : 0    Samples:  270
Best model:  MLPRegressor(hidden_layer_sizes=3, max_iter=3000, random_state=2019157,
             solver='lbfgs')
Cluster : 1    Samples:  863
Best model:  MLPRegressor(hidden_layer_sizes=6, max_iter=3000, random_state=2019157,
             solver='lbfgs')
Cluster : 2    Samples:  735


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Best model:  MLPRegressor(hidden_layer_sizes=8, max_iter=3000, random_state=2019157,
             solver='lbfgs')
Cluster : 3    Samples:  36


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Best model:  MLPRegressor(hidden_layer_sizes=7, max_iter=3000, random_state=2019157,
             solver='lbfgs')
Cluster : 4    Samples:  732
Best model:  MLPRegressor(hidden_layer_sizes=8, max_iter=3000, random_state=2019157,
             solver='lbfgs')


In [None]:
#Y_predicted_NN contains the prediction of the model on test data
Y_predicted_NN=np.array([])
#Y_test_NN will be Ytest reordered to let me calculate R^2
Y_test_NN=np.array([])

for i in range(0,5):
    #let's construct the test set of cluster i
    X_test_cluster_nn=Xtest_scaled[np.where(Ytest_predicted==i)]
    Y_test_cluster_nn=Ytest[np.where(Ytest_predicted==i)]
    #predictions of the model on test data of cluster i
    Y_predicted_NN=np.append(Y_predicted_NN,NNcluster[i].predict(X_test_cluster_nn))
    Y_test_NN=np.append(Y_test_NN,Y_test_cluster_nn)
    
#calculate R^2 using the function r2_score
measure_test_NN=r2_score(Y_test_NN, Y_predicted_NN)
#calculate R^2 esplicity
#measure_test_NN = 1-np.linalg.norm(Y_test_NN-Y_predicted_NN)**2/np.linalg.norm(Y_test_NN-Y_test_NN.mean())**2

In [None]:
print("Error on Test Data(1-R^2):", 1-measure_test_NN)

Error on Test Data(1-R^2): 0.19603322161505854


## TO DO: compare the error of the model "clustering + NNs" and of NNs (see the beginning of the HW). Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]

CLUSTERING + NNs

Test Error (1-R^2) :  0.19603322161505843


BEST NN
Test Error (1-R^2) :  0.18499202712189133

We can see that using a single NN trained on all the training data performs better than the combination of cluster with NNs.
The possible explenation is that some clusters contains little data for Training (for examples cluster 3 for training set contains 36 samples and in the test 11 samples) so the NN trained on that cluster can't set in a good way its parameters.

## TO DO: compare the error of the model "clustering + NNs" and of kNN. Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]

CLUSTERING + NNs

Test Error (1-R^2) :  0.19603322161505843

BEST kNN

-Test Error (1-R^2):    0.3216866455997077

Clustering+NNs performs better than kNN.
kNN overfits on training and validation set and can't generalize in a good way on test set.
This could be also a problem caused by a small number of samples on Training and Validation data or their particular distribution.

## TO DO: compare the error of the model "clustering + NNs" and of "clustering + Linear Models". Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]

CLUSTERING + NNs

-Test Error (1-R^2):    0.2805790237563843

CLUSTERING + LINEAR MODELS: 

Test Error (1-R^2) :  0.19603322161505843


For the same reason as before when we compared NNs with Linear models we can say that "clustering + NNs" outperform "clustering + Linear models" beacuse probably we are dealing with non linearities.
So the single NN model used for one cluster performs better than the Linear model used on the same cluster.