#  Regression on House Pricing Dataset with SVM
We consider a reduced version of a dataset containing house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

https://www.kaggle.com/harlfoxem/housesalesprediction

For each house we know 18 house features (e.g., number of bedrooms, number of bathrooms, etc.) plus its price, that is what we would like to predict.

# Overview

In the notebook you will first:
- split the data into training, validation, and test
- standardize the data

You will then be asked to learn various SVM models, in particular:
- for each of the kernels ‘linear’, ‘poly’, ‘rbf’, and ‘sigmoid’, you will learn the best model having to choose among various values of some hyperparameters; the choice of hyperparameters must be done with 5-fold cross-validation
- choose the best kernel, using a validation approach (not cross-validation)
- learn the best SVM model overall

You will then be asked to estimate the generalization error of the best SVM model you report. 

At the end, just for comparison, you will alsk be asked to learn a standard linear regression model (with squared loss), and estimate its generalization error.

### IMPORTANT
- Note that in each of the above steps you will have to choose the appropriate split of the data (see the first bullet point above)
- The code should run without requiring modifications even if some best choice of parameters, changes; for example, you should not pass the best value of hyperparameters "manually" (i.e., passing the values as input parameters to the models). The only exception is in the TO DO titled 'ANSWER THE FOLLOWING'
- For SVM, since the values to be predicted are all in the thousands of dollars, you will need to always set epsilon=1000
- Do not change the printing instructions (other than adding the correct variable name for your code), and do not add printing instructions!

## TO DO - INSERT YOUR NUMERO DI MATRICOLA BELOW

In [1]:
#put here your ``numero di matricola''
numero_di_matricola = 2120933

The following code loads all required packages

In [2]:
#import all packages needed
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import svm
from sklearn import model_selection
from sklearn import linear_model

The code below loads the data and remove samples with missing values. It also prints the number of samples in the datasets.

In [3]:
#load the data - do not change the path below!
df = pd.read_csv('kc_house_data.csv', sep = ',')

#remove the data samples with missing values (NaN)
df = df.dropna() 

Data = df.values
m = Data.shape[0]
Y = Data[:m,2]
X = Data[:m,3:]

print("Total number of samples:",m)

Total number of samples: 3164


# Data preprocessing

## TO DO - SPLIT DATA INTO TRAINING, VALIDATION, AND TESTING, WITH THE FOLLOWING PERCENTAGES: 60%, 20%, 20%

Use the train_test_split function from sklearn.model_selection to do it; in every call fix random_state to your numero_di_matricola. At the end, you should store the data in the following variables:
- Xtrain, Ytrain: training data
- Xval, Yval: validation data
- Xtrain_val, Ytrain_val: training and validation data
- Xtest, Ytest: test data

The code then prints the number of samples in Xtrain, Xval, Xtrain_val, and Xtest

IMPORTANT:
- first split the data into training+validation and test; the first part of the data in output from train_test_split must correspond to the training+validation
- then split training+validation into training and validation; the first part of the data in output from train_test_split must correspond to the training


In [4]:
m_train = int((3/5) * m)
m_val = int((m-m_train)/2.)
m_test = m - m_train - m_val

Xtrain_val, Xtest, Ytrain_val, Ytest = train_test_split(X, Y, test_size = m_test/m, random_state = numero_di_matricola)
Xtrain, Xval, Ytrain, Yval = train_test_split(Xtrain_val, Ytrain_val, test_size = m_val/(m_train+m_val), random_state = numero_di_matricola)

print("Training size: ", Xtrain.shape[0])
print("Validation size: ", Xval.shape[0])
print("Training and validation size",Xtrain_val.shape[0])
print("Test size",Xtest.shape[0])

Training size:  1898
Validation size:  633
Training and validation size 2531
Test size 633


## TO DO - STANDARDIZE THE DATA

Standardize the data using the preprocessing.StandardScaler from scikit learn.

If V is the name of the variable storing part of the data, the corresponding standardized version should be stored in V_scaled. For example, the scaled version of Xtrain should be stored in Xtrain_scaled

In [5]:
scaler = preprocessing.StandardScaler().fit(Xtrain)

Xtrain_val_scaled = scaler.transform(Xtrain_val)
Xtrain_scaled = scaler.transform(Xtrain)
Xval_scaled = scaler.transform(Xval)
Xtest_scaled = scaler.transform(Xtest)

# SVM models: learning the best model for each kernel

## TO DO - CHOOSE THE BEST HYPERPARAMETERS FOR LINEAR KERNEL

Consider svm.SVR and linear kernel. Consider the following hyperparameters and their values:
- C: 0.1, 1, 10, 100, 1000

Leave all other input parameters to default. 

Find the best value of the hyperparameters using 5-fold cross validation. Use model_selection.GridSearchCV to perform the cross-validation.

Print the best value of the hyperparameters (they are in the attribute best_params_ from GridSearchCV)

In [6]:
print("\nLinear SVM")
param_grid = {'C': [0.1, 1, 10, 100, 1000]}
linear_svr = svm.SVR(epsilon = 1000, kernel='linear')

grid_search = model_selection.GridSearchCV(linear_svr, param_grid, cv=5)
grid_search.fit(Xtrain_scaled, Ytrain)
best_params_linear = grid_search.best_params_

print("Best value for hyperparameters: ", best_params_linear)


Linear SVM
Best value for hyperparameters:  {'C': 1000}


## TO DO - LEARN A MODEL WITH LINEAR KERNEL AND BEST CHOICE OF HYPERPARAMETERS

This model will be compared with the best models with other kernels using validation (not cross validation).

DO NOT PASS PARAMETERS BY HARD-CODING THEM IN THE CODE.

Print the training score of the best model.

In [7]:
linear_svr = svm.SVR(epsilon = 1000, kernel='linear', C = best_params_linear['C'])
linear_svr.fit(Xtrain_scaled, Ytrain)

training_score = linear_svr.score(Xtrain_scaled,Ytrain)
print("Training score: ", training_score)

Training score:  0.61747068721157


## TO DO - CHOOSE THE BEST HYPERPARAMETERS FOR POLY KERNEL

Consider svm.SVR and polynomial kernel. Consider the following hyperparameters and their values:
- C: 0.1, 1, 10, 100, 1000
- degree: 2, 3, 4

Leave all other input parameters to default. 

Find the best value of the hyperparameters using 5-fold cross validation. Use model_selection.GridSearchCV to perform the cross-validation.

Print the best value of the hyperparameters.

In [8]:
print("\nPoly SVM")
param_grid = {'C': [0.1, 1, 10, 100, 1000], 'degree': [2, 3, 4]}
poly_svr = svm.SVR(epsilon = 1000, kernel = 'poly')

grid_search = model_selection.GridSearchCV(poly_svr, param_grid, cv=5)
grid_search.fit(Xtrain_scaled, Ytrain)
best_params_poly = grid_search.best_params_

print("Best value for hyperparameters: ", best_params_poly)


Poly SVM
Best value for hyperparameters:  {'C': 1000, 'degree': 3}


## TO DO - LEARN A MODEL WITH POLY KERNEL AND BEST CHOICE OF HYPERPARAMETERS

This model will be compared with the best models with other kernels using validation (not cross validation).

DO NOT PASS PARAMETERS BY HARD-CODING THEM IN THE CODE.

Print the training score of the best model.

In [9]:
poly_svr = svm.SVR(epsilon = 1000, kernel='poly', C = best_params_poly['C'], degree = best_params_poly['degree'])
poly_svr.fit(Xtrain_scaled, Ytrain)

training_score = poly_svr.score(Xtrain_scaled,Ytrain)
print("Training score: ", training_score)

Training score:  0.5742261335302029


## TO DO - CHOOSE THE BEST HYPERPARAMETERS FOR RBF KERNEL

Consider svm.SVR and RBF kernel. Consider the following hyperparameters and their values:
- C: 0.1, 1, 10, 100, 1000
- gamma: 0.01

Leave all other input parameters to default. 

Find the best value of the hyperparameters using 5-fold cross validation. Use model_selection.GridSearchCV to perform the cross-validation.

Print the best value of the hyperparameters.

In [10]:
print("\nRBF SVM")
param_grid = {'C' : [0.1, 1, 10, 100, 1000], 'gamma' : [0.01]}
rbf_svr = svm.SVR(epsilon = 1000, kernel = 'rbf')

grid_search = model_selection.GridSearchCV(rbf_svr, param_grid, cv=5)
grid_search.fit(Xtrain_scaled, Ytrain)
best_params_rbf = grid_search.best_params_

print("Best value for hyperparameters: ", best_params_rbf)


RBF SVM
Best value for hyperparameters:  {'C': 1000, 'gamma': 0.01}


## TO DO - LEARN A MODEL WITH RBF KERNEL AND BEST CHOICE OF HYPERPARAMETERS

This model will be compared with the best models with other kernels using validation (not cross validation).

DO NOT PASS PARAMETERS BY HARD-CODING THEM IN THE CODE.

Print the training score of the best model.

In [11]:
rbf_svr = svm.SVR(epsilon = 1000, kernel='rbf', C = best_params_rbf['C'], gamma = best_params_rbf['gamma'])
rbf_svr.fit(Xtrain_scaled, Ytrain)

training_score = rbf_svr.score(Xtrain_scaled,Ytrain)
print("Training score: ", training_score)

Training score:  0.11298544463417481


## TO DO - CHOOSE THE BEST HYPERPARAMETERS FOR SIGMOID KERNEL

Consider svm.SVR and sigmoid kernel. Consider the following hyperparameters and their values:
- C: 0.1, 1, 10, 100, 1000
- gamma: 0.01
- coef0: 0, 1

Leave all other input parameters to default. 

Find the best value of the hyperparameters using 5-fold cross validation. Use model_selection.GridSearchCV to perform the cross-validation.

Print the best value of the hyperparameters.

In [12]:
print("\nSigmoid SVM")
param_grid = {'C' : [0.1, 1, 10, 100, 1000], 'gamma' : [0.01], 'coef0' : [0,1]}
sigmoid_svr = svm.SVR(epsilon = 1000, kernel = 'sigmoid')

grid_search = model_selection.GridSearchCV(sigmoid_svr, param_grid, cv=5)
grid_search.fit(Xtrain_scaled, Ytrain)
best_params_sigmoid = grid_search.best_params_

print("Best value for hyperparameters: ", best_params_sigmoid)


Sigmoid SVM
Best value for hyperparameters:  {'C': 1000, 'coef0': 0, 'gamma': 0.01}


## TO DO - LEARN A MODEL WITH SIGMOID KERNEL AND BEST CHOICE OF HYPERPARAMETERS

This model will be compared with the best models with other kernels using validation (not cross validation).

DO NOT PASS PARAMETERS BY HARD-CODING THEM IN THE CODE.

Print the training score of the best model.

In [13]:
sigmoid_svr = svm.SVR(epsilon = 1000, kernel='sigmoid', C = best_params_sigmoid['C'], coef0 = best_params_sigmoid['coef0'], gamma = best_params_sigmoid['gamma'])
sigmoid_svr.fit(Xtrain_scaled, Ytrain)

training_score = sigmoid_svr.score(Xtrain_scaled,Ytrain)

print("Training score: ", training_score)

Training score:  0.10280646238642355


## TO DO - USE VALIDATION TO CHOOSE THE BEST MODEL AMONG THE ONES LEARNED FOR THE VARIOUS KERNELS

Use validation to choose the best model among the four ones (one for each kernel) you have learned above.

Print, following exactly the order described here, with 1 value for each line:
- the validation score of SVM with linear kernel (the template below does not include such print)
- the validation score of SVM with polynomial kernel (the template below does not include such print)
- the validation score of SVM with rbf kernel (the template below does not include such print)
- the validation score of SVM with sigmoid kernel (the template below does not include such print)
- the best kernel (e.g., sigmoid) 
- the validation score of the best kernel 

For the first 4 prints, use the format: "kernel validation score: ". For example, for linear kernel "Linear validation score: ", for rbf "rbf validation score: "

In [14]:
print("\nVALIDATION TO CHOOSE SVM KERNEL")

scores = {}

linear_validation_score = linear_svr.score(Xval_scaled, Yval)
scores['linear'] = linear_validation_score
print("linear validation score: ", linear_validation_score)

poly_validation_score = poly_svr.score(Xval_scaled, Yval)
scores['poly'] = poly_validation_score
print("poly validation score: ", poly_validation_score)

rbf_validation_score = rbf_svr.score(Xval_scaled, Yval)
scores['rbf'] = rbf_validation_score
print("rbf validation score: ", rbf_validation_score)

sigmoid_validation_score = sigmoid_svr.score(Xval_scaled, Yval)
scores['sigmoid'] = sigmoid_validation_score
print("sigmoid validation score: ", sigmoid_validation_score)

best_kernel = max(scores, key = scores.get)
best_score = max(scores.values())

print("\nBest kernel: ", best_kernel)
print("Validation score of best kernel: ", best_score)


VALIDATION TO CHOOSE SVM KERNEL
linear validation score:  0.6861993140856179
poly validation score:  -1.6850249655273508
rbf validation score:  0.15473989510939312
sigmoid validation score:  0.1319244871233307

Best kernel:  linear
Validation score of best kernel:  0.6861993140856179


## TO DO - LEARN THE FINAL MODEL FOR WHICH YOU WANT TO ESTIMATE THE GENERALIZATION ERROR

Learn the final model (i.e., the one you would use to make predictions about future data).

Print the score of the model on the data used to learn it.

In [15]:
print("\nTRAINING SCORE BEST MODEL")
# Scaling again since our training set now is Xtrain_val...
# Using the previously scaled sets wouldn't have changed 
# the result that much since mean and variance shouldn't
# variate a lot. But I felt like this is more precise.

scaler = preprocessing.StandardScaler().fit(Xtrain_val)
Xtrain_val_scaled = scaler.transform(Xtrain_val)

linear_svr.fit(Xtrain_val_scaled, Ytrain_val)

score = linear_svr.score(Xtrain_val_scaled, Ytrain_val)
print("Score of the best model on the data used to learn it: ", score)


TRAINING SCORE BEST MODEL
Score of the best model on the data used to learn it:  0.6533490008340843


## TO DO - PRINT THE ESTIMATE  OF THE GENERALIZATION ERROR FOR THE FINAL MODEL

Print the estimate of the generalization "score" for the final model. The generalization "score" is the score computed on the data used to estimate the generalization error.

In [16]:
print("\nGENERALIZATION SCORE BEST MODEL")

gen_score = linear_svr.score(Xtest_scaled, Ytest)

print("Estimate of the generalization score for best SVM model: ", gen_score)


GENERALIZATION SCORE BEST MODEL
Estimate of the generalization score for best SVM model:  0.6661265180107738


## TO DO - ANSWER THE FOLLOWING

Print the training score (score on data used to train the model) and the generalization score (score on data used to estimate the generalization error) of the final SVM model THAT YOU OBTAIN WHEN YOU RUN THE CODE, one per line, printing the smallest one first. NOTE: THE VALUES HERE SHOULD BE HARDCODED

Print you answer (yes/no) to the following question: does the relation (i.e., smaller, larger) between the training score and the generalization score agree with the theory?

Print your motivation for the yes/no answer above, using at most 500 characters.

In [17]:
print("\nANSWER")

#note that you may have to invert the order of the following 2 lines, print the smallest 1 first. THE VALUES HERE SHOULD BE HARD CODED!
print("Training score: ", score )
print("Generalization score: ", gen_score )

#the following is a string with you anwer
motivation = "MY ANSWER: no\nIn theory we should get that training score is greater than the generalization one. This because the model should make more precise predictions on data it has already seen and learned from. This may be caused by many factors, but the scores relativelly low suggest that the model isn't capturing correclty the patterns in the dataset."

print(motivation)


ANSWER
Training score:  0.6533490008340843
Generalization score:  0.6661265180107738
MY ANSWER: no
In theory we should get that training score is greater than the generalization one. This because the model should make more precise predictions on data it has already seen and learned from. This may be caused by many factors, but the scores relativelly low suggest that the model isn't capturing correclty the patterns in the dataset.


## TO DO: LEARN A STANDARD LINEAR MODEL
Learn a standard linear model using scikit learn.

Print the score of the model on the data used to learn it.

Print the generalization "score" of the model.

In [18]:
print("\nLR MODEL")
LR = linear_model.LinearRegression()
LR.fit(Xtrain_val_scaled, Ytrain_val)

training_score_linear = LR.score(Xtrain_val_scaled, Ytrain_val)
gen_score_linear = LR.score(Xtest_scaled, Ytest)
print("Score of LR model on data used to learng it: ", training_score_linear)
print("Generalization score of LR model: ", gen_score_linear)


LR MODEL
Score of LR model on data used to learng it:  0.7160808929396446
Generalization score of LR model:  0.7237059255139111
