#  Regression on House Pricing Dataset with SVM
We consider a reduced version of a dataset containing house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

https://www.kaggle.com/harlfoxem/housesalesprediction

For each house we know 18 house features (e.g., number of bedrooms, number of bathrooms, etc.) plus its price, that is what we would like to predict.

# Overview

In the notebook you will first:
- split the data into training, validation, and test
- standardize the data

You will then be asked to learn various SVM models, in particular:
- for each of the kernels ‘linear’, ‘poly’, ‘rbf’, and ‘sigmoid’, you will learn the best model having to choose among various values of some hyperparameters; the choice of hyperparameters must be done with 5-fold cross-validation
- choose the best kernel, using a validation approach (not cross-validation)
- learn the best SVM model overall

You will then be asked to estimate the generalization error of the best SVM model you report. 

At the end, just for comparison, you will alsk be asked to learn a standard linear regression model (with squared loss), and estimate its generalization error.

### IMPORTANT
- Note that in each of the above steps you will have to choose the appropriate split of the data (see the first bullet point above)
- The code should run without requiring modifications even if some best choice of parameters, changes; for example, you should not pass the best value of hyperparameters "manually" (i.e., passing the values as input parameters to the models). The only exception is in the TO DO titled 'ANSWER THE FOLLOWING'
- For SVM, since the values to be predicted are all in the thousands of dollars, you will need to always set epsilon=1000
- Do not change the printing instructions (other than adding the correct variable name for your code), and do not add printing instructions!

## TO DO - INSERT YOUR NUMERO DI MATRICOLA BELOW

In [1]:
#put here your ``numero di matricola''
numero_di_matricola = 2095664

The following code loads all required packages

In [2]:
#import all packages needed
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import svm
from sklearn import model_selection
from sklearn import linear_model

The code below loads the data and remove samples with missing values. It also prints the number of samples in the datasets.

In [3]:
#load the data - do not change the path below!
df = pd.read_csv('kc_house_data.csv', sep = ',')

#remove the data samples with missing values (NaN)
df = df.dropna() 

Data = df.values
m = Data.shape[0]
Y = Data[:m,2]
X = Data[:m,3:]

print("Total number of samples:",m)

Total number of samples: 3164


# Data preprocessing

## TO DO - SPLIT DATA INTO TRAINING, VALIDATION, AND TESTING, WITH THE FOLLOWING PERCENTAGES: 60%, 20%, 20%

Use the train_test_split function from sklearn.model_selection to do it; in every call fix random_state to your numero_di_matricola. At the end, you should store the data in the following variables:
- Xtrain, Ytrain: training data
- Xval, Yval: validation data
- Xtrain_val, Ytrain_val: training and validation data
- Xtest, Ytest: test data

The code then prints the number of samples in Xtrain, Xval, Xtrain_val, and Xtest

IMPORTANT:
- first split the data into training+validation and test; the first part of the data in output from train_test_split must correspond to the training+validation
- then split training+validation into training and validation; the first part of the data in output from train_test_split must correspond to the training


In [4]:
Xtrain_val, Xtest, Ytrain_val, Ytest = train_test_split(X, Y, test_size=0.2, random_state=numero_di_matricola)
Xtrain, Xval, Ytrain, Yval = train_test_split(Xtrain_val,Ytrain_val,test_size=0.25,random_state=numero_di_matricola)

print("Training size: ", Xtrain.shape[0])
print("Validation size: ", Xval.shape[0])
print("Training and validation size:",Xtrain_val.shape[0])
print("Test size:",Xtest.shape[0])


Training size:  1898
Validation size:  633
Training and validation size: 2531
Test size: 633


## TO DO - STANDARDIZE THE DATA

Standardize the data using the preprocessing.StandardScaler from scikit learn.

If V is the name of the variable storing part of the data, the corresponding standardized version should be stored in V_scaled. For example, the scaled version of Xtrain should be stored in Xtrain_scaled

In [5]:
scaler = preprocessing.StandardScaler()
scaler.fit(Xtrain)

Xtrain_scaled = scaler.transform(Xtrain)
Xval_scaled = scaler.transform(Xval)
Xtest_scaled = scaler.transform(Xtest)
Xtrain_val_scaled = scaler.transform(Xtrain_val)

# SVM models: learning the best model for each kernel

## TO DO - CHOOSE THE BEST HYPERPARAMETERS FOR LINEAR KERNEL

Consider svm.SVR and linear kernel. Consider the following hyperparameters and their values:
- C: 0.1, 1, 10, 100, 1000

Leave all other input parameters to default. 

Find the best value of the hyperparameters using 5-fold cross validation. Use model_selection.GridSearchCV to perform the cross-validation.

Print the best value of the hyperparameters (they are in the attribute best_params_ from GridSearchCV)

In [6]:
parameter_C = {'C':[0.1, 1, 10, 100, 1000]}
svr = svm.SVR(kernel='linear', epsilon=1000)

gridSearch = model_selection.GridSearchCV(svr,parameter_C, cv=5)
gridSearch.fit(Xtrain_scaled, Ytrain)

print("\nLinear SVM")
print("Best value for hyperparameters: ",gridSearch.best_params_)


Linear SVM
Best value for hyperparameters:  {'C': 1000}


## TO DO - LEARN A MODEL WITH LINEAR KERNEL AND BEST CHOICE OF HYPERPARAMETERS

This model will be compared with the best models with other kernels using validation (not cross validation).

DO NOT PASS PARAMETERS BY HARD-CODING THEM IN THE CODE.

Print the training score of the best model.

In [7]:
C_best_linear= gridSearch.best_params_['C']
#print("C:",C_best) 
best_model_linear = svm.SVR(kernel='linear', C=C_best_linear, epsilon=1000)
best_model_linear.fit(Xtrain_scaled, Ytrain)
score_linear = best_model_linear.score(Xtrain_scaled, Ytrain)

print("Training score: ", score_linear)

Training score:  0.6397700877886906


## TO DO - CHOOSE THE BEST HYPERPARAMETERS FOR POLY KERNEL

Consider svm.SVR and polynomial kernel. Consider the following hyperparameters and their values:
- C: 0.1, 1, 10, 100, 1000
- degree: 2, 3, 4

Leave all other input parameters to default. 

Find the best value of the hyperparameters using 5-fold cross validation. Use model_selection.GridSearchCV to perform the cross-validation.

Print the best value of the hyperparameters.

In [8]:
parameters = {'C':[0.1, 1, 10, 100, 1000], 'degree':[2, 3, 4]}
              
svr_poly = svm.SVR(kernel='poly',epsilon=1000)
gridSearch_poly = model_selection.GridSearchCV(svr_poly,parameters,cv=5)
gridSearch_poly.fit(Xtrain_scaled,Ytrain)

print("\nPoly SVM")
print("Best value for hyperparameters: ", gridSearch_poly.best_params_)


Poly SVM
Best value for hyperparameters:  {'C': 1000, 'degree': 3}


## TO DO - LEARN A MODEL WITH POLY KERNEL AND BEST CHOICE OF HYPERPARAMETERS

This model will be compared with the best models with other kernels using validation (not cross validation).

DO NOT PASS PARAMETERS BY HARD-CODING THEM IN THE CODE.

Print the training score of the best model.

In [9]:
C_best_poly = gridSearch_poly.best_params_['C']
degree_best_poly = gridSearch_poly.best_params_['degree']

best_model_poly = svm.SVR(kernel='poly',C=C_best_poly,degree=degree_best_poly,epsilon=1000)
best_model_poly.fit(Xtrain_scaled,Ytrain)
score_poly = best_model_poly.score(Xtrain_scaled,Ytrain)

print("Training score: ", score_poly)

Training score:  0.5612777885208244


## TO DO - CHOOSE THE BEST HYPERPARAMETERS FOR RBF KERNEL

Consider svm.SVR and RBF kernel. Consider the following hyperparameters and their values:
- C: 0.1, 1, 10, 100, 1000
- gamma: 0.01

Leave all other input parameters to default. 

Find the best value of the hyperparameters using 5-fold cross validation. Use model_selection.GridSearchCV to perform the cross-validation.

Print the best value of the hyperparameters.

In [10]:
parameters_rbf = {'C':[0.1, 1, 10, 100, 1000], 'gamma':[0.01]}

svr_rbf = svm.SVR(kernel='rbf',epsilon=1000)
gridSearch_rbf = model_selection.GridSearchCV(svr_rbf,parameters_rbf,cv=5)
gridSearch_rbf.fit(Xtrain_scaled,Ytrain)

print("\nRBF SVM")
print("Best value for hyperparameters: ", gridSearch_rbf.best_params_)


RBF SVM
Best value for hyperparameters:  {'C': 1000, 'gamma': 0.01}


## TO DO - LEARN A MODEL WITH RBF KERNEL AND BEST CHOICE OF HYPERPARAMETERS

This model will be compared with the best models with other kernels using validation (not cross validation).

DO NOT PASS PARAMETERS BY HARD-CODING THEM IN THE CODE.

Print the training score of the best model.

In [11]:
C_best_rbf = gridSearch_rbf.best_params_['C']
gamma_best_rbf = gridSearch_rbf.best_params_['gamma']

best_model_rbf = svm.SVR(kernel='rbf',C=C_best_rbf,gamma=gamma_best_rbf,epsilon=1000)
best_model_rbf.fit(Xtrain_scaled,Ytrain)
score_rbf = best_model_rbf.score(Xtrain_scaled,Ytrain)

print("Training score: ", score_rbf)


Training score:  0.12101760121263583


## TO DO - CHOOSE THE BEST HYPERPARAMETERS FOR SIGMOID KERNEL

Consider svm.SVR and sigmoid kernel. Consider the following hyperparameters and their values:
- C: 0.1, 1, 10, 100, 1000
- gamma: 0.01
- coef0: 0, 1

Leave all other input parameters to default. 

Find the best value of the hyperparameters using 5-fold cross validation. Use model_selection.GridSearchCV to perform the cross-validation.

Print the best value of the hyperparameters.

In [12]:
parameters_sig = {'C':[0.1, 1, 10, 100, 1000], 'gamma':[0.01], 'coef0':[0,1]}

svr_sig = svm.SVR(kernel='sigmoid',epsilon=1000)
gridSearch_sig = model_selection.GridSearchCV(svr_sig,parameters_sig,cv=5)
gridSearch_sig.fit(Xtrain_scaled,Ytrain)

print("\nSigmoid SVM")
print("Best value for hyperparameters: ", gridSearch_sig.best_params_)


Sigmoid SVM
Best value for hyperparameters:  {'C': 1000, 'coef0': 0, 'gamma': 0.01}


## TO DO - LEARN A MODEL WITH SIGMOID KERNEL AND BEST CHOICE OF HYPERPARAMETERS

This model will be compared with the best models with other kernels using validation (not cross validation).

DO NOT PASS PARAMETERS BY HARD-CODING THEM IN THE CODE.

Print the training score of the best model.

In [13]:
C_best_sig = gridSearch_sig.best_params_['C']
gamma_best_sig = gridSearch_sig.best_params_['gamma']
coef0_best_sig = gridSearch_sig.best_params_['coef0']

best_model_sig = svm.SVR(kernel='sigmoid',C=C_best_sig,gamma=gamma_best_sig,coef0=coef0_best_sig,epsilon=1000)
best_model_sig.fit(Xtrain_scaled,Ytrain)
score_sig = best_model_sig.score(Xtrain_scaled,Ytrain)

print("Training score: ", score_sig)

Training score:  0.11515261586785563


## TO DO - USE VALIDATION TO CHOOSE THE BEST MODEL AMONG THE ONES LEARNED FOR THE VARIOUS KERNELS

Use validation to choose the best model among the four ones (one for each kernel) you have learned above.

Print, following exactly the order described here, with 1 value for each line:
- the validation score of SVM with linear kernel (the template below does not include such print)
- the validation score of SVM with polynomial kernel (the template below does not include such print)
- the validation score of SVM with rbf kernel (the template below does not include such print)
- the validation score of SVM with sigmoid kernel (the template below does not include such print)
- the best kernel (e.g., sigmoid) 
- the validation score of the best kernel 

For the first 4 prints, use the format: "kernel validation score: ". For example, for linear kernel "Linear validation score: ", for rbf "rbf validation score: "

In [14]:
print("\nVALIDATION TO CHOOSE SVM KERNEL")

models = [best_model_linear,best_model_poly,best_model_rbf,best_model_sig]
best_score = 0
best_model = svm.SVR()

for model in models:
    score = model.score(Xval_scaled,Yval)
    print(model.get_params()['kernel'], "validation score: ", score )
    if score > best_score:
        best_score = score
        best_model = model

print("Best kernel: ", best_model.get_params()['kernel'])
print("Validation score of best kernel: ", best_score)


VALIDATION TO CHOOSE SVM KERNEL
linear validation score:  0.5905139730757103
poly validation score:  0.3902725685208822
rbf validation score:  0.12225458384488064
sigmoid validation score:  0.11673452026233799
Best kernel:  linear
Validation score of best kernel:  0.5905139730757103


## TO DO - LEARN THE FINAL MODEL FOR WHICH YOU WANT TO ESTIMATE THE GENERALIZATION ERROR

Learn the final model (i.e., the one you would use to make predictions about future data).

Print the score of the model on the data used to learn it.

In [15]:
print("\nTRAINING SCORE BEST MODEL")
final_model = svm.SVR(**best_model.get_params())
final_model.fit(Xtrain_val_scaled,Ytrain_val)

final_score = final_model.score(Xtrain_val_scaled,Ytrain_val)
print("Score of the best model on the data used to learn it: ", final_score)


TRAINING SCORE BEST MODEL
Score of the best model on the data used to learn it:  0.6276647451725295


## TO DO - PRINT THE ESTIMATE  OF THE GENERALIZATION ERROR FOR THE FINAL MODEL

Print the estimate of the generalization "score" for the final model. The generalization "score" is the score computed on the data used to estimate the generalization error.

In [16]:
generalization_score = final_model.score(Xtest_scaled,Ytest)

print("\nGENERALIZATION SCORE BEST MODEL")
print("Estimate of the generalization score for best SVM model: ",generalization_score)


GENERALIZATION SCORE BEST MODEL
Estimate of the generalization score for best SVM model:  0.6271892242280019


## TO DO - ANSWER THE FOLLOWING

Print the training score (score on data used to train the model) and the generalization score (score on data used to estimate the generalization error) of the final SVM model THAT YOU OBTAIN WHEN YOU RUN THE CODE, one per line, printing the smallest one first. NOTE: THE VALUES HERE SHOULD BE HARDCODED

Print you answer (yes/no) to the following question: does the relation (i.e., smaller, larger) between the training score and the generalization score agree with the theory?

Print your motivation for the yes/no answer above, using at most 500 characters.

In [17]:
print("\nANSWER")

train_score = 0.6276647451725295
gen_score = 0.6271892242280019 


#note that you may have to invert the order of the following 2 lines, print the smallest 1 first. THE VALUES HERE SHOULD BE HARD CODED!
print("Generalization score: ",gen_score)
print("Training score: ", train_score)

#the following is a string with you anwer
motivation = "Yes.\nMy training score is higher than my generalization score, from theory I know that it is normal if the training value is higher than the test value. \nThis is because my model uses the training data to make predictions, so it’s expected to perform slightly better on the training data.\nThis difference, however, must be as small as possible; in my case is equal to 0.0004755209445275188.\nWhen the training score is much higher than the generalization score, it means that overfitting has occurred.\nTherefore, I am satisfied with the score obtained"

print(motivation)


ANSWER
Generalization score:  0.6271892242280019
Training score:  0.6276647451725295
Yes.
My training score is higher than my generalization score, from theory I know that it is normal if the training value is higher than the test value. 
This is because my model uses the training data to make predictions, so it’s expected to perform slightly better on the training data.
This difference, however, must be as small as possible; in my case is equal to 0.0004755209445275188.
When the training score is much higher than the generalization score, it means that overfitting has occurred.
Therefore, I am satisfied with the score obtained


## TO DO: LEARN A STANDARD LINEAR MODEL
Learn a standard linear model using scikit learn.

Print the score of the model on the data used to learn it.

Print the generalization "score" of the model.

In [18]:
print("\nLR MODEL")

standard_linear_model = linear_model.LinearRegression()
standard_linear_model.fit(Xtrain_val_scaled,Ytrain_val)

score_linear = standard_linear_model.score(Xtrain_val_scaled,Ytrain_val)
score_linear_gen = standard_linear_model.score(Xtest_scaled,Ytest)

print("Score of LR model on data used to learng it: ", score_linear)
print("Generalization score of LR model: ", score_linear_gen)


LR MODEL
Score of LR model on data used to learng it:  0.7062874991650314
Generalization score of LR model:  0.746048866078462
