In [None]:
from IPython import display
from IPython.core.display import Image

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  

In [None]:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Degree in Data Science and Engineering, group 96
## Machine Learning 2
### Fall 2022

&nbsp;
&nbsp;
&nbsp;
# Lab 2. Support Vector Machines for Classification

&nbsp;
&nbsp;
&nbsp;

**Emilio Parrado Hernández**

Dept. of Signal Processing and Communications

&nbsp;
&nbsp;
&nbsp;




<img src='http://www.tsc.uc3m.es/~emipar/BBVA/INTRO/img/logo_uc3m_foot.jpg' width=400 />

# Loading the dataset

In this assignment you will work with an adaptation of the UCI repository dataset [segmentation]('https://archive.ics.uci.edu/ml/datasets/image+segmentation'). It is a binary classification task (labels are $+1$ and $-1$, respectively).

The following cell loads the data and constructs the corresponding training and testing sets.

In [None]:
train_data = np.loadtxt('segmentation.data', delimiter=',')
X_train = train_data[:,1:]
y_train = train_data[:,0]
test_data = np.loadtxt('segmentation.test', delimiter=',')
X_test = test_data[:,1:]
y_test = test_data[:,0]

# 1.- Support Vector Machines with default parameters

First approach, train three SVMs, each instantiated with the following configuration:
- linear kernel, $C$=1
- polynomial kernel, $C$ = 1, `degree` =2
- RBF kernel, $C$=1, $\gamma$ = 1

**Complete the code in the two cells below in order to evaluate the performance of these three classifiers in the test set and comment on the accuracy achieved by each method. Discuss your results**

Note: In this part do not introduce any preprocessing in the data (such as scaling)

In [None]:
# Don't forget to import needed classes!

# Set values for parameters

# Instantiate models

# Fit models with training data

# Evaluate models with test data

test_error_default_linear_svc = # accuracy in the test set of the linear svc
test_error_default_poly_svc = # accuracy in the test set of the svc with polynomial kernel
test_error_default_rbf_svc = # accuracy in the test set of the svc with RBF kernel

In [None]:
# Just run this cell, do not edit it
print("Results of default models")
print("")
model_name = ['Linear SVC', 'Polynomial SVC', 'RBF SVC']
results_lists = [test_error_default_linear_svc, test_error_default_poly_svc, test_error_default_rbf_svc]
results = []
for name, accuracy in zip(model_name, results_lists):
    model.fit(X_train, y_train)
    results.append({'method':name,'acc':np.round(accuracy,3)})
pd.DataFrame(results)

# 2.- Dependence on the parameters
## 2.1.- Linear SVC: $C$

Study the dependence of the SVC endowed with a linear kernel on the value of $C$. For this goal implement a loop that explores the given range of values for this parameter and plot the test error vs. the value of $C$. 




In [None]:
v_C = [1e-3, 1e-2, 1e-1, 1, 10, 100, 1000, 1e4]
lsvc_acc = np.empty(len(v_C)) # to store results

# implement for loop that runs for every value in the range of v_C
# at each iteration:
#   instantiate  the corresponding model
#   fit it with training data
#   evaluate the model with the test set and store it in lsvc_acc


In [None]:
# do not edit this cell, just run it
plt.plot(np.array(v_C), lsvc_acc)
plt.ylabel('accuracy')
plt.xlabel('C')
plt.grid()
plt.xscale('log')

### Your comments:
**What is the better value for parameter $C$?** 

**What is the performance of the classifier for that parameter?**

**Comment on the sensitivity of the accuracy with respect to $C$.**

## 2.2.- SVC with a polynomial kernel: $C$ and $d$

In this case you have to study the combination of two parameters. For this purpose scikit learn implements a Grid Search with Cross Validation (`GridSearchCV`). 

### Grids of hyperparameters

This method consists in to form a **grid** with a number of dimensions equal to the number of hyperparameters that one needs to optimize. The size of each dimension of the grid is equal to the number of values in the range of the corresponding hyperparameter. Notice that this method explores **discrete** ranges for each hyperparameter.

For the SVC with polynomial kernel we will initially explore the following ranges:

- `degree` $ \in [2, 3, 5]$
- `C` $ \in [0.1, 10, 100]$

Notice these ranges determine a $3\times 3$ grid: 9 different combinations of pairs (`degree`, `C`)  in the grid. 

In models that depend on a larger number of hyperparameters one has to be careful with the granularity of the ranges as the combinatorial explosion of the size of the grid can be hard to manage.

### Cross validation 

Cross validation is a commonly used procedure in machine learning to simulate the effect of training a model with a set of data and evaluate its generalization capabilities as the performance in a **separate dataset**. 

The cross validation process involves the following steps:

- Randomly partition the training dataset in $N$ disjoint subsets of similar sizes. Each of this subsets is called **fold** in machine learning jargon. Hence the term **N-fold cross validation**.

- Let us suppose we have chosen $N=3$ folds. This means the training data has been split in three subsets: $(X_1, Y_1)$, $(X_2, Y_2)$ y $(X_3, Y_3)$. 

- Create an instance of the model with the corresponding hyperparameters. The cross validation follows with the execution of the following loop

    For $n=1,2,\dots,N$ iterations:  
    1. Choose $(X_n,Y_n)$ as **validation set** for iteration $n$
    2. Prepare a **training set** for iteration $n$ joining the rest of the subsets (excluding the validation set)
    3. Fit the model instance with the training set of step 2
    4. Evaluate  the model instance (method `score`) with the validation set of step 1
    5. Keep the *score* achieved in the $n$ iteration.

- Once the loop is finished, we have $N$ scores, each corresponding to the evaluation of the model fitted in each iteration with the corresponding validation set.
- Estimate the **real score** that an instance of the model fitted using all the data would yield in a separate dataset computing the **mean** and **standard deviation** of the $N$ validation scores.

Typical values for the number of folds include $N\in \{3, 5, 10\}$


###  Cross validation to explore the grid

The grid is explored by a loop that visits all its nodes and runs a  **cross validation** to estimate the test performance that the model would yield if it were fitted using the values for hyperparameter that correspond to that node. 

Once all the nodes of the grid have been cross validated, the procedure outputs the combination of hyperparameters corresponding to the node with the best cross validation performance. 

###  Grid search in sklearn

There is a module in sklearn that implements this algorithm for exploring a grid of hyperparameters with cross validation: [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV)


### 2.2.1- Initial guess for the values of $C$ and $d$

Use the following ranges to get initial values for $C$ and $d$:
- `degree` $ \in [2, 3, 5]$
- `C` $ \in [0.1, 10, 100]$



In [None]:
v_C_1 = [.01, 10, 100]
v_d_1 = [2,3,5]

# Check GridSearchCV documentation to construct a GridSearchCV 
# object with a SVC with polynomial kernel as estimator and a parameter grid 
# that contains the two ranges given above

grid_psvc_1 = # Your code to create the GridSearchCV

# fit the GridSearchCV object

# evaluate with the test set. Check in the GridSearchCV documentation how to perform this
test_error_grid1_psvc = # the accuracy

In [None]:
# do not edit this cell
print("Best parameters")
print(grid_psvc_1.best_params_)
print("Best cross-validation score")
print(np.round(grid_psvc_1.best_score_,3))
print("Test Set Score with best parameters")
print(np.round(test_error_grid1_psvc,3))

psvc_initial_best_degree = grid_psvc_1.best_params_['degree']
psvc_initial_best_C = grid_psvc_1.best_params_['C']

### Your comments:
**Were the default values a good choice? Compare the test set accuracy that you get with the hyperparameters found by `GridSearchCV` with the test set accuracy obtained with the default values.**

**Compare the test set accuracy that you get with the hyperparameters found by `GridSearchCV` with the value of the attribute `best_score_` of the already fitted GridSearchCV object, discussing about the generalization capability of the model.**

### 2.2.2- Study of the dependence on the value of $C$ of the accuracy of the SVC endowed with a polynomial kernel.

Now run a loop that explores the given range of values for parameter $C$ in a SVC with a polynomial kernel with the best value for $d$ found in 2.2.1. (`psvc_initial_best_degree`) and plot the test set accuracy of the classifier vs. the value of $C$.



In [None]:
v_C_2 = [1e-3, 1e-2, 1e-1, 1, 10, 100, 1000, 1e4, 1e5, 1e6, 1e7, 1e8]
psvc_acc_C = np.empty(len(v_C)) #to store results
# your code with the for loop
# in each iteration the SVC must have the corresponding value of C and the degree equal to psvc_initial_best_degree

In [None]:
plt.plot(np.array(v_C_2), psvc_acc_C)
plt.xscale('log')
plt.grid()
plt.xlabel('C')
plt.ylabel('accuracy')


### Your comments:

**What is the best value of $C$ in this situation? What is the performance of the classifier for that value of $C$?**

**Discuss the sensitivity of the accuracy with respect to the value of $C$.**

### 2.2.3- Study of the dependence on the value of $d$ of the accuracy of the SVC endowed with a polynomial kernel.

Now run a loop that explores the given range of values for parameter $d$ in a SVC with a polynomial kernel with the best value for $C$ found in 2.2.1. (`psvc_initial_best_C`) and plot the test set accuracy of the classifier vs. the value of $d$.

**Discuss the sensitivity of the accuracy with respect to the value of $d$.**

In [None]:
v_d_2 = [2,3,4,5,6,7,8,9,10]
psvc_acc_d = np.empty(len(v_d_2))

# your code with the for loop
# in each iteration the SVC must have the corresponding value of degree and the C equal to psvc_initial_best_C

In [None]:
plt.plot(np.array(v_d_2), psvc_acc_d)
plt.grid()
plt.xlabel('d')
plt.ylabel('accuracy')

### Your comments:

**What is the best value of $d$ in this situation? What is the performance of the classifier for that value of $d$?**

**Discuss the sensitivity of the accuracy with respect to the value of $d$.**

### 2.2.4- GridSearch joint tunning of the values of  $d$ and $C$ for the SVC endowed with a polynomial kernel.

The values of $d$ and $C$ interplay in the fitting of the SVC. For this purpose, a greedy sequential tunning can be suboptimal and usually one conducts a grid search of both parameters. 

### Your comments:

**Building on the results of sections 2.2.2 and 2.2.3 design a grid for $d$ and $C$ with a maximum size of 25 values (that is, the product of the lengths of the ranges for $d$ and $C$ must be smaller or equal than 25). Justify the selected grid configuration.**



In [None]:
v_d_3 = # your range
v_C_3 = # your range

# construct the parameters for the grid
grid_psvc_3 = # your code

# instantiate the GridSearchCV object

# fit the GridSearchCV object

test_error_grid3_psvc = # accuracy in the test set

In [None]:
print("Best parameters")
print(grid_psvc_3.best_params_)
print("Best cross-validation score")
print(np.round(grid_psvc_3.best_score_,3))
print("Test Set Score with best parameters")
print(np.round(test_error_grid3_psvc,3))

### Your comments:

**Compare the test set accuracy that you obtain with the best hyperparameters found by the grid search with the test set accuracies obtained in sections 2.2.2 and 2.2.3.**

**Compare the test set accuracy that you obtain with the best hyperparameters found by the grid search with the value of the attribute `best_score_` of the already fitted GridSearchCV object, discussing about the generalization capability of the model.**

## 2.3.- SVC with a RBF kernel: $C$ and $\gamma$

### 2.3.1- Initial guess for the values of $C$ and $\gamma$

Use the following ranges to get initial values for $C$ and $\gamma$:
- `gamma` $ \in [.001, .1, 1]$
- `C` $ \in [0.1, 10, 100]$



In [None]:
v_C_4 = [.01, 10, 100]
v_g_4 = [1e-2,1e-1,1]

# Construct the GridSearchCV object with a SVC with RBF kernel as estimator and 
# the parameter grid defined above
grid_rsvc_4 = # your code

# fit the GridSearchCV object

test_error_grid4_rsvc = # accuracy of the classifier with the best set of parameters in the test set

In [None]:
print("Best parameters")
print(grid_rsvc_4.best_params_)
print("Best cross-validation score")
print(np.round(grid_rsvc_4.best_score_,3))
print("Test Set Score with best parameters")
print(np.round(test_error_grid4_rsvc,3))

rsvc_initial_best_gamma = grid_rsvc_4.best_params_['gamma']
rsvc_initial_best_C = grid_rsvc_4.best_params_['C']


### Your comments:
**Were the default values a good choice? Compare the test set accuracy that you get with the hyperparameters found by `GridSearchCV` with the test set accuracy obtained with the default values.**

**Compare the test set accuracy that you get with the hyperparameters found by `GridSearchCV` with the value of the attribute `best_score_` of the already fitted GridSearchCV object, discussing about the generalization capability of the model.**

### 2.3.2- Study of the dependence on the value of $C$ of the accuracy of the SVC endowed with a RBF kernel.

Now run a loop that explores about 10 values for parameter $C$ in a SVC with a RBF kernel with the best value for $\gamma$ found in 2.3.1. and plot the test set accuracy of the classifier vs. the value of $C$.

**Discuss the sensitivity of the accuracy with respect to the value of $C$.**

In [None]:
v_C_5 = # your range
rsvc_acc_C = np.empty(len(v_C))

# your code for the loop

In [None]:
plt.plot(np.array(v_C_5), rsvc_acc_C)
plt.xscale('log')
plt.xlabel('C')
plt.ylabel('accuracy')
plt.grid()


### Your comments:

**What is the best value of $C$ in this situation? What is the performance of the classifier for that value of $C$?**

**Discuss the sensitivity of the accuracy with respect to the value of $C$.**

### 2.3.3- Study of the dependence on the value of $\gamma$ of the accuracy of the SVC endowed with a RBF kernel.

Now run a loop that explores the given range of values for parameter $\gamma$ in a SVC with a RBF kernel with the best value for $C$ found in 2.3.1. and plot the test set accuracy of the classifier vs. the value of $\gamma$.



In [None]:
v_g_5 = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10]
rsvc_acc_g = np.empty(len(v_g_5)) #to store the results

# Your code for the loop

In [None]:
plt.plot(np.array(v_g_5), rsvc_acc_g)
plt.grid()
plt.xscale('log')
plt.xlabel('gamma')
plt.ylabel('accuracy')


### Your comments:

**What is the best value of $\gamma$ in this situation? What is the performance of the classifier for that value of $\gamma$?**

**Discuss the sensitivity of the accuracy with respect to the value of $\gamma$.**

### 2.3.4- GridSearch joint tunning of the values of  $\gamma$ and $C$ for the SVC endowed with a RBF kernel.

The values of $\gamma$ and $C$ interplay in the fitting of the SVC. For this purpose, a greedy sequential tunning can be suboptimal and usually one conducts a grid search of both parameters. 

**Building on the results of sections 2.3.2 and 2.3.3 design a grid for $\gamma$ and $C$ with a maximum size of 25 values (that is, the product of the lengths of the ranges for $\gamma$ and $C$ must be smaller or equal than 25). Justify the selected grid configuration.**



In [None]:
v_g_6 = # your range
v_C_6 = # your range

# Create GridSearchCV object
grid_rsvc_6 = # your code

# fit GridSearchCV object

test_error_grid6_rsvc = # evaluate with the test set

In [None]:
print("Best parameters")
print(grid_rsvc_6.best_params_)
print("Best cross-validation score")
print(np.round(grid_rsvc_6.best_score_,3))
print("Test Set Score with best parameters")
print(np.round(test_error_grid6_rsvc,3))

### Your comments:

**Compare the test set accuracy that you obtain with the best hyperparameters found by the grid search with the test set accuracies obtained in sections 3.3.2 and 3.3.3.**

**Compare the test set accuracy that you obtain with the best hyperparameters found by the grid search with the value of the attribute `best_score_` of the already fitted GridSearchCV object, discussing about the generalization capability of the model.**

# 3.- Scaling the data

In this section you will evaluate the impact of scaling the data. Remember some machine learning models benefit from a previous scaling of the input features.

## 3.1.- Scaling and linear SVC

Repeat the loop of section 2.1 but replacing the SVC by a `Pipeline` object with a `StandardScaler` as first stage and a linear SVC as second stage. 

In [None]:
s_lsvc_acc = np.empty(len(v_C)) # to store data inside the loop
# YOUR LOOP


In [None]:
# do not edit this cell, just run it
plt.plot(np.array(v_C), s_lsvc_acc,label='scaled')
plt.plot(np.array(v_C), lsvc_acc, label='not-scaled')
plt.xscale('log')
plt.ylabel('accuracy')
plt.xlabel('C')
plt.grid()
plt.legend()


### Your comments:

**Comment on the sensitivity of the accuracy with respect to $C$ if the data is previously scaled.**

## 3.2.- Scaling an  SVC with polynomial kernel

Repeat the analysis of section 2.2 but scaling the data. Notice that you can combine the `Pipeline` and the `GridSearchCV` in steps 3.2.1 and 3.2.4 to get a cleaner code.

### 3.2.1. Inital guess for parameters

In [None]:
# Create the pipeline object

# Create initial ranges for parameters of the GridSearchCV object

# Create a GridSearchCV but with the Pipeline object as estimator
grid_sc_psvc = # your code 

# Fit the GridSearchCV object

test_error_grid7_scaled_lsvc = # evaluate the classifier with the test set

In [None]:
print("Best parameters")
print(grid_sc_psvc.best_params_)
print("Best cross-validation score")
print(np.round(grid_sc_psvc.best_score_,3))
print("Test Set Score with best parameters")
print(np.round(test_error_grid7_scaled_lsvc,3))

sc_psvc_initial_best_degree = grid_sc_psvc.best_params_#complete
sc_psvc_initial_best_C = grid_sc_psvc.best_params_#complete

### Your comments:

## 3.2.2 Study of the dependence on the value of $C$ of the accuracy of the SVC endowed with a polynomial kernel with the data scaled

In [None]:
v_C_2 = [1e-3, 1e-2, 1e-1, 1, 10, 100, 1000, 1e4, 1e5, 1e6, 1e7, 1e8]
grid_sc_psvc_C = np.empty(len(v_C_2)) # to store results

# your code with the for loop

In [None]:
plt.plot(np.array(v_C_2), grid_sc_psvc_C,label='scaled')
plt.plot(np.array(v_C_2), psvc_acc_C, label='not-scaled')
plt.xscale('log')
plt.xlabel('C')
plt.ylabel('accuracy')
plt.grid()

plt.legend()

### Your comments:

## 3.2.3 Study of the dependence on the value of $d$ of the accuracy of the SVC endowed with a polynomial kernel with the data scaled

In [None]:
v_d_2 = [2,3,4,5,6,7,8,9,10]
grid_sc_psvc_d = np.empty(len(v_d_2))

# your code for the for loop

In [None]:
plt.plot(np.array(v_d_2), grid_sc_psvc_d,label='scaled')
plt.plot(np.array(v_d_2), psvc_acc_d, label='not-scaled')
plt.grid()
plt.xlabel('degree')
plt.ylabel('accuracy')
plt.legend()


### Your comments:

### 3.2.4- GridSearch joint tunning of the values of  $d$ and $C$ for the SVC endowed with a polynomial kernel and Scaled Data

The values of $d$ and $C$ interplay in the fitting of the SVC. For this purpose, a greedy sequential tunning can be suboptimal and usually one conducts a grid search of both parameters. 

### Your comments:

**Building on the results of sections 3.2.2 and 3.2.3 design a grid for $d$ and $C$ with a maximum size of 25 values (that is, the product of the lengths of the ranges for $d$ and $C$ must be smaller or equal than 25). Justify the selected grid configuration.**


In [None]:
v_d_7 = # your range
v_C_7 = # your range

# create parameters for grid

# create GridSearchCV object with a pipeline as estimator
grid_sc_psvc_2 = # your code

test_error_scaled_psvc = # evaluation with test set data

In [None]:
print("Best parameters")
print(grid_sc_psvc_2.best_params_)
print("Best cross-validation score")
print(np.round(grid_sc_psvc_2.best_score_,3))
print("Test Set Score with best parameters")
print(np.round(test_error_scaled_psvc,3))

### Your comments:

## 3.3.- Scaling a  SVC with RBF kernel

Repeat the analysis of section 2.3 but scaling the data. Notice that you can combine the `Pipeline` and the `GridSearchCV` in steps 3.3.1 and 3.3.4 to get a cleaner code.

### 3.3.1.- Initial guesses for $C$ and $\gamma$
Repeat the study of section 2.3.1 but replacing the classifiers with a pipeline that combines a scaling of the data with the classification

In [None]:
# Create the pipeline object

# Create initial ranges for parameters of the GridSearchCV object

# Create a GridSearchCV but with the Pipeline object as estimator
grid_sc_rsvc = # your code 

# Fit the GridSearchCV object

test_error_grid9_scaled_rsvc = # evaluate the classifier with the test set

In [None]:
print("Best parameters")
print(grid_sc_rsvc.best_params_)
print("Best cross-validation score")
print(np.round(grid_sc_rsvc.best_score_,3))
print("Test Set Score with best parameters")
print(np.round(test_error_grid9_scaled_rsvc,3))

sc_rsvc_initial_best_gamma = grid_sc_rsvc.best_params_#complete
sc_rsvc_initial_best_C = grid_sc_rsvc.best_params_#complete

### Your comments:

## 3.3.2 Study of the dependence on the value of $C$ of the accuracy of the SVC endowed with a RBF kernel with the data scaled

In [None]:
v_C_2 = [1e-3, 1e-2, 1e-1, 1, 10, 100, 1000, 1e4, 1e5, 1e6, 1e7, 1e8]
pipe_rsvc_acc_C = np.empty(len(v_C_2)) # to store results

# for loop

In [None]:
plt.plot(np.array(v_C_2), pipe_rsvc_acc_C, label='scaled')
plt.plot(np.array(v_C_2), rsvc_acc_C, label='not-scaled')
plt.xscale('log')
plt.xlabel('C')
plt.ylabel('accuracy')
plt.grid()


### Your comments:

## 3.3.3 Study of the dependence on the value of $\gamma$ of the accuracy of the SVC endowed with a RBF kernel with the data scaled

In [None]:
v_g_5 = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10]
pipe_sc_rsvc_g = np.empty(len(v_g_5))

# your code for the for loop

In [None]:
plt.plot(np.array(v_g_5), pipe_sc_rsvc_g, label='scaled')
plt.plot(np.array(v_g_5), rsvc_acc_g, label='not-scaled')
plt.grid()
plt.xscale('log')
plt.xlabel('gamma')
plt.ylabel('accuracy')
plt.legend()


### Your comments:

### 3.3.4- GridSearch joint tunning of the values of  $\gamma$ and $C$ for the SVC endowed with a RBF kernel and Scaled Data
The values of $\gamma$ and $C$ interplay in the fitting of the SVC. For this purpose, a greedy sequential tunning can be suboptimal and usually one conducts a grid search of both parameters. 

**Building on the results of sections 3.3.2 and 3.3.3 design a grid for $\gamma$ and $C$ with a maximum size of 25 values (that is, the product of the lengths of the ranges for $\gamma$ and $C$ must be smaller or equal than 25). Justify the selected grid configuration.**

In [None]:
v_g_8 = # your range
v_C_8 = # your range


# create parameters for grid

# create GridSearchCV object with a pipeline as estimator
grid_sc_rsvc_2 = # your code

test_error_grid_scaled_rsvc = # evaluate with the test set

In [None]:
print("Best parameters")
print(grid_sc_rsvc_2.best_params_)
print("Best cross-validation score")
print(np.round(grid_sc_rsvc_2.best_score_,3))
print("Test Set Score with best parameters")
print(np.round(test_error_grid_scaled_rsvc,3))

### Your comments:

# Items for discussion
- Is there any kernel that performs significantly better than the others?
- Is it reasonably easy to find wide ranges where the hyperparameters offer decent performances?
- Does scaling significantly change the performance of the models?