# SLU18 - Support Vector Machines (SVM): Exercise notebook

New tools in this unit

* [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
* [SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html)

In [None]:
import pandas as pd
import numpy as np
from hashlib import sha256
import json

import sklearn
# These will be needed to prepare the dataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Seed for reproducibility
np.random.seed(42)

**Let the Music Play**

The year is 2020 and due to the Covid-19 pandemic you spend a lot more time inside than you used to. You realize that one of the few things that people can still do (almost) the same way as before is listen to music. Thus, you decide to use your data skills to surprise one of your friends. To do so, you use data about your friend's listening habits and try to make a classifier that predicts whether your friend will like a song based on some attributes. 

In [None]:
songs_df = pd.read_csv("data/song_data.csv", index_col="id")
print(songs_df.shape)
songs_df.head()

The data contains information about which songs your friend liked or not in the `target` column. It also contains several attributes about each song that you suspect will be useful to infer your friend's musical taste. In this case, you decide to drop the song title and artist as you are more interested in the musical attributes. 

In [None]:
songs_df = songs_df.drop(columns=["song_title", "artist"])

In [None]:
songs_df.head()

In [None]:
songs_df.target.value_counts(normalize=True)

Since the target variable is binary, you are faced with a binary classification problem. You remember that really cool support vector machines and decide to give them a shot. 

In order to properly train and evaluate your models, you split your dataset into train set and test set and scale the data.

In [None]:
def get_X_y_train_test(df, target_col):
    """
    Convert the input dataframe df into the
    train and test features and targets
    """
    X = df.drop(target_col, axis=1)
    y = df[target_col]
    # train test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # SVMs are not scale invariant, so you scale your data beforehand
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    print("X_train of shape ", X_train.shape)
    print("y_train of shape ", y_train.shape)
    print("X_test of shape  ", X_test.shape)
    print("y_test of shape  ", y_test.shape)
    
    return X_train, X_test, y_train, y_test 

In [None]:
X_train, X_test, y_train, y_test = get_X_y_train_test(songs_df, target_col="target")

## Exercise 1: Support Vector Classifier

### Exercise 1.1: Train the classifier 
Use a support vector classifier to predict which songs your friend will like. Keep all the other arguments at default values. Instantiate the classifier and fit it to the data. Calculate the score of the classifier.

In [None]:
# svc_linear = ...
# svc_linear_score = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(svc_linear, sklearn.svm.SVC), 'The classifier is not a SVC.'
assert sha256(svc_linear.kernel.encode()).hexdigest() == '7f2fe580edb35154041fa3d4b41dd6d3adaef0c85d2ff6309f1d4b520eeecda3', 'The kernel is not correct.'
np.testing.assert_almost_equal(svc_linear_score, 0.65, decimal = 2, err_msg='The score is not correct.')

### Exercise 1.2: Number of support vectors
Obtain the number of support vectors for each class. You will need to check the documentation.

In [None]:
# sv_class_1 = ...
# sv_class_2 = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert sha256(str(sv_class_1).encode()).hexdigest() == 'fc47b34e36f4032acd1ca2192a7b9b097011ccbfe3d8e27b04bb6999e000578d', 'Number of support vectors of class 1 is not correct.'
assert sha256(str(sv_class_2).encode()).hexdigest() == '524148f24802f8c68974c2e1ecc8b8f47d0d60b7a0d1948951c050a25b5a8e59', 'Number of support vectors of class  is not correct.'

### Exercise 1.3: Support vectors
Obtain the array of the support vectors for the classifier from exercise 1.1.

In [None]:
# s_vectors = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert s_vectors.shape == (1242,13), 'The shape of the array is not correct.'
np.testing.assert_almost_equal(s_vectors.mean(), -0.009, decimal = 3, err_msg='The support vectors are not correct.')
np.testing.assert_almost_equal(s_vectors.var(), 0.93, decimal = 2, err_msg='The support vectors are not correct.')
np.testing.assert_almost_equal(s_vectors.max(), 9.03, decimal = 2, err_msg='The support vectors are not correct.')
np.testing.assert_almost_equal(s_vectors.min(), -6.958, decimal = 3, err_msg='The support vectors are not correct.')

### Exercise 1.4 : Tuning parameter
Create a SVC classifier and set the tuning parameter value to 100. Calculate the score of the classifier.

In [None]:
# svc_linear_100 = ...
# svc_linear_100_score = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(svc_linear_100, sklearn.svm.SVC), 'The classifier is not a SVC.'
assert sha256(svc_linear_100.kernel.encode()).hexdigest() == '7f2fe580edb35154041fa3d4b41dd6d3adaef0c85d2ff6309f1d4b520eeecda3', 'The kernel is not correct.'
np.testing.assert_almost_equal(svc_linear_100_score, 0.65, decimal = 2, err_msg='The score is not correct.')

## Exercise 2 : Support Vector Machine
Having tried the Support Vector Classifier, you turn to Support Vector Machine to see if they peform better. You wonder which kernel you should use, and decide to start with the polynomial kernel.

### Exercise 1.2: Polynomial SVM 
Create an SVM with polynomial kernel of degree 2. Fit the model to the data and create the predictions.

In [None]:
# Use these variables for the classifier and the predictions.
# svm_poly = ...
# svm_poly_preds = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(svm_poly, sklearn.svm.SVC), 'The classifier is not of the correct type.'
assert sha256(svm_poly.kernel.encode()).hexdigest() == '2e68a8e49fe8b9e6e94c2fbdec0f227acaecc5cdc3c5e2e411e2bbe49b440ae2', 'The kernel is not correct.'
assert svm_poly.degree == 2, 'Incorrect polynomial degree.'
assert sha256(''.join([str(i) for i in svm_poly_preds]).encode()).hexdigest() == '775aa4cf134658977edc34aefe6ea47b86962f4fccb9d44116f6a7520e53dcea', 'The predictions are not correct.'
print(f'The score of the polynomial SVM is {svm_poly.score(X_test,y_test)}.')

### Exercise 2.2: Radial SVM
Create an SVM with a radial kernel. Fit it to the data and create the predictions.

In [None]:
# svm_radial = ...
# svm_radial_preds = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(svm_radial, sklearn.svm.SVC), 'The classifier is not of the correct type.'
assert sha256(svm_radial.kernel.encode()).hexdigest() == '01a0d0e53b2345784a5f47788c5187e466a5fba8310cd265267ad7ed5810bd51', 'The kernel is not correct.'
assert sha256(''.join([str(i) for i in svm_radial_preds]).encode()).hexdigest() == '391e22acee70ee903a8ae5324dea544e27ed02c72ab8b6b0fa6b3a8cdfecf138', 'The predictions are not correct.'
print(f'The score of the radial SVM is {svm_radial.score(X_test,y_test)}.')

## Exercise 3 : Support Vector Regression

You also wonder whether the energy of a song can be predicted by the remaining attributes. Let's prepare the dataset:

In [None]:
X_train, X_test, y_train, y_test = get_X_y_train_test(songs_df.drop("target", axis=1), target_col="energy")

### Exercise 3.1: Energy of a song.
Use an SVR estimator with a radial kernel to predict the energy of a song. Create and fit an SVR estimator, then calculate the score and the predictions.

In [None]:
# svr = ...
# svr_score = ...
# svr_predictions = ...

# YOUR CODE HERE
raise NotImplementedError()


In [None]:
assert isinstance(svr, sklearn.svm.SVR), 'The estimator is not the correct type.'
assert sha256(svr.kernel.encode()).hexdigest() == '01a0d0e53b2345784a5f47788c5187e466a5fba8310cd265267ad7ed5810bd51', 'The estimator is not correct type.'
np.testing.assert_almost_equal(svr_preds.mean(), 0.67, decimal = 2, err_msg='The predictions are not correct.')
np.testing.assert_almost_equal(svr_preds.var(), 0.028, decimal = 3, err_msg='The predictions are not correct.')
np.testing.assert_almost_equal(svr_preds.max(), 0.996, decimal = 3, err_msg='The predictions are not correct.')
np.testing.assert_almost_equal(svr_preds.min(), 0.076, decimal = 3, err_msg='The predictions are not correct.')
np.testing.assert_almost_equal(svr.score(X_test, y_test), 0.73, decimal = 2, err_msg = 'The score is not correct.')