# SLU18 - Support Vector Machines (SVM): Exercise notebook

In [None]:
import pandas as pd
import numpy as np
import hashlib
import json

# These will be needed to prepare the dataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC, SVR

# Seed for reproducibility
np.random.seed(42)

**Let the Music Play**

The year is 2020 and due to the Covid-19 pandemic you spend a lot more time inside than you used to. You realize that one of the few things that people can still do (almost) the same way as before is listen to music. Thus, you decide to use your data skills to surprise one of your friends. To do so, you use data about your friend's listening habits and try to make a classifier that predicts whether your friend will like a song based on some attributes. 

In [None]:
songs_df = pd.read_csv("data/song_data.csv", index_col="id")
print(songs_df.shape)
songs_df.head()

The data contains information about which songs your friend liked or not in the `target` column. It also contains several attributes about each song that you suspect will be useful to infer your friend's musical taste. In this case, you decide to drop the song title and artist as you are more interested in the musical attributes. 

In [None]:
songs_df = songs_df.drop(columns=["song_title", "artist"])
songs_df.head()

The dataset is balanced:

In [None]:
songs_df.target.value_counts(normalize=True)

Since the target variable is binary, you are faced with a binary classification problem. You remember that really cool support vector machines and decide to give them a shot. 

In order to properly train and evaluate your models, you split your dataset into train set and test set and scale the data.

In [None]:
def get_X_y_train_test(df, target_col):
    """
    Convert the input dataframe df into the
    train and test features and targets
    """
    X = df.drop(target_col, axis=1)
    y = df[target_col]
    # train test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # SVMs are not scale invariant, so you scale your data beforehand
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    print("X_train of shape ", X_train.shape)
    print("y_train of shape ", y_train.shape)
    print("X_test of shape  ", X_test.shape)
    print("y_test of shape  ", y_test.shape)
    
    return X_train, X_test, y_train, y_test 

In [None]:
X_train, X_test, y_train, y_test = get_X_y_train_test(songs_df, target_col="target")

## Exercise 1: Support Vector Classifier

### Exercise 1.1: Train the classifier 
Use a support vector classifier to predict which songs your friend will like. 

Assign the classifier to the `svc_linear` variable and fit it to the data. Use the appropriate kernel. Calculate the score of the classifier on the test data and assign it to the variable `svc_linear_score`.

In [None]:
# svc_linear = ...
# svc_linear_score = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(svc_linear, SVC), 'The classifier is not an SVC.'
assert hashlib.sha256(json.dumps(svc_linear.kernel).encode()).hexdigest() == \
'4dd109ad15a2481c6cd88948a95b3116061fd1b966bef5c92481ec848e167829', 'The kernel is not correct.'
np.testing.assert_almost_equal(svc_linear_score, 0.65, decimal = 2, err_msg='The score is not correct.')

### Exercise 1.2: Number of support vectors
Obtain the number of support vectors for each class. You will need to check the documentation.

In [None]:
# sv_class_1 = ...
# sv_class_2 = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert hashlib.sha256(json.dumps(str(sv_class_1)).encode()).hexdigest() == \
'a7ef5f997803c55d6b5b03ac733bb52e045d9f7c0a60fb066b3d2d32c877baa2', 'The number of support vectors of class 1 is not correct.'
assert  hashlib.sha256(json.dumps(str(sv_class_2)).encode()).hexdigest() == \
'edc90998c425418d30e05bb7e28a3036933695a18eaae0badee98aaa66bddddb', 'The number of support vectors of class 2 is not correct.'

### Exercise 1.3: Support vectors
Obtain the array of the support vectors for the classifier from exercise 1.1.

In [None]:
# s_vectors = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert s_vectors.shape == (1242,13), 'The shape of the array is not correct.'
np.testing.assert_almost_equal(s_vectors.mean(), -0.009, decimal = 3, err_msg='The support vectors are not correct.')
np.testing.assert_almost_equal(s_vectors.var(), 0.93, decimal = 2, err_msg='The support vectors are not correct.')
np.testing.assert_almost_equal(s_vectors.max(), 9.03, decimal = 2, err_msg='The support vectors are not correct.')
np.testing.assert_almost_equal(s_vectors.min(), -6.958, decimal = 3, err_msg='The support vectors are not correct.')

### Exercise 1.4 : Tuning parameter
Create an SVC classifier with a linear kernel and set the tuning parameter value to 10. Calculate the score of the classifier.

In [None]:
# svc_linear_10 = ...
# svc_linear_10_score = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(svc_linear_10, SVC), 'The classifier is not an SVC.'
assert hashlib.sha256(json.dumps(svc_linear_10.kernel).encode()).hexdigest() == \
'4dd109ad15a2481c6cd88948a95b3116061fd1b966bef5c92481ec848e167829', 'The kernel is not correct.'
assert hashlib.sha256(json.dumps(svc_linear_10.C).encode()).hexdigest() == \
'4a44dc15364204a80fe80e9039455cc1608281820fe2b24f1e5233ade6af1dd5', 'The value of C is not correct.'
np.testing.assert_almost_equal(svc_linear_10_score, 0.65, decimal = 2, err_msg='The score is not correct.')

## Exercise 2 : Support Vector Machine
Having tried the Support Vector Classifier, you turn to Support Vector Machine to see if it peforms better. You wonder which kernel you should use, and decide to start with the polynomial kernel.

### Exercise 1.2: Polynomial SVM 
Create an SVM with a polynomial kernel of degree 2. Fit the model to the train data and create the predictions on the test data.

In [None]:
# Use these variables for the classifier and the predictions.
# svm_poly = ...
# svm_poly_preds = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(svm_poly, SVC), 'The classifier is not of the correct type.'
assert hashlib.sha256(json.dumps(svm_poly.kernel).encode()).hexdigest() == \
'725a22fa9523486de6898066949541f8c8a3a994612812068f844a48f6e2acd4', 'The kernel is not correct.'
assert svm_poly.degree == 2, 'Incorrect polynomial degree.'
assert  hashlib.sha256(json.dumps(''.join([str(i) for i in svm_poly_preds])).encode()).hexdigest() == \
'40c6929ff4cb7fd3187181f555c87ee06f64c6f0e18952fb62892de7f0015303', 'The predictions are not correct.'
print(f'The score of the polynomial SVM is {svm_poly.score(X_test,y_test)}.')

### Exercise 2.2: Radial SVM
Create an SVM with a radial kernel. Fit the model to the train data and create the predictions on the test data.

In [None]:
# svm_radial = ...
# svm_radial_preds = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(svm_radial, SVC), 'The classifier is not of the correct type.'
assert hashlib.sha256(json.dumps(svm_radial.kernel).encode()).hexdigest() == \
'215cfefb3870cb23e2b12025254dfbc170f86d077a0ffb9ed442c90a36a64f51', 'The kernel is not correct.'
assert  hashlib.sha256(json.dumps(''.join([str(i) for i in svm_radial_preds])).encode()).hexdigest() == \
'2b5f86435b20021d1ade9acfc1cc7334533b71a90274cb8a5e73550cf029ec68', 'The predictions are not correct.'
print(f'The score of the radial SVM is {svm_radial.score(X_test,y_test)}.')

## Exercise 3 : Support Vector Regression

You also wonder whether the energy of a song can be predicted by the remaining attributes. Let's prepare the dataset:

In [None]:
X_train, X_test, y_train, y_test = get_X_y_train_test(songs_df.drop("target", axis=1), target_col="energy")

### Exercise 3.1: Energy of a song.
Use an SVR estimator with a radial kernel to predict the energy of a song. Create and fit an SVR estimator with the train data, then calculate the score and the predictions on the test data.

In [None]:
# svr = ...
# svr_score = ...
# svr_preds = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(svr, SVR), 'The estimator is not the correct type.'
assert hashlib.sha256(json.dumps(svr.kernel).encode()).hexdigest() == \
'215cfefb3870cb23e2b12025254dfbc170f86d077a0ffb9ed442c90a36a64f51', 'The estimator is not correct type.'
np.testing.assert_almost_equal(svr_preds.mean(), 0.67, decimal = 2, err_msg='The predictions are not correct.')
np.testing.assert_almost_equal(svr_preds.var(), 0.028, decimal = 3, err_msg='The predictions are not correct.')
np.testing.assert_almost_equal(svr_preds.max(), 0.996, decimal = 3, err_msg='The predictions are not correct.')
np.testing.assert_almost_equal(svr_preds.min(), 0.076, decimal = 3, err_msg='The predictions are not correct.')
np.testing.assert_almost_equal(svr.score(X_test, y_test), 0.73, decimal = 2, err_msg = 'The score is not correct.')
print(f'The score of the SVR is {svr_score}.')