# 0 Support Vector Machine (SVM) algorithm

Given a data set $D=\left\{x_i, y_i\right\}_{i=1}^n$ of $N$ points, the method of $\varepsilon$-Support Vector Regression(denoted SVR) fits a function $f$ to the data $D$ of the following form:  


$$
f(x)=w^T \phi(x)+b
$$


We aim to minimize
$$
J(w)=\frac{1}{2}||w||^2+C \sum_{i=1}^n\left|\xi_i\right|
$$
subject to constrains
$$
\left|y_i-w_i \phi(x_i)\right| \leq \varepsilon+\left|\xi_i\right|
$$  


where:

-   $w, b$ are coefficients to be estimated    

-   $\phi(\mathbf{x})$ is a mapping from lower dimensional $x$-space to higer dimensional feature space  

-   $\xi_i$ is a slack variable of the point $x_i$ for dealing with infeasible constraints, for any data point $(x_i, y_i)$ that falls outside of $\varepsilon$, its deviation from the margin is denoted as $\xi_i$   

- $\varepsilon$ : distance from margins to hyperplane, only data points with absolute error less than or equal to $\varepsilon+\left|\xi_i\right|$ will be considered
 
-   $C$ is a positive constant that controls the penalty imposed on observations that lie outside the margin specified $\epsilon$, as $C$ increases, the tolerence for points outside margins increases 

One great advantage of SVMs is that solving for the optimal parameters is equivalent to a convex optimization problem, in which case all local optimum are also global.  

In [1]:
import os
import pathlib
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.svm import SVR
import matplotlib.pyplot as plt

from preprocess import *

from sklearn.model_selection import StratifiedKFold, GridSearchCV, RandomizedSearchCV, cross_val_score
from hyperopt import tpe, STATUS_OK, Trials, hp, fmin, STATUS_OK, space_eval

# 1 Data

In [13]:
# if using google colab
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/data (2).csv')
df.head()

Mounted at /content/drive


In [2]:
parent_path = str(pathlib.Path(os.getcwd()).parent)
df = pd.read_csv(os.path.join(parent_path, 'data/data.csv'))
small_ds = df.sample(frac=0.2, random_state=42)
df.head()

Unnamed: 0.1,Unnamed: 0,optionid,securityid,strike,callput,date_traded,contract_price,market_price,underlyings_price,contract_volume,days_to_maturity,moneyness,rate,volatility
0,0,150034236.0,504569.0,0.42,C,2006-10-18,0.0715,0.07025,0.4885,5.0,2.0,1.163095,0.053646,0.022956
1,1,150247468.0,504880.0,40.0,C,2006-10-18,0.124,0.1225,39.913799,56137.0,2.0,0.997845,0.053646,0.114784
2,2,150255000.0,506496.0,62.0,C,2006-10-18,0.172,0.174,61.827798,27369.0,2.0,0.997223,0.053646,0.106823
3,3,150255496.0,506497.0,53.5,C,2006-10-18,0.296,0.2655,53.6129,1224.0,2.0,1.00211,0.053646,0.110336
4,4,150255498.0,506497.0,54.0,C,2006-10-18,0.075,0.0645,53.6129,963.0,2.0,0.992831,0.053646,0.110336


In [3]:
dataframe_BS = makeBS(df)
small_dataframe_BS = makeBS(small_ds)

Get train and test data in tuples of features and targets. Print out their dimensions to check they are in shapes we want.

In [4]:
(x_train, y_train) , (x_test, y_test)= propocessed(dataframe_BS)
print(np.shape(x_train), np.shape(y_train), np.shape(x_test), np.shape(y_test))
(small_x_train, small_y_train) , (small_x_test, small_y_test)= propocessed(small_dataframe_BS)
print(np.shape(small_x_train), np.shape(small_y_train), np.shape(small_x_test), np.shape(small_y_test))

(85999, 5) (85999,) (21500, 5) (21500,)
(17200, 5) (17200,) (4300, 5) (4300,)


# 2 Model

In this approach, we use the radial basis function (RBF) as a transformation kernel for our data.    

For a pair of data points $x_i, x_j$, the RBF is defined as  

$$
k(x_i, x_j) = \exp(-\gamma ||x_i-x_j||)
$$  

where $\gamma$ is a positive hyperparameter.

Fit the model:

In [10]:
regressor = SVR(kernel = 'rbf')

In [None]:
regressor.fit(x_train, y_train)

Evaluate the model:

In [8]:
y_pred = regressor.predict(x_test)
rmse = np.sqrt(np.mean((y_test-y_pred)**2))
rmse

0.08411201112962334

# 3 Tuning hyperparameters

The ranges of hyperparameter are chosen based on experiments in [Practical Option Pricing with
Support Vector Regression and MART
by
Ian I-En Choo
Stanford University](http://cs229.stanford.edu/proj2009/Choo.pdf).

In [5]:
C_range = np.logspace(1,3,3)
print(f'The list of values for C are {C_range}')

epsilon_range = np.logspace(-1,-3,3)
print(f'The list of values for epsilon are {epsilon_range}')

gamma_range = np.logspace(-5, -2, 4)
print(f'The list of values for gamma are {gamma_range}')

The list of values for C are [  10.  100. 1000.]
The list of values for epsilon are [0.1   0.01  0.001]
The list of values for gamma are [1.e-05 1.e-04 1.e-03 1.e-02]


In [6]:
param_grid = { 
    # Regularization parameter
    "C": C_range,
    # Kernel type
    "kernel": ['rbf', 'poly'],
    # margin parameter
    "epsilon":epsilon_range,
    # Gamma is the Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’
    "gamma": gamma_range
    }

# Set up score
scoring = ['accuracy']

## Hyperparameter Tuning Using Grid Search

In [7]:
# Define grid search
regressor = SVR()
grid_search = GridSearchCV(estimator=regressor, 
                           param_grid=param_grid, 
                           refit= 'neg_root_mean_squared_error', 
                           verbose=0)

In [8]:
# Fit grid search
grid_result = grid_search.fit(small_x_train, small_y_train)
# Print grid search summary
grid_result

In [9]:
# Print the best accuracy score for the training dataset
print(f'The best accuracy score for the training dataset is {grid_result.best_score_:.4f}')
# Print the hyperparameters for the best score
print(f'The best hyperparameters are {grid_result.best_params_}')
# Print the best accuracy score for the testing dataset
print(f'The accuracy score for the testing dataset is {grid_search.score(x_test, y_test):.4f}')

The best accuracy score for the training dataset is 0.0604
The best hyperparameters are {'C': 1000.0, 'epsilon': 0.1, 'gamma': 0.01, 'kernel': 'rbf'}
The accuracy score for the testing dataset is -0.0504


In [10]:
best_regressor = grid_result.best_estimator_
y_pred = best_regressor.predict(x_test)
rmse = np.sqrt(np.mean((y_test-y_pred)**2))
rmse

0.09844271167951188

## Hyperparameter Tuning Using Random Search

In [15]:
# Define random search
random_search = RandomizedSearchCV(estimator=regressor, 
                           param_distributions=param_grid, 
                           n_iter=100,
                           scoring=scoring, 
                           refit='accuracy', 
                           n_jobs=-1, # use all cores
                           cv=5, # defailt is 5-fold cross validation 
                           verbose=0)
# Fit grid search
random_result = random_search.fit(small_x_train, small_y_train)

Traceback (most recent call last):
  File "/Users/customer/miniforge3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/Users/customer/miniforge3/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 106, in __call__
    score = scorer._score(cached_call, estimator, *args, **kwargs)
  File "/Users/customer/miniforge3/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 267, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "/Users/customer/miniforge3/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 211, in accuracy_score
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
  File "/Users/customer/miniforge3/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 104, in _check_targets
    raise ValueError("{0} is not supported".format(y_type))
ValueError: continuous is not supported

Traceback (

In [17]:
# Print random search summary
random_result.best_score_

nan

# 4 Results and Conclusion