# Third Assignment - FINTECH 540 - Machine Learning for FinTech - Volatility Forecasting with Neural Networks

In this assignment, you will implement a neural network architecture for volatility forecasting. The primary objective of this regression task is to achieve a satisfactory performance on the test set (out-of-sample).

## Some context about volatility
In finance, volatility is one of the most important factors to consider to make appropriate decisions. Volatility is a measure of the variation of prices for a given stock or a market index over some time. Low volatility indicates relatively stable stock prices, while high volatility is associated with wild price fluctuations associated with a risky market. Volatility is particularly significant in several financial activities, such as risk and portfolio management or derivative pricing.

Volatility refers to the degree of fluctuation of the price of an asset. It is not directly observable, and different definitions in the financial domain are used to measure it.

We can distinguish between the following types of volatility:

- Historical volatility
- Implied volatility
- Volatility index
- Intraday volatility
- Realized volatility

**Historical Volatility**

Historical volatility measures the past price changes of an underlying financial instrument over a given period. It is usually defined as the standard deviation of returns, which is obtained by calculating the variance as follows:

$$\sigma^2=\left(E\left[|r-\mu|^2\right]\right)^{1 / 2}$$


Here,  $\mu$ stands for the expected value, $r$ represents the returns over a given period, and  $\bar{x}$ is the mean price in the same period. Historical volatility may be used to predict future price movements based on previous behavior, but it does not provide insights regarding the future trend or direction of prices.

**Implied Volatility**

Implied volatility isn't calculated from the historical prices of the stock but instead estimates the future volatility by looking at the market price of the options. Whereas historical volatility is static for a fixed given period, implied volatility varies for a stock based on different options strike prices.

Implied volatility is calculated by applying the Black-Scholes option pricing model. This partial differential equation describes the price of an option over time and solving it for the value of volatility. There are several approaches to calculating the implied volatility, with the iterative search being the most straightforward method.

Implied volatility is a critical parameter in option pricing since it provides a forward-looking estimation of possible future price fluctuations.

**Volatility Index**

The volatility index is a measure of volatility applied to a market index or its exchange-traded fund equivalent. There are several volatility indexes quoted in financial markets, with the Chicago Board Options Exchange (CBOE) Volatility Index (VIX) being the most popular one.

The volatility index can be calculated as the weighted average of the implied volatilities for several options related to a specific index. Traders use this index as an indicator of investor sentiment to identify if there is too much optimism or fear in the market and, hence, possible reversals.

**Intraday Volatility**

Intraday volatility uses high-frequency asset prices, representing the market fluctuations during a trading day. Intraday volatility provides much more accurate variance estimations since it uses more observations.


**Realized Volatility**

Realized volatility also uses intraday information. It's based on the realized variance introduced by Barndorff-Nielsen and Shephard (2001). Over an interval of time of a length of $T$, the realized variance can be defined as the sum of squared intraday changes over a day:


$$
\{y\}_n=\sum_{j=1}^M\left\{y^*\left((n-1) \Delta+\frac{\Delta j}{M}\right)-y^*\left((n-1) \Delta+\frac{\Delta(j-1)}{M}\right)\right\}^2, \quad n=1,2, \ldots
$$

Here, $P$ represents the stock log prices, and $N$ is the number of intraday observations during a day.


The realized volatility is the square root of the realized variance. It provides an efficient measure of volatility since it considers all transactions in a given day.

## Dataset Overview

The realized measure of volatility in the dataset is given by a realized kernel introduced by Barndorff-Nielsen et al. (2008). The realized kernel yields a more robust volatility estimation, even in noise. 

- **Assets**: 28 stocks + 1 ETF (SPY).
- **Features**: Daily returns (close-to-close) and realized volatility estimates. Each feature follow the name convention: "SYMBOL_ret" for return and "SYMBOL_vol" for the volatility.
- **Format**: Divided into train and test sets (2 files are provided).


## Task and General Hints

In this assignment, you are tasked with building a predictive regression model on equity data. Your primary goal is to ensure accurate out-of-sample predictions and evaluate them with the below-mentioned metric.

**THE TARGET VARIABLE YOU HAVE TO PREDICT**

This is a univariate prediction task. The target variable is already provided in both the train and the test file. The name of the target variable is **SPY_vol_t+1**. You are indeed trying to predict the volatility of the SPY ETF of the next day based on the amount of information that you have. Based on this information you can assume that all the other features in the dataset are at time $t$. Only the target is at time $t+1$ since you have to solve a forecasting problem. Do not create another target variable. 

To guide you through this process, consider breaking down your tasks into the following three phases:

**Preprocessing**
The dataset is already free of inconsistencies, missing values, or outliers. 
- **Feature Engineering**: You might want to create additional variables (lags, etc..) or perform transformations. Ensure that all the variables you want to use for modeling are correctly preprocessed. You don't need to use all the variables necessarily. You will eventually refine your choices while modeling.
- **Data Splitting**: The dataset is partitioned, and two files are provided to you.

**Model Selection**
- This notebook focuses on using neural networks for regression. You can experiment with the different neural network architectures we have seen in class. Feel free to compare the performance against a linear regression model. 

**Model Tuning and Evaluation**
- Once you've selected a model, you'll want to fine-tune its parameters to achieve the best out-of-sample performance.
- You may adjust parameters manually, or you can construct a routine to fit several models with different hyperparameters. 
- Evaluate your final model using the Root Mean Squared Error (RMSE) metrics. The last cell of this notebook will also take care of it, so follow the naming convention stated at the bottom of the notebook.

**Note**: Parameter choices and tuning should be made thoughtfully while it is up to you. Carefully study the documentation of the neural network models and refer to the Jupyer Notebooks we used in class to see the possible parameters you can fine-tune.

**IMPORTANT REMARK**: 
You must use the test set solely as data the model has never seen before. The results on that part of the dataset are those that are going to provide your grade.

Remember to set the seed when training and instantiating your model. You can use either Keras (Tensorflow) or Pytorch for this task, and you must make your results fully reproducible for grading. Double-check that you have correctly set the seed before diving into the coding part.

- [Setting the seed in Keras](https://www.tensorflow.org/api_docs/python/tf/keras/utils/set_random_seed)
- [Setting the seed in Pytorch](https://pytorch.org/docs/stable/notes/randomness.html)

## Grading Rubric

Your grade will be determined by the **normalized Root Mean Squared Error (RMSE)** your model achieves on the test set. Specifically, your grade will be calculated as:

$$ \text{Grade} = \text{Normalized RMSE} \times 100 $$

which will be a number between 0 and 100. Grades may be curved before being released.

The normalization for RMSE is defined as:

$$ \text{Normalized RMSE} = 1 - \left( \frac{\text{RMSE}}{\text{MAX_POSSIBLE_RMSE}} \right) $$

Where `MAX_POSSIBLE_RMSE` represents a domain-specific value that signifies the worst possible RMSE for your dataset, which could be set as the standard deviation of the target variable. This normalization ensures that the RMSE value is scaled between 0 (worst) and 1 (best).

The prediction quality assessed by those metrics will result from all the choices you made when it comes to preprocessing features, including them into a model, selecting and evaluating a proper regression model, and eventually doing hyperparameter optimization. 

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import StandardScaler

tf.keras.utils.set_random_seed(42)

train_data = pd.read_csv('train_retvol_dataset.csv')
test_data = pd.read_csv('test_retvol_dataset.csv')
train_data_features = train_data.drop(columns=['date', 'SPY_vol_t+1'])
test_data_features = test_data.drop(columns=['date', 'SPY_vol_t+1'])

train_target_log = np.log1p(train_data['SPY_vol_t+1'])
test_target_log = np.log1p(test_data['SPY_vol_t+1'])

scaler = StandardScaler()
train_data_normalized = scaler.fit_transform(train_data_features)
test_data_normalized = scaler.transform(test_data_features)

X_train = train_data_normalized.astype(np.float32)
y_train = train_target_log.values.astype(np.float32)
X_test = test_data_normalized.astype(np.float32)
y_test = test_target_log.values.astype(np.float32)

model = tf.keras.Sequential([
    tf.keras.layers.Dense(256, activation='relu', input_shape=(X_train.shape[1],)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(X_train, y_train, epochs=500, batch_size=64, verbose=1)

y_test_pred = model.predict(X_test).squeeze() 

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 74/500
Epoch 75/500
Epoch 76/500
Epoch 77/500
Epoch 78

**Instructions to let the next code cell run:**

Before running the cell below, ensure the following:
1. The target variable of your problem has to be named exactly `y_test`, while the out-of-sample prediction variable has to be named `y_test_pred`. Also the calculation of `MAX_POSSIBLE_RMSE` relies on this naming convention to determine the standard deviation of the test target values. 

By adhering to these naming conventions, the grading cell can compute the final score without any issues.

In [2]:
import math
import numpy as np

def evaluate(y_test, y_test_pred):
    """
    Function to calculate RMSE for one-dimensional arrays
    """
    mse = np.mean((y_test - y_test_pred) ** 2)
    rmse = np.sqrt(mse)
    return rmse

rmse = evaluate(y_test, y_test_pred)

MAX_POSSIBLE_RMSE = y_test.std()
normalized_rmse = (1 - rmse / MAX_POSSIBLE_RMSE)

Grade = normalized_rmse
print('The grade for this assignment is ', math.ceil(Grade * 100))

The grade for this assignment is  36
