In [12]:
import pandas as pd
test = pd.read_csv('input/test.csv')
sample_submission = pd.read_csv('input/sample_submission.csv')
book_test = pd.read_parquet('input/book_test.parquet/stock_id=0')
trade_test = pd.read_parquet('input/trade_test.parquet/stock_id=0')

# Optiver Realized Volatility Prediction: 91st place solution

* Competition website: [https://www.kaggle.com/competitions/optiver-realized-volatility-prediction/overview](https://www.kaggle.com/competitions/optiver-realized-volatility-prediction/overview)
* Evaluation details: [https://www.kaggle.com/competitions/optiver-realized-volatility-prediction/overview/evaluation](https://www.kaggle.com/competitions/optiver-realized-volatility-prediction/overview/evaluation)
* Data details: [https://www.kaggle.com/competitions/optiver-realized-volatility-prediction/overview/evaluation](https://www.kaggle.com/competitions/optiver-realized-volatility-prediction/data)
* My submission: [https://www.kaggle.com/code/chrisrichardmiles/opt-inf-ensemble-final-1](https://www.kaggle.com/code/chrisrichardmiles/opt-inf-ensemble-final-1)

## What does the host of this competition want?
Predictions of the volatility of stock prices over the next 10 minute window, given trading data and book data.
## Why volatility? 
Volatility is important because it is used in calculating the value of a stock option. We can trade more profitably if we are better at determining value. 

$$ \textrm{option_value} = \textrm{intrinsic_value} + \textrm{time_value} $$

Intrinsic value is just the difference between the current price of the stock and strike price of the option, so it is known at the time of sale. But the time value is harder to calculate. To calculate the time value, one would need to know the probability density distribution of stock price at the expiration time of the option. The volatility of a stock's price will affect that distribution since a stock with high volatility will have larger price changes over time. Therefore if we can better predict the volatility of a stock, we can trade options more profitably. 

## Precisely, what are we being asked to predict?
### realized volatility
We will compute the log returns over all consecutive book updates and we define the **realized volatility, $\sigma$,** as the squared root of the sum of squared log returns.
$$
\sigma = \sqrt{\sum_{t}r_{t-1, t}^2}
$$
### Log returns 
Calling $S_t$ the price of the stock $S$ at time $t$, we can define the log return between $t_1$ and $t_2$ as:
$$
r_{t_1, t_2} = \log \left( \frac{S_{t_2}}{S_{t_1}} \right)
$$
Where we use **WAP** (Weighted Average Price) as price of the stock to compute log returns.
$$ WAP = \frac{BidPrice_{1}*AskSize_{1} + AskPrice_{1}*BidSize_{1}}{BidSize_{1} + AskSize_{1}} $$
where an order book looks like this: 
![order_book_1](https://www.optiver.com/wp-content/uploads/2021/05/OrderBook3.png)

## How will our predictions be scored? 
Submissions are evaluated using the root mean square percentage error, defined as:

$$\text{RMSPE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} ((y_i - \hat{y}_i)/y_i)^2}$$

There will be around 100 stock ids in the test set and around 150,000 rows to predict.
With `row_id` reffering to "stock_id"-"time_id", the submission file looks like: 

In [8]:
sample_submission

Unnamed: 0,row_id,target
0,0-4,0.003048
1,0-32,0.003048
2,0-34,0.003048


## What does the input data look like at the time of prediction? 

### book_test.parque
A parquet file partitioned by stock_id. Provides order book data on the most competitive buy and sell orders entered into the market. The top two levels of the book are shared. The first level of the book will be more competitive in price terms, it will then receive execution priority over the second level.

Here are the first few rows of the book data for stock_id 0:

<!-- stock_id - ID code for the stock. Not all stock IDs exist in every time bucket. Parquet coerces this column to the categorical data type when loaded; you may wish to convert it to int8.
time_id - ID code for the time bucket. Time IDs are not necessarily sequential but are consistent across all stocks.
seconds_in_bucket - Number of seconds from the start of the bucket, always starting from 0.
bid_price[1/2] - Normalized prices of the most/second most competitive buy level.
ask_price[1/2] - Normalized prices of the most/second most competitive sell level.
bid_size[1/2] - The number of shares on the most/second most competitive buy level.
ask_size[1/2] - The number of shares on the most/second most competitive sell level. -->

In [10]:
book_test

Unnamed: 0,time_id,seconds_in_bucket,bid_price1,ask_price1,bid_price2,ask_price2,bid_size1,ask_size1,bid_size2,ask_size2
0,4,0,1.000049,1.00059,0.999656,1.000639,91,100,100,24
1,4,1,1.000049,1.00059,0.999656,1.000639,91,100,100,20
2,4,5,1.000049,1.000639,0.999656,1.000885,290,20,101,15


### trade_test.parquet 
A parquet file partitioned by stock_id. Contains data on trades that actually executed. Usually, in the market, there are more passive buy/sell intention updates (book updates) than actual trades, therefore one may expect this file to be more sparse than the order book.

Here are the first few rows of the trade data for stock_id 0:

In [13]:
trade_test

Unnamed: 0,time_id,seconds_in_bucket,price,size,order_count
0,4,7,1.000344,1,1
1,4,24,1.000049,100,7
2,4,27,1.000059,100,3


### test.csv 
Provides the mapping between the other data files and the submission file. As with other test files, most of the data is only available to your notebook upon submission with just the first few rows available for download.

In [14]:
test

Unnamed: 0,stock_id,time_id,row_id
0,0,4,0-4
1,0,32,0-32
2,0,34,0-34
