# Optiver Realized Volatility Prediction - Train

**This notebook seeks to EDITS HERE**
---------

## Files
**book_[train/test].parquet** - A [parquet](https://arrow.apache.org/docs/python/parquet.html) file partitioned by `stock_id`. Provides order book data on the most competitive buy and sell orders entered into the market. The top two levels of the book are shared. The first level of the book will be more competitive in price terms, it will then receive execution priority over the second level.

 - `stock_id` - ID code for the stock. Not all `stock_id`s exist in every time bucket. Parquet coerces this column to the categorical data type when loaded; you may wish to convert it to int8.
 - `time_id` - ID code for the time bucket. `time_id`s are not necessarily sequential but are consistent across all stocks.
 - `seconds_in_bucket` - Number of seconds from the start of the bucket, always starting from 0.
 - `bid_price[1/2]` - Normalized prices of the most/second most competitive buy level.
 - `ask_price[1/2]` - Normalized prices of the most/second most competitive sell level.
 - `bid_size[1/2]` - The number of shares on the most/second most competitive buy level.
 - `ask_size[1/2]` - The number of shares on the most/second most competitive sell level.
 
**trade_[train/test].parquet** - A [parquet](https://arrow.apache.org/docs/python/parquet.html) file partitioned by `stock_id`. Contains data on trades that actually executed. Usually, in the market, there are more passive buy/sell intention updates (book updates) than actual trades, therefore one may expect this file to be more sparse than the order book.

 - `stock_id` - Same as above.
 - `time_id` - Same as above.
 - `seconds_in_bucket` - Same as above. Note that since trade and book data are taken from the same time window and trade data is more sparse in general, this field is not necessarily starting from 0.
 - `price` - The average price of executed transactions happening in one second. Prices have been normalized and the average has been weighted by the number of shares traded in each transaction.
 - `size` - The sum number of shares traded.
 - `order_count` - The number of unique trade orders taking place.
 
**train.csv** The ground truth values for the training set.

 - `stock_id` - Same as above, but since this is a csv the column will load as an integer instead of categorical.
 - `time_id` - Same as above.
 - `target` - The realized volatility computed over the 10 minute window following the feature data under the same `stock_id`/`time_id`. There is no overlap between feature and target data. 
 
**test.csv** Provides the mapping between the other data files and the submission file. As with other test files, most of the data is only available to your notebook upon submission with just the first few rows available for download.

 - `stock_id` - Same as above.
 - `time_id` - Same as above.
 - `row_id` - Unique identifier for the submission row. There is one row for each existing `stock_id`/`time_id` pair. Each time window is not necessarily containing every individual stock.
 
**sample_submission.csv** - A sample submission file in the correct format.

 - `row_id` - Same as in test.csv.
 - `target` - Same definition as in **train.csv**. The benchmark is using the median target value from **train.csv**.
 
## Prepare Environment
### Import Packages

In [14]:
# General packages
import pandas as pd
import numpy as np
import pyarrow.parquet as pq # To handle parquet files
import os
import gc
import random
from tqdm import tqdm, tqdm_notebook
from pathlib import Path
import multiprocessing

import time
import warnings
warnings.filterwarnings('ignore')

# Data vis packages
import matplotlib.pyplot as plt
%matplotlib inline

# Data prep
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.decomposition import PCA

# Modelling packages
import tensorflow as tf
from tensorflow import keras
from tensorflow.python.keras import backend as k
# Key layers
from tensorflow.keras.models import Model, Sequential, load_model
from tensorflow.keras.layers import Input, Add, Dense, Flatten
# Activation layers
from tensorflow.keras.layers import ReLU, LeakyReLU, ELU, ThresholdedReLU
# Dropout layers
from tensorflow.keras.layers import Dropout, AlphaDropout, GaussianDropout
# Normalisation layers
from tensorflow.keras.layers import BatchNormalization
# Embedding layers
from tensorflow.keras.layers import Embedding, Concatenate, Reshape
# Callbacks
from tensorflow.keras.callbacks import Callback, EarlyStopping, LearningRateScheduler, ModelCheckpoint
# Optimisers
from tensorflow.keras.optimizers import SGD, RMSprop, Adam, Adadelta, Adagrad, Adamax, Nadam, Ftrl
# Model cross validation and evaluation
from sklearn.model_selection import StratifiedKFold
from tensorflow.keras.losses import binary_crossentropy

# For Bayesian hyperparameter searching
from skopt import gbrt_minimize, gp_minimize
from skopt.utils import use_named_args
from skopt.space import Real, Categorical, Integer

In [15]:
strategy = tf.distribute.get_strategy()
REPLICAS = strategy.num_replicas_in_sync

# Data access
gpu_options = tf.compat.v1.GPUOptions(allow_growth=True)

# Get number of cpu cores for multiprocessing
try:
    cpus = int(multiprocessing.cpu_count() / 2)
except NotImplementedError:
    cpus = 1 # Default number of cores
    
print(f"Num GPUs Available: {len(tf.config.experimental.list_physical_devices('GPU'))}")
print(f"Num CPU Threads Available: {cpus}")
print(f'REPLICAS: {REPLICAS}')

Num GPUs Available: 1
Num CPU Threads Available: 64
REPLICAS: 1


### Read in Data

In [16]:
# Data paths
comp_dir_path = Path("../input/optiver-realized-volatility-prediction")

# Train paths
train_book_path   = comp_dir_path/"book_train.parquet"
train_trade_path  = comp_dir_path/"trade_train.parquet"
train_labels_path = comp_dir_path/"train.csv"

# Test paths
test_book_path   = comp_dir_path/"book_test.parquet"
test_trade_path  = comp_dir_path/"trade_test.parquet"
test_labels_path = comp_dir_path/"test.csv"

# Sample submission path
sample_sub_path = comp_dir_path/"sample_submission.csv"

In [17]:
def get_stock_ids_list(data_dir_path):
    data_dir = os.listdir(data_dir_path)
    # Get list of stock ids in directory
    stock_ids = list(map(lambda x: x.split("=")[1], data_dir))
    return stock_ids
    
    
def load_book_stock_id_data(stock_id):
    # Get stock id extension
    stock_id_ext = f"stock_id={stock_id}"
    
    # Read individual stock parquet file
    if is_train_test == "train":
        book_stock_id_path = os.path.join(train_book_path, stock_id_ext)
    elif is_train_test == "test":
        book_stock_id_path = os.path.join(test_book_path, stock_id_ext)
    book_stock_id = pd.read_parquet(book_stock_id_path)
    
    # Add stock id feature from filename
    book_stock_id["stock_id"] = stock_id
            
    return book_stock_id

def load_trade_stock_id_data(stock_id):
    # Get stock id extension
    stock_id_ext = f"stock_id={stock_id}"
    
    # Read individual stock parquet file
    if is_train_test == "train":
        trade_stock_id_path = os.path.join(train_trade_path, stock_id_ext)
    elif is_train_test == "test":
        trade_stock_id_path = os.path.join(test_trade_path, stock_id_ext)
    trade_stock_id = pd.read_parquet(trade_stock_id_path)
    
    # Add stock id feature from filename
    trade_stock_id["stock_id"] = stock_id
            
    return trade_stock_id

In [18]:
%%time
# Get list of stock ids
train_stock_ids = get_stock_ids_list(train_book_path)
test_stock_ids = get_stock_ids_list(test_book_path)

# Read train data
is_train_test = "train"
# Create worker pool and read
pool         = multiprocessing.Pool(processes=cpus)
train_book   = pd.concat(pool.map(load_book_stock_id_data, train_stock_ids[0:2]))
train_trade  = pd.concat(pool.map(load_trade_stock_id_data, train_stock_ids[0:2]))
train_labels = pd.read_csv(train_labels_path)
# Close worker pool
pool.close()
pool.join()

# Read test data
is_train_test = "test"
# Create worker pool and read
pool        = multiprocessing.Pool(processes=cpus)
test_book   = pd.concat(pool.map(load_book_stock_id_data, test_stock_ids))
test_trade  = pd.concat(pool.map(load_trade_stock_id_data, test_stock_ids))
test_labels = pd.read_csv(test_labels_path)

# Read sample submission
sample_sub = pd.read_csv(sample_sub_path)

# Print data dimensions
print("TRAIN DATA DIMENSIONS")
print(f"train_book shape: {train_book.shape}")
print(f"train_trade shape: {train_trade.shape}")
print(f"train_labels shape: {train_labels.shape}")

print("\nTEST DATA DIMENSIONS")
print(f"test_book shape: {test_book.shape}")
print(f"test_trade shape: {test_trade.shape}")
print(f"test_labels shape: {test_labels.shape}\n")

TRAIN DATA DIMENSIONS
train_book shape: (2425085, 11)
train_trade shape: (419653, 6)
train_labels shape: (428932, 3)

TEST DATA DIMENSIONS
test_book shape: (3, 11)
test_trade shape: (3, 6)
test_labels shape: (3, 3)

CPU times: user 241 ms, sys: 1.97 s, total: 2.21 s
Wall time: 2.71 s


## Data Preparation
### Feature Engineering

In [41]:
# Define helper functions for data manipulation
def get_log_return(list_stock_prices):
    return np.log(list_stock_prices).diff()


def get_wap(df_book, col_bid_price, col_ask_price):
    """
    Returns Weighted Average Price. 
    """
    wap_numerator   = df_book[col_bid_price] * df_book[col_ask_price] + 
                        df_book[col_ask_price] * df_book[col_bid_price]
    wap_denominator = df_book[col_bid_price] + df_book[col_ask_price]
    return wap_numerator / wap_denominator


def get_vol_wap(df_book, col_stock_id, col_time_id, col_wap):
    """
    Returns the Volume Weighted Average Price at each time ID.
    """
    vol_wap = df_book.groupby(by=[col_stock_id, col_time_id])[col_wap].apply(get_log_return)
    vol_wap = vol_wap.fillna(0)
    return vol_wap


def get_trade_log_return(df_trade, col_stock_id, col_time_id, col_price):
    """
    Returns the Log Return at each time ID
    """
    trade_log_return = df_trade.groupby([col_stock_id, col_time_id])[col_price].apply(get_log_return)
    trade_log_return = trade_log_return.fillna(0)
    return trade_log_return


def rmspe(y_true, y_pred):
    """
    Returns the Root Mean Squared Prediction Error
    """
    rmspe = np.sqrt(np.mean(np.square((y_true - y_pred) / y_true)))
    return rmspe

In [45]:
train_trade

Unnamed: 0,time_id,seconds_in_bucket,price,size,order_count,stock_id
0,5,21,1.002301,326,12,0
1,5,46,1.002778,128,4,0
2,5,50,1.002818,55,1,0
3,5,57,1.003155,121,5,0
4,5,68,1.003646,4,1,0
...,...,...,...,...,...,...
296205,32767,579,0.999010,81,3,1
296206,32767,587,0.999109,50,1,1
296207,32767,588,0.999010,126,2,1
296208,32767,592,0.999109,1,1,1
