This notebook is intended for starters who are just trying to understand the competition and understand what needs to be done for a basic submission.
<br>
<b>I myself am a beginner and hope to learn from this exercise.</b>

 # **Importing the required libraries**

In [None]:
import os
import pandas as pd
import numpy as np

# Loading data and understanding the data.

In [None]:
train = pd.read_csv('../input/optiver-realized-volatility-prediction/train.csv')

In [None]:
len(train['stock_id'].unique())

This csv contains the stock-id(total 112 in number) and the time-id along with the target, i.e. the volatility score at that point of time for that particular stock.

# Loading Parquete Files

As an example, loading just a single file for now to check the data and draw some insights and understand the format.

In [None]:
book_train_stock_id = pd.read_parquet(f'../input/optiver-realized-volatility-prediction/book_train.parquet/stock_id=0/')

# len(train['time_id'].unique())
book_train_stock_id.head()

As different stocks trade on different level on the market we take the ratio of best offer price and best bid price to calculate the bid-ask spread.

The formula of bid/ask spread can be written in below form:
BidAskSpread=BestOffer/BestBid−1

Source: https://www.kaggle.com/jiashenliu/introduction-to-financial-concepts-and-data

In [None]:
book_train_stock_id['bid_ask_spread'] = book_train_stock_id[['ask_price1', 'ask_price2']].min(axis=1)/book_train_stock_id[['bid_price1', 'bid_price2']].max(axis=1) - 1
book_train_stock_id.head()

The definition and formula for weighted average can be found from the link below.
https://www.kaggle.com/jiashenliu/introduction-to-financial-concepts-and-data?scriptVersionId=67183666&cellId=18

The weighted average price is the metric that is used to calculate the actual value of the stock in consideration.

The WAP is calculated from the top level price and volume information.

In [None]:
book_train_stock_id['wap'] = (book_train_stock_id['bid_price1'] * book_train_stock_id['ask_size1'] +
                            book_train_stock_id['ask_price1'] * book_train_stock_id['bid_size1']) / (
                            book_train_stock_id['bid_size1']+ book_train_stock_id['ask_size1'])

In [None]:
book_train_stock_id.groupby(by = ['time_id']).mean().head()

# Utility Functions

In [None]:
# This is the function for calculating the log of return.
# This metric is useful to get the difference of the stock value compared between 2 time steps.
# diff function in pandas returns the difference between consecutive values in a pandas dataframe.
def log_return(list_stock_prices):
    return np.log(list_stock_prices).diff() #The value present in each cell is 
                                            #the difference of current cell value with 
                                            #the previous row corresponding cell

#The function is for computing the value of realised volatility given a series of log returns.
# Volatility is a metric that depends on how much the stock price has changed in a short period of time.
# So if the log return has changed a lot on consecutive time steps, it is more volatile and vise-versa.
def realized_volatility(series_log_return):
    return np.sqrt(np.sum(series_log_return**2))

<b>The code below is used to run the 'log_return' function on each group of time_id.</b>
<br>
i.e. All the list of 'wap' values for every time_id are taken and the difference of consecutive 'wap' is given as a list in return. 
<br>
Also filling the na values with 0 in the process.

In [None]:
book_train_stock_id['log_return'] = (book_train_stock_id.groupby(by = ['time_id'])['wap'].apply(log_return).
                                       reset_index(drop = True).
                                       fillna(0))

In [None]:
book_train_stock_id.head()

After getting the 'log_return' column for 'wap' of each 'time_id', we need to get the realized volatility using the formula provided in the link.
<br>
https://www.kaggle.com/jiashenliu/introduction-to-financial-concepts-and-data

In [None]:
df_realized_vol = book_train_stock_id.groupby(by = ['time_id'])['log_return'].agg(realized_volatility)

The below cell gets the mean of 'bid_ask_spread' for each group of 'time_id'

In [None]:
pd_ba_spread = book_train_stock_id.groupby(by = ['time_id'], as_index = False)['bid_ask_spread'].mean()

Merging the above 2 dataframes for getting the feature columns

In [None]:
merged_dataframe = pd.merge(df_realized_vol, pd_ba_spread, on=['time_id'], how='left')

In [None]:
merged_dataframe.head()

Assigning the stock_id value to the new column. We need to repeat the above process for all the stock_ids and create the final dataframe.

In [None]:
merged_dataframe['stock_id'] = 0

In [None]:
merged_dataframe.head()

In [None]:
train_dataSet = pd.merge(train, merged_dataframe, on = ['stock_id', 'time_id'], how = 'left')

In [None]:
train_dataSet.head()

In [None]:
y_train = train_dataSet['target']
X_train = train_dataSet.drop(['stock_id', 'time_id', 'target'], axis = 1)

After getting the train features and targets, we need to train the model on the this data and run the prediction.

To do the training we need to put all the above steps in a function and loop over the available stock-ids and create the training dataset.

In [None]:
# Each step in this function is executed as a single cell for everyone to run and check the intermediate outputs.
# This is helpful in looking at the dataframes and understanding exactly what is happening at each level of code.

def get_stock_data(stock_id, train='train'):
    parquet_file_path = f'../input/optiver-realized-volatility-prediction/book_'+train+'.parquet/stock_id='+str(stock_id)+'/'
    book_train_stock_id = pd.read_parquet(parquet_file_path)
    
    book_train_stock_id['bid_ask_spread'] = book_train_stock_id[['ask_price1', 'ask_price2']].min(axis=1)/book_train_stock_id[['bid_price1', 'bid_price2']].max(axis=1) - 1
    book_train_stock_id['wap'] = (book_train_stock_id['bid_price1'] * book_train_stock_id['ask_size1'] +
                            book_train_stock_id['ask_price1'] * book_train_stock_id['bid_size1']) / (
                            book_train_stock_id['bid_size1']+ book_train_stock_id['ask_size1'])
    book_train_stock_id['log_return'] = (book_train_stock_id.groupby(by = ['time_id'])['wap'].apply(log_return).
                                       reset_index(drop = True).
                                       fillna(0))
    df_realized_vol = book_train_stock_id.groupby(by = ['time_id'])['log_return'].agg(realized_volatility)
    pd_ba_spread = book_train_stock_id.groupby(by = ['time_id'], as_index = False)['bid_ask_spread'].mean()
    
    merged_dataframe = pd.merge(df_realized_vol, pd_ba_spread, on=['time_id'], how='left')
    merged_dataframe['stock_id'] = stock_id
    
    return merged_dataframe

In [None]:
# The list of stock ids can be fetched from the train df.

stock_ids = train['stock_id'].unique()

# Calling the get_stock_data function for each stock_id to create a list of dataframes.
# These dataframes are then concatenated to form a single dataframe.
df_list = [get_stock_data(i) for i in stock_ids]
final_concat_list = pd.concat(df_list, ignore_index = True)

In [None]:
# Joining the train df with the above df on stock_id and time_id
train_dataSet = pd.merge(train, final_concat_list, on = ['stock_id', 'time_id'], how = 'left')

In [None]:
y_train = train_dataSet['target']
X_train = train_dataSet.drop(['stock_id', 'time_id', 'target'], axis = 1)

In [None]:
# Importing the required packages for training the ML model
import xgboost as xgb
from xgboost.sklearn import XGBRegressor
from xgboost import cv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold

In [None]:
model = XGBRegressor(random_state = 0
                   #,n_estimators = 200
                   #,learning_rate = 0.1
                   #,subsample = 0.8
                   #,colsample_bytree = 0.8
                   ,n_jobs= - 1)

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X_train, y_train, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
print('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()))

In [None]:
test = pd.read_csv('../input/optiver-realized-volatility-prediction/test.csv')

# test_stock_stat_df = get_dataSet(stock_ids = test['stock_id'].unique(), dataType = 'test')
test_stock_ids = test['stock_id'].unique()
test_df_list = [get_stock_data(i) for i in test_stock_ids]
test_final_concat_list = pd.concat(test_df_list, ignore_index = True)

test_dataSet = pd.merge(test, test_final_concat_list, on = ['stock_id', 'time_id'], how = 'left')
test_dataSet = test_dataSet.drop(['stock_id', 'time_id'], axis = 1)

y_pred = test_dataSet[['row_id']]
X_test = test_dataSet.drop(['row_id'], axis = 1).fillna(0)

In [None]:
model.fit(X_train, y_train)

In [None]:
y_pred = y_pred.assign(target = model.predict(X_test))
y_pred.to_csv('submission.csv',index = False)

In [None]:
y_pred.head()