In this document, we will collect the data processing functions. The functions themselves will be stored within ../proj_mod/data_processing.py where proj_mod can be imported as a python module when needed. Following two python block surves as an example of importing proj_mod. 

In [1]:
import sys
sys.path.append("../")

In [2]:
import pandas as pd

In [3]:
import numpy as np

In [4]:
from proj_mod import data_processing

## Book data harvesting function by stock and time id

We will first demonstrate the function book_for_stock(). 

In [5]:
df_book_0_5=data_processing.book_for_stock(str_file_path="../raw_data/kaggle_ORVP/book_train.parquet",stock_id=0,time_id=5)

In [6]:
df_book_0_5

Unnamed: 0,index,time_id,seconds_in_bucket,bid_price1,ask_price1,bid_price2,ask_price2,bid_size1,ask_size1,bid_size2,ask_size2,stock_id,wap,log_return
0,1,5,1,1.001422,1.002301,1.001370,1.002353,3,100,2,100,0,1.001448,0.000014
1,2,5,5,1.001422,1.002301,1.001370,1.002405,3,100,2,100,0,1.001448,0.000000
2,3,5,6,1.001422,1.002301,1.001370,1.002405,3,126,2,100,0,1.001443,-0.000005
3,4,5,7,1.001422,1.002301,1.001370,1.002405,3,126,2,100,0,1.001443,0.000000
4,5,5,11,1.001422,1.002301,1.001370,1.002405,3,100,2,100,0,1.001448,0.000005
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,297,5,585,1.003129,1.003749,1.003025,1.003801,100,3,26,3,0,1.003731,0.000245
297,298,5,586,1.003129,1.003749,1.002612,1.003801,100,3,2,3,0,1.003731,0.000000
298,299,5,587,1.003129,1.003749,1.003025,1.003801,100,3,26,3,0,1.003731,0.000000
299,300,5,588,1.003129,1.003749,1.002612,1.003801,100,3,2,3,0,1.003731,0.000000


## Trade data harvesting function by stock and time id. 

Function trade_for_stock() is similar. 

In [7]:
df_trade_0_5=data_processing.trade_for_stock(str_file_path="../raw_data/kaggle_ORVP/trade_train.parquet",stock_id=0,time_id=5)

In [8]:
df_trade_0_5.head()

Unnamed: 0,time_id,seconds_in_bucket,price,size,order_count,stock_id
0,5,21,1.002301,326,12,0
1,5,46,1.002778,128,4,0
2,5,50,1.002818,55,1,0
3,5,57,1.003155,121,5,0
4,5,68,1.003646,4,1,0


In the following, we show an example of realized_vol function. 

In [9]:
rv, row= data_processing.realized_vol(df_book_0_5)

In [10]:
rv

np.float64(0.004499364172786558)

In [11]:
row

'0-5'

## Time series creation function

For each row id "a-b" (stock id a and time id b), we have the trade data. 
We create the RV of sub-intervals (e.g. seconds_in_bracket in interval [0,10]) for all disjoint sub-intervals within [0,600] (e.g. [0,10], [11, 21], ...). 
This will help us to bypass the fact that there are different total number of seconds_in_bracket in each row_id. 
This sequence of RV can serve as a time series data. 

We create a function to create this time series feature. 

In [18]:
arr_RV_0_5=data_processing.create_RV_timeseries(df_in=df_book_0_5)

In [19]:
arr_RV_0_5

array([1.49832934e-05, 1.03072451e-05, 1.05685058e-03, 0.00000000e+00,
       8.98304774e-04, 8.92053424e-04, 4.54066889e-04, 9.91255076e-04,
       3.65205270e-05, 3.70139248e-05, 2.99336299e-05, 3.62542885e-04,
       9.60843370e-04, 3.78928830e-04, 8.51440932e-04, 1.09884521e-03,
       5.05653081e-04, 8.49980121e-04, 6.40812165e-04, 4.85923409e-04,
       7.97586512e-04, 4.25454588e-04, 7.69266746e-04, 5.20608145e-04,
       2.68669240e-06, 6.78738335e-04, 6.79358412e-04, 1.47598809e-04,
       4.78524245e-04, 1.65369514e-05, 6.23763457e-04, 7.32826186e-04,
       1.08703048e-03, 6.87840067e-04, 3.12315713e-04, 6.45995291e-06,
       2.41529705e-04, 4.29347021e-04, 3.58229596e-06, 7.15967292e-04,
       1.38924412e-03, 9.61347656e-06, 4.77481691e-04, 3.43513251e-04,
       2.23079881e-04, 0.00000000e+00, 6.32441530e-04, 6.41041735e-04,
       1.42769361e-04, 6.56544153e-05, 5.48786596e-06, 3.36101519e-04,
       5.06175206e-04, 5.92941392e-04, 6.63459680e-04, 6.13845227e-04,
      