# Data Preparation

We would like to alter our LOB data for our machine learning purposes.

## Incorporation of trade book in data selection from Limit Order Book
We import the trade book of January 2017

In [1]:
import pandas as pd
import string
from prepare_data import *
project_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
filedates, filenames = read('GARAN')

with open(project_dir + '/DATA/GARAN_TRADE' + '/GARAN_2017-01.csv', 'rb') as input:
    trade_book = pd.read_csv(input,index_col =False,names =list(string.ascii_uppercase)[:-5])
trade_book.head(5)

Unnamed: 0,A,B,C,D,E,F,G,H,I,J,...,L,M,N,O,P,Q,R,S,T,U
0,2017-01-17,GARAN.E,9705841000018D0,,A,2,7.83,15.66,5E6372810005E9F6,09:55:01,...,P_ESLESTIRME,1,,P,244489262,122059551,2017-01-17 06:55:10,H,1,2017-01-17 09:55:11
1,2017-01-19,GARAN.E,970A04100001792,,S,6127,7.9,48403.3,5E66428100054DE1,09:55:01,...,P_ESLESTIRME,1,,P,246906706,123267358,2017-01-19 06:55:08,H,1,2017-01-19 09:55:09
2,2017-01-17,GARAN.E,9705841000018AF,,S,51928,7.83,406596.24,5E63728100057FD6,09:55:01,...,P_ESLESTIRME,1,,P,244489131,122059486,2017-01-17 06:55:10,H,1,2017-01-17 09:55:10
3,2017-01-17,GARAN.E,9705841000018A1,,A,13742,7.83,107599.86,5E63728100054B72,09:55:01,...,P_ESLESTIRME,1,,P,244489076,122059458,2017-01-17 06:55:10,H,1,2017-01-17 09:55:10
4,2017-01-17,GARAN.E,9705841000018ED,,S,25,7.83,195.75,5E63728100055496,09:55:01,...,P_ESLESTIRME,1,,P,244489341,122059591,2017-01-17 06:55:10,H,1,2017-01-17 09:55:11


### Session types
We are interested in the 'L' column, where we get information on the type of session at a given time. The times are given in the column 'J'. We first look at the types of sessions. We have matching, continuous and closing sessions:

In [2]:
trade_book = trade_book[['A','J','L']]
trade_book['L'].unique()

array(['P_ESLESTIRME', 'P_SUREKLI_ISLEM', 'P_KAPANIS_FIY_ISLEM'],
      dtype=object)

P_SUREKLI_ISLEM stands for 'continuous session'. We would like to include times in the order book which fall into continuous session times. To do that we determine the continuous session intervals and take the times inside these intervals only.

Let's take the day Jan 2nd as an example:
First, we take the corresponding entries in the trade book for Jan 2nd and make sure that the times are ordered. We sort the entries by the time column 'J' to do that.

In [3]:
date = '2017-01-02'

if trade_book['J'].apply(lambda x: True if len(x)==15 else False).sum() == len(trade_book): #for November
    trade_book['J'] = trade_book['J'].apply(make_clock)
    
trade_book_df = trade_book[trade_book['A']==date].sort_values(by=['J'])

print('Checking if trade book times are in ascending order...')
for i in range(len(trade_book_df)-1):
    if get_time_inbetween(make_clock(trade_book_df.iloc[i]['J']),make_clock(trade_book_df.iloc[i+1]['J']),'ms') < 0:
        raise Exception('Given trade book times are not correctly ordered.')
trade_book_df

Checking if trade book times are in ascending order...


Unnamed: 0,A,J,L
996,2017-01-02,09:55:08,P_ESLESTIRME
1071,2017-01-02,09:55:08,P_ESLESTIRME
1070,2017-01-02,09:55:08,P_ESLESTIRME
1069,2017-01-02,09:55:08,P_ESLESTIRME
1068,2017-01-02,09:55:08,P_ESLESTIRME
...,...,...,...
718717,2017-01-02,18:08:22,P_KAPANIS_FIY_ISLEM
719168,2017-01-02,18:09:54,P_KAPANIS_FIY_ISLEM
719171,2017-01-02,18:09:54,P_KAPANIS_FIY_ISLEM
719205,2017-01-02,18:09:59,P_KAPANIS_FIY_ISLEM


#### Simplifying trade book info
The trade books of GARAN stock for 2017 have seconds precision except for November. In November we have miliseconds precision.

As it can be seen in the above data frame, the trade book has multiple entries with the same time and session type. To simplify things, we look at each unique time entry in the trade book whether the session type changes for that time point. Because if not, we could simply pass on one entry for each unique time entry with its corresponding session type:

In [4]:
print('Looking if trade book has one session type per second/ms...')
trade_book_prec = get_prec_of_clock(trade_book_df.iloc[0]['J'])
unique_seconds = trade_book_df['J'].unique()
for time in unique_seconds:
    if len(trade_book_df[trade_book_df['J']==time]['L'].unique())!=1:
        raise Exception('Session type changes inside one unit of precision: ', time)
        
trade_book_simplified = trade_book_df['L'].to_frame()
trade_book_simplified.index=pd.Index(trade_book_df['J'].apply(make_clock).values)
trade_book_simplified = get_only_last_entry(trade_book_simplified,trade_book_prec)
trade_book_simplified

Looking if trade book has one session type per second/ms...


100%|██████████| 1912/1912 [00:10<00:00, 190.04it/s]


Unnamed: 0_level_0,L
time,Unnamed: 1_level_1
09:55:08,P_ESLESTIRME
10:00:00,P_SUREKLI_ISLEM
10:00:01,P_SUREKLI_ISLEM
10:00:02,P_SUREKLI_ISLEM
10:00:03,P_SUREKLI_ISLEM
...,...
18:05:02,P_ESLESTIRME
18:08:00,P_KAPANIS_FIY_ISLEM
18:08:22,P_KAPANIS_FIY_ISLEM
18:09:54,P_KAPANIS_FIY_ISLEM


Now, for instance, we have one time entry 09:55:08 with its session type instead of multiple identical ones.

#### LOB with continuous session times

We use the simplified trade book to get continuous session time blocks. We take the time entries from our LOB, which fall into these time blocks. Then, from the LOB with continuous session times, we only take the last entry of each minute. Note that while taking the last entry of each minute from the LOB with continuous session times, the missing minutes that fall into a continuous session time block are forward filled.

In [5]:
filename = filenames[filedates.index(date)]
df = get_df(filename).astype(float) #Importing LOB Data
df = df[df.columns[df.columns.isin([i for i in df.columns if not i.count('ord')])]] #eliminating order column
df.index = pd.Index(get_clocks(df)[-1], name='time')

times = get_continuous_trading_times(trade_book_simplified)
LOB_times = [] ; LOB_conti_ffilled = pd.DataFrame([])
print('Continuous session blocks: ',times)
for time_pair in times:
    t1 = make_clock(time_pair[0]); t2 = make_clock(time_pair[1])
    LOB_times_to_add = df.index[[True if get_time_inbetween(t1,i,'ms')>=0  and get_time_inbetween(i,t2,'ms')>=0 else False for i in df.index]].to_list()
    LOB_times += LOB_times_to_add
    LOB_conti_ffilled = pd.concat([LOB_conti_ffilled,get_only_last_entry(df.loc[LOB_times_to_add],'m',ffill=True)],axis=0)

LOB_conti = df.loc[LOB_times]

Continuous session blocks:  [['10:00:00', '13:00:00'], ['14:00:00', '17:59:59']]


100%|██████████| 180/180 [00:00<00:00, 565.32it/s]
100%|██████████| 237/237 [00:00<00:00, 443.39it/s]


Now we have the LOB with only the continuous session times:

In [6]:
LOB_conti

Unnamed: 0_level_0,bid1,bsize1,bid2,bsize2,bid3,bsize3,bid4,bsize4,bid5,bsize5,ask1,asize1,ask2,asize2,ask3,asize3,ask4,asize4,ask5,asize5
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
10:00:00.123,7.60,242105.0,7.59,43187.0,7.58,97797.0,7.57,152626.0,7.56,54095.0,7.61,266342.0,7.62,121087.0,7.63,161225.0,7.64,71092.0,7.65,443326.0
10:00:00.227,7.60,132105.0,7.59,43187.0,7.58,97797.0,7.57,152626.0,7.56,54095.0,7.61,266342.0,7.62,121087.0,7.63,161225.0,7.64,71092.0,7.65,443326.0
10:00:00.334,7.60,132105.0,7.59,48187.0,7.58,107797.0,7.57,152626.0,7.56,64095.0,7.61,266342.0,7.62,121087.0,7.63,161225.0,7.64,71092.0,7.65,443326.0
10:00:00.441,7.60,132105.0,7.59,48187.0,7.58,112797.0,7.57,157626.0,7.56,114095.0,7.61,280562.0,7.62,121087.0,7.63,161225.0,7.64,71092.0,7.65,443326.0
10:00:00.550,7.60,132105.0,7.59,48187.0,7.58,112797.0,7.57,157626.0,7.56,116595.0,7.61,282562.0,7.62,121087.0,7.63,161225.0,7.64,71092.0,7.65,443326.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17:59:57.855,7.59,1085081.0,7.58,1047275.0,7.57,825530.0,7.56,455417.0,7.55,470547.0,7.60,880323.0,7.61,711568.0,7.62,416412.0,7.63,956605.0,7.64,721347.0
17:59:58.055,7.59,1085078.0,7.58,1047275.0,7.57,825530.0,7.56,455417.0,7.55,470547.0,7.60,880323.0,7.61,711568.0,7.62,416412.0,7.63,956605.0,7.64,721347.0
17:59:58.357,7.59,1085075.0,7.58,1047275.0,7.57,825530.0,7.56,455417.0,7.55,470547.0,7.60,880323.0,7.61,711568.0,7.62,416412.0,7.63,956605.0,7.64,721347.0
17:59:58.760,7.59,1085072.0,7.58,1047275.0,7.57,825530.0,7.56,455417.0,7.55,470547.0,7.60,880323.0,7.61,711568.0,7.62,416412.0,7.63,956605.0,7.64,721347.0


And the LOB with the continuous session minutes, which is also forward filled:

In [7]:
LOB_conti_ffilled

Unnamed: 0,bid1,bsize1,bid2,bsize2,bid3,bsize3,bid4,bsize4,bid5,bsize5,ask1,asize1,ask2,asize2,ask3,asize3,ask4,asize4,ask5,asize5
10:00,7.60,271744.0,7.59,304569.0,7.58,205405.0,7.57,182835.0,7.56,200495.0,7.61,128832.0,7.62,154054.0,7.63,177426.0,7.64,236042.0,7.65,542626.0
10:01,7.61,115256.0,7.60,303769.0,7.59,304569.0,7.58,206405.0,7.57,182835.0,7.62,230906.0,7.63,180890.0,7.64,253542.0,7.65,527628.0,7.66,550249.0
10:02,7.60,291328.0,7.59,304770.0,7.58,206405.0,7.57,189835.0,7.56,200495.0,7.61,191558.0,7.62,230142.0,7.63,192452.0,7.64,253567.0,7.65,527628.0
10:03,7.61,114014.0,7.60,347112.0,7.59,204770.0,7.58,306805.0,7.57,190735.0,7.62,489862.0,7.63,198885.0,7.64,253817.0,7.65,530878.0,7.66,550999.0
10:04,7.60,364947.0,7.59,207470.0,7.58,307372.0,7.57,191235.0,7.56,200995.0,7.61,328789.0,7.62,340852.0,7.63,192454.0,7.64,303817.0,7.65,530878.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17:55,7.59,474125.0,7.58,1124599.0,7.57,1015430.0,7.56,356726.0,7.55,390977.0,7.60,723798.0,7.61,882065.0,7.62,564761.0,7.63,816739.0,7.64,721347.0
17:56,7.59,949475.0,7.58,1124599.0,7.57,1015430.0,7.56,356726.0,7.55,390978.0,7.60,739756.0,7.61,842590.0,7.62,564761.0,7.63,816739.0,7.64,721347.0
17:57,7.59,954050.0,7.58,1124749.0,7.57,1005430.0,7.56,356726.0,7.55,390977.0,7.60,778348.0,7.61,842590.0,7.62,515761.0,7.63,816739.0,7.64,721347.0
17:58,7.59,990706.0,7.58,785753.0,7.57,1114321.0,7.56,455617.0,7.55,470547.0,7.60,771361.0,7.61,842390.0,7.62,624652.0,7.63,1006605.0,7.64,721347.0


### Liquidity measures 

At this point we can also calculate the Liquidity Measures. We use here the LOB with continuous session entries to calculate them. We then forward fill the missing minutes.

In [8]:
df_liq = get_all(LOB_conti,LOB_conti.index[0])
data = []
for t in LOB_conti_ffilled.index:
    if df_liq.index.to_list().count(t):
        data.append(df_liq.loc[t].to_list())
    else:
        data.append([None]*len(df_liq.columns))
df_liq_ffilled = pd.DataFrame(data,index = LOB_conti_ffilled.index,columns=df_liq.columns).ffill()
df_liq_ffilled

100%|██████████| 20/20 [00:08<00:00,  2.32it/s]
100%|██████████| 417/417 [00:22<00:00, 18.29it/s]


Unnamed: 0,ABD1,ABD2,ABD5,WABD1,WABD2,WABD5,AAD1,AAD2,AAD5,WAAD1,...,ATD2,ATD5,WATD1,WATD2,WATD5,AS,WAS,bid slope,ask slope,order slope
10:00,2.506354e+05,5.169619e+05,1.071270e+06,1.904829e+06,3.926247e+06,8.122535e+06,194588.134021,3.384470e+05,1.235212e+06,1.480816e+06,...,8.554089e+05,2.306482e+06,3.385644e+06,6.503268e+06,1.755421e+07,0.01,0.001315,2.331006e-08,2.464418e-08,2.369486e-08
10:01,2.160129e+05,5.204134e+05,1.148534e+06,1.642010e+06,3.953536e+06,8.711464e+06,160540.481481,3.518645e+05,1.447369e+06,1.222582e+06,...,8.722778e+05,2.595903e+06,2.864592e+06,6.634675e+06,1.977052e+07,0.01,0.001314,1.979427e-08,2.073803e-08,2.220880e-08
10:02,1.950150e+05,4.917780e+05,1.146727e+06,1.482786e+06,3.736854e+06,8.699438e+06,198375.057471,4.005659e+05,1.573251e+06,1.510941e+06,...,8.923439e+05,2.719978e+06,2.993727e+06,6.789505e+06,2.072206e+07,0.01,0.001314,1.986821e-08,1.880670e-08,2.107064e-08
10:03,2.288847e+05,5.394822e+05,1.243707e+06,1.739753e+06,4.098344e+06,9.432642e+06,228880.020408,5.428345e+05,1.643651e+06,1.743353e+06,...,1.082317e+06,2.887359e+06,3.483107e+06,8.234715e+06,2.198709e+07,0.01,0.001314,1.845517e-08,1.732490e-08,1.907946e-08
10:04,3.143170e+05,5.365475e+05,1.235334e+06,2.388876e+06,4.076001e+06,9.367577e+06,187883.696203,5.275725e+05,1.696028e+06,1.430201e+06,...,1.064120e+06,2.931363e+06,3.819077e+06,8.095009e+06,2.231730e+07,0.01,0.001315,1.946275e-08,1.584689e-08,1.817554e-08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17:55,5.440809e+05,1.665096e+06,3.430408e+06,4.129574e+06,1.262687e+07,2.597874e+07,697725.714286,1.579934e+06,3.684960e+06,5.302715e+06,...,3.245030e+06,7.115367e+06,9.432289e+06,2.464319e+07,5.405795e+07,0.01,0.001317,5.696135e-09,6.709726e-09,6.585935e-09
17:56,8.142430e+05,1.938842e+06,3.701976e+06,6.180105e+06,1.470457e+07,2.804010e+07,724950.931034,1.588985e+06,3.691832e+06,5.509627e+06,...,3.527827e+06,7.393808e+06,1.168973e+07,2.678949e+07,5.617131e+07,0.01,0.001317,5.283904e-09,6.727365e-09,6.059816e-09
17:57,9.475161e+05,2.072160e+06,3.832627e+06,7.191647e+06,1.571645e+07,2.903180e+07,749267.966667,1.591858e+06,3.678969e+06,5.694437e+06,...,3.664018e+06,7.511596e+06,1.288608e+07,2.782300e+07,5.706472e+07,0.01,0.001317,5.098364e-09,6.795103e-09,5.815857e-09
17:58,1.011729e+06,1.938367e+06,3.765863e+06,7.679020e+06,1.470294e+07,2.852484e+07,776181.700000,1.618702e+06,3.747099e+06,5.898981e+06,...,3.557068e+06,7.512962e+06,1.357800e+07,2.701349e+07,5.707691e+07,0.01,0.001317,5.488918e-09,6.713209e-09,5.851751e-09


### Normalization of Data

As the last step of data preprocessing, we apply normalization on prices, sizes and times. Prices are normalized by the mid price of the first
continuous session entry, in this case the entry at 10:00:00.123 in the above data frame. Bid and ask side volumes are 
normalized by the total volume on their respective side at each row. The time index is replaced by equidistant points between 0 and 1. The data frame below represents how each day's preprocessed data looks like:

In [9]:
midprice = (LOB_conti.iloc[0]['ask1'] + LOB_conti.iloc[0]['bid1'])*0.5
result_df = normalize(pd.concat([LOB_conti_ffilled,df_liq_ffilled], axis=1), midprice)
result_df = result_df[[i for i in result_df.columns if i!='mid']]
result_df

Unnamed: 0,bid1,bsize1,bid2,bsize2,bid3,bsize3,bid4,bsize4,bid5,bsize5,...,ATD2,ATD5,WATD1,WATD2,WATD5,AS,WAS,bid slope,ask slope,order slope
0.000000,0.999343,0.233247,0.998028,0.261422,0.996713,0.176306,0.995398,0.156933,0.994083,0.172092,...,8.554089e+05,2.306482e+06,3.385644e+06,6.503268e+06,1.755421e+07,0.01,0.001315,2.331006e-08,2.464418e-08,2.369486e-08
0.002387,1.000657,0.103570,0.999343,0.272969,0.998028,0.273688,0.996713,0.185477,0.995398,0.164297,...,8.722778e+05,2.595903e+06,2.864592e+06,6.634675e+06,1.977052e+07,0.01,0.001314,1.979427e-08,2.073803e-08,2.220880e-08
0.004773,0.999343,0.244232,0.998028,0.255501,0.996713,0.173038,0.995398,0.159146,0.994083,0.168083,...,8.923439e+05,2.719978e+06,2.993727e+06,6.789505e+06,2.072206e+07,0.01,0.001314,1.986821e-08,1.880670e-08,2.107064e-08
0.007160,1.000657,0.097998,0.999343,0.298351,0.998028,0.176005,0.996713,0.263706,0.995398,0.163941,...,1.082317e+06,2.887359e+06,3.483107e+06,8.234715e+06,2.198709e+07,0.01,0.001314,1.845517e-08,1.732490e-08,1.907946e-08
0.009547,0.999343,0.286904,0.998028,0.163103,0.996713,0.241641,0.995398,0.150340,0.994083,0.158013,...,1.064120e+06,2.931363e+06,3.819077e+06,8.095009e+06,2.231730e+07,0.01,0.001315,1.946275e-08,1.584689e-08,1.817554e-08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0.990453,0.998028,0.141031,0.996713,0.334517,0.995398,0.302044,0.994083,0.106110,0.992768,0.116298,...,3.245030e+06,7.115367e+06,9.432289e+06,2.464319e+07,5.405795e+07,0.01,0.001317,5.696135e-09,6.709726e-09,6.585935e-09
0.992840,0.998028,0.247439,0.996713,0.293077,0.995398,0.264627,0.994083,0.092965,0.992768,0.101891,...,3.527827e+06,7.393808e+06,1.168973e+07,2.678949e+07,5.617131e+07,0.01,0.001317,5.283904e-09,6.727365e-09,6.059816e-09
0.995227,0.998028,0.248974,0.996713,0.293520,0.995398,0.262382,0.994083,0.093093,0.992768,0.102031,...,3.664018e+06,7.511596e+06,1.288608e+07,2.782300e+07,5.706472e+07,0.01,0.001317,5.098364e-09,6.795103e-09,5.815857e-09
0.997613,0.998028,0.259555,0.996713,0.205859,0.995398,0.291941,0.994083,0.119367,0.992768,0.123278,...,3.557068e+06,7.512962e+06,1.357800e+07,2.701349e+07,5.707691e+07,0.01,0.001317,5.488918e-09,6.713209e-09,5.851751e-09


As a side note, after processing each day's data in this way results in entries of 420 minutes for each day, indicating 7 hours of continuous session time per day, except for the days shown below:

In [10]:
filenames = os.listdir(project_dir + '/CODES/dataset/LOB_LIQ_VARS/GARAN')
[filenames.pop(i) for i,k in enumerate(filenames) if k.split('.')[-1]!='npy']
filenames.sort()
for filename in filenames:
    with open(project_dir + '/CODES/dataset/LOB_LIQ_VARS/GARAN/' + f'{filename}', 'rb') as input:
        data = np.load(input,allow_pickle='TRUE').item()
        assert data['X'].shape[0] == data['y'].shape[0]
        no_of_entries = data['y'].shape[0] + 60
        if no_of_entries != 420:
            print(f'{filename[:-4]}: ',no_of_entries)

2017-03-29:  419
2017-04-14:  419
2017-06-23:  419
2017-08-02:  419
2017-08-03:  419
2017-08-10:  419
2017-08-18:  419
2017-08-23:  416
2017-08-29:  418
2017-08-31:  150


# Input Data

From the preprocessed data of each day, we would like to create 60 minute rolling windows and use it as input to the neural network. Each day's set of rolling windows is used isolated from another, i.e. forecasting takes place within one day's set and does not involve another day's rolling windows.

The quantities we would like to forecast are the midprice and expectation and variance of prices of the following minute of each rolling window. This would result in vector with a length of 5, since expectation and variance are calculated for both ask and bid side seperately.

Note that a day with 420 minute entries would result in a set of 360 rolling windows and not 361, since the last entry of the day is to be forecast.

## Batches

Our model accepts a rolling window of 60 minutes as input, which corresponds to a tensor with shape (60, no. of features), where the features could be normalized time of entry, prices, volumes and the liquidity measures.

We can decide how many days one batch will consist of, for both training and validation batches. Our batch size would be then the total number of rolling windows of all days. This is demonstrated below, where we have chosen the number of training and validation days to be 10 and 2 respectively:

In [11]:
from torchsummary import summary
from train import *

train_dataloader , val_dataloader = load_data()

print(f'No of days in training batch: {train_dataloader.batch_size}\n', \
      f'No of days in validation batch: {val_dataloader.batch_size}\n')

for i in train_dataloader:
    train = i
    break
    
for i in val_dataloader:
    val = i
    break
    
print('Shapes for one batch: \n\n',
    'TRAIN:\n',
    f'training data input shape: {train.input.numpy().shape}', \
    f'training data output shape: {train.target.numpy().shape}', \
    '\n VALIDATION:\n',
        f'validation data input shape: {val.input.numpy().shape}', \
        f'validation data output shape: {val.target.numpy().shape}')

print('\nLoading model with one hidden layer.\n')
model = load_model()

print(f'Model output shape for input shape {train.input.numpy().shape}: \
      \n {model(train.input).detach().numpy().shape}')
summary(model.float(), input_size=(60,44))

No of days in training batch: 1
 No of days in validation batch: 1

Shapes for one batch: 

 TRAIN:
 training data input shape: (360, 60, 44) training data output shape: (360, 5) 
 VALIDATION:
 validation data input shape: (360, 60, 44) validation data output shape: (360, 5)

Loading model with one hidden layer.

Model output shape for input shape (360, 60, 44):       
 (360, 5)
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Linear-1                   [-1, 64]         169,024
         LeakyReLU-2                   [-1, 64]               0
            Linear-3                    [-1, 5]             325
         LeakyReLU-4                    [-1, 5]               0
Total params: 169,349
Trainable params: 169,349
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.01
Forward/backward pass size (MB): 0.00
Params size (MB): 0.65
Estimated Tot

# Target Data

As mentioned in the previous section, the target data consists of the mid price, expectation and variance of bid and ask prices for each minute entry in the LOB, starting from the 61st trading minute for each day.

We will now explain how expectation and variance are calculated. We take the 61st minute of Jan 2nd, 11:00 as example.

We will concentrate only on the bid side, since the calculations are analogous for the ask side. At 11:00 we have the following bid prices and volumes:

In [12]:
_61st_entry = LOB_conti_ffilled.iloc[360]
bid_prices = _61st_entry[['bid'+ str(i) for i in range(1,cfg.LOB_LVL+1)]].to_list()
bid_vols = _61st_entry[['bsize'+ str(i) for i in range(1,cfg.LOB_LVL+1)]].apply(int).to_list()
print('Prices :', bid_prices)
print('Volumes :', bid_vols)

Prices : [7.6, 7.59, 7.58, 7.57, 7.56]
Volumes : [761940, 518339, 432843, 730054, 426676]


For each minute we would like to create a probability measure by normalizing the volumes and calculate the expectation and variance of bid price with it. We show it for 11:00:

In [13]:
bid_vols_norm = [i/sum(bid_vols) for i in bid_vols]; print('Normalized Volumes :', bid_vols_norm)
bid_expectation = sum([i*k for i,k in zip(bid_prices,bid_vols_norm)]) ; print('Expected price: ',bid_expectation)
bid_var = sum([i*k for i,k in zip([(l-bid_expectation)**2 for l in bid_prices],bid_vols_norm)])
print('Variance of price: ',bid_var)

Normalized Volumes : [0.26549801174415966, 0.18061523730143575, 0.15082415399818527, 0.2543873342597458, 0.14867526269647355]
Expected price:  7.58159873401137
Variance of price:  0.00020661361649325515
