## Data description

X: The dataset consists of 100 sequential order-book observations. There are 20 observations randomly taken per stock and per day. There are 504 days in the dataset (approximately 2 years) and 24 stocks. This means there are 100 x 20 x 505 x 24 rows of data = 24240000. The columns correspond to the following items:

- obs_id: uniquely identifies a sequence of 100 order book events drawn from a random stock inside a random day;

- venue: The exchange on which the event occurs. It can be NASDAQ, BATY, etc. but they are just encoded in the data as integers;

- action: This is type of order-book event that occurred, it can be ‘A’, ‘D’ or ‘U’. A means volume was added to the book in the form of a new order. ‘D’ means an order was deleted from the book and ‘U’ means an order was updated;

- order_id: The exchange data is ‘Level 3’or Market-by-Order, this means that each update provides a unique identifier for the specific order that was affected. It means that we can track the lifetime of an individual order. If it was placed earlier with a ‘A’, we may see it again deleted in the data by the same market participant if we see the same order id occur again with a ‘D’. Note however that the order-ids have been obfuscated somewhat. The first order referenced in any given sequence of data for a particular observation is given the id=0. If order_id 0 is seen again, you will know that it was the same order again that was affected;

- side: The side of the order-book on which the event took place ‘A’ or ‘B’;

- price: The price of the order that was affected;

- bid: The price of the best bid;

- ask: The price of the best ask;

- bid_size: The volume of orders at the best bid of the aggregated book;

- ask_size: The volume of orders at the best ask of the aggregated book;

- flux: The change to the order-book affected by the event. i.e. if the volume in a level increased or decreased due to the event;

- trade: A boolean true or false to indicate whether a deletion or update event was due to a trade or due to a cancellation.

Because the price itself provides such a large clue, we subtract the best bid price for the first event in the sequence of 100 from the ‘price’, ‘bid’ and ‘ask’ fields.

Y: The Y of the dataset is the eqt_code_cat. However, for the training set construction this is an integer between 0 and 23 which identifies the particular stock that was affected.

The training set is drawn from one period of time. The same stocks are used again in the test period, but the observations of the market are drawn from a different future period.

In [22]:
import pandas as pd
import seaborn as sns

In [3]:
X_train = pd.read_csv('data/X_train.csv')
X_test= pd.read_csv('data/X_test.csv')
y_train = pd.read_csv('data/y_train.csv')

In [26]:
X_train.head(100)

Unnamed: 0,obs_id,venue,order_id,action,side,price,bid,ask,bid_size,ask_size,trade,flux,WAT
0,0,4,0,A,A,0.30,0.0,0.01,100,1,False,100,0.0
1,0,4,1,A,B,-0.17,0.0,0.01,100,1,False,100,0.0
2,0,4,2,D,A,0.28,0.0,0.01,100,1,False,-100,0.0
3,0,4,3,A,A,0.30,0.0,0.01,100,1,False,100,0.0
4,0,4,4,D,A,0.37,0.0,0.01,100,1,False,-100,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,4,62,A,A,0.38,0.0,0.02,100,310,False,100,0.0
96,0,4,63,A,A,0.38,0.0,0.02,100,310,False,100,0.0
97,0,4,64,A,A,0.41,0.0,0.02,100,310,False,100,0.0
98,0,4,63,D,A,0.38,0.0,0.02,100,310,False,-100,0.0
