# MSCF 46982 Market Microstructure and Algorithmic Trading

Fall 2025 Mini 2

Limit Order Book Models

Copyright &copy; 2025 Nick Psaris. All Rights Reserved

# TOC
- [Attribution](#Attribution)
- [Initialize](#Initialize)
- [Tables](#Tables)
- [Orders](#Orders)
- [Hand-crafted Features](#Hand-crafted-Features)
- [Constructing Features](#Constructing-Features)
- [LOB Benchmark Dataset](#LOB-Benchmark-Dataset)
- [LOB Prediction Survey](#LOB-Prediction-Survey)
- [DeepLOB Model](#DeepLOB-Model)
- [SoTA Models](#SoTA-Models)
- [TLOB Model](#TLOB-Model)


# Attribution
- Market data provided by [ICE Data Services][]
- Data feed sourced from [Blockstream][]
- **Any saved or downloaded data must be immediately removed from
  personal devices after completion of the course**

[ICE Data Services]: https://www.theice.com/market-data/cryptocurrencies "ICE Data Services"
[Blockstream]: https://blockstream.com/cryptofeed "Blockstream"

# Initialize
- We start by initializing the number of rows and columns displayed
- Then create connection to the crypto kdb+ server

In [None]:
import os
os.environ['PYKX_JUPYTERQ'] = 'true'
os.environ['PYKX_4_1_ENABLED'] = 'true'
import pykx as kx


In [None]:
\c 15 120
/ windows and mac/linux use different environment variables
home:`HOME`USERPROFILE "w"=first string .z.o
upf:0N!` sv (hsym`$getenv home),`cmu_userpass.txt
h:`$":tcps://tpr-mscf-kx.tepper.cmu.edu:5003:",first read0 upf


# Orders
- First we create a query to return all data for a specific date and a
  subset of symbols
- Adding the `g` attribute will speed-up future look-ups

In [None]:
orders:{[dt;s]
 o:select `g#orderid,time,sym,bid,ask,bsize,asize,tbm from order where date = last dt, sym like s;
 o}

In [None]:
tm:2023.12.01D20
show o:h (orders;"d"$tm;"X:SXBTUSD@OKC")

# Building an Order Book

- By starting with an empty bid and ask book we can build a market
  depth table by applying each order event -- one by one
- Even though we need to keep the whole book in memory as we progress,
  we only need to store changes in the top N levels
- Kdb+ has two extra arguments to the `select` operator which permits
  us to extract the top n elements after having performed a custom
  sort
- Making an efficient book update function is crucial for this
  exercise.  Improvements are welcome!


## Book update function

- The book update function accepts the table name which will be
  appended-to in each iteration -- if the top N levels have changed
- The function first grabs the last bid and ask order book records
  from `tn` (the table name)
- If the bid (ask) is not null, the new order update is then appended
  to the bid (ask) book, the book is re-sorted and the top `n` records
  are retained
- If there are any changes (to the top `n` records) the results are
  appended to `tn`


In [None]:
upd:{[tn;n;ba;x]
 B:OB:last'[tn`bid`bsize];A:OA:last'[tn`ask`asize]; / grab current OB state
 if[not null x`bid;B:value flip 0!select[n;>bid] sum bsize by bid from ba[0]:select from ba[0],:x`orderid`bid`bsize where bsize>0];
 if[not null x`ask;A:value flip 0!select[n;<ask] sum asize by ask from ba[1]:select from ba[1],:x`orderid`ask`asize where asize>0];
 if[not (OB;OA)~(B;A);tn upsert x[`time],B,A;if[not count[get tn] mod 10000;1"."]];
 ba}

## Build book

- Initialize empty **b**id, **a**sk and **depth** tables
- Delete any data from depth (in case we rerun the code, then iterate
  over **o**rder events


In [None]:
b:1!flip `orderid`bid`bsize!"jff"$\:()
a:1!flip `orderid`ask`asize!"jff"$\:()
depth:flip `time`bid`bsize`ask`asize!"p****"$\:()

In [None]:
n:3
\ts delete from `depth; ba:(b;a) upd[`depth;n]/o; -1"";

In [None]:
depth

## Clean book

- Each LOB state vector must be of uniform length
- We can either pad with null values, or throw away the book events
  with insufficient levels


In [None]:
/ pad book with nulls
pad:{[n;x]@[x;where n>count each x;n#,[;n#0n]::]};
padt:{[n;t]@[t;`bid`ask`bsize`asize;pad[n]]}
/padt `depth
/@[`depth;`bid`ask`bsize`asize;pad[n]]
/ delete records with insufficient levels
show depth:select from depth where n=count each ask,n=count each bid

# Hand-crafted Features

A. N. Kercheval and Y. Zhang, “[Modelling high-frequency limit order
book dynamics with support vector machines][SVM],” Quantitative Finance,
vol. 15, no. 8, pp. 1315–1329, Aug. 2015, doi:
10.1080/14697688.2015.1032546.

[SVM]:
https://www.semanticscholar.org/paper/Modelling-high-frequency-limit-order-book-dynamics-Kercheval-Zhang/0007e2e93ea0174252b515685f74405f1fc97ef3


## Basic Features
- Bid/Ask prices/sizes for the first 10 levels (4*10=40)

## Time-Insensitive Features

- Bid-Ask spreads and mid prices for the first 10 levels (2*10=20)
- Range of bid/ask lob price levels and absolute price difference
  between lob levels (2+2*9=20)
- Average price and size for the first 10 levels (4)
- Total price and size per-level imbalance (2)

## Time-Sensitive Features

- Bid/Ask price/size time derivatives (4*10=40)
- Short-term (1 second) arrival rate (intensity) of bid/ask
  limit/market/cancel updates (2*3=6)
- Relative short-term (10 seconds) vs long-term (900 seconds)
  indicator bid/ask limit/market (2*2=4)
- Short-term (1 second) arrival rate acceleration -- bid/ask
  limit/market change in intensity (2*2=4)


Total features: 140

Note how relative indicator and short-term arrival rate don't include
cancel rates.  The next paper, which includes an LOB dataset, does
include them and results in 144 total features.

# Constructing Features

- We can't distinguish between market orders and top of book cancels,
  so will leave market orders out of time-sensitive features
- We create a function `featurize` that will transform the LOB
  structure into feature vectors

In [None]:
featurize:flip (raze flip::) each flip::

## Basic Features

- Bid/Ask prices/sizes for N levels

In [None]:
count X:featurize depth`ask`asize`bid`bsize

## Time-insensitive features

- Bid-Ask spreads and mid prices for the first N levels
- Range of bid/ask lob price levels and absolute price difference
  between lob levels
- Average price and size for the first N levels
- Total price and size per-level imbalance

In [None]:
X,:featurize exec (ask-bid;.5*ask+bid) from depth / (spread;mid)
X,:(depth[`ask][;n-1]-depth[`ask][;0];depth[`bid][;0]-depth[`bid][;n-1]) / (ask;bid) ranges
X,:featurize (1_'deltas each depth[`ask];neg 1_'deltas each depth[`bid]) / (ask;bid) differences
X,:(avg'') depth`ask`bid`asize`bsize / (price;size) means
X,:(sum'') (depth[`ask]-depth[`bid];depth[`asize]-depth[`bsize]) / (price;size) accumulated differences
count X

## Time-sensitive features

- Bid/Ask price/size time derivatives
- Short-term (1 second) arrival rate (intensity) of bid/ask
  limit/market/cancel updates
- Relative short-term (10 seconds) vs long-term (900 seconds)
  indicator bid/ask limit/market
- Short-term (1 second) arrival rate acceleration -- bid/ask
  limit/market change in intensity


## Order flow imbalance

- Computing the order flow imbalance allows us to flag limit order and
  cancel rates


In [None]:
ofi:{[b;bs;a;as]
 e:               bs *b>=pb:prev[first b;b];
 e-:prev[first bs;bs]*b<=pb;
 e-:              as *a<=pa:prev[first a;a];
 e+:prev[first as;as]*a>=pa;
 e}
depth:update bimb:sum each ofi[bid;bsize;0f;0f],aimb:sum each ofi[0f;0f;ask;asize] from depth
depth:(delete aimb,bimb from depth),'"f"$select la:0>aimb,lb:0<bimb,ca:0<aimb,cb:0>bimb from depth

## Intensities

- Computing intensities requires the  use of `wj1` to total the
  number of new events per unit of time
- Relative intensities requires comparison of two intensities
- 900 seconds is the same as 15 minutes
- Analysis is on a single symbol, so we don't need to worry about the
  existence of a `` `p`` attribute on the `` `sym`` column


In [None]:
depth:depth,'.1*select la10:la,lb10:lb,ca10:ca,cb10:cb from depth
depth:depth,'(1f%900f)*select la900:la,lb900:lb,ca900:ca,cb900:cb from depth
depth:wj1[-0D00:00:01 0D00:00:00+\:depth`time;`time;depth;`depth,flip (sum;`la`lb`ca`cb)]
depth:wj1[-0D00:00:10 0D00:00:00+\:depth`time;`time;depth;`depth,flip (sum;c10:`la10`lb10`ca10`cb10)]
depth:wj1[-0D00:15:00 0D00:00:00+\:depth`time;`time;depth;`depth,flip (sum;c900:`la900`lb900`ca900`cb900)]
depth:((c10,c900)_depth),'flip `rla`rlb`rca`rcb!"f"$(>/)depth(c10;c900) / relative intensities

## Time-sensitive feature extraction

- Using `asof` allows us to look back to a precise time and compute
  derivatives
- Should we throw away the first/10/900 seconds?

In [None]:
/ price/size derivatives 
X,:0f^featurize depth[`ask`bid`asize`bsize] - value flip padt[n] (`time`ask`bid`asize`bsize#depth) asof select time-0D00:00:01 from depth
/ intensity (effect of market orders included in cancel intensity)
X,:depth`la`lb`ca`cb`rla`rlb`rca`rcb / average and relative intensities
X,:0f^depth[`la`lb`ca`cb] - value flip (`time`la`lb`ca`cb#depth) asof select time-0D00:00:01 from depth
count X

# LOB Benchmark Dataset

A. Ntakaris, M. Magris, J. Kanniainen, M. Gabbouj, and A. Iosifidis,
“[Benchmark Dataset for Mid-Price Forecasting of Limit Order Book Data
with Machine Learning Methods][Benchmark],” Journal of Forecasting, vol. 37,
no. 8, pp. 852–866, Dec. 2018, doi: 10.1002/for.2543.

- Ntakaris, et al. published LOB dataset [FI-2010][] to promote the
  comparison of model performance on predicting future mid price
  movements
- The dataset comes from NASDAQ Nordic and includes LOB updates for 5
  securities over the course of 10 days
- Two versions of the data are provided: with and without auctions

[Benchmark]: https://arxiv.org/abs/1705.03233v5
[FI-2010]: https://etsin.fairdata.fi/dataset/73eb48d7-4dbc-4a10-a52a-da745b47a649

## LOB Benchmark Dataset features
- The FI-2010 dataset provides the 140 features from Kercheval and
  Zhang plus the additional 4 features resulting from relative cancel
  intensity and intensity acceleration -- totaling 144
- Additionally the features provide 5 labels indicating upward (>
  .002%), flat (between -.00199% and .00199) and downward (<-.002%)
  price movements (1, 2 and 3 respectively) over the following 1, 2,
  3, 5 and 10 events (not seconds or minutes)
- Future mid prices are computed as moving averages to reduce stochastic noise
$$
l_i^{(j)} = \frac{\frac{1}{k}\sum\limits_{j=i+1}^{i+k}m_j-m_i}{m_i},
$$
where $m_j$ is the future mid price of event $j$ and $m_i$ is the current mid price.



## Feature Normalization

- The data is provided with three methods of normalization: z-score
  ($Z_{score}$), min-max ($MM$) and decimal precision ($DP$)
$$
\begin{align*}
x_i^{(Z_{score})} &= \frac{x_i - \frac{1}{N}\sum\limits_{j=1}^N x_j}{\sqrt{\frac{1}{N}\sum\limits_{j=1}^N(x_j-\bar{x})^2}} \\
x_i^{(MM)} &= \frac{x_i - x_{min}}{x_{max} - x_{min}} \\
x_i^{(DP)} &= \frac{x_i}{10^k}
\end{align*}
$$
where $k$ is the interger that will give us the maximum value for $|x_{DP}|<1$

In [None]:
zscore:{(x-avg x)%sdev x}
minmax:{(x-m)%max[x]-m:min x}
decprec:{x%10 xexp ceiling max 10 xlog abs x}

In [None]:
x:10+til 11
/ align in a matrix for comparision
(::;zscore;minmax;decprec) @\: x

## Time series k-fold cross validation

- Each set is decomposed into training and testing datasets that
  follows a [time-series-split][] methodology -- thus allowing k-fold
  cross validation without being exposed to look-ahead biases
  
[time-series-split]: https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split "time-series-split"


In [None]:
tsfold:{flip ((raze #[;x]::) each 1_til count x;tst:1_x)}
([]split:1+til 9),'flip `train`test!flip tsfold 1+til 10

## Dataset

- Data is partitioned by split (1-9) (Kdb+ allows partitions by
  `date`, `month`, `year` or `int`)
- Always filter on the split (`int` column) first
- There are two tables: Auction and NoAuction
- Each split has the `` `p#`` attribute applied to the `norm` column
  instead of `sym` which has values: **DecPre, MinMax and Zscore**
- Each split has both test and train data -- filter on the `train`
  column to differentiate
- The last 5 columns are the labels 1, 2 and 3 representing up, flat
  and down: label1, label2, label3, label4, label5


In [None]:
h:`$":tcps://tpr-mscf-kx.tepper.cmu.edu:5004:",first read0 upf
h"5#meta Auction"


In [None]:
\c 20 10000
h "select int,norm,train,label1,label2,label3,label4,label5 from Auction where int=1,norm = `DecPre,not train"

## Obtaining data to match DeepLOB Jupyter notebooks

In [None]:
fi2010:{[tn;ints;nrm;trn;lbls]
 t:select from tn where int in ints, norm=nrm, train=trn;
 t:(lbls,-5_3_cols t)#t; / only keep requested lables (and move to front)
 t}
norm:`MinMax
`Y`X set' 0 1 cut value flip h (fi2010;`NoAuction;7;norm;1b;`label5)
`Yt`Xt set' 0 1 cut value flip h (fi2010;`NoAuction;7 8 9;norm;0b;`label5)

# LOB Prediction Survey

I. Zaznov, J. Kunkel, A. Dufour, and A. Badii, “[Predicting Stock
Price Changes Based on the Limit Order Book: A Survey][Survey],”
Mathematics, vol. 10, no. 8, Art. no. 8, Jan. 2022, doi:
10.3390/math10081234.

[Survey]: https://www.mdpi.com/2227-7390/10/8/1234

- Provides a review of Stock Price Trend Prediction (SPTP) research
- Focuses on papers that publish results on the LOB Benchmark dataset
- Identifies that DeepLOB (CNN+LSTM) and TransLOB (Transformer) models
  have the best performance


## Identified Problems with the LOB Benchmark Dataset and published research

- Dated (created in 2017)
- Unbalanced classes (substantially more "flat" labels than "up" or
  "down")
- Merged (data for 5 different stocks are merged and therefore
  indistinguishable)
- Overfitting (more data -- longer time series -- is required to
  reduce overfitting of complex models)
- Unrealistic (assumed mid-price execution would only be useful for a
  market maker setting their quotes, assessing profitability would
  require executing on bid/ask prices, as well as the inclusion of
  transaction and impact costs)
- Benchmarks (performance metrics include accuracy, precision, recall
  and F1 -- rather than profitability)

# DeepLOB Model

Z. Zhang, S. Zohren and S. Roberts, "[DeepLOB: Deep Convolutional
Neural Networks for Limit Order Books][DeepLOB]," in IEEE Transactions on
Signal Processing, vol. 67, no. 11, pp. 3001-3012, 1 June1, 2019, doi:
10.1109/TSP.2019.2907260.

[DeepLOB]: https://arxiv.org/abs/1808.03668

- Instead of hand designing features, DeepLOB uses a combination of
  CNN (for feature extraction) and LSTM (for temporal analysis)
- Uses [Inception Module][] to "infer local interactions over
  different time horizons"
- CNNs (convolutional neural networks) are useful when the "spatial"
  structure of the data is meaningful
- Adjacent prices and sizes in a LOB -- as well as proximity in the time domain have meaning
- DeepLOB only uses the Basic features from the FI-2010 dataset
  (bid/ask price/size x 10 levels) and 100 most recent LOB states (for
  a total of 4000 points per observation)
- Quotes research that shows 80% of price discovery is in the best bid
  and ask (L1-Bid and L1-Ask)
- Performs analysis over longer horizon on LSE securities and
  normalizes the data with a z-score based on 5-day trailing values --
  which is resilient to regime shifts
- The [DeepLOB implementation][] is available in Jupyter notebooks
  using both TensorFlow (version 1 and 2) and PyTorch.


[Inception Module]: https://paperswithcode.com/method/inception-module "Inception Module"
[DeepLOB implementation]:  https://github.com/zcakhaa/DeepLOB-Deep-Convolutional-Neural-Networks-for-Limit-Order-Books "DeepLOB implementation"


## Spatial features
- Order books are 2D structures (at each point in time)
- When prices are ordered, we have a matrix whose adjacent elements
  have meaning -- just like an image

Without manually engineering features, CNN models can find patterns
such as:
- Sharp increases (decreases) across adjacent prices
- Upward (downward) price pressure due to large quantities at the best
  bid (offer)
- Convexity (concavity) of cumulative volumes

| Feature       | Concave | Convex |
|---------------|---------|--------|
| Depth Decay   | Quick   | Slow   |
| Market Impact | High    | Low    |
| Volatility    | High    | Low    |
| Confidence    | Low     | High   |



# SoTA Models

Matteo Prata, Giuseppe Masi, Leonardo Berti, Viviana Arrigoni, Andrea
Coletta, Irene Cannistraci, Svitlana Vyetrenko, Paola Velardi, Novella
Bartolini "[LOB-Based Deep Learning Models for Stock Price Trend
Prediction: A Benchmark Study][Benchmark Study]"

**Acknowledgements**: This research was funded by JPMorgan Chase AI
Research Faculty award "*Understanding interdependent market
dynamics: vulnerabilities and opportunities*". We also thank Poste
Italiane for funding a Ph.D. scholarship on Financial applications of
Artificial Intelligence.

- Authors test multiple **State-of-The-Art** (SoTA) models
- Test models on the FI-2010 dataset and proprietary data downloaded from [LOBSTER][]
- Developed [LOBCAST][], "an open-source framework that incorporates
  data preprocessing, DL model training, evaluation and profit
  analysis"
- Find that Deep Learning (DL) models don't generalize well (tend to
  overfit)
- Find that the best models **employ attention mechanisms**

[Benchmark Study]: https://arxiv.org/abs/2308.01915
[LOBSTER]: https://lobsterdata.com/
[LOBCAST]: https://github.com/matteoprata/LOBCAST


# TLOB Model
L Berti, G Kasneci, "[TLOB: A Novel Transformer Model with Dual
Attention for Price Trend Prediction with Limit Order Book
Data][TLOB]"

- Introduces a new labeling technique to smooth noisy prices and
  detect persistent directional moves and reduce signal jitter.
- Introduces two new deep learning models:
  + MLPLOB (Multi-Layer Perceptron Limit Order Book)
  + TLOB (Transformer Limit Order Book)

[TLOB]: https://arxiv.org/html/2502.15757 "TLOB"


## Return labeling

- To reduce jitter, returns are calculated on mid prices averaged
  across both the historical and future windows
- To prevent over- and under-smoothing, the paper introduces an extra
  parameter to dissociate the smoothing horizon $k$ from the prediction
  horizon $h$.
- The window prices $w$ and returns $l$ are then:

$$
\begin{align*}
w_+(t,h,k) &= \frac{1}{k+1} \sum_{i=0}^k p(t+h-i) \\
w_-(t,h,k) &= \frac{1}{k+1} \sum_{i=0}^k p(t-i) \\
l(t,h,k) &= \frac{w_+(l,h,k) - w_-(l,h,k)}{w_-(l,h,k)}
\end{align*}
$$

- The paper also argues that the return threshold $\theta$ to
  determine up, down and stable should be a function of the spread,
  rather than chosen to balance the occurrence of each label.


## MLPOB

- Includes multiple blocks of spatial and temporal mixing MLPs
- Then combines all features into a single vector and reduces the
  results into a single trend prediction: up, down or stable



## TLOB

- Applies Bilinear input Normalization ([BiN][]) to automatically
  normalize the data to shifts in prices, volatility, etc.  Instead of
  using Batch Normalization, BiN normalizes each record across both
  feature and time axes.
- Applies Self-Attention to both temporal and spatial features
  (Dual-Attention)
- Passes the Dual-Attention result through MLPOB to extract even more
  value from the combining the spatial and temporal signals

> By blending two distinct self-attention operations (temporal first,
> then spatial) with an MLPLOB feed-forward component, TLOB is
> designed to capture the complex market microstructure present in LOB
> data. Its Transformer foundation enables effective scaling for large
> datasets, while the dual-attention mechanism better handles the
> fine-grained feature interactions and sequence dependencies
> characteristic of financial time series.
  

![mlp](https://arxiv.org/html/2502.15757/x2.png)

[Bin]: https://arxiv.org/abs/2109.00983 "Bilinear input Normalization"


## Results

- The [table of results][] shows that both MLPOB and TLOB have higher
  F1 scores on the FI-2020 benchmark datasets
- The MLPOB out performs the TLOB on 3 of the 4 prediction horizons.
- This uselessness of the extra TLOB complexity is attributed to the
  low complexity of the FI-2010 dataset
- The TLOB model actually performs better than the other SoTA models
  when used on other, more varied, datasets such as the [BTC LOB
  dataset][].
- The number of model parameters (10s of millions) and inference times
  (micro seconds) are much larger than other models.

> Although TLOB and MLPLOB have a higher number of parameters compared
> to SoTA LOB-based models, they still have significantly fewer
> parameters than state-of-the-art deep learning models commonly used
> in standard machine learning tasks. Furthermore, they do result in
> slightly higher inference times, but their speed remains adequate
> for application in high-frequency trading scenarios.


[table of results]: https://arxiv.org/html/2502.15757#S7
[BTC LOB dataset]: https://www.kaggle.com/datasets/siavashraz/bitcoin-perpetualbtcusdtp-limit-order-book-data/data