# MSCF 46982 Market Microstructure and Algorithmic Trading

Fall 2025 Mini 2

Introduction to NYSE Daily TAQ

Copyright &copy; 2025 Nick Psaris. All Rights Reserved

# TOC
- [Initialize](#Initialize)
- [Daily TAQ Database](#Daily-TAQ-Database)
- [Connecting to a Database](#Connecting-to-a-Database)
- [Database Storage Paradigm](#Database-Storage-Paradigm)
- [Exchange Table](#Exchange-Table)
- [Trade Table](#Trade-Table)
- [Trade Conditions](#Trade-Conditions)
- [Passing Lambdas](#Passing-Lambdas)
- [First Generation Features](#First-Generation-Features)
- [Richard Roll's Effective Spread](#Richard-Roll's-Effective-Spread)
- [Corwin-Schultz Spread Estimate](#Corwin-Schultz-Spread-Estimate)
- [Volume-Weighted Average Price (VWAP)](#Volume-Weighted-Average-Price-(VWAP))
- [Participation Weighted Price (PWP)](#Participation-Weighted-Price-(PWP))
- [Quote Table](#Quote-Table)
- [NBBO Table](#NBBO-Table)
- [Mid-Price and Spread](#Mid-Price-and-Spread)
- [Imbalance](#Imbalance)
- [Weighted Mid-Price](#Weighted-Mid-Price)
- [Time-Weighted Average Price (TWAP)](#Time-Weighted-Average-Price-(TWAP))


# Initialize
- We start by initializing the number of rows and columns displayed

In [1]:
import os
os.environ['PYKX_JUPYTERQ'] = 'true'
os.environ['PYKX_4_1_ENABLED'] = 'true'
import pykx as kx


PyKX now running in 'jupyter_qfirst' mode. All cells by default will be run as q code. 
Include '%%py' at the beginning of each cell to run as python code. 


In [2]:
\c 25 100

In [3]:
%%py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import datetime
plt.style.use('default')
mpl.rcParams["figure.figsize"] = [15,5]

# Daily TAQ Database
- Includes all trades and quotes from all US equity exchanges and
  Alternate Trading Systems ([ATS][])
- Kdb+'s Daily TAQ loader was one of KX's signature products
- It is now available on [github][]
- We have downloaded 2020.02m to demonstrate an actual Kdb+
  environment

[ats]: https://www.sec.gov/foia/docs/atslist.htm
[github]: https://github.com/KxSystems/kdb-taq

# Connecting to a Database
## Open and close the connection yourself
- Kdb+ connection handles take the form `:host:port:user:password
- For security, we use [TLS][] to encrypt communication with the Kdb+
  server and prefix the host with "tcps://"
- To make the notebook generic and hide the password, we store both
  user and password in a file "cmu_userpass.txt"
- The `hopen` operator takes a symbol with the host, port and
  optionally the username and password
- The `hopen` operator returns an integer (aka handle) which is used
  to communicate with the server

[tls]: https://en.wikipedia.org/wiki/Transport_Layer_Security
"Transport Layer Security"


In [5]:
/ user:password file
/ windows and mac/linux use different environment variables
home:`HOME`USERPROFILE "w"=first string .z.o
upf:0N!` sv (hsym`$getenv home),`cmu_userpass.txt
/ user:password
up:first read0 upf

`:/Users/nick/cmu_userpass.txt


In [None]:
h:0N!hopen `$":tcps://tpr-mscf-kx.tepper.cmu.edu:5000:",up


- The `hclose` operator closes the connection

In [None]:
hclose h

## Let Q open/close the connection each time
- We can alternatively store the connection string to `h` directly.  Q
  will then automatically open and close the connection on each query.
  This provides two benefits:
  + Our notebooks won't be affected by the server getting restarted
  + The server won't be at risk of running out of file descriptors
    because users rerun their notebooks without closing the connection
- The syntax for sending queries remains unchanged

In [None]:
h:`$":tcps://tpr-mscf-kx.tepper.cmu.edu:5000:",up


# Sending Queries to the Database
- We can enclose our commands in quotes and pass it to the handle
- The `tables` operator lists the tables in a q process

In [None]:
h "tables[]"

# Database Storage Paradigm

## Small tables are stored as a single file

```
[npsaris@tpr-mscf-kx taq]$ ls -lr | grep -v .q | head
total 252
-rw-rw-r-- 1 npsaris npsaris  20547 Oct 20  2018 tspilot
-rw-rw-r-- 1 npsaris npsaris  50962 Sep 24  2020 sym
-rw-rw-r-- 1 npsaris npsaris 159794 Sep 24  2020 mas
drwxrwxr-x 2 npsaris npsaris     20 Sep 14  2019 html
-rw-r--r-- 1 npsaris npsaris    743 Oct  9  2018 ex
drwxrwxr-x 2 npsaris npsaris   4096 Nov 15  2021 daily
-rw-rw-r-- 1 npsaris npsaris   1807 Nov 11  2018 cond
drwxrwxr-x 5 npsaris npsaris     41 Sep 25  2020 2020.02.28
drwxrwxr-x 5 npsaris npsaris     41 Sep 24  2020 2020.02.27
```

## Large tables are stored across date partitions in 'splayed' table directories
- Each column of a splayed tables is stored in a file
- These files can be compressed -- notice which file is the largest

```
[npsaris@tpr-mscf-kx taq]$ tree -h --du --charset=ascii taq | head
taq
|-- [8.5G]  2020.02.03
|   |-- [3.0G]  nbbo
|   |   |-- [351M]  asize
|   |   |-- [ 38M]  ask
|   |   |-- [ 38M]  bid
|   |   |-- [347M]  bsize
|   |   |-- [8.6M]  sym
|   |   `-- [2.2G]  time
|   |-- [5.2G]  quote
```

# Exchange Table

In [None]:
show e:h"ex"

## New Exchanges

4 new exchanges have recently been added

- Long Term Stock Exchange ([LTSE][])
  - Founded on September 9, 2020
  - Only lists companies that are organized to sustain long-term thinking
- Members Exchange ([MEMX][])
  - Founded on September 4, 2020
  - Founded by 9 banks: BofA Securities, Charles Schwab Corporation,
    Citadel LLC, E-Trade, Fidelity Investments, Morgan Stanley, TD
    Ameritrade, UBS, and Virtu Financial
  - 9 other institutions have since invested: BlackRock, Citigroup,
    J.P. Morgan, Goldman Sachs, Wells Fargo, and Jane Street.
- MIAX Pearl Equities Exchange ([MIAX][])
  - Founded on September 25, 2020
  - Delivers "unmatched technology with ultra-low latency, high
    throughput and deterministic trading platforms"
- The Texas Stock Exchange ([TXSE][])
  - Approved by the SEC on September 30, 2025
  - Will begin operation in late 2025
  - "Over the long run, TXSE's mission is to reverse the decades-long
    decline in the number of U.S. public companies by reducing the
    burden of going and staying public while maintaining some of the
    highest quantitative standards in the industry."
      
[ltse]: https://ltse.com/ "Long Term Stock Exchange"
[memx]: https://memx.com/ "Members Exchange"
[miax]: https://www.miaxglobal.com/markets/us-options/pearl-options "MIAX Pearl Equities Exchange"
[txse]: https://www.txse.com/ "Texas Stock Exchange"
    

# Daily Table
- The `daily` table lists a daily summary for each security
- The `size` column includes the total traded volume for each security
  $\sum \text{size}$
- The `price` column includes the total notional traded $\sum
  \text{price} \times \text{size}$


In [None]:
h"select from daily where date = last date"

## Daily plot of traded notional leading into COVID-19 outbreak

In [None]:
.pykx.set[`df] .pykx.topd update "p"$date from h "select notional:sum price by date from daily where date.month=2020.02m"


In [None]:
%%py
plt.stem(df.index,df.notional)
plt.title('Covid-19 doubles traded notional')


# Trade Table
- The `trade` table lists all the trades across each exchange and ATS
- Averages >30M rows per day, but spike during volatile markets (COVID-19)
- The table is so big we force users to specify both the `date` and
  `sym` in their queries
- The limitation is not on the server - but on the ability to transfer
  the data back to the user
- Filtering and/or summarizing the data on the server is the best solution

## Daily plot of trade counts leading into COVID-19 outbreak

In [None]:
.pykx.set[`df] .pykx.topd update "p"$date from h "select n:count sym by date from trade where date.month=2020.02m"


In [None]:
%%py
plt.stem(df.index,df.n)
plt.title('Covid-19 doubled trade counts')


# Trade Conditions

- The `cond` and `ex` column definitions can be found in the [NYSE
  Daily TAQ Client Spec][]
- Consolidated Tape Association (CTA) contains quotes and trades from
  NYSE/AMEX/etc listed securities
- Unlisted Trading Privileges (UTP) contains quotes and trades from
  NASDAQ listed securities

[nyse daily taq client spec]:
https://www.nyse.com/publicdocs/nyse/data/Daily_TAQ_Client_Spec_v3.0.pdf
"NYSE Daily TAQ Client Spec"

## Trade conditions sorted by frequency
- What is trade condition "F"?

In [None]:
h"`n xdesc select n:count sym by cond from trade where date = last date"

In [None]:
show c:h"cond"

## Nanosecond precision
- Note how NASDAQ (and NYSE as of 2020) names have nanosecond precision

In [None]:
dt:2020.02.03
h"select from trade where date = ",string[dt],",sym=`BAC,null cond"

# Passing Lambdas

- We can bypass converting everything to strings by passing an
  anonymous function ([lambda][]) and its arguments to the file handle
- This only works when the client is another Kdb+ process
- Note how NYSE and NASDAQ names both have nanosecond precision

[lambda]: https://en.wikipedia.org/wiki/Anonymous_function "Anonymous Function"

In [None]:
trades:{[dt;syms]select from trade where date=dt,sym in syms, null cond}
sym:`TSLA
show -10#t:h (trades;dt;sym)

# FINRA Alternative Display Facility (ADF)
- The FINRA [ADF][] records the trades from ATS including ECN and dark
  pools

[adf]: http://www.finra.org/industry/adf "Alternative Display Facility"

In [None]:
(select[>size] sum size by ex from t) lj e

# Pivot Tables

In [None]:
/ pivot table
pivot:{[t]
 u:`$string asc distinct last f:flip key t;
 pf:{x#(`$string y)!z};
 p:?[t;();g!g:-1_ k;(pf;`u;last k:key f;last key flip value t)];
 p}
pivot select sum size by price,ex from t where ex="D"

# First Generation Features

The Roll effective spread and the Corwin-Schultz spread estimate are
examples of "first generation" market features.

- Developed at a time when the only information available was daily
high, low and close prices.
- Attempt to estimate the bid ask spread without observing the market
quotes.
- Can use them on intraday data but need to be careful not to use at
  too short a time interval



# Richard Roll's Effective Spread

[Roll, R. (1984)][] A Simple Implicit Measure of the Effective Bid-Ask
Spread in an Efficient Market. The Journal of Finance, 39, 1127-1139.

[Roll, R. (1984)]: https://doi.org/10.1111/j.1540-6261.1984.tb03897.x
"Roll, R. (1984)"

> In an efficient market, the fundamental value of a security
> fluctuates randomly. However, trading costs induce negative serial
> dependence in successive observed market price changes. In fact,
> given market efficiency, the effective bid-ask spread can be
> measured by

$$
\text{spread} = 2\sqrt{-cov}
$$
- Where $cov$ is the first-order serial covariance of price changes

In [None]:
roll:{2f*sqrt neg x scov prev x-:prev x}

## When $P_{t-1}$ is at the bid

![Image of Roll Effective Spread](https://onlinelibrary.wiley.com/cms/asset/4d294f2f-f87a-4bf8-ae43-aec2b2005cd9/jofi3897-fig-0001-m.jpg)


| $\Delta P_t$ | | 0 | +s |
|------------------|----|-----|-----|
|                  | -s | 0 | 1/4 |
| $\Delta P_{t+1}$ | 0 | 1/4 | 1/4 |
|                  | +s | 1/4 | 0 |


## When $P_{t-1}$ is at the ask

| $\Delta P_t$ | | -s| 0 |
|-----------------|--|---|---|
|                 |-s| 0 |1/4|
|$\Delta P_{t+1}$ | 0|1/4|1/4|
|                 |+s|1/4|0  |

## Combining the two probabilities

| $\Delta P_t$    |  | -s| 0 | +s|
|-----------------|--|---|---|---|
|                 |-s| 0 |1/8|1/8|
|$\Delta P_{t+1}$ | 0|1/8|1/4|1/8|
|                 |+s|1/8|1/8|0  |

## Autocovariance as a function of the spread
$$
\text{Cov}(\Delta P_t,\Delta P_{t-1}) = \frac{1}{8}(-s^2-s^2) = -\frac{s^2}{4}
$$

$$
\text{spread} = 2\sqrt{-cov}
$$

> First, note that s is not necessarily the quoted spread. Successive
> price changes are recorded from actual transactions-so the s in the
> probability table above and in Equation (1)is the effective spread,
> i.e., the spread faced by the dollar-weighted average investor who
> actually trades at the observed prices.


## Assumptions, Advantages and Disadvantages
- Assumptions:
  - Security trades in an efficient market
  - Probability distribution of price changes are stationary
- Advantages:
  - Efficient to compute because only needs trade prices
  - Independent of time interval of successive prices
- Disadvantages:
  - Empirical evidence shows that the choice of time interval affects
    spread calculation
  - What happens when serial covariance of price changes is positive?
    See [Harris (1990)][]
  
[Harris (1990)]: https://doi.org/10.2307/2328671 "Harris (1990)"

# Corwin-Schultz Spread Estimate

[Corwin and Schultz (2012)][] Corwin, Shane A. and Schultz, Paul H., A
Simple Way to Estimate Bid-Ask Spreads from Daily High and Low Prices
(March 30, 2011). Journal of Finance, Forthcoming, Available at SSRN:
https://ssrn.com/abstract=1106193

[Corwin and Schultz (2012)]: https://doi.org/10.1111/j.1540-6261.2012.01729.x
"Corwin and Schultz (2012)"

> We develop a bid-ask spread estimator from daily high and low
> prices. Daily high (low) prices are almost always buy (sell)
> trades. Hence, the high–low ratio reflects both the stock's variance
> and its bid-ask spread. Although the variance component of the
> high–low ratio is proportional to the return interval, the spread
> component is not. This allows us to derive a spread estimator as a
> function of high–low ratios over 1-day and 2-day intervals. The
> estimator is easy to calculate, can be applied in a variety of
> research areas, and generally outperforms other low-frequency
> estimators.


## Reasonable Assumptions
- The true value of a security follows a diffusion process (geometric
  Brownian motion)
- A spread $S\%$ is constant over the two-day estimation period
- This spread causes buys to be higher than sells by $S\%$
- The daily high prices are buy-initiated and the daily low prices are
  seller-initiated -- both deviating from the true price by $+(S/2)\%$
  and $-(S/2)\%$ respectively

## Unreasonable Assumptions
- The security trades continuously while the market is open
- The security value does not change while the market is closed

## Benefits
- The Corwin-Schultz spread estimator explicitly models the price
  volatility in addition to the spread -- a clear improvement over the
  Roll measure.
- The model captures effects of temporary illiquidity that result in
  high and low prices.

> It is important to note that the high-low spread estimator captures
> liquidity more broadly than just the bid-ask spread. Price pressure
> from large orders will often lead to execution at daily high or low
> prices. Likewise, a succession of buy or sell orders in a shallow
> market may result in executions at daily high or low prices. The
> high-low spread estimator captures these transitory price effects in
> addition to the bid-ask spread.


## Simplified Derivation
Observing that the difference in high and low prices is a function of
the stock's volatility and spread, we can derive these values from the
data if we can combine two equations with these two unknowns.

Equating the observed and actual prices for a single day gives us the
following equation

$$
\left[\ln\left(\frac{H_t^O}{L_t^O}\right)\right]^2 = 
\left[\ln\left(\frac{H_t^A(1+S/2)}{L_t^A(1-S/2)}\right)\right]^2 ,
$$


where $H_t^O$ and $L_t^O$ are the observed high and low prices and
$H_t^A$ and $L_t^A$ are the actual high and low prices respectively.


Expanding this gives an equation where the first term is proportional to
the stock's.

$$
\left[\ln\left(\frac{H_t^O}{L_t^O}\right)\right]^2 = 
\left[\ln\left(\frac{H_t^A}{L_t^A}\right)\right]^2 + 
2\left[\ln\left(\frac{H_t^A}{L_t^A}\right)\right]\left[\ln\left(\frac{2+S}{2-S}\right)\right] +
\left[\ln\left(\frac{2+S}{2-S}\right)\right]^2
$$


We can also produce a second equation for the two day period
$$
\left[\ln\left(\frac{H_{t,t+1}^O}{L_{t,t+1}^O}\right)\right]^2 = 
\left[\ln\left(\frac{H_{t,t+1}^A}{L_{t,t+1}^A}\right)\right]^2 + 
2\left[\ln\left(\frac{H_{t,t+1}^A}{L_{t,t+1}^A}\right)\right]\left[\ln\left(\frac{2+S}{2-S}\right)\right] +
\left[\ln\left(\frac{2+S}{2-S}\right)\right]^2 ,
$$


where $H_{t,t+1}^O$ and $L_{t,t+1}^O$ are the observed two-day high
and low prices and $H_{t,t+1}^A$ and $L_{t,t+1}^A$ are the actual
two-day high and low prices respectively.


Corwin and Schultz refer to Parkinson (1980) and Garman and Klass
(1980) to refactor the equation in terms of the high-low volatility
$\sigma_{HL}$.

We now have two equations with two unknowns: $S$ and $\sigma_{HL}$.

Using simplifying assumptions, the definition of these values become

$$
S = \frac{2\left(e^\alpha - 1\right)}{1+e^\alpha}
$$

and

$$
\sigma_{HL} = 
\frac{\sqrt{\beta/2} - \sqrt{\beta}}{k_2(3-2\sqrt{2})} +
\sqrt{\frac{\gamma}{k_2^2(3-2\sqrt{2})}}
$$

where 

$$
\alpha =
\frac{\sqrt{2\beta} - \sqrt{\beta}}{3-2\sqrt{2}} -
\sqrt{\frac{\gamma}{3-2\sqrt{2}}}.
$$

$$
\gamma =  \left[\ln\left(\frac{H_{t,t+1}^O}{L_{t,t+1}^O}\right)\right]^2
$$

$$
\beta = E\left\{\sum_{j=0}^1\left[\ln\left(\frac{H_{t+j}^O}{L_{t+j}^O}\right)\right]^2\right\}
$$

which is the sum of variances of $t$ and $t+1$

$$
\beta = E\left\{\left[\ln\left(\frac{H_t^O}{L_t^O}\right)\right]^2 + \left[\ln\left(\frac{H_{t+1}^O}{L_{t+1}^O}\right)\right]^2\right\}
$$

and the $k_2$ constant required to compute the volatility is

$$
k_2 = \sqrt{\frac{8}{\pi}}
$$

## Handling overnight price jumps
- News released between market close and the next day's market open
  can result in large price gaps.
- Price gaps may cause today's low to be higher than tomorrows high or
  today's high to be lower than tomorrow's low.
- This results in the intraday volatility estimated using the two-day
  high and low prices to be **overestimated** and the spread,
  therefore, to be **underestimated**.
- Overnight gaps can be taken into account by adjusting the second day
  high and low prices down (up) by the difference between $t$'s close
  and $t+1$'s low (high) -- only when the close is less (greater) than
  $t+1$'s low (high).



## Corwin-Schultz Implementation

### High/Low vectors
- The Corwin-Schultz volatility and spread are implemented with
  composable functions -- allowing analysis, customization, and reuse
  of intermediate state.
- Instead of computing $t$ and $t+1$, we use $t-1$ and $t$ to ensure
  we have not created a look-ahead bias.
- We can vectorize the implementation if we first generate a `(prev
  x;x)` vector pairs for the high and low prices.
- The `::` identity operator can be used to return its argument
  unchanged.


In [None]:
sym:`BAC 
t:h (trades;dt;sym)


In [None]:
show ohlc:0!select o:first price,h:max price, l:min price, c:last price, sum price*size, sum size, nt:count i, durn:sum 1e-9*deltas[first time;time] by 0D00:00:10 xbar time from t where not ex="D"


In [None]:
cshl:{[h;l]
 HL:(prev;::)@\:/:(h;l);        / (prev x;x) pairs
 HL}

show HL:cshl . ohlc`h`c

### Adjusting for overnight gaps

- Adjust the $t+1$ high and low prices down (up) by any overnight gap
  found between $t$'s close and $t+1$'s low (high).
- Up and down gaps are exclusive and their adjustments can be stored
  in a single variable `d`.
- The `.` operator can be used to pass the elements of a list as each
  of a functions arguments.  If the variable `a` was a two-element
  list, `f . a` would be equivalent to `f[a 0;a 1]`.  In Python this
  would be `f(*a)`.
  

In [None]:
csadjhl:{[c;H;L]
 c:prev c;
 d: 0f|c-H[1];                  / prev close > today's high
 d+:0f&c-L[1];                  / prev close < today's low
 H[1]+:d;                       / adjust today's high for overnight jump
 L[1]+:d;                       / adjust today's low for overnight jump
 (H;L)}

show HL:csadjhl[ohlc`c] . cshl . ohlc`h`l


### Precomputing the `f` factor and `beta` and `gamma` estimates

- The spread $S$ and high-low volatility $\sigma_{HL}$ estimates can
  both be calculated if we precompute the $\frac{1}{3-2\sqrt{2}}$
  scaling factor as well as the $\beta$ and $\gamma$ estimates (which
  are computed over a historical window `w`.


In [None]:
csfbg:{[w;H;L]                  / Corwin-Schultz f,b,g coefficients
 f:1f%3f-2f*sqrt 2f;            / factor
 b:w mavg sum x*x:log H%L;      / beta
 g:x*x:log max[H]%min[L];       / gamma
 (f;b;g)}

w:3                             / window for estimating beta and gamma
show fbg:csfbg[w] . HL


### Corwin-Schultz volatility
- Using the `f` factor and $\beta$ and $\gamma$ estimates we can
  estimate the period volatilities.
- $\sqrt{b}$ has been factored out for performance.


In [None]:
cshlv:{[f;b;g]                  / Corwin-Schultz high-low volatility
 k:sqrt .25*acos 0f;            / k:1%k2:sqrt 8%pi
 v:k*f*sqrt[b]*-1f+sqrt .5;
 v+:k*sqrt f*g;
 v}

sqrt[252]*cshlv . fbg

### Corwin-Schultz spread

- Similarly, we can use the `f` factor and $\beta$ and $\gamma$
  coefficients to estimate the period spreads.
- Again, $\sqrt{b}$ has been factored out for performance.
- Spread estimates are computed as a **percentage** of the security
  price.
- Spread estimates can be floored to zero.


In [None]:
cssprd:{[f;b;g]                 / Corwin-Schultz spread
 a:f*sqrt[b]*-1+sqrt 2f;        / alpha
 a-:sqrt f*g;
 s:2f*(x-1f)%1f+x:exp a;        / spread
 s}

0f|cssprd . fbg                    / floor negative spreads to 0

### Corwin-Schults Spread and Volatility

- Combining these pieces, we can create a `cs` function that returns
  the average.



In [None]:
cs:{[w;h;l;c]avg 0f|c*cssprd . csfbg[w] . csadjhl[c] . cshl[h;l]}
cs[10] . ohlc`h`l`c

#  Volume-Weighted Average Price (VWAP)
- Can be computed from the trade data

$$
\frac{\sum \text{price} \times \text{size}}{\sum \text{size}}
$$

In [None]:
show vwap:select vwap:sum[price]%sum size, sum size, rsprd:roll c,cssprd:cs[1;h;l;c],sum nt, sum[durn]%sum nt by 0D00:10 xbar time from ohlc


## Intraday plot of Roll effective spread, Corwing Schultz spread estimate and size traded

In [None]:
.pykx.set[`df] .pykx.topd vwap


In [None]:
%%py
ax = df[['rsprd','cssprd','size']].plot(drawstyle='steps-post',secondary_y=['size'],title='Effective Spread vs. Traded Size')
ax.set_ylim(0)
ax.set_ylabel('spread')
ax.right_ax.set_ylabel('size')


## Intraday plot of trade count and time between trades

In [None]:
%%py
ax = df[['nt','durn']].plot(drawstyle='steps-post',secondary_y=['durn'],title='Trade Count vs. Time Between Trades')
ax.set_ylabel('trades')
ax.right_ax.set_ylabel('duration')


# Participation-Weighted Price (PWP)
- The PWP is a benchmark assuming the trader executes inline with a
  POV algorithm.
- Given a security, quantity, beginning time, and desired
  participation rate, the PWP is computed by looking ahead until
  enough volume is executed on the exchange, such that our order would
  have been filled given our desired participation rate. Once this
  ending time is determined, the VWAP over that interval is computed
  and declared the PWP.
- A PWP is specific to a single transaction and has a moving ending
  time - depending on the participation rate.
- The higher the participation rate, the shorter the time needed to
  fill the order. But the higher the participation rate, the less
  realistic the PWP becomes because it ignores market impact.
- The lower the participation rate, the longer it takes to complete,
  and the higher the likelihood that the execution price deviates from
  the arrival price.
- The computation is made difficult by an open-ended completion time -
  which may require loading multiple days worth of trades.
- We must choose which types of trades to include in our computation
  (block/odd-lot trades and open/close crosses?)


## Implementation
- The `binr` operator is similar to `bin` but picks the "right" bucket
- We `sums` the size so we can find the last trade that takes the
  total volume above our threshold
- We `sums` price*size so we can compute the vwap over the execution
  period
- We pass the list of desired condition codes in
- What happens when the volume for one date is not enough?

In [None]:
pwp:{[s;qty;pov;c;tm]
 t:select time+ date,sums size,sums price*size from trade where date="d"$tm,sym=s,time>=tm,cond in c;
 t:update price%size from t t[`size] binr "i"$qty%pov;
 t}
h (pwp;sym;100000;.6;" O";2020.02.03D09:00)
h (pwp;sym;100000;1 .5 .2 .1 .05 .01 .005;" O";2020.02.03D09:00)

# Quote Table
- The `quote` table lists the best bid and offer for each exchange
  (EBBO)
- The bid and ask prices should be ignored when their respective sizes
  are 0
- The number of quotes per day are typically above 600M, but have
  spiked above 2B


In [None]:
h "select from quote where date = last date, sym = last sym"

## Daily plot of quote count leading into COVID-19 outbreak

In [None]:
.pykx.set[`df] .pykx.topd update "p"$date from h "select n:count sym by date from quote where date.month=2020.02m"


In [None]:
%%py
plt.stem(df.index,df.n)
plt.title('Covid-19 Trippled Quote Counts')


# The `asof`  Operator
- The `asof` operator searches a table to find the record matching all
  values of the supplied dictionary (or table) - except the last,
  which is used with the `bin` operator to find the 'most recent'
  record


## Calling `asof` with a dictionary

In [None]:
quotes:{[dt;syms]select from quote where date=dt,sym in syms}
q:h (quotes;dt;sym)

In [None]:
q asof `date`sym`ex`time!(dt;sym;"N";0D10:00)

## Calling `asof` with a table
- The `asof` operator will also accept a list of dictionaries (i.e.: a
  table)
- The `flip` operator generate a table when supplied a dictionary with
  at least one vector column


In [None]:
show a:k,'q asof k:flip `date`sym`ex`time!(dt;sym;asc distinct q`ex;0D10:00)

# Book Building
- We can build a book of each exchange's bbo
- But this is by no means the full book

In [None]:
d:(select bex:ex,bsize by price:bid from a where bsize>0)
d:d,'(select asize,aex:ex by price:ask from a where asize>0)
`price xdesc d

# NBBO Table
- The `nbbo` table lists the national best bid and offer across all
  exchanges (NBBO)
- The `nbbo` table has fewer records than the `quote` table


In [None]:
.pykx.set[`df] .pykx.topd update "p"$date from h "select n:count sym by date from nbbo where date.month=2020.02m"


## Daily plot of nbbo count leading into COVID-19 outbreak

In [None]:
%%py
plt.stem(df.index,df.n)
plt.title('Covid-19 Trippled NBBO Counts')


# Combining the nbbo quotes with trade prices

In [None]:
nbbos:{[dt;syms]select from nbbo where date=dt,sym in syms}
n:h(nbbos;dt;`BAC)

In [None]:
win:00:00:00 00:20 + 12:00:00
p:select time+date,bid,price:0n,ask from n where time within win,bsize>0,asize>0
p,:select time+date,bid:0n,price,ask:0n from t where time within win
show p:fills `time xasc p
.pykx.set[`df] .pykx.topd p


## Graphing the bid/ask/mid? bounce

In [None]:
%%py
plt.title('Bid/Ask (Mid?) Bounce')
plt.plot(df.time,df.ask,"r-",drawstyle="steps-post",label="ask")
plt.plot(df.time,df.bid,"b-",drawstyle="steps-post",label="bid")
plt.plot(df.time,df.price,"g.",label="trade")
plt.legend(loc='upper center')
plt.xlabel('time')
plt.grid()


# Mid-Price and Spread

$$
P^{\text{mid}} = \frac{P^b + P^a}{2}
$$

$$
S = P^a - P^b
$$

$$
S^{\text{bps}} = \frac{P^a - P^b}{P^{\text{mid}}}
$$

- Spread quoted in basis points can be compared across securities and
  is not affected by splits
- Ignore quotes with zero bid or ask size

## Computing the mid-price and spread

In [None]:
show n:update mid:.5*bid+ask,sprd:ask-bid from n where bsize>0,asize>0

## Problems with mid-price
- Changes in mid-price are highly correlated (and therefore not a
  martingale).  Trades at the bid are often followed by trades at the
  offer (and vice versa).  This is referred to as the **bid-ask
  bounce** and is caused by impatient traders demanding immediate
  liquidity.
- Changes much slower than the rate of quote updates -- making it a
  low-frequency signal.
- Does not take available liquidity (quote sizes) at the best bid and
  ask into effect.


# Imbalance

$$
I = \frac{Q^b}{Q^b+Q^a}
$$

- Find all the changes in mid-price and determine if they are up-ticks
  or down-ticks
- Fill this **backwards** for all imbalance computations
- Compute the probability of an uptick for 'bin'ed values of imbalance


## Bid/Ask imbalance

In [None]:
ud:select imbalance:bsize%bsize+asize,ud:reverse fills reverse -1 0N 1@1+signum deltas mid from n

In [None]:
.pykx.set[`ifreq] .pykx.topd -1_update n%sum n from select n:count i by .01 xbar imbalance from ud

In [None]:
%%py
ifreq.plot(drawstyle='steps-mid',title='Imbalance Probability Distribution',ylim=0)
#plt.stem(ifreq.index,ifreq.n)


In [None]:
.pykx.set[`ud] .pykx.topd -1_update down:1-up from select up:avg ud=1 by .01 xbar imbalance from  ud

In [None]:
%%py
ud.plot(drawstyle='steps-mid',title='Probability of an Up/Down Move as a Function of Imbalance',ylim=[0,1])


# Weighted Mid-Price

- The weighted mid-price inversely weighs the bid/ask prices with the
  ask/bid sizes
- The weighted mid-price incorporates book pressure when tick sizes
  are meaningful

$$
P^{\text{wmid}} = \frac{Q^bP^a + Q^aP^b}{Q^a+Q^b}
$$

- Can also be computed a function of the imbalance

$$
P^{\text{wmid}} = IP^a + (1-I)P^b
$$


## Computing the weighted mid-price

In [None]:
n:update wmid:(bsize;asize) wavg (ask;bid) from n where bsize>0,asize>0
.pykx.set[`n] .pykx.topd 1!select time+date,ask,wmid,bid from n where time within 10:00 10:03


In [None]:
%%py
n.plot(drawstyle='steps-post',title='Bid/Ask and Weighted Mid-Price')


## Problems with Weighted Mid-Price
- Do you trust mid-prices generated from wide spreads?
- Do you trust trade prices if the bid/ask moves away from last trade
  price?
- How can you incorporate the trade price into the weighted mid-price?
- What good is a weighted mid-price when quotes are 1 lot and a penny
  wide?

- Frequency of updates (any time bid/ask size changes) introduces
  noise in high-frequency volatility estimates ([Gatheral and Oomen
  (2010)][]).
- Unlike the Sasha Stoikov Micro-Price ([Stoikov (2017)][]), it is not
  a martingale

[Gatheral and Oomen (2010)]: https://ssrn.com/abstract=970358 "Gatheral and Oomen (2010)"
[Stoikov (2017)]: https://dx.doi.org/10.2139/ssrn.2970694 "Stoikov (2017)"

## Counter-intuitive examples
- Weighted mid-price changes can be counter-intuitive when a new quote
  narrows the spread or when a cancel widens the spread.



In [None]:
/ changes in spread can result in mid price and wmid price moving in opposite directions
10#select from n where {x|next x} all (wmid;prev mid)>(prev wmid;mid),time> 09:30 


# Time-Weighted Average Price (TWAP)
- Ignores volume and equal weights across the trading day
- Good for indices and other products that don't trade
- To calculate, we need to know how long each quote lasted


## Computing the quote duration

In [None]:
show n:update durn:deltas[first time;time] from n

## Duration weighted averages
- `twap` and `sprd` are duration weighted averages of the *previous*
  values
- Always look back -- never forward -- to prevent look-ahead biases


In [None]:
show twap:select twap:durn wavg prev wmid,durn wavg prev sprd,nq:count i by 0D00:10 xbar time from n where time within 09:30 16:00

## Intraday plot of spread and quote count


In [None]:
.pykx.set[`df] .pykx.topd twap


In [None]:
%%py
ax = df[['sprd','nq']].plot(drawstyle='steps-post',secondary_y=['nq'],title='Intraday Plot of Spread vs. # of Quotes')
ax.set_ylim(0)
ax.set_ylabel('spread')
ax.right_ax.set_ylabel('quotes')
