In [None]:
# === Environment Setup ===
import os, sys, math, time, random, json, textwrap, warnings
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns

# --- Configuration ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'figure.dpi': 130, 'font.size': 12, 'axes.titlesize': 'x-large',
    'axes.labelsize': 'large', 'xtick.labelsize': 'medium', 'ytick.labelsize': 'medium'})

# --- Utility Functions ---
def note(msg, **kwargs):
    from IPython.display import display, Markdown
    display(Markdown(f"<div class='alert alert-info'>📝 {textwrap.fill(msg, width=100)}</div>"))
def sec(title):
    print(f"\n{100*'='}\n| {title.upper()} |\n{100*'='}")

note("Environment initialized.")

# Part 9: Finance
## Chapter 9.7: High-Frequency Data and Market Microstructure

### Introduction: The World of Ticks and Trades

This notebook introduces the world of **high-frequency data** and **market microstructure**—the study of how transaction costs, information asymmetry, and the detailed mechanics of trading affect asset prices. We move from the daily or monthly data used in traditional asset pricing to the level of individual trades and quotes, which occur in milliseconds or even microseconds.

### 1. Handling Tick Data
High-frequency data, often called **tick data**, records every single event in the market. This requires specialized data handling techniques due to its large volume and irregular time intervals.

In [None]:
sec("Creating Synthetic Tick Data")
n_ticks = 10000
base_time = pd.to_datetime('2023-10-27 09:30:00')
time_deltas = np.random.exponential(scale=0.1, size=n_ticks).cumsum()
timestamps = base_time + pd.to_timedelta(time_deltas, unit='s')

mid_price = 100 + np.random.randn(n_ticks).cumsum() * 0.01
spread = np.random.uniform(0.01, 0.05, n_ticks)
bid = mid_price - spread / 2
ask = mid_price + spread / 2
volume = np.random.randint(100, 1000, n_ticks)

tick_df = pd.DataFrame({'bid': bid, 'ask': ask, 'volume': volume}, index=timestamps)
note("Synthetic tick data created.")
display(tick_df.head())

### 2. The Limit Order Book (LOB) and Key Concepts

Modern electronic markets are organized around a **Limit Order Book (LOB)**. This is a centralized ledger of all outstanding buy (bid) and sell (ask) orders for a given asset, organized by price level. The diagram below shows a snapshot of a typical LOB.

![Limit Order Book](../images/png/limit_order_book.png)

From the LOB, we can derive several key microstructure concepts:

- **Best Bid and Best Ask:** The highest bid price and the lowest ask price available in the market at any given moment.
- **Bid-Ask Spread:** The difference between the best ask and the best bid ($Spread = Ask_{best} - Bid_{best}$). It is a primary measure of market liquidity and a key cost for traders who demand immediate execution by crossing the spread.
- **Market Depth:** The volume of orders available at the best bid and ask, and at other price levels. Deeper markets are more liquid as they can absorb larger trades without a significant price impact.
- **Volume-Weighted Average Price (VWAP):** The average price of a stock over a given time period, weighted by the volume of trades at each price. It is often used as a benchmark for execution quality. A trader who buys below the VWAP has achieved a good execution.

In [None]:
sec("Calculating Microstructure Metrics")
tick_df['spread'] = tick_df['ask'] - tick_df['bid']
tick_df['midprice'] = (tick_df['ask'] + tick_df['bid']) / 2

# Calculate VWAP over 5-minute intervals
tick_df['vwap'] = (tick_df['midprice'] * tick_df['volume']).resample('5T').sum() / tick_df['volume'].resample('5T').sum()
tick_df['vwap'].fillna(method='ffill', inplace=True)

note("Calculated spread and VWAP.")
display(tick_df[['midprice', 'vwap', 'spread']].head())

sec("Visualizing Midprice vs. VWAP")
plt.figure(figsize=(14, 7))
plt.plot(tick_df.index, tick_df['midprice'], label='Mid-Price (Tick-by-Tick)', alpha=0.6)
plt.plot(tick_df.index, tick_df['vwap'], label='VWAP (5-Minute)', color='red', linestyle='--', lw=2.5)
plt.title('High-Frequency Mid-Price vs. 5-Minute VWAP')
plt.xlabel('Time')
plt.ylabel('Price')
plt.legend()
plt.show()

### 3. Realized Volatility and Microstructure Noise

One of the most powerful applications of high-frequency data is the ability to compute a precise, model-free measure of volatility. While GARCH models *estimate* conditional volatility based on daily data, **realized volatility** *measures* the actual volatility that occurred over a period (e.g., a day) by summing the squared high-frequency returns.

If we have intraday returns $r_{t,j}$ (e.g., from 5-minute intervals), the realized variance for day $t$ is:
$$ RV_t = \sum_{j=1}^{M} r_{t,j}^2 $$
The annualized realized volatility is then $\sqrt{252 \cdot RV_t}$. This provides a much more accurate measure of daily volatility than can be obtained from daily data alone.

**The Challenge of Microstructure Noise:**
In theory, the more frequently we sample returns (e.g., every second), the more accurate our realized volatility measure should be. In practice, this is not true. At very high frequencies, the observed prices are contaminated by **market microstructure noise**, such as the price bouncing between the bid and ask prices. This noise adds variance that is not part of the true price process.

This leads to a trade-off: sampling too infrequently means we miss some of the true price variation, while sampling too frequently means our measure is contaminated by noise. The **volatility signature plot** is the standard diagnostic tool for visualizing this trade-off and choosing an appropriate sampling frequency. It plots the average realized volatility against the sampling interval.

In [None]:
sec("Calculating Realized Volatility and Signature Plot")

def calculate_realized_vol(data, freq):
    """Calculates realized volatility for a given sampling frequency."""
    returns = data['midprice'].resample(freq).last().pct_change().dropna()
    rv_daily = (returns**2).resample('D').sum()
    # Return average annualized vol
    return np.sqrt(np.mean(rv_daily) * 252) 

# Calculate RV for a range of frequencies
frequencies = [f'{s}S' for s in range(10, 61, 5)] + [f'{m}T' for m in range(2, 31, 2)]
volatilities = [calculate_realized_vol(tick_df, freq) for freq in frequencies]

note("Calculating realized volatility across different sampling frequencies...")
plt.figure(figsize=(14, 7))
plt.plot([pd.to_timedelta(f).total_seconds() for f in frequencies], volatilities, marker='o')
plt.title('Volatility Signature Plot')
plt.xlabel('Sampling Interval (seconds)')
plt.ylabel('Average Annualized Realized Volatility')
plt.xscale('log')
plt.show()

note("The signature plot shows the classic U-shape. At very high frequencies (left), the measured volatility is high due to microstructure noise. As we sample less frequently, the noise effect diminishes and the volatility estimate drops. At very low frequencies (right), we start to miss true price variation, so the estimate may begin to drift. A common choice for the optimal frequency is the minimum of this curve, often around 5 minutes for liquid stocks.")

### 4. Order Flow Imbalance (OFI)

A more modern microstructure concept is **Order Flow Imbalance (OFI)**. It measures the net pressure on the bid and ask sides of the book, capturing the intensity of buying versus selling interest. A simplified version can be calculated based on changes in the best bid and ask prices and sizes.

$$ OFI_t = I_{B,t} - I_{A,t} $$ 
Where $I_{B,t}$ is an indicator for buying pressure and $I_{A,t}$ is for selling pressure. For example:
- $I_{B,t} = \Delta q_{B,t}$ if $\Delta p_{B,t} \ge 0$ (volume increases at a non-decreasing price)
- $I_{A,t} = \Delta q_{A,t}$ if $\Delta p_{A,t} \le 0$ (volume increases at a non-increasing price)

A positive OFI indicates strong buying pressure and has been shown to predict short-term price increases.

In [None]:
sec("Calculating Order Flow Imbalance")
df = tick_df.copy()
df['prev_bid_price'] = df['bid'].shift(1)
df['prev_ask_price'] = df['ask'].shift(1)
df['prev_bid_size'] = df['volume'].shift(1) # Using total volume as a proxy
df['prev_ask_size'] = df['volume'].shift(1) # Using total volume as a proxy

df['delta_bid_price'] = df['bid'].diff()
df['delta_ask_price'] = df['ask'].diff()
df['delta_bid_size'] = df['volume'].diff()
df['delta_ask_size'] = df['volume'].diff()

I_B = np.where(df['delta_bid_price'] >= 0, df['delta_bid_size'], 0)
I_A = np.where(df['delta_ask_price'] <= 0, df['delta_ask_size'], 0)

df['OFI'] = I_B - I_A

# Plot cumulative OFI against price
df['cumulative_OFI'] = df['OFI'].cumsum().fillna(0)

fig, ax1 = plt.subplots(figsize=(14, 8))
ax1.plot(df.index, df['midprice'], 'b-', label='Mid-Price')
ax1.set_xlabel('Time')
ax1.set_ylabel('Price', color='b')
ax1.tick_params('y', colors='b')

ax2 = ax1.twinx()
ax2.plot(df.index, df['cumulative_OFI'], 'r-', alpha=0.6, label='Cumulative OFI')
ax2.set_ylabel('Cumulative Order Flow Imbalance', color='r')
ax2.tick_params('y', colors='r')

fig.legend(loc='upper left', bbox_to_anchor=(0.1, 0.9))
plt.title('Mid-Price vs. Cumulative Order Flow Imbalance')
plt.show()
note("The plot shows a strong correlation between the cumulative OFI and the mid-price, illustrating how net buying pressure drives prices up.")