# Working with microsecond time-series

Here we are going to explore some important concepts you will need to do your home assignment when using high frequency time series of financial markets.

For this, we are going to work with the following dataset.

## TRD Data: Structure, Location, and Interpretation

This section summarizes **only the trade (TRD)** files from the dataset, how they are organized on disk, how they should be interpreted, and the key rules for cleaning and processing them.

---

## 1. Where the TRD Files Live (Directory Layout)

Your working directory contains a folder called `DATA/`.
Inside it, there is **one subfolder per venue per trading date**, for example:

- `BME_2025-11-07/`
- `TURQUOISE_2025-11-07/`
- `CBOE_2025-11-07/`
- `AQUIS_2025-11-07/`

Each of these venue–date folders contains gzip-compressed, semicolon-separated CSV files of three types:

- `QTE_*.csv.gz` — order book snapshots
- `TRD_*.csv.gz` — **trades**
- `STS_*.csv.gz` — market trading status

For the purposes of this document, only the `TRD_*.csv.gz` files matter.

### File Naming Pattern

All files follow:

```
<type>_<session>_<isin>_<ticker>_<mic>_<part>.csv.gz
```

Where:

- **type**: `TRD`
- **session**: trading date (`YYYY-MM-DD`)
- **isin**: cross-venue ISIN
- **ticker**: venue-specific symbol (multiple books for the same ISIN may exist)
- **mic**: segment MIC (e.g., `XMAD`)
- **part**: integer; if multiple `part` values exist for the same identity, **only the highest part number must be processed**, and you should emit a warning.

---

## 2. Identity & Join Keys (Relevant for Trades)

A single order book identity is defined as:

```
(session, isin, mic, ticker)
```

All TRD files belonging to this identity correspond to trades occurring on the same order book.

---

## 3. Time Semantics for TRD

All timestamps use:

- **epoch** = microseconds since Unix epoch (UTC)
- Feeds are **perfectly synchronized** across file types (no skew adjustments)

When multiple trades share the same epoch:

- TRD rows are ordered by **ascending sequence** (lowest sequence = earliest event at that timestamp)

---

## 4. TRD (Trade) Files

### Columns

- session
- inst_id
- sequence
- isin
- ticker
- mic
- currency
- epoch
- auction_on_demand_mic
- aggressor_side
- trade_id
- mmt_flags
- mmt_flags_version
- px
- qty
- securityTradingId
- bloombergTicker
- lastTradePriceCurrency

### Column Semantics

- **aggressor_side**
  - `1` = buy-initiated trade
  - `2` = sell-initiated
  - empty = unknown

- **px, qty**: Trade price and quantity.

- **mmt_flags**: The first two characters encode the most important trade attributes:
  - **First character** (Market Model) examples: `1=CLOB`, `3=Dark`, `4=Off Book`, `5=Periodic Auction`, etc.
  - **Second character** (Trade Mode) examples: `2=Continuous`, `K=Opening Auction`, `I=Closing Auction`, `4=Out of Main Session`, etc.

- **Off-book trades**: Valid TRD entries may include off-book prints. They **do not influence** addressability (that comes from QTE and STS only).

In [None]:
import pandas as pd
import numpy as np

# --- 1. Load the Data (Using your existing files) ---
# We assume the files are in the folder structure you provided
path_xmad = 'DATA_SMALL/BME_2025-11-07/TRD_2025-11-07_ES0113900J37_SAN_XMAD_1.csv.gz'
path_ceux = 'DATA_SMALL/CBOE_2025-11-07/TRD_2025-11-07_ES0113900J37_SANe_CEUX_1.csv.gz'

# Load BME (Madrid)
df_xmad = pd.read_csv(path_xmad, sep=';', compression='gzip')

# Load CBOE (Europe)
df_ceux = pd.read_csv(path_ceux, sep=';', compression='gzip')

print(f"Loaded {len(df_xmad)} rows from XMAD")
print(f"Loaded {len(df_ceux)} rows from CEUX")

## Preparing the data

Let's prepare the data step by step and then later we encapsulate it in a function.

In [None]:
df = df_xmad.copy()
display(df.head())

In [None]:
# 1. Sort strictly
df = df.sort_values(by=['epoch', 'sequence'], ascending=[True, True])

# We can see how there are many trades happening at exactly the same epoch! 
# This is common in the stock exchange, when an aggressive order executes against many passive orders in the orderbook 
# at the same or differ price levels, the exchange reports one transaction at a time but all with the same timestamp
display(df.head())

In [None]:
# It is a good idea to have an index without duplicates. Otherwise, many future transformations will trigger an error
# because the require unique index values (e.g. merge_asof)
duplicates = df.duplicated(subset='epoch', keep=False).sum()
print(f"Found {duplicates} trades sharing the same microsecond.")

### How to treat a timeseries with events happening at exactly the same time

We need to find a strategy to work with this. The simplest strategy is often the best, but it is not always acceptable depending on the requirements.

In [None]:
# Option 1: Good for orderbooksnapshots but not valid for trades because you lose them
# Orderbook snapshots that happen on the same microsecond for us are virtually non existing. The only thing we see
# is the last orderbook of the microsecond. Therefore for orderbooksnapshots it is valid to keep only the last.
df_clean = df.drop_duplicates(subset='epoch', keep='last').copy()
display(df_clean.head())

In [None]:
# Option 2 (Intermediate): the nanoseconds trick
temp_ts = pd.to_datetime(df['epoch'], unit='us')
display(temp_ts)

In [None]:
# standard groupby().cumcount() numbers the items in a group: 0, 1, 2, 3...
# We group by 'epoch' to find trades happening at the same time.
# We treat that count as Nanoseconds.

# Basically, for every trade, we count how many trades had the same epoch before, this is like a counter that allow us to separate trades 
offset_ns = df.groupby('epoch').cumcount()
display(offset_ns)

In [None]:
# We check that there are not more than 1000 executions at the same microsecond
# If there were more, this trick could cause issues, since the trade 1001 would be assigned to the next microsecond
max(offset_ns)

In [None]:
# Create the final High-Res Timestamp
# Base Time (us) + Offset (ns)
df['ts'] = temp_ts + pd.to_timedelta(offset_ns, unit='ns')
df.set_index('ts', inplace=True)
display(df.head())

In [None]:
# Option 3 (Advanced): Set MultiIndex
# We keep 'epoch' as a column too, just in case we need the raw int later
df['ts'] = pd.to_datetime(df['epoch'], unit='us')
df.set_index(['ts', 'sequence'], inplace=True)
display(df.head())

# The problem of this method is that it prevent us from using the merge_asof which is very useful when timestamps do not repeat.
# For the purpose of this lecture, we will use this approach. In real life, you would not pivot the data and you would rather process the full stream of data tick by tick
# This makes it much more complex, so we can explore a sufficiently valid approach for our use case which does not require
# the utmost precision. 

# Also this approach will be useful for your homework

In [None]:
# For simplicity and to be able to use merge_asof we will go for Option 2.
# Here is the whole process encapsulated
def clean_hft_data_preserve_all(df):
    """
    1. Sorts by Epoch and Sequence.
    2. Identifies duplicates.
    3. Adds a 'nanosecond' offset to duplicates to make time unique.
    4. Sets unique Index.
    """
    # 1. Sort strictly
    df = df.sort_values(by=['epoch', 'sequence'], ascending=[True, True])
    
    # 2. Convert to basic Timestamp (Microseconds)
    # We do this temporarily to get the base time
    temp_ts = pd.to_datetime(df['epoch'], unit='us')
    
    # 3. The "Nanosecond Trick"
    # standard groupby().cumcount() numbers the items in a group: 0, 1, 2, 3...
    # We group by 'epoch' to find trades happening at the same time.
    # We treat that count as Nanoseconds.
    
    offset_ns = df.groupby('epoch').cumcount()
    
    if max(offset_ns) > 1000:
        raise Exception(f"There are more than 1000 executions happening at the same microsecond. Max number of execs: {offset_ns}")
    
    # 4. Create the final High-Res Timestamp
    # Base Time (us) + Offset (ns)
    df['ts'] = temp_ts + pd.to_timedelta(offset_ns, unit='ns')
    
    # 5. Set Index
    df.set_index('ts', inplace=True)
    
    return df

# Apply the new logic
print("--- Processing XMAD (Preserving All) ---")
df_xmad_full = clean_hft_data_preserve_all(df_xmad)

print("\n--- Processing CEUX (Preserving All) ---")
df_ceux_full = clean_hft_data_preserve_all(df_ceux)

# Verify Uniqueness
print(f"\nIs XMAD index unique now? {df_xmad_full.index.is_unique}")

# Let's inspect a "Burst" (Trades that happened at the same microsecond)
# We look for where the nanosecond part is not zero
bursts = df_xmad_full[df_xmad_full.index.nanosecond > 0]

if not bursts.empty:
    print("\nExample of a resolved collision (Look at the timestamps!):")
    # Show the base trade (ns=0) and the offset trade (ns=1)
    t_base = bursts.index[0].replace(nanosecond=0)
    print(df_xmad_full.loc[t_base : t_base + pd.Timedelta(nanoseconds=5), ['px', 'qty']])
else:
    print("No collisions found in this slice, but the code is ready for them.")

### Building the Consolidated Tape

Now that we have cleaned the data for both venues (Madrid and CBOE Europe), we need to combine them into a single view of the market. This is often called a **Consolidated Tape**.

**The Challenge: Asynchronous Data**

Financial markets are *asynchronous*. A trade happens on BME at `t=100`, and a trade happens on CBOE at `t=105`. They never align perfectly.

If we just join them, we will have gaps (NaNs). To solve this, we use the **Last Traded Price (LTP)** logic:

> *"If a trade didn't happen on this venue at this exact microsecond, the valid price is still the price of the **last** trade that did happen."*

**The Solution:**

1. **Concat & Sort:** Put all trades in one long list sorted by time.
2. **Pivot:** Create columns for each venue (`XMAD`, `CEUX`).
3. **Forward Fill (`ffill`):** This is the magic step. It propagates the last known valid observation forward to the next valid index.

In [None]:
# 1. Combine
tape_full = pd.concat([df_xmad_full, df_ceux_full])
tape_full.sort_index(inplace=True)

# 2. Pivot
# Now strict uniqueness is guaranteed, so pivot works perfectly
tape_pivot = tape_full.pivot(columns='mic', values='px')

# 3. Fill Forward
# This propagates the last valid price to the nanosecond level
consolidated_tape = tape_pivot.ffill()

print(f"Total Updates in Tape: {len(consolidated_tape)}")
print(consolidated_tape.head())

In [None]:
# Instant Vectorized Calculation
consolidated_tape['spread'] = consolidated_tape['XMAD'] - consolidated_tape['CEUX']
display(consolidated_tape)

# Visualize the discrepancy
consolidated_tape['spread'].plot(title="Price Difference (XMAD - CEUX) in Euros", figsize=(10, 4))

### Merging Low-Frequency Data (News) with High-Frequency Data (Trades)

We often need to enrich our trade data with external events (News, Macro announcements, Tweets). However, we face a major problem: **Timestamp Mismatch**.

- **News arrives at:** `11:40:00.000000`
- **Trade happens at:** `11:40:00.005231`

A standard SQL-style `left_join` or pandas `merge` requires an **exact match**. Since the timestamps rarely match down to the microsecond, a standard merge would result in almost all data being lost.

**The Solution: `merge_asof`**

Pandas `merge_asof` allows for **inexact matches**. It tells the computer:

> *"For every trade, look **backwards** in time and find the most recent news headline that appeared before this trade."*

- **`direction='backward'`**: Ensures we avoid "look-ahead bias." We only tag a trade with news that was already public when the trade occurred.

In [None]:
# 1. Create the News Dataframe
news_data = pd.DataFrame([
    {'ts': '2025-11-07 11:00:00', 'headline': 'Market Open', 'sentiment': 'NEUTRAL'},
    {'ts': '2025-11-07 11:40:00', 'headline': 'RUMOR: Acquisition Talks', 'sentiment': 'POSITIVE'},
    {'ts': '2025-11-07 12:10:00', 'headline': 'CORRECTION: Rumor Denied', 'sentiment': 'NEGATIVE'}
])

# Convert news time to datetime
news_data['ts'] = pd.to_datetime(news_data['ts'])
news_data.sort_values('ts', inplace=True)
display(news_data)

In [None]:
# 2. Prepare Trade Data
# Use the cleaned XMAD data with unique timestamps
trades_for_merge = df_xmad_full[['px', 'qty']].copy()
trades_for_merge.reset_index(inplace=True)

# 3. Use merge_asof
# This is the key function for time-aware inexact joins
enriched_trades = pd.merge_asof(
    trades_for_merge,
    news_data,
    on='ts',
    direction='backward'  # Look backwards in time to find the most recent news
)

print("Enriched Trades (Sample):")
display(enriched_trades.sample(10))

In [None]:
# Analysis: What was the average trade price during each "news regime"?
print("\nAverage Price by News Sentiment:")
print(enriched_trades.groupby('sentiment')['px'].mean())