# Lab Exercise: High-Frequency Arbitrage in Fragmented Markets

**Deadline:** 9th of December 23:59 CET

**Submission:** Email to francisco.merlos@six-group.com with title: "Arbitrage study in BME | Your name"

In [1]:
# Import necessary libraries for data processing and analysis
import pandas as pd
import numpy as np
import os
from pathlib import Path
from glob import glob
import warnings
warnings.filterwarnings('ignore')

# Set pandas display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)


## 1. Context: The Fragmented Market

In modern European equity markets, liquidity is **fragmented**. The same stock (ISIN) trades simultaneously on the primary exchange (BME) and various Multilateral Trading Facilities (MTFs) like CBOE, Turquoise, and Aquis.

Due to this fragmentation, temporary price discrepancies occur. A stock might be offered for sale at €10.00 on Turquoise while a buyer is bidding €10.01 on BME. A High-Frequency Trader (HFT) can profit from this by buying low and selling high instantaneously.

However, these opportunities are fleeting. The "theoretical" profit you see in a snapshot might disappear by the time your order reaches the exchange due to **latency**.

### The Mission

You have been hired as a Quantitative Researcher at a proprietary trading firm. Your boss has given you a dataset of high-resolution market data and asked you to answer three critical questions:

1. **Do arbitrage opportunities still exist in Spanish equities?**
2. **What is the maximum theoretical profit** (assuming 0 latency)?
3. **The "Latency Decay" Curve:** How quickly does this profit vanish as our trading system gets slower (from 0µs to 100ms)?

## 2. Data Specifications

You are provided with a `DATA_BIG/` folder containing subfolders for specific trading dates. Inside, you will find three types of compressed CSV files for various instruments.

**Note:** You can also find a `DATA_SMALL` folder that you can use to test quickly without needing to run the simulation over all the data.

### File Naming Convention

The naming pattern for all three file types (QTE, STS, TRD) is:

```
<type>_<session>_<isin>_<ticker>_<mic>_<part>.csv.gz
```

| Field | Description |
|-------|-------------|
| **type** | QTE, TRD, or STS |
| **session** | Trading date (YYYY-MM-DD) |
| **isin** | Cross-venue **ISIN** (International Securities Identification Number) |
| **ticker** | Venue-specific trading symbol (distinguishes multiple books for the same ISIN on the same MIC) |
| **mic** | Market Identifier Code (MIC, e.g., XMAD) |
| **part** | Integer part number. Assume it is always 1 for simplicity. |

### Order Book Identity and Join Key

A single **order book identity** is defined by the tuple:

```
(session, isin, mic, ticker)
```

This identity is the **key used to join** corresponding QTE, TRD, and STS data belonging to the same book.

### File Types

1. **QTE (Quotes/Snapshots):** Represents the state of the order book (up to 10 levels deep).
   - `epoch`: Timestamp in microseconds (UTC).
   - `px_bid_0`, `px_ask_0`: Best Bid and Best Ask prices.
   - `qty_bid_0`, `qty_ask_0`: Available volume at the best price.
   - *Note: Columns exist for levels 0-9.*

2. **STS (Trading Status):** Updates on the market phase (e.g., Open, Auction, Closed).
   - `epoch`: Timestamp.
   - `market_trading_status`: An integer code representing the state.

3. **TRD (Trades):** Represents the transactions. Not needed for this exercise.

### CRITICAL: Vendor Data Definitions

Real-world financial data is rarely clean. The data vendor has provided the following specifications. **Ignoring these will result in massive errors in your P&L calculation.**

#### A. "Magic Numbers" (Special Price Codes)

The vendor uses specific high-value constants to indicate non-tradable states (e.g., Market Orders during auctions). **These are NOT real prices.** If you treat 999,999 as a valid bid, your algorithm will assume you can sell for a million euros.

| Value | Meaning | Action Required |
|-------|---------|----------------|
| 666666.666 | Unquoted/Unknown | **Discard** |
| 999999.999 | Market Order (At Best) | **Discard** |
| 999999.989 | At Open Order | **Discard** |
| 999999.988 | At Close Order | **Discard** |
| 999999.979 | Pegged Order | **Discard** |
| 999999.123 | Unquoted/Unknown | **Discard** |

#### B. Market Status Codes

You can only trade when the market is in **Continuous Trading**. If you trade during an Auction, a Halt, or Pre-Open, your order will not execute immediately. A snapshot is only valid/addressable if the STS for that venue is one of these codes:

| Venue | Continuous Trading Code |
|-------|------------------------|
| AQUIS | 5308427 |
| BME | 5832713, 5832756 |
| CBOE | 12255233 |
| TURQUOISE | 7608181 |

## 3. Implementation Guide

You are encouraged to use AI tools (ChatGPT, Claude, etc.) to generate the Python/Pandas code. However, **you** are responsible for the logic and the financial validity of the results.

### Step 1: Data Ingestion & Cleaning

- Write a function to load the QTE and STS files for a given ISIN.
- **Task:** Ensure you are using only valid prices
- **Task:** Ensure you are only looking at addressable orderbooks

In [2]:
# ============================================================================
# STEP 1: DATA INGESTION & CLEANING
# ============================================================================
# This step discovers all ISINs in DATA_BIG and loads QTE (quotes) and STS 
# (trading status) files for each ISIN across all venues.
# ============================================================================

# Define the base data directory and trading session
DATA_DIR = 'DATA_BIG'
SESSION = '2025-11-07'

# Define venues (Market Identifier Codes) we need to process
VENUES = ['BME', 'CBOE', 'TURQUOISE', 'AQUIS']

# Define "magic numbers" that represent invalid/non-tradable prices
# These are special codes used by the vendor to indicate non-executable states
INVALID_PRICES = [
    666666.666,  # Unquoted/Unknown
    999999.999,  # Market Order (At Best)
    999999.989,  # At Open Order
    999999.988,  # At Close Order
    999999.979,  # Pegged Order
    999999.123   # Unquoted/Unknown
]

# Define valid market trading status codes for continuous trading per venue
# Only orderbooks in continuous trading are "addressable" (can be traded)
VALID_STATUS_CODES = {
    'AQUIS': [5308427],
    'BME': [5832713, 5832756],
    'CBOE': [12255233],
    'TURQUOISE': [7608181]
}

def discover_all_isins(data_dir=DATA_DIR, session=SESSION, venues=VENUES):
    """
    Discovers all unique ISINs across all venues by scanning QTE files.
    
    In fragmented markets, the same ISIN trades on multiple venues. This function
    scans all venue folders to build a comprehensive list of all available instruments.
    
    Returns:
        set: Unique ISINs found across all venues
    """
    isins = set()
    
    # Scan each venue folder for QTE files
    for venue in venues:
        venue_folder = f"{data_dir}/{venue}_{session}"
        
        if not os.path.exists(venue_folder):
            print(f"Warning: Venue folder {venue_folder} not found. Skipping.")
            continue
        
        # Find all QTE files in this venue
        # File pattern: QTE_<session>_<isin>_<ticker>_<mic>_<part>.csv.gz
        qte_files = glob(f"{venue_folder}/QTE_*.csv.gz")
        
        # Extract ISIN from each filename (3rd component after splitting by '_')
        for file_path in qte_files:
            filename = os.path.basename(file_path)
            parts = filename.split('_')
            if len(parts) >= 3:
                isin = parts[2]  # ISIN is the 3rd component
                isins.add(isin)
    
    print(f"Discovered {len(isins)} unique ISINs across all venues")
    return sorted(isins)

def load_qte_sts_for_isin(isin, data_dir=DATA_DIR, session=SESSION, venues=VENUES):
    """
    Loads all QTE (quotes) and STS (trading status) files for a given ISIN across all venues.
    
    Market Microstructure Context:
    - QTE files contain orderbook snapshots showing bid/ask prices and quantities
    - STS files contain market phase information (Open, Continuous Trading, Auction, etc.)
    - We need both to determine: 1) What prices are available, 2) Whether we can actually trade
    
    Args:
        isin: The ISIN identifier (e.g., 'ES0113900J37')
        data_dir: Base directory containing venue folders
        session: Trading date (YYYY-MM-DD)
        venues: List of venue codes to process
    
    Returns:
        tuple: (qte_dataframes_dict, sts_dataframes_dict)
            - qte_dataframes_dict: {venue: DataFrame} mapping
            - sts_dataframes_dict: {venue: DataFrame} mapping
    """
    qte_dfs = {}
    sts_dfs = {}
    
    # Process each venue
    for venue in venues:
        venue_folder = f"{data_dir}/{venue}_{session}"
        
        if not os.path.exists(venue_folder):
            continue
        
        # Find QTE and STS files for this ISIN in this venue
        # Note: There may be multiple tickers for the same ISIN on the same venue
        qte_pattern = f"{venue_folder}/QTE_{session}_{isin}_*_*.csv.gz"
        sts_pattern = f"{venue_folder}/STS_{session}_{isin}_*_*.csv.gz"
        
        qte_files = glob(qte_pattern)
        sts_files = glob(sts_pattern)
        
        # Load all QTE files for this venue (may have multiple orderbooks)
        if qte_files:
            # Load and concatenate all QTE files for this ISIN+venue combination
            qte_list = []
            for qte_file in qte_files:
                try:
                    df_qte = pd.read_csv(qte_file, sep=';', compression='gzip')
                    # Add metadata columns to identify the orderbook
                    filename = os.path.basename(qte_file)
                    parts = filename.split('_')
                    if len(parts) >= 5:
                        df_qte['ticker'] = parts[3]  # Extract ticker from filename
                        df_qte['mic'] = parts[4]     # Extract MIC from filename
                        df_qte['venue'] = venue      # Add venue identifier
                    qte_list.append(df_qte)
                except Exception as e:
                    print(f"Error loading {qte_file}: {e}")
                    continue
            
            if qte_list:
                # Combine all QTE files for this venue
                qte_combined = pd.concat(qte_list, ignore_index=True)
                qte_dfs[venue] = qte_combined
                print(f"  {venue}: Loaded {len(qte_combined)} QTE records")
        
        # Load all STS files for this venue
        if sts_files:
            sts_list = []
            for sts_file in sts_files:
                try:
                    df_sts = pd.read_csv(sts_file, sep=';', compression='gzip')
                    filename = os.path.basename(sts_file)
                    parts = filename.split('_')
                    if len(parts) >= 5:
                        df_sts['ticker'] = parts[3]
                        df_sts['mic'] = parts[4]
                        df_sts['venue'] = venue
                    sts_list.append(df_sts)
                except Exception as e:
                    print(f"Error loading {sts_file}: {e}")
                    continue
            
            if sts_list:
                sts_combined = pd.concat(sts_list, ignore_index=True)
                sts_dfs[venue] = sts_combined
                print(f"  {venue}: Loaded {len(sts_combined)} STS records")
    
    return qte_dfs, sts_dfs

# ============================================================================
# EXECUTE: Discover all ISINs and load QTE/STS files
# ============================================================================

print("=" * 70)
print("STEP 1: DISCOVERING ISINs AND LOADING QTE/STS FILES")
print("=" * 70)

# Discover all unique ISINs in the dataset
all_isins = discover_all_isins()
print(f"\nTotal unique ISINs found: {len(all_isins)}")
print(f"First 10 ISINs: {all_isins[:10]}\n")

# Load QTE and STS files for each ISIN
# Note: We'll process all ISINs, but for large datasets, you might want to 
# process in batches or use lazy loading strategies
print("\nLoading QTE and STS files for all ISINs...")
print("-" * 70)

# Store all loaded data in a dictionary structure
# Structure: {isin: {'qte': {venue: df}, 'sts': {venue: df}}}
all_data = {}

for idx, isin in enumerate(all_isins, 1):
    print(f"\n[{idx}/{len(all_isins)}] Processing ISIN: {isin}")
    qte_dfs, sts_dfs = load_qte_sts_for_isin(isin)
    
    if qte_dfs or sts_dfs:
        all_data[isin] = {
            'qte': qte_dfs,
            'sts': sts_dfs
        }
    else:
        print(f"  Warning: No QTE/STS files found for {isin}")

print("\n" + "=" * 70)
print(f"STEP 1 COMPLETE: Loaded data for {len(all_data)} ISINs")
print("=" * 70)


STEP 1: DISCOVERING ISINs AND LOADING QTE/STS FILES
Discovered 195 unique ISINs across all venues

Total unique ISINs found: 195
First 10 ISINs: ['ARP125991090', 'AU000000BKY0', 'BRBBDCACNPR8', 'BRPETRACNOR9', 'BRPETRACNPR6', 'BRUSIMACNOR3', 'DE000FA5G8E7', 'DE000FA5HCF1', 'DE000FA5HGL0', 'DE000FA5HH03']


Loading QTE and STS files for all ISINs...
----------------------------------------------------------------------

[1/195] Processing ISIN: ARP125991090
  BME: Loaded 48 QTE records
  BME: Loaded 7 STS records

[2/195] Processing ISIN: AU000000BKY0
  BME: Loaded 1256 QTE records
  BME: Loaded 7 STS records
  CBOE: Loaded 928 QTE records
  CBOE: Loaded 7 STS records
  AQUIS: Loaded 1135 QTE records
  AQUIS: Loaded 3 STS records

[3/195] Processing ISIN: BRBBDCACNPR8
  BME: Loaded 25 QTE records
  BME: Loaded 4 STS records

[4/195] Processing ISIN: BRPETRACNOR9
  BME: Loaded 1324 QTE records
  BME: Loaded 4 STS records

[5/195] Processing ISIN: BRPETRACNPR6
  BME: Loaded 1288 QTE recor

In [3]:
# ============================================================================
# DATA CLEANING: Filter Invalid Prices and Non-Continuous Trading Periods
# ============================================================================
# This step filters QTE snapshots to keep only:
# 1. Snapshots during Continuous Trading (addressable orderbooks)
# 2. Snapshots with valid prices (no magic numbers)
# ============================================================================

def clean_qte_data(qte_df, sts_df, venue, invalid_prices=INVALID_PRICES, valid_status_codes=VALID_STATUS_CODES):
    """
    Cleans QTE data by:
    1. Merging with STS to get market trading status at each timestamp
    2. Filtering to keep only continuous trading periods
    3. Removing rows with invalid prices (magic numbers)
    
    Market Microstructure Context:
    - We can only trade during Continuous Trading (not during auctions/halts)
    - Magic numbers represent non-executable orders (market orders, pegged orders, etc.)
    - These must be filtered out to avoid false arbitrage signals
    
    Args:
        qte_df: DataFrame with QTE (quote) data
        sts_df: DataFrame with STS (trading status) data
        venue: Venue identifier (BME, CBOE, etc.)
        invalid_prices: List of magic numbers to filter out
        valid_status_codes: Dict mapping venue to valid status codes
    
    Returns:
        Cleaned QTE DataFrame
    """
    if qte_df.empty:
        return qte_df
    
    # Make a copy to avoid modifying original
    qte_clean = qte_df.copy()
    
    # Convert epoch to datetime for time-based merging
    # QTE data: epoch is in microseconds
    qte_clean['ts'] = pd.to_datetime(qte_clean['epoch'], unit='us')
    
    # Sort by timestamp for merge_asof
    qte_clean = qte_clean.sort_values('ts').reset_index(drop=True)
    
    # Process STS data if available
    if not sts_df.empty:
        # Prepare STS data for merging
        sts_clean = sts_df.copy()
        sts_clean['ts'] = pd.to_datetime(sts_clean['epoch'], unit='us')
        sts_clean = sts_clean.sort_values('ts').reset_index(drop=True)
        
        # Get valid status codes for this venue
        venue_valid_codes = valid_status_codes.get(venue, [])
        
        # Merge STS status with QTE using merge_asof (backward direction)
        # This assigns the most recent market status to each quote
        # Direction='backward' ensures we only use status that was known at quote time
        qte_clean = pd.merge_asof(
            qte_clean,
            sts_clean[['ts', 'market_trading_status']],
            on='ts',
            direction='backward'
        )
        
        # Filter: Keep only rows where market status is in continuous trading
        if venue_valid_codes:
            qte_clean = qte_clean[
                qte_clean['market_trading_status'].isin(venue_valid_codes)
            ].copy()
        
        # Drop the market_trading_status column (no longer needed)
        qte_clean = qte_clean.drop(columns=['market_trading_status'], errors='ignore')
    else:
        # If no STS data, we can't filter by status, but we'll still filter invalid prices
        print(f"    Warning: No STS data for {venue}, cannot filter by trading status")
    
    # Filter out invalid prices (magic numbers)
    # Check both bid and ask prices at level 0 (best bid/ask)
    price_columns = ['px_bid_0', 'px_ask_0']
    
    # Create a mask: True for rows with valid prices
    valid_price_mask = pd.Series(True, index=qte_clean.index)
    
    for col in price_columns:
        if col in qte_clean.columns:
            # Check if price is in the invalid prices list
            col_mask = ~qte_clean[col].isin(invalid_prices)
            valid_price_mask = valid_price_mask & col_mask
    
    # Apply the filter
    qte_clean = qte_clean[valid_price_mask].copy()
    
    # Drop the temporary timestamp column (keep epoch as the time reference)
    qte_clean = qte_clean.drop(columns=['ts'], errors='ignore')
    
    return qte_clean

# ============================================================================
# EXECUTE: Clean all loaded QTE data
# ============================================================================

print("\n" + "=" * 70)
print("CLEANING QTE DATA: Filtering Invalid Prices and Non-Continuous Trading")
print("=" * 70)

# Clean QTE data for each ISIN
cleaned_data = {}

for isin, data in all_data.items():
    print(f"\nCleaning data for ISIN: {isin}")
    qte_dfs = data.get('qte', {})
    sts_dfs = data.get('sts', {})
    
    cleaned_qte = {}
    
    for venue in qte_dfs.keys():
        qte_df = qte_dfs[venue]
        sts_df = sts_dfs.get(venue, pd.DataFrame())
        
        # Get original count
        original_count = len(qte_df)
        
        # Clean the data
        qte_cleaned = clean_qte_data(qte_df, sts_df, venue)
        
        # Get cleaned count
        cleaned_count = len(qte_cleaned)
        removed_count = original_count - cleaned_count
        
        if original_count > 0:
            removal_pct = (removed_count / original_count) * 100
            print(f"  {venue}: {original_count:,} -> {cleaned_count:,} records "
                  f"({removed_count:,} removed, {removal_pct:.2f}%)")
        
        if cleaned_count > 0:
            cleaned_qte[venue] = qte_cleaned
    
    if cleaned_qte:
        cleaned_data[isin] = {
            'qte': cleaned_qte,
            'sts': sts_dfs  # STS data unchanged
        }

print("\n" + "=" * 70)
print(f"CLEANING COMPLETE: Cleaned data for {len(cleaned_data)} ISINs")
print("=" * 70)

# Update all_data with cleaned version
all_data = cleaned_data



CLEANING QTE DATA: Filtering Invalid Prices and Non-Continuous Trading

Cleaning data for ISIN: ARP125991090
  BME: 48 -> 22 records (26 removed, 54.17%)

Cleaning data for ISIN: AU000000BKY0
  BME: 1,256 -> 1,127 records (129 removed, 10.27%)
  CBOE: 928 -> 926 records (2 removed, 0.22%)
  AQUIS: 1,135 -> 1,132 records (3 removed, 0.26%)

Cleaning data for ISIN: BRBBDCACNPR8
  BME: 25 -> 20 records (5 removed, 20.00%)

Cleaning data for ISIN: BRPETRACNOR9
  BME: 1,324 -> 1,249 records (75 removed, 5.66%)

Cleaning data for ISIN: BRPETRACNPR6
  BME: 1,288 -> 1,245 records (43 removed, 3.34%)

Cleaning data for ISIN: BRUSIMACNOR3
  BME: 56 -> 54 records (2 removed, 3.57%)

Cleaning data for ISIN: DE000FA5G8E7
  BME: 340 -> 340 records (0 removed, 0.00%)

Cleaning data for ISIN: DE000FA5HCF1
  BME: 652 -> 652 records (0 removed, 0.00%)

Cleaning data for ISIN: DE000FA5HGL0
  BME: 1,501 -> 1,497 records (4 removed, 0.27%)

Cleaning data for ISIN: DE000FA5HH03
  BME: 784 -> 784 records (0

In [None]:
# ============================================================================
# DATA PREPARATION: Create Unique Timestamps for QTE Data
# ============================================================================
# QTE snapshots can have multiple records at the same microsecond.
# We use the "nanosecond trick" to create unique timestamps for each snapshot.
# This is essential for operations like merge_asof and pivot.
# ============================================================================

def prepare_qte_timestamps(qte_df):
    """
    Prepares QTE data by creating unique timestamps using the nanosecond trick.
    
    Market Microstructure Context:
    - Multiple orderbook snapshots can occur at the same microsecond
    - We need unique timestamps for time-based operations (merge_asof, pivot, etc.)
    - The nanosecond trick preserves all snapshots while ensuring uniqueness
    
    Args:
        qte_df: DataFrame with QTE data (must have 'epoch' column)
    
    Returns:
        DataFrame with unique timestamp index
    """
    if qte_df.empty:
        return qte_df
    
    # Make a copy to avoid modifying original
    df = qte_df.copy()
    
    # 1. Sort by epoch (and sequence if available)
    # Sequence helps maintain order when multiple snapshots share the same epoch
    sort_cols = ['epoch']
    if 'sequence' in df.columns:
        sort_cols.append('sequence')
    
    df = df.sort_values(by=sort_cols, ascending=[True] * len(sort_cols))
    
    # 2. Convert epoch to datetime (microseconds)
    temp_ts = pd.to_datetime(df['epoch'], unit='us')
    
    # 3. The "Nanosecond Trick"
    # groupby().cumcount() numbers items in a group: 0, 1, 2, 3...
    # We group by 'epoch' to find snapshots happening at the same time
    # We treat that count as nanoseconds to create unique timestamps
    offset_ns = df.groupby('epoch').cumcount()
    
    # Safety check: ensure we don't exceed 1000 snapshots per microsecond
    # (nanoseconds go from 0-999, so 1000 would overflow to next microsecond)
    max_offset = offset_ns.max() if len(offset_ns) > 0 else 0
    if max_offset >= 1000:
        raise Exception(f"Too many snapshots at the same microsecond. Max offset: {max_offset}")
    
    # 4. Create the final High-Resolution Timestamp
    # Base Time (microseconds) + Offset (nanoseconds)
    df['ts'] = temp_ts + pd.to_timedelta(offset_ns, unit='ns')
    
    # 5. Set timestamp as index
    df.set_index('ts', inplace=True)
    
    return df

# ============================================================================
# EXECUTE: Prepare timestamps for all QTE data
# ============================================================================

print("\n" + "=" * 70)
print("PREPARING QTE DATA: Creating Unique Timestamps")
print("=" * 70)

# Prepare QTE data for each ISIN
prepared_data = {}

for isin, data in all_data.items():
    print(f"\nPreparing timestamps for ISIN: {isin}")
    qte_dfs = data.get('qte', {})
    sts_dfs = data.get('sts', {})
    
    prepared_qte = {}
    
    for venue, qte_df in qte_dfs.items():
        if qte_df.empty:
            continue
        
        # Check for duplicate epochs before preparation
        duplicates_before = qte_df.duplicated(subset='epoch', keep=False).sum()
        
        # Prepare timestamps
        qte_prepared = prepare_qte_timestamps(qte_df)
        
        # Verify uniqueness
        is_unique = qte_prepared.index.is_unique
        duplicates_after = qte_prepared.index.duplicated().sum()
        
        print(f"  {venue}: {len(qte_prepared):,} snapshots, "
              f"{duplicates_before} duplicate epochs -> "
              f"Unique timestamps: {is_unique}")
        
        if not is_unique:
            print(f"    WARNING: Index is not unique! {duplicates_after} duplicates found.")
        
        prepared_qte[venue] = qte_prepared
    
    if prepared_qte:
        prepared_data[isin] = {
            'qte': prepared_qte,
            'sts': sts_dfs  # STS data unchanged
        }

print("\n" + "=" * 70)
print(f"PREPARATION COMPLETE: Prepared data for {len(prepared_data)} ISINs")
print("=" * 70)

# Update all_data with prepared version
all_data = prepared_data

# ============================================================================
# DEMONSTRATION: Nanosecond Trick for Duplicate Epochs
# ============================================================================
# Similar to the reference notebook, we demonstrate how the nanosecond trick
# handles multiple snapshots at the same microsecond
# ============================================================================

print("\n" + "=" * 70)
print("DEMONSTRATION: Handling Duplicate Epochs with Nanosecond Trick")
print("=" * 70)

# Find an example with duplicate epochs (one that needed the nanosecond trick)
example_found = False
example_isin = None
example_venue = None
example_df = None

for isin, data in prepared_data.items():
    qte_dfs = data.get('qte', {})
    for venue, qte_df in qte_dfs.items():
        if not qte_df.empty:
            # Check for duplicate epochs (before preparation, but we can check the index)
            # Look for timestamps with nanosecond offsets (indicating duplicates were resolved)
            bursts = qte_df[qte_df.index.nanosecond > 0]
            if not bursts.empty:
                example_isin = isin
                example_venue = venue
                example_df = qte_df
                example_found = True
                break
    if example_found:
        break

if example_found and example_df is not None:
    print(f"\n📊 Example: {example_isin} on {example_venue}")
    print(f"   Total snapshots: {len(example_df):,}")
    
    # Check for duplicate epochs (like in reference notebook lines 1-3)
    # We need to check the original epoch column to see duplicates
    duplicates = example_df.duplicated(subset='epoch', keep=False).sum()
    print(f"   Found {duplicates} snapshots sharing the same microsecond (duplicate epochs)")
    
    # Find a specific burst example
    bursts = example_df[example_df.index.nanosecond > 0]
    if not bursts.empty:
        print(f"   Found {len(bursts)} snapshots with nanosecond offsets (collisions resolved)")
        
        # Get the first burst timestamp
        burst_ts = bursts.index[0]
        base_ts = burst_ts.replace(nanosecond=0)
        
        # Get all snapshots at this base timestamp (the "burst")
        burst_snapshots = example_df.loc[base_ts : base_ts + pd.Timedelta(nanoseconds=999)]
        
        print(f"\n💥 Example of a resolved collision (Look at the timestamps!):")
        print(f"   Base timestamp: {base_ts}")
        print(f"   Number of snapshots at this microsecond: {len(burst_snapshots)}")
        print(f"\n   All snapshots at this microsecond:")
        
        # Show relevant columns including sequence if available
        display_cols = ['epoch', 'px_bid_0', 'px_ask_0', 'qty_bid_0', 'qty_ask_0']
        if 'sequence' in burst_snapshots.columns:
            display_cols.insert(1, 'sequence')
        
        print(burst_snapshots[display_cols].to_string())
        
        print(f"\n   Note: Each snapshot has a unique timestamp index with nanosecond offsets")
        print(f"   This preserves the order (by sequence if available) while ensuring uniqueness")
    else:
        print("   No duplicate epochs found in this example")
else:
    print("\n⚠️  No examples with duplicate epochs found to demonstrate")

# ============================================================================
# STATISTICS: Epochs Requiring Nanosecond Trick
# ============================================================================

print("\n" + "=" * 70)
print("STATISTICS: Epochs Requiring Nanosecond Trick")
print("=" * 70)

# Collect statistics about duplicate epochs
duplicate_stats = []

for isin, data in prepared_data.items():
    qte_dfs = data.get('qte', {})
    for venue, qte_df in qte_dfs.items():
        if not qte_df.empty:
            # Count snapshots with nanosecond offsets (indicating duplicate epochs were resolved)
            bursts = qte_df[qte_df.index.nanosecond > 0]
            num_bursts = len(bursts)
            
            if num_bursts > 0:
                # Count unique epochs that had duplicates
                burst_epochs = qte_df.loc[bursts.index, 'epoch'].unique()
                num_duplicate_epochs = len(burst_epochs)
                
                # Find max number of snapshots at a single epoch
                epoch_counts = qte_df.groupby('epoch').size()
                max_snapshots_per_epoch = epoch_counts.max()
                
                duplicate_stats.append({
                    'ISIN': isin,
                    'Venue': venue,
                    'Total_Snapshots': len(qte_df),
                    'Snapshots_With_NS_Offset': num_bursts,
                    'Unique_Epochs_With_Duplicates': num_duplicate_epochs,
                    'Max_Snapshots_Per_Epoch': max_snapshots_per_epoch
                })

if duplicate_stats:
    df_duplicate_stats = pd.DataFrame(duplicate_stats)
    
    print(f"\n📈 Summary of Epochs Requiring Nanosecond Trick:")
    print(f"   • Total venue-ISIN combinations with duplicate epochs: {len(df_duplicate_stats)}")
    print(f"   • Total snapshots requiring nanosecond offsets: {df_duplicate_stats['Snapshots_With_NS_Offset'].sum():,}")
    print(f"   • Total unique epochs with duplicates: {df_duplicate_stats['Unique_Epochs_With_Duplicates'].sum():,}")
    print(f"   • Maximum snapshots at a single epoch: {df_duplicate_stats['Max_Snapshots_Per_Epoch'].max()}")
    
    print(f"\n📋 Top 10 ISIN-Venue combinations by number of duplicate epochs:")
    top_duplicates = df_duplicate_stats.nlargest(10, 'Unique_Epochs_With_Duplicates')
    print(top_duplicates[['ISIN', 'Venue', 'Total_Snapshots', 'Unique_Epochs_With_Duplicates', 
                          'Max_Snapshots_Per_Epoch']].to_string(index=False))
    
    print(f"\n💡 Market Microstructure Context:")
    print(f"   Multiple orderbook snapshots at the same microsecond occur when:")
    print(f"   - Rapid orderbook updates happen faster than microsecond resolution")
    print(f"   - Market data feeds batch multiple updates together")
    print(f"   - High-frequency trading activity creates bursts of updates")
    print(f"   The nanosecond trick preserves all snapshots while maintaining chronological order")
else:
    print("\n✅ No duplicate epochs found - all snapshots have unique microsecond timestamps")



PREPARING QTE DATA: Creating Unique Timestamps

Preparing timestamps for ISIN: ARP125991090
  BME: 22 snapshots, 0 duplicate epochs -> Unique timestamps: True

Preparing timestamps for ISIN: AU000000BKY0
  BME: 1,127 snapshots, 105 duplicate epochs -> Unique timestamps: True
  CBOE: 926 snapshots, 0 duplicate epochs -> Unique timestamps: True
  AQUIS: 1,132 snapshots, 0 duplicate epochs -> Unique timestamps: True

Preparing timestamps for ISIN: BRBBDCACNPR8
  BME: 20 snapshots, 0 duplicate epochs -> Unique timestamps: True

Preparing timestamps for ISIN: BRPETRACNOR9
  BME: 1,249 snapshots, 0 duplicate epochs -> Unique timestamps: True

Preparing timestamps for ISIN: BRPETRACNPR6
  BME: 1,245 snapshots, 2 duplicate epochs -> Unique timestamps: True

Preparing timestamps for ISIN: BRUSIMACNOR3
  BME: 54 snapshots, 0 duplicate epochs -> Unique timestamps: True

Preparing timestamps for ISIN: DE000FA5G8E7
  BME: 340 snapshots, 0 duplicate epochs -> Unique timestamps: True

Preparing time

In [None]:
# ============================================================================
# DESCRIPTIVE SUMMARY: Overview of Loaded Data
# ============================================================================
# This section creates a summary DataFrame showing key statistics about the
# downloaded data: venues, ISINs, record counts, etc.
# ============================================================================

print("\n" + "=" * 70)
print("DESCRIPTIVE SUMMARY OF LOADED DATA")
print("=" * 70)

# Initialize summary statistics
summary_data = []

# Analyze each ISIN
for isin, data in all_data.items():
    qte_dfs = data.get('qte', {})
    sts_dfs = data.get('sts', {})
    
    # Count records per venue for this ISIN
    for venue in VENUES:
        qte_count = len(qte_dfs.get(venue, pd.DataFrame()))
        sts_count = len(sts_dfs.get(venue, pd.DataFrame()))
        
        if qte_count > 0 or sts_count > 0:
            summary_data.append({
                'ISIN': isin,
                'Venue': venue,
                'QTE_Records': qte_count,
                'STS_Records': sts_count,
                'Total_Records': qte_count + sts_count
            })

# Create summary DataFrame
df_summary = pd.DataFrame(summary_data)

if len(df_summary) > 0:
    # Overall statistics
    print(f"\n📊 OVERALL STATISTICS:")
    print(f"   • Total unique ISINs: {df_summary['ISIN'].nunique()}")
    print(f"   • Total venue-ISIN combinations: {len(df_summary)}")
    print(f"   • Total QTE records: {df_summary['QTE_Records'].sum():,}")
    print(f"   • Total STS records: {df_summary['STS_Records'].sum():,}")
    print(f"   • Total records: {df_summary['Total_Records'].sum():,}")
    
    # Statistics by venue
    print(f"\n🏢 STATISTICS BY VENUE:")
    venue_stats = df_summary.groupby('Venue').agg({
        'ISIN': 'nunique',
        'QTE_Records': 'sum',
        'STS_Records': 'sum',
        'Total_Records': 'sum'
    }).rename(columns={'ISIN': 'Unique_ISINs'})
    venue_stats = venue_stats.sort_values('Total_Records', ascending=False)
    print(venue_stats.to_string())
    
    # Statistics by ISIN (top 10)
    print(f"\n📈 TOP 10 ISINs BY TOTAL RECORDS:")
    isin_stats = df_summary.groupby('ISIN').agg({
        'Venue': 'count',
        'QTE_Records': 'sum',
        'STS_Records': 'sum',
        'Total_Records': 'sum'
    }).rename(columns={'Venue': 'Num_Venues'})
    isin_stats = isin_stats.sort_values('Total_Records', ascending=False)
    print(isin_stats.head(10).to_string())
    
    # Venue coverage per ISIN
    print(f"\n🔗 VENUE COVERAGE:")
    venue_coverage = df_summary.groupby('ISIN')['Venue'].count()
    coverage_stats = venue_coverage.value_counts().sort_index()
    print("Number of venues per ISIN:")
    for num_venues, count in coverage_stats.items():
        print(f"   • {num_venues} venue(s): {count} ISIN(s)")
    
    # Statistics about duplicate epochs (nanosecond trick usage)
    print(f"\n⏱️  DUPLICATE EPOCHS STATISTICS (Nanosecond Trick Required):")
    duplicate_epoch_stats = []
    
    for isin, data in all_data.items():
        qte_dfs = data.get('qte', {})
        for venue, qte_df in qte_dfs.items():
            if not qte_df.empty:
                # Count snapshots with nanosecond offsets (indicating duplicate epochs)
                bursts = qte_df[qte_df.index.nanosecond > 0]
                num_bursts = len(bursts)
                
                if num_bursts > 0:
                    # Count unique epochs that had duplicates
                    burst_epochs = qte_df.loc[bursts.index, 'epoch'].unique()
                    num_duplicate_epochs = len(burst_epochs)
                    
                    # Find max number of snapshots at a single epoch
                    epoch_counts = qte_df.groupby('epoch').size()
                    max_snapshots_per_epoch = epoch_counts.max()
                    
                    duplicate_epoch_stats.append({
                        'ISIN': isin,
                        'Venue': venue,
                        'Snapshots_With_NS_Offset': num_bursts,
                        'Unique_Epochs_With_Duplicates': num_duplicate_epochs,
                        'Max_Snapshots_Per_Epoch': max_snapshots_per_epoch
                    })
    
    if duplicate_epoch_stats:
        df_dup_epochs = pd.DataFrame(duplicate_epoch_stats)
        total_dup_combinations = len(df_dup_epochs)
        total_snapshots_with_offset = df_dup_epochs['Snapshots_With_NS_Offset'].sum()
        total_dup_epochs = df_dup_epochs['Unique_Epochs_With_Duplicates'].sum()
        max_snapshots = df_dup_epochs['Max_Snapshots_Per_Epoch'].max()
        
        print(f"   • Venue-ISIN combinations with duplicate epochs: {total_dup_combinations}")
        print(f"   • Total snapshots requiring nanosecond offsets: {total_snapshots_with_offset:,}")
        print(f"   • Total unique epochs with duplicates: {total_dup_epochs:,}")
        print(f"   • Maximum snapshots at a single epoch: {max_snapshots}")
        print(f"\n   Top 5 by number of duplicate epochs:")
        top_5_dup = df_dup_epochs.nlargest(5, 'Unique_Epochs_With_Duplicates')
        print(top_5_dup[['ISIN', 'Venue', 'Unique_Epochs_With_Duplicates', 'Max_Snapshots_Per_Epoch']].to_string(index=False))
    else:
        print(f"   • No duplicate epochs found - all snapshots have unique microsecond timestamps")
    
    # Create a detailed summary DataFrame for display
    print(f"\n📋 DETAILED SUMMARY DATAFRAME (First 20 rows):")
    print(df_summary.head(20).to_string(index=False))
    
    # Store the summary DataFrame for later use
    print(f"\n✅ Summary DataFrame stored as 'df_summary' ({len(df_summary)} rows)")
    
else:
    print("⚠️  No data loaded. Please check Step 1 execution.")

print("\n" + "=" * 70)



DESCRIPTIVE SUMMARY OF LOADED DATA

📊 OVERALL STATISTICS:
   • Total unique ISINs: 185
   • Total venue-ISIN combinations: 434
   • Total QTE records: 7,311,154
   • Total STS records: 2,374
   • Total records: 7,313,528

🏢 STATISTICS BY VENUE:
           Unique_ISINs  QTE_Records  STS_Records  Total_Records
Venue                                                           
BME                 185      3529996         1125        3531121
CBOE                 89      1519860          632        1520492
AQUIS                96      1359160          423        1359583
TURQUOISE            64       902138          194         902332

📈 TOP 10 ISINs BY TOTAL RECORDS:
              Num_Venues  QTE_Records  STS_Records  Total_Records
ISIN                                                             
ES0177542018           4      1115460           26        1115486
ES0113900J37           4       586156           23         586179
ES0113211835           4       333681           22         333703


### Step 2: Create the "Consolidated Tape"

- To detect arbitrage, you need to compare prices across venues *at the exact same time*.
- **Task:** Create a single DataFrame where the index is the timestamp, and the columns represent the Best Bid and Best Ask for **every** venue (BME, XMAD, CBOE, etc.).

### Step 3: Signal Generation

- **Arbitrage Condition:** An opportunity exists when Global Max Bid > Global Min Ask.
- **Profit Calc:** (Max Bid - Min Ask) * Min(BidQty, AskQty).
- **Rising Edge:** In a simulation, if an opportunity persists for 1 second (1000 snapshots), you can only trade it *once* (the first time it appears). Ensure you aren't "double counting" the same opportunity. If the opportunity vanishes and quickly reappears you can count it as a new opportunity for simplification.
- **Simplification:** Only look at opportunities between Global Max Bid and Global Min Ask. There might be others at the second or third price levels of the orderbook, but let's make it simple and use only the best Bid Ask of each trading venue.

In [6]:
# Your code here


### Step 4: The "Time Machine" (Latency Simulation)

- In reality, if you see a price at time $T$, you cannot trade until $T + \Delta$.
- **Task:** Simulate execution latencies of [0, 100, 500, 1000, 2000, 3000, 4000, 5000, 10000, 15000, 20000, 30000, 50000, 100000] microseconds
- *Method:* If a signal is detected at T, look up what the profit *actually is* at T + Latency in your dataframe.

In [7]:
# Your code here


## 4. Deliverables & Evaluation

Submit a Jupyter Notebook containing your code and the following analysis:

1. **The "Money Table":** A summary table showing the Total Realized Profit for all processed ISINs at each latency level.
2. **The Decay Chart:** A line chart visualizing how Total Profit (Y-axis) decays as Latency (X-axis) increases.
3. **Top Opportunities:** A list of the Top 5 most profitable ISINs (at 0 latency). **Sanity check these results**—do they look real?

### 1. The "Money Table"

In [8]:
# Your code here


### 2. The Decay Chart

In [9]:
# Your code here


### 3. Top Opportunities

In [10]:
# Your code here


## Grading Rubric (Max 10 Points)

- **5-6 Points (Baseline):** The code runs, correctly calculates the consolidated tape, identifies Bid > Ask opportunities, and estimates theoretical (0 latency) profit.

- **7-8 Points (Robust):** The simulation accurately models latency (using strict time-lookups) and strictly adheres to the vendor's data quality specs.

- **9-10 Points (Expert):** You demonstrate deep understanding of market microstructure. You handle **Market Status** correctly to avoid fake signals, identify anomalies in the instrument list, and handle edge cases around Market Open/Close.