# pageindex-rs vs PageIndex (Python) — Benchmark

Measures **index build speed**, **node retrieval speed**, and **consistency** across document sizes.

- 500 iterations per build benchmark
- 1000 random lookups per retrieval benchmark
- Reports mean, median, stdev, p95, p99, and max for every metric

No LLM calls — pure parsing and retrieval only.

## 1. Imports and Configuration

In [1]:
import os
os.environ['GROQ_API_KEY']=""

In [2]:
import sys, time, random, statistics, re
from pathlib import Path

import pageindex_rs

ORIG_PATH = "/Volumes/ExtraStorage/PageIndexOriginal/PageIndex"
if ORIG_PATH not in sys.path:
    sys.path.insert(0, ORIG_PATH)

from pageindex.page_index_md import (
    extract_nodes_from_markdown,
    extract_node_text_content,
    build_tree_from_nodes,
)
from pageindex.utils import write_node_id, format_structure, structure_to_list

TESTS_DIR = Path("/Volumes/ExtraStorage/PageIndexRust/tests")
TESTS_DIR.mkdir(exist_ok=True)
print("Setup complete")

Setup complete


## 2. Generate Benchmark Documents

Pulls Wikipedia articles into documents of increasing size. Skip if files already exist.

In [3]:
import wikipedia

def wiki_to_markdown(topic):
    try:
        page = wikipedia.page(topic, auto_suggest=False)
        content = page.content
        content = re.sub(r"=== (.+?) ===", r"### \\1", content)
        content = re.sub(r"== (.+?) ==", r"## \\1", content)
        return f"# {page.title}\\n\\n{content}"
    except Exception as e:
        print(f"  Skipped {topic}: {e}")
        return ""

TOPICS_SMALL = ["Corporate finance"]

TOPICS_MEDIUM = [
    "Corporate finance", "Investment banking", "Mergers and acquisitions",
    "Financial statement", "Valuation (finance)", "Private equity",
    "Hedge fund", "Capital structure", "Financial risk management", "Stock market",
]

TOPICS_LARGE = TOPICS_MEDIUM + [
    "Bond (finance)", "Derivative (finance)", "Financial modeling",
    "Leveraged buyout", "Initial public offering", "Venture capital",
    "Asset management", "Portfolio management", "Risk management",
    "Financial regulation", "Banking", "Central bank", "Monetary policy",
    "Fiscal policy", "Economic growth", "Inflation", "Interest rate",
    "Foreign exchange market", "Commodity market", "Real estate investment trust",
]

def build_and_save(topics, filename):
    path = TESTS_DIR / filename
    if path.exists():
        print(f"  {filename} exists ({path.stat().st_size/1024:.0f} KB) — skipping")
        return
    combined = ""
    for topic in topics:
        md_text = wiki_to_markdown(topic)
        combined += md_text + "\\n\\n"
        print(f"  {topic}: {len(md_text):,} chars")
    path.write_text(combined)
    print(f"  => {filename}: {path.stat().st_size/1024:.0f} KB")

print("Generating small (~43 KB)...")
build_and_save(TOPICS_SMALL,  "bench_small.md")
print("Generating medium (~3 MB)...")
build_and_save(TOPICS_MEDIUM, "bench_medium.md")
print("Generating large (~10 MB)...")
build_and_save(TOPICS_LARGE,  "bench_large.md")

Generating small (~43 KB)...
  Corporate finance: 43,082 chars
  => bench_small.md: 42 KB
Generating medium (~3 MB)...
  Corporate finance: 43,082 chars
  Investment banking: 32,507 chars
  Mergers and acquisitions: 59,540 chars
  Financial statement: 6,773 chars
  Valuation (finance): 25,363 chars
  Private equity: 62,795 chars
  Hedge fund: 68,938 chars
  Capital structure: 17,879 chars
  Financial risk management: 47,198 chars
  Stock market: 39,487 chars
  => bench_medium.md: 395 KB
Generating large (~10 MB)...
  Corporate finance: 43,082 chars
  Investment banking: 32,507 chars
  Mergers and acquisitions: 59,540 chars
  Financial statement: 6,773 chars
  Valuation (finance): 25,363 chars
  Private equity: 62,795 chars
  Hedge fund: 68,938 chars
  Capital structure: 17,879 chars
  Financial risk management: 47,198 chars
  Stock market: 39,487 chars
  Bond (finance): 34,800 chars
  Derivative (finance): 58,927 chars
  Financial modeling: 12,131 chars
  Leveraged buyout: 22,188 chars



  lis = BeautifulSoup(html).find_all('li')


  Skipped Portfolio management: "Portfolio management" may refer to: 
Portfolio manager
Investment management
IT portfolio management
Application portfolio management
Product portfolio management
Project management
Project portfolio management
  Portfolio management: 0 chars
  Risk management: 42,215 chars
  Financial regulation: 4,487 chars
  Banking: 42,480 chars
  Central bank: 44,079 chars
  Monetary policy: 50,730 chars
  Fiscal policy: 16,796 chars
  Economic growth: 65,061 chars
  Inflation: 66,912 chars
  Interest rate: 24,457 chars
  Foreign exchange market: 41,192 chars
  Commodity market: 27,948 chars
  Real estate investment trust: 32,676 chars
  => bench_large.md: 1055 KB


## 3. Helper Functions

In [4]:
def percentile(data, p):
    """Return the p-th percentile of sorted data."""
    sorted_data = sorted(data)
    k = (len(sorted_data) - 1) * p / 100
    lo, hi = int(k), min(int(k) + 1, len(sorted_data) - 1)
    return sorted_data[lo] + (sorted_data[hi] - sorted_data[lo]) * (k - lo)

def stats(times):
    return {
        "mean":   statistics.mean(times),
        "median": statistics.median(times),
        "stdev":  statistics.stdev(times),
        "min":    min(times),
        "p95":    percentile(times, 95),
        "p99":    percentile(times, 99),
        "max":    max(times),
    }

def build_py_index(path):
    content = Path(path).read_text()
    node_list, md_lines = extract_nodes_from_markdown(content)
    nodes_with_content = extract_node_text_content(node_list, md_lines)
    tree = build_tree_from_nodes(nodes_with_content)
    write_node_id(tree)
    return format_structure(tree, order=["title", "node_id", "text", "line_num", "nodes"])

def find_py_node(tree, node_id):
    for node in structure_to_list(tree):
        if node["node_id"] == node_id:
            return node
    return structure_to_list(tree)[0]

def print_stats_table(label, rs, py):
    speedups = {
        k: py[k] / rs[k] if rs[k] > 0 else float('inf')
        for k in ["mean", "median", "stdev", "p95", "p99", "max"]
    }
    print(f"\\n  {label}")
    print(f"  {'Metric':<10} {'Rust':>10} {'Python':>10} {'Speedup':>10}")
    print(f"  {'-'*44}")
    for k in ["mean", "median", "stdev", "p95", "p99", "max"]:
        print(f"  {k:<10} {rs[k]:>9.4f}ms {py[k]:>9.4f}ms {speedups[k]:>9.2f}x")

print("Helpers ready")

Helpers ready


## 4. Index Build Benchmark (500 iterations)

Pure markdown parsing — no API calls.

Each iteration builds a full tree from scratch including:
- Header extraction
- Text content extraction per node
- Tree assembly
- Node ID assignment

In [5]:
N_BUILD = 500
build_results = {}

for fname in ["bench_small.md", "bench_medium.md", "bench_large.md"]:
    path = TESTS_DIR / fname
    if not path.exists():
        print(f"Skipping {fname} — not found")
        continue

    size_kb = path.stat().st_size / 1024
    print(f"\\nBenchmarking {fname} ({size_kb:.0f} KB) — {N_BUILD} iterations...")

    # Rust
    rs_times = []
    for _ in range(N_BUILD):
        t0 = time.perf_counter()
        pageindex_rs.PageIndex.from_file(fname, str(path))
        rs_times.append((time.perf_counter() - t0) * 1000)

    # Python
    content = path.read_text()
    py_times = []
    for _ in range(N_BUILD):
        t0 = time.perf_counter()
        nl, ml = extract_nodes_from_markdown(content)
        nc = extract_node_text_content(nl, ml)
        t = build_tree_from_nodes(nc)
        write_node_id(t)
        py_times.append((time.perf_counter() - t0) * 1000)

    rs_s = stats(rs_times)
    py_s = stats(py_times)
    build_results[fname] = {"size_kb": size_kb, "rs": rs_s, "py": py_s}
    print_stats_table(f"{fname} ({size_kb:.0f} KB)", rs_s, py_s)

print("\\nBuild benchmark complete")

\nBenchmarking bench_small.md (42 KB) — 500 iterations...
\n  bench_small.md (42 KB)
  Metric           Rust     Python    Speedup
  --------------------------------------------
  mean          0.2067ms    0.1531ms      0.74x
  median        0.1084ms    0.1495ms      1.38x
  stdev         0.8347ms    0.0141ms      0.02x
  p95           0.4845ms    0.1665ms      0.34x
  p99           1.3354ms    0.2057ms      0.15x
  max          17.4007ms    0.3755ms      0.02x
\nBenchmarking bench_medium.md (395 KB) — 500 iterations...
\n  bench_medium.md (395 KB)
  Metric           Rust     Python    Speedup
  --------------------------------------------
  mean          0.8726ms    1.3693ms      1.57x
  median        0.8452ms    1.3814ms      1.63x
  stdev         0.0602ms    0.0534ms      0.89x
  p95           0.9740ms    1.4553ms      1.49x
  p99           1.1289ms    1.5107ms      1.34x
  max           1.2899ms    1.6474ms      1.28x
\nBenchmarking bench_large.md (1055 KB) — 500 iterations...
\n  

## 5. Node Retrieval Benchmark (1000 lookups)

Random node lookups across the full tree.

- **Rust**: `HashMap` — O(1) lookup
- **Python**: linear scan — O(n) — performance degrades as node count grows

In [6]:
N_RETRIEVAL = 1000
retrieval_results = {}

for fname in ["bench_small.md", "bench_medium.md", "bench_large.md"]:
    path = TESTS_DIR / fname
    if not path.exists():
        print(f"Skipping {fname} — not found")
        continue

    size_kb = path.stat().st_size / 1024
    print(f"\\nBuilding indexes for {fname} ({size_kb:.0f} KB)...")
    rs_idx = pageindex_rs.PageIndex.from_file(fname, str(path))
    py_tree = build_py_index(path)
    node_count = len(rs_idx.node_ids())
    print(f"  Node count: {node_count}")

    rs_node_ids = rs_idx.node_ids()
    py_node_ids = [n["node_id"] for n in structure_to_list(py_tree)]

    rs_times = []
    for _ in range(N_RETRIEVAL):
        nid = random.choice(rs_node_ids)
        t0 = time.perf_counter()
        rs_idx.get_node(nid)
        rs_times.append((time.perf_counter() - t0) * 1000)

    py_times = []
    for _ in range(N_RETRIEVAL):
        nid = random.choice(py_node_ids)
        t0 = time.perf_counter()
        find_py_node(py_tree, nid)
        py_times.append((time.perf_counter() - t0) * 1000)

    rs_s = stats(rs_times)
    py_s = stats(py_times)
    retrieval_results[fname] = {"size_kb": size_kb, "node_count": node_count, "rs": rs_s, "py": py_s}
    print_stats_table(f"{fname} — {node_count} nodes", rs_s, py_s)

print("\\nRetrieval benchmark complete")

\nBuilding indexes for bench_small.md (42 KB)...
  Node count: 28
\n  bench_small.md — 28 nodes
  Metric           Rust     Python    Speedup
  --------------------------------------------
  mean          0.0072ms    0.0060ms      0.83x
  median        0.0054ms    0.0048ms      0.90x
  stdev         0.0098ms    0.0100ms      1.02x
  p95           0.0198ms    0.0073ms      0.37x
  p99           0.0428ms    0.0303ms      0.71x
  max           0.2055ms    0.2320ms      1.13x
\nBuilding indexes for bench_medium.md (395 KB)...
  Node count: 261
\n  bench_medium.md — 261 nodes
  Metric           Rust     Python    Speedup
  --------------------------------------------
  mean          0.0119ms    0.0272ms      2.29x
  median        0.0121ms    0.0270ms      2.23x
  stdev         0.0055ms    0.0023ms      0.42x
  p95           0.0210ms    0.0307ms      1.46x
  p99           0.0227ms    0.0319ms      1.41x
  max           0.0322ms    0.0591ms      1.83x
\nBuilding indexes for bench_large.md (10

## 6. Summary

Consolidated view across all document sizes.

In [7]:
W = 90
print("=" * W)
print("INDEX BUILD — 500 iterations per size")
print("=" * W)
print(f"{'Document':<20} {'Size':>8}  {'Mean':>8}  {'Median':>8}  {'Stdev':>8}  {'p95':>8}  {'p99':>8}  {'Max':>8}  {'Speedup':>9}")
print(f"  (all times in ms)")
print("-" * W)
for fname, r in build_results.items():
    for impl, label in [(r['rs'], 'Rust'), (r['py'], 'Python')]:
        speedup = r['py']['mean'] / r['rs']['mean']
        sp_str = f"{speedup:.2f}x" if label == 'Rust' else ""
        print(f"  {fname:<18} {r['size_kb']:>7.0f}KB  {impl['mean']:>8.3f}  {impl['median']:>8.3f}  {impl['stdev']:>8.3f}  {impl['p95']:>8.3f}  {impl['p99']:>8.3f}  {impl['max']:>8.3f}  {sp_str:>9}  [{label}]")
    print()

print("=" * W)
print("NODE RETRIEVAL — 1000 random lookups per size")
print("=" * W)
print(f"{'Document':<20} {'Nodes':>7}  {'Mean':>8}  {'Median':>8}  {'Stdev':>8}  {'p95':>8}  {'p99':>8}  {'Max':>8}  {'Speedup':>9}")
print(f"  (all times in ms)")
print("-" * W)
for fname, r in retrieval_results.items():
    for impl, label in [(r['rs'], 'Rust'), (r['py'], 'Python')]:
        speedup = r['py']['mean'] / r['rs']['mean']
        sp_str = f"{speedup:.2f}x" if label == 'Rust' else ""
        print(f"  {fname:<18} {r['node_count']:>7}  {impl['mean']:>8.4f}  {impl['median']:>8.4f}  {impl['stdev']:>8.4f}  {impl['p95']:>8.4f}  {impl['p99']:>8.4f}  {impl['max']:>8.4f}  {sp_str:>9}  [{label}]")
    print()

INDEX BUILD — 500 iterations per size
Document                 Size      Mean    Median     Stdev       p95       p99       Max    Speedup
  (all times in ms)
------------------------------------------------------------------------------------------
  bench_small.md          42KB     0.207     0.108     0.835     0.485     1.335    17.401      0.74x  [Rust]
  bench_small.md          42KB     0.153     0.149     0.014     0.166     0.206     0.376             [Python]

  bench_medium.md        395KB     0.873     0.845     0.060     0.974     1.129     1.290      1.57x  [Rust]
  bench_medium.md        395KB     1.369     1.381     0.053     1.455     1.511     1.647             [Python]

  bench_large.md        1055KB     2.549     2.543     0.104     2.685     2.781     3.706      1.68x  [Rust]
  bench_large.md        1055KB     4.278     3.960     2.782     4.158    20.993    42.890             [Python]

NODE RETRIEVAL — 1000 random lookups per size
Document               Nodes      M