Skip to content

Profile existing SimpleMapStore: Criterion benchmarks for insert, query, and store analyze#211

Merged
zzylol merged 4 commits intomainfrom
139-profile-existing-simple-store
Mar 20, 2026
Merged

Profile existing SimpleMapStore: Criterion benchmarks for insert, query, and store analyze#211
zzylol merged 4 commits intomainfrom
139-profile-existing-simple-store

Conversation

@zzylol
Copy link
Copy Markdown
Contributor

@zzylol zzylol commented Mar 20, 2026

Summary

Adds a Criterion benchmark suite to profile the existing SimpleMapStore before the replacement in #175. Both PerKey and Global lock strategies are covered, and both SumAccumulator (trivial baseline) and DatasketchesKLLAccumulator k=200 (realistic sketch) are benchmarked to expose clone cost during queries.

Algorithm complexity

Operation Complexity Notes
insert_precomputed_output_batch O(B + A·log A) B = batch size, A = agg IDs; +O(W·log W) if cleanup triggers
query_precomputed_output (range) O(W·log W + k) full scan + sort on every call, even for small ranges
query_precomputed_output_exact O(k) direct HashMap lookup, unaffected by store size
get_earliest_timestamp (analyze) O(A) PerKey: lock-free atomics; Global: clone under Mutex

Space: O(A · W · L · sketch_size) — sketches are heap-allocated per entry, no sharing across windows.

Concurrency: both strategies serialize readers — query_precomputed_output takes a write lock to update read_counts, so concurrent reads degrade linearly with thread count.

Benchmarks

Group Parameter swept Accumulator types
insert/batch_size 100 → 50 000 items sum, kll
insert/num_agg_ids 1 → 1 000 distinct agg IDs (1 000 items total) sum
query/range_store_size 500 → 50 000 stored windows sum, kll
query/exact_store_size 500 → 50 000 stored windows sum
store_analyze/num_agg_ids 10 → 5 000 agg IDs sum
concurrent_reads/thread_count 1 → 16 threads (5 000 windows) sum

Accumulator types:

  • sumSumAccumulator (single f64, ~0 clone cost, lock/sort baseline)
  • kllDatasketchesKLLAccumulator k=200 (sketch populated with 20 values, realistic clone cost)

How to run

cargo bench -p query_engine_rust --bench simple_store_bench
# HTML reports → target/criterion/

Benchmark results

Measured on this machine (Linux, optimized build). All times are median of 100 samples.

insert/batch_size — O(B), throughput roughly constant

batch size sum/per_key sum/global kll/per_key kll/global
100 49.8 µs 35.6 µs 70.0 µs 55.0 µs
1 000 505.6 µs 376.8 µs 724.5 µs 596.3 µs
5 000 2.58 ms 1.90 ms 3.79 ms 3.24 ms
10 000 5.19 ms 3.92 ms 8.25 ms 7.30 ms
50 000 28.6 ms 21.0 ms 71.1 ms 58.8 ms
  • Global lock is ~30% faster for inserts across all batch sizes (one Mutex vs per-shard DashMap locking).
  • KLL adds ~40% overhead for sum at small batches, growing to ~2.5× at 50 000 items due to sketch construction cost (20 updates per sketch).

insert/num_agg_ids — fixed 1 000-item batch, spread across N agg IDs

agg IDs per_key global
1 512.8 µs 374.3 µs
10 528.4 µs 367.3 µs
50 584.0 µs 415.2 µs
200 678.6 µs 503.2 µs
1 000 1.355 ms 599.9 µs

Per-key latency grows 2.6× from 1 → 1 000 agg IDs; global stays nearly flat (~1.6×). At 1 000 agg IDs global is 2.3× faster — each agg ID in per-key hits a separate DashMap shard and acquires a separate RwLock.

query/range_store_size — O(W·log W + k), KLL clone cost dominates

stored windows sum/per_key sum/global kll/per_key kll/global kll overhead
500 83.7 µs 87.3 µs 816.7 µs 888.6 µs ~10×
1 000 177.0 µs 168.4 µs 1.72 ms 1.71 ms ~10×
5 000 979.3 µs 961.6 µs 8.71 ms 8.61 ms ~9×
10 000 2.01 ms 1.99 ms 18.5 ms 18.4 ms ~9×
50 000 11.6 ms 12.0 ms 144.4 ms 141.3 ms ~12×

The sum baseline confirms O(W·log W) sorting behaviour. KLL adds a consistent ~10× overheadclone_boxed_core() on a k=200 sketch dominates the query cost at every scale. With real sketches, range queries scale far worse than the sort alone suggests. Lock strategy makes no practical difference for range queries at any scale.

query/exact_store_size — O(1) confirmed, flat regardless of store size

stored windows per_key global
500 304 ns 283 ns
1 000 289 ns 289 ns
5 000 306 ns 301 ns
10 000 289 ns 293 ns
50 000 304 ns 291 ns

Flat ~290–305 ns across 500 → 50 000 windows. O(1) HashMap lookup confirmed. Lock strategy irrelevant.

store_analyze/num_agg_ids — global is 58–130× faster

agg IDs per_key global global speedup
10 6.60 µs 50.7 ns 130×
100 13.5 µs 128 ns 106×
500 48.8 µs 837 ns 58×
1 000 90.6 µs 1.55 µs 58×
5 000 413 µs 6.02 µs 69×

Global maintains a pre-built HashMap cloned under one Mutex in nanoseconds. PerKey must iterate all DashMap shards with per-shard locking — O(A) with high constant. The gap remains 2–3 orders of magnitude across all scales.

concurrent_reads/thread_count — both strategies serialize (5 000 windows)

threads per_key global
1 2.11 ms 1.94 ms
2 3.27 ms 3.25 ms
4 6.12 ms 6.00 ms
8 8.98 ms 8.90 ms
16 16.3 ms 16.7 ms

Both strategies degrade linearly with thread count (~2× per doubling, confirming full serialisation). The write lock taken during query_precomputed_output (to update read_counts) eliminates any concurrency benefit from either design. Global and per-key are indistinguishable at scale.

Test plan

  • cargo build --benches passes with no new warnings
  • Run benchmarks and record baseline numbers in the issue thread

🤖 Generated with Claude Code

Adds six benchmark groups to measure the existing SimpleMapStore's
algorithm complexity before the inverted-index replacement in PR #175:
- insert/batch_size: O(B) insert scaling across 10–5000 items
- insert/num_agg_ids: lock overhead across 1–200 aggregation IDs
- query/range_store_size: O(W·log W + k) range query across 100–5000 windows
- query/exact_store_size: O(1) HashMap lookup verified across store sizes
- store_analyze/num_agg_ids: O(A) earliest-timestamp scan across 10–1000 IDs
- concurrent_reads/thread_count: write-lock serialisation with 1–8 threads

Both Global and PerKey lock strategies are profiled in each group.
Results land in target/criterion/ as HTML reports.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zzylol zzylol linked an issue Mar 20, 2026 that may be closed by this pull request
- Apply rustfmt to simple_store_bench.rs (alignment, closure brace style)
- Add dummy benches/simple_store_bench.rs stub to Dockerfile dep-cache layer
  so cargo can parse the [[bench]] manifest entry during docker build

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
zzylol and others added 2 commits March 19, 2026 20:59
- Increase parameter ranges (batch: 100→50k, windows: 500→50k,
  agg IDs: 1→1k/5k, threads: 1→16)
- Add DatasketchesKLLAccumulator k=200 variant to insert and range
  query benchmarks to expose realistic sketch clone cost
- KLL results show ~10x overhead on range queries vs trivial SumAccumulator

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…_bench

Same fix as asap-query-engine/Dockerfile — the workspace Cargo.toml for
asap-query-engine declares [[bench]] simple_store_bench, so Cargo requires
the file to exist even during dependency-only builds.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@milindsrivastava1997 milindsrivastava1997 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Pls merge after CI.

@zzylol zzylol merged commit 1ce5abe into main Mar 20, 2026
4 checks passed
@zzylol zzylol deleted the 139-profile-existing-simple-store branch March 20, 2026 02:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

profile existing simple store

2 participants