Profile existing SimpleMapStore: Criterion benchmarks for insert, query, and store analyze by zzylol · Pull Request #211 · ProjectASAP/ASAPQuery

zzylol · 2026-03-20T00:51:28Z

Summary

Adds a Criterion benchmark suite to profile the existing SimpleMapStore before the replacement in #175. Both PerKey and Global lock strategies are covered, and both SumAccumulator (trivial baseline) and DatasketchesKLLAccumulator k=200 (realistic sketch) are benchmarked to expose clone cost during queries.

Algorithm complexity

Operation	Complexity	Notes
`insert_precomputed_output_batch`	O(B + A·log A)	B = batch size, A = agg IDs; +O(W·log W) if cleanup triggers
`query_precomputed_output` (range)	O(W·log W + k)	full scan + sort on every call, even for small ranges
`query_precomputed_output_exact`	O(k)	direct HashMap lookup, unaffected by store size
`get_earliest_timestamp` (analyze)	O(A)	PerKey: lock-free atomics; Global: clone under Mutex

Space: O(A · W · L · sketch_size) — sketches are heap-allocated per entry, no sharing across windows.

Concurrency: both strategies serialize readers — query_precomputed_output takes a write lock to update read_counts, so concurrent reads degrade linearly with thread count.

Benchmarks

Group	Parameter swept	Accumulator types
`insert/batch_size`	100 → 50 000 items	`sum`, `kll`
`insert/num_agg_ids`	1 → 1 000 distinct agg IDs (1 000 items total)	`sum`
`query/range_store_size`	500 → 50 000 stored windows	`sum`, `kll`
`query/exact_store_size`	500 → 50 000 stored windows	`sum`
`store_analyze/num_agg_ids`	10 → 5 000 agg IDs	`sum`
`concurrent_reads/thread_count`	1 → 16 threads (5 000 windows)	`sum`

Accumulator types:

sum — SumAccumulator (single f64, ~0 clone cost, lock/sort baseline)
kll — DatasketchesKLLAccumulator k=200 (sketch populated with 20 values, realistic clone cost)

How to run

cargo bench -p query_engine_rust --bench simple_store_bench
# HTML reports → target/criterion/

Benchmark results

Measured on this machine (Linux, optimized build). All times are median of 100 samples.

insert/batch_size — O(B), throughput roughly constant

batch size	sum/per_key	sum/global	kll/per_key	kll/global
100	49.8 µs	35.6 µs	70.0 µs	55.0 µs
1 000	505.6 µs	376.8 µs	724.5 µs	596.3 µs
5 000	2.58 ms	1.90 ms	3.79 ms	3.24 ms
10 000	5.19 ms	3.92 ms	8.25 ms	7.30 ms
50 000	28.6 ms	21.0 ms	71.1 ms	58.8 ms

Global lock is ~30% faster for inserts across all batch sizes (one Mutex vs per-shard DashMap locking).
KLL adds ~40% overhead for sum at small batches, growing to ~2.5× at 50 000 items due to sketch construction cost (20 updates per sketch).

insert/num_agg_ids — fixed 1 000-item batch, spread across N agg IDs

agg IDs	per_key	global
1	512.8 µs	374.3 µs
10	528.4 µs	367.3 µs
50	584.0 µs	415.2 µs
200	678.6 µs	503.2 µs
1 000	1.355 ms	599.9 µs

Per-key latency grows 2.6× from 1 → 1 000 agg IDs; global stays nearly flat (~1.6×). At 1 000 agg IDs global is 2.3× faster — each agg ID in per-key hits a separate DashMap shard and acquires a separate RwLock.

query/range_store_size — O(W·log W + k), KLL clone cost dominates

stored windows	sum/per_key	sum/global	kll/per_key	kll/global	kll overhead
500	83.7 µs	87.3 µs	816.7 µs	888.6 µs	~10×
1 000	177.0 µs	168.4 µs	1.72 ms	1.71 ms	~10×
5 000	979.3 µs	961.6 µs	8.71 ms	8.61 ms	~9×
10 000	2.01 ms	1.99 ms	18.5 ms	18.4 ms	~9×
50 000	11.6 ms	12.0 ms	144.4 ms	141.3 ms	~12×

The sum baseline confirms O(W·log W) sorting behaviour. KLL adds a consistent ~10× overhead — clone_boxed_core() on a k=200 sketch dominates the query cost at every scale. With real sketches, range queries scale far worse than the sort alone suggests. Lock strategy makes no practical difference for range queries at any scale.

query/exact_store_size — O(1) confirmed, flat regardless of store size

stored windows	per_key	global
500	304 ns	283 ns
1 000	289 ns	289 ns
5 000	306 ns	301 ns
10 000	289 ns	293 ns
50 000	304 ns	291 ns

Flat ~290–305 ns across 500 → 50 000 windows. O(1) HashMap lookup confirmed. Lock strategy irrelevant.

store_analyze/num_agg_ids — global is 58–130× faster

agg IDs	per_key	global	global speedup
10	6.60 µs	50.7 ns	130×
100	13.5 µs	128 ns	106×
500	48.8 µs	837 ns	58×
1 000	90.6 µs	1.55 µs	58×
5 000	413 µs	6.02 µs	69×

Global maintains a pre-built HashMap cloned under one Mutex in nanoseconds. PerKey must iterate all DashMap shards with per-shard locking — O(A) with high constant. The gap remains 2–3 orders of magnitude across all scales.

concurrent_reads/thread_count — both strategies serialize (5 000 windows)

threads	per_key	global
1	2.11 ms	1.94 ms
2	3.27 ms	3.25 ms
4	6.12 ms	6.00 ms
8	8.98 ms	8.90 ms
16	16.3 ms	16.7 ms

Both strategies degrade linearly with thread count (~2× per doubling, confirming full serialisation). The write lock taken during query_precomputed_output (to update read_counts) eliminates any concurrency benefit from either design. Global and per-key are indistinguishable at scale.

Test plan

cargo build --benches passes with no new warnings
Run benchmarks and record baseline numbers in the issue thread

🤖 Generated with Claude Code

Adds six benchmark groups to measure the existing SimpleMapStore's algorithm complexity before the inverted-index replacement in PR #175: - insert/batch_size: O(B) insert scaling across 10–5000 items - insert/num_agg_ids: lock overhead across 1–200 aggregation IDs - query/range_store_size: O(W·log W + k) range query across 100–5000 windows - query/exact_store_size: O(1) HashMap lookup verified across store sizes - store_analyze/num_agg_ids: O(A) earliest-timestamp scan across 10–1000 IDs - concurrent_reads/thread_count: write-lock serialisation with 1–8 threads Both Global and PerKey lock strategies are profiled in each group. Results land in target/criterion/ as HTML reports. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Apply rustfmt to simple_store_bench.rs (alignment, closure brace style) - Add dummy benches/simple_store_bench.rs stub to Dockerfile dep-cache layer so cargo can parse the [[bench]] manifest entry during docker build Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Increase parameter ranges (batch: 100→50k, windows: 500→50k, agg IDs: 1→1k/5k, threads: 1→16) - Add DatasketchesKLLAccumulator k=200 variant to insert and range query benchmarks to expose realistic sketch clone cost - KLL results show ~10x overhead on range queries vs trivial SumAccumulator Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…_bench Same fix as asap-query-engine/Dockerfile — the workspace Cargo.toml for asap-query-engine declares [[bench]] simple_store_bench, so Cargo requires the file to exist even during dependency-only builds. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

milindsrivastava1997

Thanks for this! Pls merge after CI.

zzylol linked an issue Mar 20, 2026 that may be closed by this pull request

profile existing simple store #139

Closed

zzylol requested a review from milindsrivastava1997 March 20, 2026 01:36

zzylol and others added 2 commits March 19, 2026 20:59

This was referenced Mar 20, 2026

Replace SimpleStore with inverted index (label -> BTreeMap<Time>) #175

Merged

Store correctness contract test suite #212

Merged

milindsrivastava1997 approved these changes Mar 20, 2026

View reviewed changes

zzylol merged commit 1ce5abe into main Mar 20, 2026
4 checks passed

zzylol deleted the 139-profile-existing-simple-store branch March 20, 2026 02:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile existing SimpleMapStore: Criterion benchmarks for insert, query, and store analyze#211

Profile existing SimpleMapStore: Criterion benchmarks for insert, query, and store analyze#211
zzylol merged 4 commits intomainfrom
139-profile-existing-simple-store

zzylol commented Mar 20, 2026 •

edited

Loading

Uh oh!

milindsrivastava1997 left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zzylol commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Algorithm complexity

Benchmarks

How to run

Benchmark results

insert/batch_size — O(B), throughput roughly constant

insert/num_agg_ids — fixed 1 000-item batch, spread across N agg IDs

query/range_store_size — O(W·log W + k), KLL clone cost dominates

query/exact_store_size — O(1) confirmed, flat regardless of store size

store_analyze/num_agg_ids — global is 58–130× faster

concurrent_reads/thread_count — both strategies serialize (5 000 windows)

Test plan

Uh oh!

milindsrivastava1997 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zzylol commented Mar 20, 2026 •

edited

Loading

milindsrivastava1997 left a comment •

edited

Loading