You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current methodology runs each query 3 times and reports min(iter2, iter3) as "hot" (README "Caching"). For engines that run on a JVM (Trino, Presto, Spark, QuestDB, Druid, Pinot, CrateDB, Doris FE, Greengage / WarehousePG, …), this window catches the JVM mid-warmup. The reported "hot" times are therefore systematically pessimistic compared to actual steady-state performance, by a measurable amount.
Proposing a discussion about increasing the default hot iteration count above 2.
The mechanism
A JVM starts each query in the interpreter or low-tier compiled code, then progressively tiers up to fully-optimised native code as the JIT meets compilation thresholds. Methods can also be deoptimised when a speculation fails (e.g. a newly-loaded class invalidates an inlining assumption) and recompiled. All of this happens online, in microsecond-to-millisecond bursts, inside query iterations. Two consequences:
The first few iterations of any query are slower than the final steady-state by a factor that depends on how much code has tiered up. For engines that use per-worker thread pools, every worker crosses its own JIT thresholds independently - a 192-vCPU box has ~24x more workers than an 8-vCPU one, so the bigger box takes proportionally more iterations to fully warm.
JIT-compile and deopt events land at random points inside iterations. With only 2 hot iterations to choose from, the min definition gives lucky-or-unlucky outcomes: a single mid-iteration compile spike can push that iteration to 2-30x its steady-state time, and if it lands in iter 2 or 3, that is what ClickBench reports.
This is intrinsic to JVM runtime compilation, not a quirk of any single engine.
Observed magnitude
Running a full 43-query suite for 10 hot iterations (instead of 2) on a 192-vCPU box exposes the gap between the 3-iter window and true steady state for one JVM-based engine in ClickBench:
Hot iters
Total runs
Suite-sum (s)
Gap vs steady state
Queries within ±10% of steady state
2 (current ClickBench)
3
4.88
+48%
5/43
3
4
4.07
+23%
19/43
4
5
3.61
+9%
24/43
5
6
3.48
+5%
34/43
6
7
3.41
+3%
35/43
9 (steady state)
10
3.30
-
43/43
Two representative query trajectories (cold + 9 hot iters, seconds), showing how a single mid-iteration JIT event distorts the 3-iter result:
cold=9.83, 0.256, 1.500, 0.043, 0.042, 0.043, 0.054, 0.042, 0.042, 0.042, 0.042 - iter 3 takes 1.5 s because a JIT recompile pause happens inside the iteration. ClickBench currently records this single sample as the hot time. Steady state is ~42 ms.
cold=0.972, 0.461, 0.116, 0.066, 0.055, 0.030, 0.036, 0.039, 0.041, 0.041 - monotonic convergence; iter 2 is 7-15x slower than steady state.
The 1.5-2x iter-3 spike pattern reproduces on both HotSpot C2 (OpenJDK 25) and GraalVM CE 25 on the same binary, so it's not a single-compiler artefact.
The exact magnitude will differ across JVM-based engines depending on codebase, GC, JIT mode, and hardware. The direction is the same for all of them: longer windows produce closer-to-steady-state numbers. JVM engines other than the one measured here will exhibit the same kind of curve, with their own numerical specifics.
Existing data points to the same gap
ClickBench's recent methodology refresh added a sustained-throughput metric (concurrent_qps, 10 worker connections x 600 s window). By second ~30 of that window every hot JIT method has been compiled across every worker, so concurrent_qps reflects actual steady-state throughput, not warmup.
The presence of both hot and concurrent_qps in the methodology is implicit acknowledgement that single-shot timings don't fully characterise an engine's performance. The current proposal targets the same gap from the other side: make the cheap, single-threaded hot number itself a closer reflection of what an engine actually does at steady state, so that the two metrics tell consistent stories. JVM engines today are the largest source of disagreement between the two.
Proposal for discussion
Increase the default number of hot iterations above 2. The exact number is for the maintainers and the community to decide. A few aspects worth thinking through together:
Cost. Most additional iterations are near-steady-state and cheap. For the bench run above, going from 3 to 7 total runs added ~10 s of wall-clock to a 7-minute single-machine suite. For slower systems (Postgres, MariaDB, …), load and cold-run time dominate, so the relative cost is even smaller.
Backwards compatibility. The website's Hot Run = min(runs[1:]) formula is agnostic to how many runs are recorded; existing 3-iter result JSON remains valid input. Re-running individual submissions is gradual, as with previous methodology refreshes (Migrate from lukewarm to true cold runs #793).
Diminishing returns. Marginal closure of the warmup gap drops sharply after a small number of additional iterations; the table above gives one engine's curve, and other JVM engines would have their own. Coordinating with several JVM-engine maintainers (or running 10-iter probes on a representative few ourselves) could help pick a default.
I'd be happy to share more data, run additional experiments on other hardware, or coordinate with maintainers of other JVM-based engines.
Why now
The recent methodology refresh added sustained-throughput (concurrent_qps) and migrated to true cold runs (#793). Increasing the hot iteration count would close out a third class of methodology gap that's currently observable: the single-threaded hot metric systematically under-reporting JVM-based engines' steady-state performance.
Summary
The current methodology runs each query 3 times and reports
min(iter2, iter3)as "hot" (README "Caching"). For engines that run on a JVM (Trino, Presto, Spark, QuestDB, Druid, Pinot, CrateDB, Doris FE, Greengage / WarehousePG, …), this window catches the JVM mid-warmup. The reported "hot" times are therefore systematically pessimistic compared to actual steady-state performance, by a measurable amount.Proposing a discussion about increasing the default hot iteration count above 2.
The mechanism
A JVM starts each query in the interpreter or low-tier compiled code, then progressively tiers up to fully-optimised native code as the JIT meets compilation thresholds. Methods can also be deoptimised when a speculation fails (e.g. a newly-loaded class invalidates an inlining assumption) and recompiled. All of this happens online, in microsecond-to-millisecond bursts, inside query iterations. Two consequences:
mindefinition gives lucky-or-unlucky outcomes: a single mid-iteration compile spike can push that iteration to 2-30x its steady-state time, and if it lands in iter 2 or 3, that is what ClickBench reports.This is intrinsic to JVM runtime compilation, not a quirk of any single engine.
Observed magnitude
Running a full 43-query suite for 10 hot iterations (instead of 2) on a 192-vCPU box exposes the gap between the 3-iter window and true steady state for one JVM-based engine in ClickBench:
Two representative query trajectories (cold + 9 hot iters, seconds), showing how a single mid-iteration JIT event distorts the 3-iter result:
cold=9.83, 0.256, 1.500, 0.043, 0.042, 0.043, 0.054, 0.042, 0.042, 0.042, 0.042- iter 3 takes 1.5 s because a JIT recompile pause happens inside the iteration. ClickBench currently records this single sample as the hot time. Steady state is ~42 ms.cold=0.972, 0.461, 0.116, 0.066, 0.055, 0.030, 0.036, 0.039, 0.041, 0.041- monotonic convergence; iter 2 is 7-15x slower than steady state.The 1.5-2x iter-3 spike pattern reproduces on both HotSpot C2 (OpenJDK 25) and GraalVM CE 25 on the same binary, so it's not a single-compiler artefact.
The exact magnitude will differ across JVM-based engines depending on codebase, GC, JIT mode, and hardware. The direction is the same for all of them: longer windows produce closer-to-steady-state numbers. JVM engines other than the one measured here will exhibit the same kind of curve, with their own numerical specifics.
Existing data points to the same gap
ClickBench's recent methodology refresh added a sustained-throughput metric (
concurrent_qps, 10 worker connections x 600 s window). By second ~30 of that window every hot JIT method has been compiled across every worker, soconcurrent_qpsreflects actual steady-state throughput, not warmup.The presence of both
hotandconcurrent_qpsin the methodology is implicit acknowledgement that single-shot timings don't fully characterise an engine's performance. The current proposal targets the same gap from the other side: make the cheap, single-threadedhotnumber itself a closer reflection of what an engine actually does at steady state, so that the two metrics tell consistent stories. JVM engines today are the largest source of disagreement between the two.Proposal for discussion
Increase the default number of hot iterations above 2. The exact number is for the maintainers and the community to decide. A few aspects worth thinking through together:
Hot Run = min(runs[1:])formula is agnostic to how many runs are recorded; existing 3-iter result JSON remains valid input. Re-running individual submissions is gradual, as with previous methodology refreshes (Migrate from lukewarm to true cold runs #793).I'd be happy to share more data, run additional experiments on other hardware, or coordinate with maintainers of other JVM-based engines.
Why now
The recent methodology refresh added sustained-throughput (
concurrent_qps) and migrated to true cold runs (#793). Increasing the hot iteration count would close out a third class of methodology gap that's currently observable: the single-threaded hot metric systematically under-reporting JVM-based engines' steady-state performance.