Methodology: 3 hot iterations is insufficient for JVM-based engines

## Summary

The current methodology runs each query 3 times and reports `min(iter2, iter3)` as "hot" ([README "Caching"](https://github.com/ClickHouse/ClickBench#caching)). For engines that run on a JVM (Trino, Presto, Spark, QuestDB, Druid, Pinot, CrateDB, Doris FE, Greengage / WarehousePG, …), this window catches the JVM mid-warmup. The reported "hot" times are therefore systematically pessimistic compared to actual steady-state performance, by a measurable amount.

Proposing a discussion about increasing the default hot iteration count above 2.

## The mechanism

A JVM starts each query in the interpreter or low-tier compiled code, then progressively tiers up to fully-optimised native code as the JIT meets compilation thresholds. Methods can also be *deoptimised* when a speculation fails (e.g. a newly-loaded class invalidates an inlining assumption) and recompiled. All of this happens online, in microsecond-to-millisecond bursts, **inside query iterations**. Two consequences:

1. The first few iterations of any query are slower than the final steady-state by a factor that depends on how much code has tiered up. For engines that use per-worker thread pools, every worker crosses its own JIT thresholds independently - a 192-vCPU box has ~24x more workers than an 8-vCPU one, so the bigger box takes proportionally more iterations to fully warm.
2. JIT-compile and deopt events land at random points inside iterations. With only 2 hot iterations to choose from, the `min` definition gives lucky-or-unlucky outcomes: a single mid-iteration compile spike can push that iteration to 2-30x its steady-state time, and if it lands in iter 2 or 3, that is what ClickBench reports.

This is intrinsic to JVM runtime compilation, not a quirk of any single engine.

## Observed magnitude

Running a full 43-query suite for 10 hot iterations (instead of 2) on a 192-vCPU box exposes the gap between the 3-iter window and true steady state for one JVM-based engine in ClickBench:

| Hot iters | Total runs | Suite-sum (s) | Gap vs steady state | Queries within ±10% of steady state |
| --------- | ---------- | ------------- | ------------------- | ----------------------------------- |
| **2 (current ClickBench)** | **3** | **4.88** | **+48%** | **5/43** |
| 3         | 4          | 4.07          | +23%                | 19/43 |
| 4         | 5          | 3.61          | +9%                 | 24/43 |
| 5         | 6          | 3.48          | +5%                 | 34/43 |
| 6         | 7          | 3.41          | +3%                 | 35/43 |
| 9 (steady state) | 10  | 3.30          | -                   | 43/43 |

Two representative query trajectories (cold + 9 hot iters, seconds), showing how a single mid-iteration JIT event distorts the 3-iter result:

- `cold=9.83, 0.256, 1.500, 0.043, 0.042, 0.043, 0.054, 0.042, 0.042, 0.042, 0.042` - iter 3 takes 1.5 s because a JIT recompile pause happens inside the iteration. ClickBench currently records this single sample as the hot time. Steady state is ~42 ms.
- `cold=0.972, 0.461, 0.116, 0.066, 0.055, 0.030, 0.036, 0.039, 0.041, 0.041` - monotonic convergence; iter 2 is 7-15x slower than steady state.

The 1.5-2x iter-3 spike pattern reproduces on both HotSpot C2 (OpenJDK 25) and GraalVM CE 25 on the same binary, so it's not a single-compiler artefact.

The exact magnitude will differ across JVM-based engines depending on codebase, GC, JIT mode, and hardware. The *direction* is the same for all of them: longer windows produce closer-to-steady-state numbers. JVM engines other than the one measured here will exhibit the same kind of curve, with their own numerical specifics.

## Existing data points to the same gap

ClickBench's recent methodology refresh added a sustained-throughput metric (`concurrent_qps`, 10 worker connections x 600 s window). By second ~30 of that window every hot JIT method has been compiled across every worker, so `concurrent_qps` reflects actual steady-state throughput, not warmup.

The presence of both `hot` and `concurrent_qps` in the methodology is implicit acknowledgement that single-shot timings don't fully characterise an engine's performance. The current proposal targets the same gap from the other side: make the cheap, single-threaded `hot` number itself a closer reflection of what an engine actually does at steady state, so that the two metrics tell consistent stories. JVM engines today are the largest source of disagreement between the two.

## Proposal for discussion

Increase the default number of hot iterations above 2. The exact number is for the maintainers and the community to decide. A few aspects worth thinking through together:

- **Cost.** Most additional iterations are near-steady-state and cheap. For the bench run above, going from 3 to 7 total runs added ~10 s of wall-clock to a 7-minute single-machine suite. For slower systems (Postgres, MariaDB, …), load and cold-run time dominate, so the relative cost is even smaller.
- **Backwards compatibility.** The website's `Hot Run = min(runs[1:])` formula is agnostic to how many runs are recorded; existing 3-iter result JSON remains valid input. Re-running individual submissions is gradual, as with previous methodology refreshes (#793).
- **Diminishing returns.** Marginal closure of the warmup gap drops sharply after a small number of additional iterations; the table above gives one engine's curve, and other JVM engines would have their own. Coordinating with several JVM-engine maintainers (or running 10-iter probes on a representative few ourselves) could help pick a default.

I'd be happy to share more data, run additional experiments on other hardware, or coordinate with maintainers of other JVM-based engines.

## Why now

The recent methodology refresh added sustained-throughput (`concurrent_qps`) and migrated to true cold runs (#793). Increasing the hot iteration count would close out a third class of methodology gap that's currently observable: the single-threaded hot metric systematically under-reporting JVM-based engines' steady-state performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Methodology: 3 hot iterations is insufficient for JVM-based engines #934

Summary

The mechanism

Observed magnitude

Existing data points to the same gap

Proposal for discussion

Why now

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Hot iters	Total runs	Suite-sum (s)	Gap vs steady state	Queries within ±10% of steady state
2 (current ClickBench)	3	4.88	+48%	5/43
3	4	4.07	+23%	19/43
4	5	3.61	+9%	24/43
5	6	3.48	+5%	34/43
6	7	3.41	+3%	35/43
9 (steady state)	10	3.30	-	43/43

Methodology: 3 hot iterations is insufficient for JVM-based engines #934

Description

Summary

The mechanism

Observed magnitude

Existing data points to the same gap

Proposal for discussion

Why now

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions