Skip to content

Fix starrocks and doris cheating at cold runs.#845

Merged
alexey-milovidov merged 4 commits into
mainfrom
fix-cold-run-cheating-starrocks-doris
May 4, 2026
Merged

Fix starrocks and doris cheating at cold runs.#845
alexey-milovidov merged 4 commits into
mainfrom
fix-cold-run-cheating-starrocks-doris

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

No description provided.

alexey-milovidov and others added 4 commits May 2, 2026 23:48
Both StarRocks and Doris run a long-lived BE daemon with a process-internal
`storage_page_cache` (default ~20% of RAM) that holds decoded column data
across queries. The benchmark's `run.sh` only does
`echo 3 > /proc/sys/vm/drop_caches`, which clears the OS page cache but
does NOT touch the BE's in-process memory. As a result, the "cold run"
(first of three tries) is served from the BE's warm in-memory cache and
underreports cold-run latency - a clear violation of benchmark rules
(README "Caching" section: cold runs require all database caches to be
cleared, not only the OS page cache).

This is effectively cheating: every system with internal in-memory caching
that does not clear it before the first run gets an unearned advantage on
the cold-run leaderboard. Both systems' existing results are already
tagged `lukewarm-cold-run`, but they are still displayed under the cold
metric on the website.

Fix: disable the relevant in-process caches in `be.conf` before starting
the BE, so that all reads must go through the OS page cache (which
`run.sh` does clear).

  starrocks/benchmark.sh:
    disable_storage_page_cache = true
    datacache_enable = false      # covers unified Data Cache in v3.3+

  doris/benchmark.sh:
    disable_storage_page_cache = true
    segment_cache_capacity = 0

Existing results still carry the stale `lukewarm-cold-run` tag and need
to be re-collected on AWS hardware to reflect the corrected configuration.

DuckDB does not have this problem: its `run.sh` launches a fresh `duckdb`
CLI process per query, so the buffer pool is empty at the start of each
cold run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Commit 1753902 added an inline comment after the line-continuation
backslash:

    -H "timeout:1000" \ # see #740

In bash this is *not* a continuation: the backslash escapes the
space (not the newline), the `#` then starts an end-of-line
comment, and the unescaped newline terminates the curl command.
Curl runs without its URL and fails:

    curl: (3) URL using bad/illegal format or missing URL

so the data never gets loaded into StarRocks.

Move the comment to its own line above the curl invocation. No
similar pattern was found in any other benchmark.sh / run.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ickHouse/ClickBench into fix-cold-run-cheating-starrocks-doris
@alexey-milovidov alexey-milovidov self-assigned this May 4, 2026
@alexey-milovidov alexey-milovidov merged commit e3b12cd into main May 4, 2026
@rschu1ze
Copy link
Copy Markdown
Member

rschu1ze commented May 4, 2026

@HappenLee Can you please clarify what information the segment cache and the storage page cache in Doris store and how they work (lifecycle)? I found some scattered bits of information in the Doris docs (e.g. https://doris.apache.org/docs/3.x/admin-manual/trouble-shooting/memory-management/memory-analysis/doris-cache-memory-analysis) but it is not really well documented. In particular, is there a way to clear these caches before each first cold query? If yes, then let's do so instead. Note that "doris-parquet/run.sh" (but not "doris/run.sh") does curl -sS http://127.0.0.1:8040/api/clear_cache/all already does that - will that do the trick?

This PR disables the caches globally, which also impacts hot runs (and that may be unfair).

@murphyatwork I have the similar question for Starrocks. However, in the case of the data cache, the docs say

Currently, Data Cache does not provide a direct interface to clear the cached data.

so disabling the data cache globally seems fair.

Can you please also explain what the block cache is doing (mentioned here)? It is not disabled by this PR. Should it? Can it be cleared between queries otherwise?

@HappenLee
Copy link
Copy Markdown
Contributor

@rschu1ze @alexey-milovidov Hello, Here are my two questions regarding this issue:

First, we consider the page cache mechanism to be reasonable. Its logic is similar to that of DuckDB's buffer pool—previously accessed disk files are pinned and cached in the queue. In real-world production environments, users actually use it this way. So why is it considered unreasonable? If that is the case, could I equally argue that DuckDB's results are unreasonable as well?

Second, if this is indeed unreasonable, shouldn’t we clarify the rules clearly and check each database's results accordingly? For closed-source databases, how can we ensure fairness and verifiability under such rules?

@rschu1ze rschu1ze mentioned this pull request May 9, 2026
rschu1ze added a commit that referenced this pull request May 9, 2026
@rschu1ze
Copy link
Copy Markdown
Member

rschu1ze commented May 9, 2026

Nevermind, I reverted this PR, sorry for the confusion.

@alexey-milovidov
Copy link
Copy Markdown
Member Author

alexey-milovidov commented May 9, 2026

@HappenLee, @rschu1ze, Cold result should run with no caches. Otherwise, the results are non-representative.

alexey-milovidov added a commit that referenced this pull request May 9, 2026
lukasvogel pushed a commit to lukasvogel/ClickBench that referenced this pull request May 11, 2026
lukasvogel pushed a commit to lukasvogel/ClickBench that referenced this pull request May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants