Merging #86935: Fix logical error when enable enable_filesystem_query_cache_limit by alexey-milovidov · Pull Request #101428 · ClickHouse/ClickHouse

alexey-milovidov · 2026-03-31T18:56:30Z

Continuation of #86935.

After rebasing onto current master, only one of the original three changes from #86935 is still applicable:

IFileCachePriority::check now treats max_size = 0 and max_elements = 0 as "no limit", consistent with canFit (which already treats 0 as unlimited). This is a defensive guard for the per-query priority, which is constructed with max_elements = 0 (matching the 0 == unlimited convention used elsewhere).

The original PR's other two changes are no longer needed:

The FileCacheQueryLimit::QueryContext::remove -> tryRemove rename was needed because EvictionCandidates::finalize called query_context->remove(...) for every evicted entry. That call site was removed from master in commit 62ffdf7, so tryRemove would have no callers.
The integration test test_caches_with_query_limit could not reliably reproduce the bug from Use s3 with filesystem cache got Logical error: Cache limits violated. #86855: with the current tryReserve shape, query_priority_iterator->incrementSize is unreachable on the first reservation (the iterator is only assigned inside the !main_priority_iterator branch, and add does not return it), so the per-query check is no longer hit. The "Bugfix validation" CI job confirmed this — the test passed against the master binary.

Original issue: #86855

CI report (previous run with the wider scope): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101428&sha=95a5413b6250fe9b49e79ae60f50d1babb31909b&name_0=PR

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Treat max_size = 0 and max_elements = 0 in IFileCachePriority::check as "no limit", consistent with how canFit interprets these values. This silences spurious Cache limits violated logical errors on priorities created with zero limits (such as the per-query filesystem cache priority used when enable_filesystem_query_cache_limit is on).

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

…limit # Conflicts: # src/Interpreters/Cache/EvictionCandidates.cpp

clickhouse-gh · 2026-03-31T18:57:13Z

Workflow [PR], commit [3a6ecaf]

Summary: ✅

AI Review

Summary

This PR updates IFileCachePriority::check so max_size = 0 and max_elements = 0 are treated as unlimited, matching existing canFit semantics and avoiding false LOGICAL_ERROR exceptions for zero-limit priorities. The change is small, coherent with current cache-limit conventions, and I did not find correctness, safety, or rollout issues in the current diff.

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	➖
Core-area scrutiny	✅
No test removal	✅
Experimental gate	➖
No magic constants	✅
Backward compatibility	✅
`SettingsChangesHistory.cpp`	➖
PR metadata quality	✅
Safe rollout	✅
Compilation time	✅
No large/binary files	✅

Final Verdict

Status: ✅ Approve

clickhouse-gh · 2026-03-31T19:01:04Z

+            );
+        """
+    )
+    node.query("insert into fs_cache_query_limit select number,randomString(4096) from system.numbers limit 1000000")


⚠️ This test writes 1_000_000 rows with randomString(4096), i.e. roughly ~4 GiB of random payload before compression, which is very heavy for an integration test and can significantly increase CI runtime/flakiness.

Could we reduce the data volume while keeping the same coverage goal (triggering eviction with enable_filesystem_query_cache_limit)? For example, lower the row count and/or set a much smaller per-query write limit so we still exercise the same path with less data.

alexey-milovidov · 2026-04-07T00:28:48Z

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

clickhouse-gh · 2026-04-07T01:29:14Z

@@ -49,7 +49,7 @@ std::string IFileCachePriority::Entry::toString(const std::string & prefix) cons

 void IFileCachePriority::check(const CacheStateGuard::Lock & lock) const
 {
-    if (getSize(lock) > max_size || getElementsCount(lock) > max_elements)
+    if ((max_size != 0 && getSize(lock) > max_size) || (max_elements != 0 && getElementsCount(lock) > max_elements))


Changelog category in the PR template is currently CI Fix or Improvement, but this PR changes runtime cache behavior (IFileCachePriority::check and query-limit eviction handling), not CI/test infrastructure.

Please switch the category to a runtime-facing one (likely Bug Fix or Improvement) so release notes classification stays accurate.

…limit # Conflicts: # tests/integration/test_filesystem_cache/test.py

The test was writing 1M rows * 4096 bytes (~4 GiB) which is excessive for a 1 MiB cache. Reduce to 1000 rows (~4 MiB) which still exceeds the cache size and triggers eviction, while being much lighter on CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-04-10T02:28:07Z

The flaky check failure is fixed in #102148, let's update the branch.

clickhouse-gh · 2026-04-10T04:03:24Z

@@ -51,7 +51,7 @@ std::string IFileCachePriority::Entry::toString(const std::string & prefix) cons

 void IFileCachePriority::check(const CacheStateGuard::Lock & lock) const
 {
-    if (getSize(lock) > max_size || getElementsCount(lock) > max_elements)
+    if ((max_size != 0 && getSize(lock) > max_size) || (max_elements != 0 && getElementsCount(lock) > max_elements))


Changelog entry is currently Fix #86855, which is too vague for CHANGELOG.md users.

Please replace it with a user-readable sentence describing the behavior change and scope, for example:
Fix a logical error in filesystem cache query-limit mode: max_size=0 and max_elements=0 are now treated as unlimited, and query-limit eviction no longer throws when an entry was already removed by the main cache priority path.

…limit

clickhouse-gh · 2026-04-24T10:35:13Z

    assert int(elapsed_time) > 1
    assert int(elapsed_time) < 5

+def test_caches_with_query_limit(cluster):


⚠️ This test exercises query-limit eviction, but it does not cover the max_size = 0 / max_elements = 0 behavior changed in IFileCachePriority::check.

Please add a case (in this test or a dedicated one) with enable_filesystem_query_cache_limit = 'true' and zero limits, and assert the query finishes without LOGICAL_ERROR. Otherwise the first fix can regress without test coverage.

…limit

The original test in this PR was reduced from `1_000_000` to `1000` rows to keep CI fast, but the smaller volume no longer reliably triggered the bug (`Bugfix validation (integration tests)` showed the test passing on the unfixed master binary, which means it was not exercising the fixed code paths). The bug only fires on the eviction/re-add path in `FileCache::tryReserve` (`src/Interpreters/FileCache/FileCache.cpp` around lines 1219-1246), where `query_priority_iterator` becomes non-null because an entry is still in the per-query priority but has been evicted from the main priority. Reaching that path within a single query requires significantly more data than the cache size (1 MiB), and ideally re-reads of the same segments. Increase to `100000` rows of `randomString(4096)` (~400 MiB of incompressible data, ~400x the cache capacity) and run the `SELECT` three times within the test to amplify eviction/re-add pressure on the per-query priority. Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101428&sha=018ad05eaddbfb4e8961766a5cc955805f8be278&name_0=PR&name_1=Bugfix%20validation%20%28integration%20tests%29

clickhouse-gh · 2026-04-25T04:36:13Z

-        throw Exception(ErrorCodes::LOGICAL_ERROR, "There is no {}:{} in query context", key, offset);
+        return;

    record->second->remove(lock);


tryRemove can still hit an exception when the key exists in records but its stored iterator was already invalidated/removed by a prior eviction path. In that case record->second->remove(lock) calls assertValid and throws LOGICAL_ERROR (Attempt to use invalid iterator).

Please guard the removal with isValid(lock) and always erase from records, e.g.:

auto record = records.find({key, offset}); if (record == records.end()) return; if (record->second->isValid(lock)) record->second->remove(lock); records.erase(record);

…limit

…limit # Conflicts: # programs/keeper-bench/Runner.cpp # src/Core/ServerSettings.cpp

…move` `FileCacheQueryLimit::QueryContext::tryRemove` previously returned only when the entry was missing from `records`. If the entry was present but its stored iterator had already been invalidated by a prior eviction path, `record->second->remove(lock)` would call `assertValid` and throw `LOGICAL_ERROR ("Attempt to use invalid iterator")`. Guard the call with `isValid(lock)` and always erase the record entry from the map.

The previous version of the test did not set `filesystem_cache_max_download_size`, so `FileCache::getQueryContextHolder` returned a null holder and the per-query `QueryContext` was never created. That meant the test passed both before and after the fix and the bugfix-validation CI job correctly reported "Failed to reproduce the bug". Set `filesystem_cache_max_download_size` on each `SELECT` so that the per-query `QueryContext` is constructed (with `max_elements = 0`) and the per-query priority's `LRUFileCachePriority::LRUIterator::incrementSize` calls `IFileCachePriority::check`, which is the path that used to throw `LOGICAL_ERROR`. Setting `skip_download_if_exceeds_query_cache = 0` also exercises the recache-on-exceed path so that entries can be re-added to the per-query priority after eviction. `enable_filesystem_cache_on_write_operations` is set to `0` for the INSERT to keep cache state deterministic and to avoid filling the cache on the write path before reads even start. CI reports for context: - https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101428&sha=d119180d9433be7fae7d6d164b9ca8985968dbf8&name_0=PR&name_1=Bugfix%20validation%20%28integration%20tests%29

clickhouse-gh · 2026-04-28T04:28:42Z


-    record->second->remove(lock);
-    records.erase({key, offset});
+    if (record->second->isValid(lock))


QueryContext::tryRemove is introduced here, but I can't find any call site in the tree (rg "tryRemove\(" src/Interpreters/FileCache only matches this definition/declaration).

That means the second fix in the PR description is still effectively inactive: records keeps stale iterators for entries evicted through other paths, and a later lookup can still return an invalid iterator instead of recreating a fresh record.

Can we wire tryRemove into the eviction finalization path (for candidates removed by main/query priority), so records is kept in sync with actual queue state?

…limit

After rebasing onto current master, two of the original three changes in #86935 are no longer applicable: 1. The `FileCacheQueryLimit::QueryContext::remove` -> `tryRemove` rename was needed because `EvictionCandidates::finalize` called `query_context->remove(...)` for every evicted entry. That call site was removed from master in 62ffdf7, so `tryRemove` had no callers and was effectively dead code. Reverting the `QueryLimit.cpp` / `QueryLimit.h` changes to match master. 2. The integration test `test_caches_with_query_limit` was meant to reproduce the bug from #86855, but with the current `tryReserve` shape `query_priority_iterator->incrementSize` is unreachable on the first reservation (the iterator is only assigned inside the `!main_priority_iterator` branch and `add` does not return it), so the per-query `check` is no longer hit. The "Bugfix validation" job confirms this - the test passes against the master binary. Removing the test since it does not validate the fix. What remains is the `IFileCachePriority::check` change: treat `max_size = 0` and `max_elements = 0` as "no limit" (consistent with `canFit`, which already does so). This is a defensive guard for the per-query priority, which is constructed with `max_elements = 0`, matching the `0 == unlimited` convention used elsewhere. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101428&sha=95a5413b6250fe9b49e79ae60f50d1babb31909b&name_0=PR Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A trailing blank line between `test_finished_download_time` and `test_concurrent_eviction` was inadvertently dropped. Restore it so the net diff against master is just the one-line change in `IFileCachePriority::check`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…limit

clickhouse-gh · 2026-04-30T22:56:03Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.00%	84.10%	+0.10%
Functions	91.10%	91.10%	+0.00%
Branches	76.50%	76.60%	+0.10%

Changed lines: 50.00% (1/2) · Uncovered code

Full report · Diff report

linkwk7 and others added 4 commits September 14, 2025 09:51

Add integration test for fs cache with query limit

bff780e

Fix logical error when enable enable_filesystem_query_cache_limit

d028244

Merge branch 'fixup-after-merge' into fix_fs_cache_query_limit

89f656c

Merge remote-tracking branch 'origin/master' into fix_fs_cache_query_…

bcfff23

…limit # Conflicts: # src/Interpreters/Cache/EvictionCandidates.cpp

alexey-milovidov mentioned this pull request Mar 31, 2026

Fix logical error when enable enable_filesystem_query_cache_limit #86935

Closed

clickhouse-gh Bot added the pr-ci label Mar 31, 2026

clickhouse-gh Bot reviewed Mar 31, 2026

View reviewed changes

kssenii self-assigned this Apr 1, 2026

Merge branch 'master' into fix_fs_cache_query_limit

fd3635e

clickhouse-gh Bot reviewed Apr 7, 2026

View reviewed changes

kssenii approved these changes Apr 7, 2026

View reviewed changes

alexey-milovidov and others added 2 commits April 9, 2026 20:12

Merge remote-tracking branch 'origin/master' into fix_fs_cache_query_…

4dd4c7e

…limit # Conflicts: # tests/integration/test_filesystem_cache/test.py

Merge branch 'master' into fix_fs_cache_query_limit

1bfba40

clickhouse-gh Bot reviewed Apr 10, 2026

View reviewed changes

alexey-milovidov added 6 commits April 12, 2026 13:32

Merge branch 'master' into fix_fs_cache_query_limit

1b63ef3

Merge remote-tracking branch 'origin/master' into fix_fs_cache_query_…

29b8ab6

…limit

Merge remote-tracking branch 'origin/master' into fix_fs_cache_query_…

197870a

…limit

Merge remote-tracking branch 'origin/master' into fix_fs_cache_query_…

9cbe621

…limit

Merge remote-tracking branch 'origin/master' into fix_fs_cache_query_…

4ca2575

…limit

Merge remote-tracking branch 'origin/master' into fix_fs_cache_query_…

018ad05

…limit

clickhouse-gh Bot added pr-bugfix Pull request with bugfix, not backported by default and removed pr-ci labels Apr 24, 2026

clickhouse-gh Bot reviewed Apr 24, 2026

View reviewed changes

alexey-milovidov added 2 commits April 25, 2026 04:20

Merge remote-tracking branch 'origin/master' into fix_fs_cache_query_…

ea49278

…limit

clickhouse-gh Bot reviewed Apr 25, 2026

View reviewed changes

alexey-milovidov added 4 commits April 26, 2026 17:49

Merge remote-tracking branch 'origin/master' into fix_fs_cache_query_…

157d3dc

…limit

Merge remote-tracking branch 'origin/master' into fix_fs_cache_query_…

ca2f180

…limit # Conflicts: # programs/keeper-bench/Runner.cpp # src/Core/ServerSettings.cpp

clickhouse-gh Bot reviewed Apr 28, 2026

View reviewed changes

alexey-milovidov and others added 4 commits April 28, 2026 21:23

Merge remote-tracking branch 'origin/master' into fix_fs_cache_query_…

f59c525

…limit

Merge remote-tracking branch 'origin/master' into fix_fs_cache_query_…

b58e40f

…limit

clickhouse-gh Bot added pr-improvement Pull request with some product improvements and removed pr-bugfix Pull request with bugfix, not backported by default labels Apr 29, 2026

alexey-milovidov added 2 commits April 30, 2026 05:14

Merge remote-tracking branch 'origin/master' into fix_fs_cache_query_…

1afa4c9

…limit

Merge remote-tracking branch 'origin/master' into fix_fs_cache_query_…

3a6ecaf

…limit

alexey-milovidov added this pull request to the merge queue May 1, 2026

Merged via the queue into master with commit c316f0f May 1, 2026
165 checks passed

alexey-milovidov deleted the fix_fs_cache_query_limit branch May 1, 2026 05:10

robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 1, 2026

groeneai mentioned this pull request May 5, 2026

Logical error: Not-ready Set is passed as the second argument for function 'A (STID: 0250-4409) #104130

Open

groeneai mentioned this pull request May 16, 2026

Make filesystem() honor the memory limit when loading file content #104956

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging #86935: Fix logical error when enable enable_filesystem_query_cache_limit#101428

Merging #86935: Fix logical error when enable enable_filesystem_query_cache_limit#101428
alexey-milovidov merged 26 commits into
masterfrom
fix_fs_cache_query_limit

alexey-milovidov commented Mar 31, 2026 •

edited

Loading

Uh oh!

clickhouse-gh Bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

clickhouse-gh Bot Mar 31, 2026

Uh oh!

alexey-milovidov commented Apr 7, 2026

Uh oh!

clickhouse-gh Bot Apr 7, 2026

Uh oh!

alexey-milovidov commented Apr 10, 2026

Uh oh!

clickhouse-gh Bot Apr 10, 2026

Uh oh!

clickhouse-gh Bot Apr 24, 2026

Uh oh!

clickhouse-gh Bot Apr 25, 2026

Uh oh!

clickhouse-gh Bot Apr 28, 2026

Uh oh!

clickhouse-gh Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

alexey-milovidov commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

ClickHouse Rules

Final Verdict

Uh oh!

clickhouse-gh Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov commented Apr 7, 2026

Uh oh!

clickhouse-gh Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov commented Apr 10, 2026

Uh oh!

clickhouse-gh Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot commented Apr 30, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexey-milovidov commented Mar 31, 2026 •

edited

Loading

clickhouse-gh Bot commented Mar 31, 2026 •

edited

Loading