Skip to content

Add CGroupMemoryUsedWithoutPageCache async metric and clarify CGroupMemoryUsed description#101513

Merged
alexey-milovidov merged 16 commits intoClickHouse:masterfrom
primeroz:cgroupmemorywithoutuserspacepagecache
Apr 10, 2026
Merged

Add CGroupMemoryUsedWithoutPageCache async metric and clarify CGroupMemoryUsed description#101513
alexey-milovidov merged 16 commits intoClickHouse:masterfrom
primeroz:cgroupmemorywithoutuserspacepagecache

Conversation

@primeroz
Copy link
Copy Markdown
Member

@primeroz primeroz commented Apr 1, 2026

CGroupMemoryUsed is the preferred metric for memory accounting in cgroup environments (as noted in support escalation #7289), where autoscaling and memory overload warnings are moving away from MemoryResident.

However, when the ClickHouse userspace page cache is enabled, CGroupMemoryUsed includes that cache's memory in its value — just like MemoryResident does. PR #81233 addressed this for the RSS-based path by introducing MemoryResidentWithoutPageCache. This PR adds the equivalent for the cgroup path.

Additionally, the existing CGroupMemoryUsed description said "(excluding page cache)" without specifying which page cache — it actually excludes only the kernel OS page cache, not the ClickHouse userspace page cache. This was confusing.

Changes

  • Simplified the CGroupMemoryUsed description to explicitly state it excludes the kernel OS page cache.
  • Added CGroupMemoryUsedWithoutPageCache async metric:
    • Formula: max(0, CGroupMemoryUsed - page_cache_bytes)
    • When userspace page cache is disabled, equals CGroupMemoryUsed
    • Mirrors the pattern established by MemoryResidentWithoutPageCache in add MemoryResidentWithoutPageCache #81233
  • Added a stateless test verifying CGroupMemoryUsedWithoutPageCache presence and invariant (<= CGroupMemoryUsed).

Related: #81233 (added MemoryResidentWithoutPageCache)
Related: #100901 (improves CGroupMemoryUsed calculation by subtracting slab_reclaimable)
Related: https://github.com/ClickHouse/support-escalation/issues/7289

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Added CGroupMemoryUsedWithoutPageCache async metric that reports cgroup memory usage excluding both the kernel OS page cache and the ClickHouse userspace page cache, mirroring MemoryResidentWithoutPageCache. Also clarified the CGroupMemoryUsed metric description.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Co-authored-by: Claude noreply@anthropic.com

…emoryUsed description

Add a new `CGroupMemoryUsedWithoutPageCache` async metric that subtracts the
ClickHouse userspace page cache from `CGroupMemoryUsed`, mirroring what
`MemoryResidentWithoutPageCache` does for RSS (added in ClickHouse#81233).

Also clarify the `CGroupMemoryUsed` description: it previously said
"excluding page cache" without specifying which page cache. It now explicitly
states that the kernel page cache (OS-level file cache) is excluded because
`memory.stat` does not account for the `file` field, while the ClickHouse
userspace page cache is NOT excluded - that is what the new metric is for.

Closes ClickHouse/support-escalation#7289

Co-authored-by: Claude <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Apr 1, 2026

Workflow [PR], commit [0de40db]

Summary:


AI Review

Summary

This PR adds CGroupMemoryUsedWithoutPageCache, updates CGroupMemoryUsed description to be cgroup-version-aware, and adds a stateless query test validating presence symmetry and the invariant CGroupMemoryUsedWithoutPageCache <= CGroupMemoryUsed. I did not find correctness, safety, performance, or compatibility regressions in the current PR head.

Missing context
  • ⚠️ CI checks were still in progress while reviewing; no full CI log set was available yet.
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
No large/binary files
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-improvement Pull request with some product improvements label Apr 1, 2026
Comment thread src/Common/AsynchronousMetrics.cpp Outdated

UInt64 cgroup_page_cache_bytes = 0;
if (context && context->getPageCache())
cgroup_page_cache_bytes = context->getPageCache()->sizeInBytes();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have two metrics *WithoutPageCache can we expose size of user-space page cache in asynchronous metrics instead?

Copy link
Copy Markdown
Member Author

@primeroz primeroz Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do expose the size of user-space page cache already but is not usable in clickhouse-scraper

because clickhouse-scraper can only fetch metrics from table and aggregate over 1 minute. so we end up with

max_over_1m(CGroupUsedMemory) - max_over_1m(page_cache) which is not the same as max_over_1m(CgroupUsedMemory-page_cache)

We already tried this before but we failed and that's why we ended up creating MemoryResidentWithoutPageCache in #81233

we already had this conversation in https://github.com/ClickHouse/data-plane-application/pull/19933 which lead to exposing a raw metric that already took the WithoutPageCache into account

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, maybe I wasn't clear, let me clarify.

Right now we expose PageCacheBytes only in system.metrics, I'm suggesting to expose it also in system.asynchronous_metric that way you will get it in one place and can do what ever arithmetic your need.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah , yeah i misunderstood. 👍

indeed right now we do graph the page cache usage from metrics and the max page cache from async metrics

image

I'm suggesting to expose it also in system.asynchronous_metric that way you will get it in one place and can do what ever arithmetic your need.

if you think is best sure, 💯

do you want to do it or should i throw my claude at it ?

Side note , i just noticed with use the page cache also in a warning about memory usage

Copy link
Copy Markdown
Member

@azat azat Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to do it 👍

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check the current version ? i have a feeling you meant something slighlty different because i don't really need to expose it as async metric as well as standard metric this way ...

@azat azat self-assigned this Apr 1, 2026
primeroz and others added 2 commits April 1, 2026 18:38
Add `MemoryUserSpacePageCache` async metric that exposes the ClickHouse
userspace page cache size in `system.asynchronous_metrics`, alongside the
existing `MemoryResident`, `MemoryResidentWithoutPageCache`, `CGroupMemoryUsed`
and `CGroupMemoryUsedWithoutPageCache` metrics.

Previously this value was only available in `system.metrics` as `PageCacheBytes`.
Having it in `system.asynchronous_metrics` means operators can do arbitrary
arithmetic directly in one place, e.g.:
  CGroupMemoryUsed - MemoryUserSpacePageCache
  MemoryResident - MemoryUserSpacePageCache

The value is computed from the already-fetched `context->getPageCache()->sizeInBytes()`
local variable, so no extra call is made.

Co-authored-by: Claude <noreply@anthropic.com>
…ved metrics

`MemoryUserSpacePageCache` is now populated first (single call to
`context->getPageCache()->sizeInBytes()`), and both
`MemoryResidentWithoutPageCache` and `CGroupMemoryUsedWithoutPageCache`
read back that value instead of calling `getPageCache()->sizeInBytes()`
independently.

Co-authored-by: Claude <noreply@anthropic.com>
Comment thread src/Common/AsynchronousMetrics.cpp Outdated
Comment thread src/Common/AsynchronousMetrics.cpp
…s test

- Reword `CGroupMemoryUsed` description to attribute the page cache
  exclusion to ClickHouse's own field selection from `memory.stat`
  (anonymous memory, socket buffers, non-reclaimable kernel memory),
  rather than implying it is a kernel accounting guarantee.

- Add stateless test 04075 that:
  - Verifies `MemoryUserSpacePageCache` is always present in
    `system.asynchronous_metrics`
  - When cgroup metrics are available, asserts the invariant
    `CGroupMemoryUsedWithoutPageCache <= CGroupMemoryUsed`

Co-authored-by: Claude <noreply@anthropic.com>
Comment thread tests/queries/0_stateless/04075_memory_userspace_pagecache_metrics.sql Outdated
Replace the WHERE-based filter that silently produced no output when
CGroupMemoryUsedWithoutPageCache was missing with explicit if()-based
assertions that always produce deterministic output. The existence
check and invariant check now properly fail when cgroup metrics are
available but the expected metric is absent, while gracefully passing
in environments without cgroup support.

Co-authored-by: Claude <noreply@anthropic.com>
Comment thread src/Common/AsynchronousMetrics.cpp Outdated
…description

Remove the intermediate MemoryUserSpacePageCache async metric — it is
not needed since page_cache_bytes is already in scope.
CGroupMemoryUsedWithoutPageCache now reads page_cache_bytes directly.

Simplify the CGroupMemoryUsed description to say it excludes the kernel
OS page cache, without claiming specifics about cgroup v1/v2 memory.stat
field accounting.

Update the stateless test accordingly.

Co-authored-by: Claude <noreply@anthropic.com>
page_cache_bytes is local to an earlier #if block and not visible in the
cgroup metrics section. Read the value from context->getPageCache()
directly instead.

Co-authored-by: Claude <noreply@anthropic.com>
@alexey-milovidov
Copy link
Copy Markdown
Member

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

alexey-milovidov and others added 2 commits April 6, 2026 17:50
The if() false-branch used literal `true` (Bool) while the true-branch
returned UInt8 from comparisons. ClickHouse promotes both to Bool,
printing `true` instead of `1`, causing the reference file mismatch.

Co-authored-by: Claude <noreply@anthropic.com>
Comment on lines +14 to +15
(SELECT value FROM system.asynchronous_metrics WHERE metric = 'CGroupMemoryUsedWithoutPageCache')
<= (SELECT value FROM system.asynchronous_metrics WHERE metric = 'CGroupMemoryUsed'),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be flaky, you cannot read metric two times

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WITH
    (SELECT groupArray((metric, value)) FROM system.asynchronous_metrics
     WHERE metric IN ('CGroupMemoryUsed', 'CGroupMemoryUsedWithoutPageCache')) AS metrics,
    arrayFirst(x -> x.1 = 'CGroupMemoryUsed', metrics) AS used,
    arrayFirst(x -> x.1 = 'CGroupMemoryUsedWithoutPageCache', metrics) AS without_cache
SELECT
    if(used.2 > 0, without_cache.2 > 0, 1),
    if(used.2 > 0, without_cache.2 <= used.2, 1);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you ! updating now with your fix

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adf682e

thanks again

primeroz and others added 2 commits April 8, 2026 10:48
…kiness

Read `CGroupMemoryUsed` and `CGroupMemoryUsedWithoutPageCache` in one
`groupArray` query instead of multiple subqueries against
`system.asynchronous_metrics`. The metrics can be updated between
separate reads, making the comparison racy.

ClickHouse#101513

Co-authored-by: Claude <noreply@anthropic.com>
…then test

Make `CGroupMemoryUsed` description explicitly document cgroup v1 vs v2
differences (RSS vs anon+sock+kernel). Simplify `CGroupMemoryUsedWithoutPageCache`
description to accurately state it subtracts the userspace page cache from
`CGroupMemoryUsed`.

Strengthen `04075_memory_userspace_pagecache_metrics` test with explicit
metric-existence assertions using `countIf` instead of relying on `arrayFirst`
default values, which could mask registration/rename regressions.

Co-authored-by: Claude <noreply@anthropic.com>
-- Both metrics must be either both present or both absent.
has_used = has_without_cache,
-- When present, without_cache must be positive and <= used.
if(has_used = 1, without_cache.2 > 0, 1),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion if(has_used = 1, without_cache.2 > 0, 1) is too strict for this metric.

CGroupMemoryUsedWithoutPageCache is computed as max(0, CGroupMemoryUsed - userspace_page_cache_bytes), so 0 is a valid value (for example, when userspace page cache fully covers cgroup usage, or when usage is zero). This can make the test fail even when the metric is present and correct.

Please remove the strict > 0 check and keep the presence + without_cache <= used invariant checks.

- Replace countIf (aggregate, no lambda support) with arrayCount.
- Remove the without_cache > 0 check: the metric is max(0, CGroupMemoryUsed - page_cache_bytes) so 0 is valid.
- Keep only presence and without_cache <= used invariants.

Co-authored-by: Claude <noreply@anthropic.com>
@alexey-milovidov
Copy link
Copy Markdown
Member

The 02346_text_index_bug101913 test failure is fixed by #102108 (already merged). Please rebase or merge master to pick up the fix.

@alexey-milovidov
Copy link
Copy Markdown
Member

The MSan stress test failure (MemorySanitizer: use-of-uninitialized-value, STID 4179-5154 or 4148-3044) is a known pre-existing issue unrelated to this PR. Fix: #102158

@primeroz
Copy link
Copy Markdown
Member Author

primeroz commented Apr 9, 2026

weird, now i am getting permission denied on filesystem errors

[2026-04-09 08:49:17] WARNING: stderr: warning: could not open directory 'disk3_02961/': Permission denied
[2026-04-09 08:49:16] 
[2026-04-09 08:49:16] ##### Final Verdict
[2026-04-09 08:49:16] - Status: **✅ Approve**
[2026-04-09 08:49:16] 

[2026-04-09 08:49:17] WARNING: stderr: warning: could not open directory 'disk3_02961/': Permission denied

according to claude

All three logs show the same CI run (PR #101513, sha 90026a87455) and the AI review approves the PR. The failures found are:                                                                                    
                                                                                                                                                                                                                     
     1. test_s3_cluster/test.py::test_wrong_cluster -- test infra issue (FileNotFoundError for missing data/generated/ directory), not related to our change                                                         
     2. Server died (arm_msan stress test) -- linked to existing issue #102044, not our code                                                                                                                         
     3. MemorySanitizer: use-of-uninitialized-value (arm_msan stress test) -- same issue #102044                                                                                                                     
     4. Upgrade check error message -- Azure StorageException: 503 rate limit, transient infra issue                                                                                                                 
                                                                                                                                                                                                                     
     None of these failures are related to our 04075_memory_userspace_pagecache_metrics test or the CGroupMemoryUsedWithoutPageCache metric. Our test now passes. All failures are pre-existing flaky tests or       
     transient infrastructure problems.                                                                                                                                                                              
                                                                                                                                                                                                                     
     Nothing to fix on our side.

but i will see if i can somehow get the whole pr green

@primeroz
Copy link
Copy Markdown
Member Author

primeroz commented Apr 9, 2026

run the ci script and created

#102238
#102241

@alexey-milovidov
Copy link
Copy Markdown
Member

The Can't adjust last granule error in CI is a known issue. The fix is in #101641

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Apr 10, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.00% 84.00% +0.00%
Functions 90.90% 90.90% +0.00%
Branches 76.50% 76.50% +0.00%

Changed lines: 100.00% (20/20) · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov merged commit bdc558e into ClickHouse:master Apr 10, 2026
162 of 163 checks passed
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants