Per-CPU bound on untracked memory to prevent OOMs by azat · Pull Request #106055 · ClickHouse/ClickHouse

azat · 2026-05-28T14:37:02Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Per-CPU bound on untracked memory to prevent OOMs

Previously ClickHouse works memory overcommit was "max_untracked_memory*threads", which may in theory lead to ±40GiB of overcommited memory in a worst case scenario and 10K threads

By bounding maximum untracked memory to "ncpu*max_untracked_memory" we can avoid this massive memory overcommit, and also with default max_server_memory_usage_to_ram_ratio=0.9, all this memory is already reserved, so for now per-cpu memory is not reserved explicitly.

"max_untracked_memory" per query continues to control the per-thread accumulator and additionally sizes the thread's per-CPU bucket (rounded up to a power of two), so raising it opts that thread into looser tracking without affecting other queries.

Note, on Linux it uses librseq gives only ±1% (in the best case), so this patch has been removed from the patch set

clickhouse-gh · 2026-05-28T14:37:50Z

Workflow [PR], commit [30446d9]

Summary: ⏳

job_name	test_name	status	info	comment
Stateless tests (arm_asan_ubsan, targeted)		FAIL
	Server died	FAIL	cidb

AI Review

Summary

This PR adds a Linux-only process-wide PerCPUMemory budget to bound untracked memory by CPU instead of by thread count, plus server settings, tests, and a benchmark. The design is much stronger than the earlier bucket version, but the current code still loses the new bound on an allocation-limit exception path and leaves several server-setting surfaces inconsistent, so I do not think it is mergeable yet.

Missing context / blind spots

⚠️ CI for head 30446d93 is still mostly pending: the fetched Praktika PR report only had config and style complete, with performance comparison jobs pending. fetch_perf_report.py could not run locally because no clickhouse binary is in PATH. Completed performance comparison or benchmark numbers would close this gap.

Findings

❌ Blockers

src/Common/CurrentMemoryTracker.cpp:88 The rollback path removes the thread's per-CPU contribution before calling MemoryTracker::allocImpl, but on MEMORY_LIMIT_EXCEEDED it restores only ThreadStatus::untracked_memory. A query that catches the exception can keep the previous deferred bytes outside per_cpu_memory, so the PR's bound is lost exactly on the OOM-prevention path. Restore or re-account the previous PerCPUMemoryThreadState contribution when rethrowing, or delay releasing it until the tracker update succeeds.

⚠️ Majors

programs/server/Server.cpp:2322 clickhouse-local loads ServerSettings and installs total_memory_tracker, but never applies max_per_cpu_untracked_memory or per_cpu_untracked_memory_thread_buffer. The default per-CPU accounting is active there, while --max_per_cpu_untracked_memory=0 and buffer overrides are ignored. Wire the same settings in LocalServer next to max_server_memory_usage.
programs/server/Server.cpp:2321 The new settings are applied on config reload, but system.server_settings will not report their current values because dumpToSystemServerSettingsColumns does not list them in changeable_settings. Add getters on PerCPUMemory if needed and expose the runtime values there.
src/Common/PerCPUMemory.h:137 per_cpu_untracked_memory_thread_buffer=0 is silently ignored by setThreadBuffer, keeping the previous/default buffer. Either make 0 a real no-buffer value or reject/document it explicitly.
src/Common/CurrentMemoryTracker.cpp [dismissed by author] The PR is categorized as Performance Improvement and changes the hot allocation/free path, but the current PR state does not include before/after measurements and the performance jobs are still pending. I still consider this real because existing broad perf tests are not evidence until their results are available for this head.

Tests

⚠️ Add a focused regression for the MEMORY_LIMIT_EXCEEDED rollback path: start with a non-zero deferred contribution, force a per-CPU flush that throws, catch it, and assert the previous bytes remain accounted in per_cpu_memory.
⚠️ Add coverage for the clickhouse-local setting path and for runtime visibility of max_per_cpu_untracked_memory / per_cpu_untracked_memory_thread_buffer through system.server_settings.
⚠️ Provide before/after benchmark or performance-comparison evidence for representative allocation/free patterns before relying on the Performance Improvement claim.

Final Verdict

Status: ❌ Block

Minimum required actions: fix the allocation-exception rollback, wire the settings consistently in clickhouse-local and system.server_settings, define 0 semantics for the thread buffer setting, and provide completed CI/performance evidence.

…OMs) Bounds untracked_memory to ncpu * SLICE (=4MiB) instead of nthreads * max_untracked_memory, so server memory limits becomes a true upper bound on real memory usage (almost). But leave small per-thread buffer (BUFFER_SIZE=32KiB) as a hot path With default max_server_memory_usage_to_ram_ratio=0.9, OOMs due to untracked_memory is not possible (later we can do a real reservation of this memory, but then we need to decrease max_server_memory_usage_to_ram_ratio) On Linux with librseq the slot bump is one rseq load-cbne-store; otherwise it falls back to atomic operations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Those allocations may be outside of allowed range [memory_profiler_sample_min_allocation_size, memory_profiler_sample_max_allocation_size] Since for isSizeOkForSampling() we may have size bigger then the initial allocation (due to other allocations on this CPU), but later MemorySample will be written with initial size.

…at the query needs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…r-query Now with per-cpu untracked memory server cannot start with such low limit Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tracked_memory After per-cpu memory tracking it make flaky because now the flush (and hence the check) can happens earlier

…e what the query needs Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Ergus

This is labelled "Performance Improvement" but the primary goal is OOM prevention... a correctness/safety fix.

So I think this is an Improvement,

…ry cleanup "system free memory" frees the allocated memory deterministically, so it and the user cleanup after it have headroom and need no retry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The per-CPU bucket size is rounded up to the next power of two (std::bit_width in applyQuerySettings), so the effective threshold may be slightly higher than the configured value. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Encapsulate the per-CPU state save/reset/restore in PerCPUMemoryThreadState and use it from MemoryTrackerSwitcher, now covering alloc_cpu/free_cpu and nallocs/nfrees in addition to the pending counters (slice_log2 stays, it is a query property). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

azat · 2026-05-31T16:46:07Z

This is labelled "Performance Improvement" but the primary goal is OOM prevention... a correctness/safety fix.

It is marked as so to run performance tests, and in some sense "OOM prevention" can be "performance improvement"

Restore max_untracked_memory and memory_profiler_step defaults (4 MiB) and the per-thread wording, and drop the SettingsChangesHistory entries. The per-CPU bound is reintroduced in the following commit as a separate server-wide budget, leaving max_untracked_memory as the per-thread cap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

To keep the global MemoryTracker off the hot path, each thread accumulates small allocations/deallocations in a thread-local `untracked_memory` and only flushes to the tracker once it exceeds `max_untracked_memory`. The downside is that the tracker undercounts by up to `threads * max_untracked_memory` at any moment - with thousands of threads that is tens of GiB the server never sees, so it can blow past `max_server_memory_usage` and OOM. Bound that overcommit per-CPU instead of per-thread. Each CPU owns a signed counter of the net un-flushed bytes of the threads running on it (allocations add, frees subtract); a thread reflects its `untracked_memory` into that counter and must flush once the counter leaves [-max_per_cpu_untracked_memory, +max_per_cpu_untracked_memory]. Only one thread runs on a CPU at a time and admission is gated, so the deferred total is bounded by `ncpu * max_per_cpu_untracked_memory` regardless of the thread count, and even threads that park while holding untracked memory cannot inflate it past the budget. The counter is signed, so a thread's deferred allocations and another's deferred frees on the same CPU cancel instead of both forcing a flush. To avoid touching the shared counter on every allocation, a thread only updates it after drifting by `per_cpu_untracked_memory_thread_buffer` (default 32 KiB); that is the only slack on top of the per-CPU budget. The update is exact - a thread's contribution is added to and removed from the same CPU and moved on migration - so the counter never drifts; only the threshold check is best-effort under concurrency, which is fine because the MemoryTracker stays authoritative. `max_untracked_memory` keeps its per-thread meaning and still caps a single thread; `max_per_cpu_untracked_memory` (default 8 MiB, 0 disables the bound) is a new server setting. Supported on Linux (needs sched_getcpu); elsewhere the behaviour is unchanged (per-thread only).

…emory

azat · 2026-06-01T18:48:49Z

@Ergus I do not like current implementation, will keep it as draft for now, and ping you/undraft once it will be ready

clickhouse-gh · 2026-06-01T18:57:07Z

                new_server_settings[ServerSetting::min_allocation_size_to_throw_on_memory_limit]);

+            per_cpu_memory.setBudgetCapacity(new_server_settings[ServerSetting::max_per_cpu_untracked_memory]);
+            per_cpu_memory.setThreadBuffer(new_server_settings[ServerSetting::per_cpu_untracked_memory_thread_buffer]);


These settings are only applied in clickhouse-server, but clickhouse-local also loads ServerSettings and installs total_memory_tracker in programs/local/LocalServer.cpp. With this change the process-wide per_cpu_memory default is active in clickhouse-local, yet --max_per_cpu_untracked_memory=0 and --per_cpu_untracked_memory_thread_buffer=... are silently ignored because LocalServer never calls setBudgetCapacity or setThreadBuffer.

Please wire the same settings in LocalServer next to max_server_memory_usage and min_allocation_size_to_throw_on_memory_limit, otherwise local runs and tests cannot tune or disable the new accounting behavior.

clickhouse-gh · 2026-06-01T18:57:16Z

+        if (current_thread->untracked_memory > current_thread->untracked_memory_limit
+            || !DB::per_cpu_memory.reserve(current_thread->untracked_memory, current_thread->per_cpu_untracked_memory))
        {
+            DB::per_cpu_memory.release(current_thread->per_cpu_untracked_memory);


The rollback restores only untracked_memory, but the per-CPU contribution has already been removed by reserve/release before this allocImpl call. If the flush throws MEMORY_LIMIT_EXCEEDED, the failed allocation is not made, while the previous unflushed bytes remain in ThreadStatus::untracked_memory with per_cpu_untracked_memory reset to zero. A query that catches the exception can then keep those bytes outside the per-CPU budget until a later allocation or free happens to re-reserve them, so the new bound is lost on the OOM path.

Please restore or re-account the previous PerCPUMemoryThreadState contribution together with previous_untracked_memory when the flush throws, or delay dropping the previous contribution until the tracker update has succeeded.

clickhouse-gh · 2026-06-01T21:50:05Z

            CurrentMemoryTracker::setMinAllocationSizeBytesToThrow(
                new_server_settings[ServerSetting::min_allocation_size_to_throw_on_memory_limit]);

+            per_cpu_memory.setBudgetCapacity(new_server_settings[ServerSetting::max_per_cpu_untracked_memory]);


These settings are applied on config reload, but system.server_settings will still report the stale value because dumpToSystemServerSettingsColumns only overrides current values for entries listed in its changeable_settings map. After an operator changes max_per_cpu_untracked_memory or per_cpu_untracked_memory_thread_buffer and reloads config, the new budget is active in per_cpu_memory but the system table continues to show the old/default value and changeable_without_restart = No.

Please add these settings to the runtime-current value path, with getters on PerCPUMemory if needed, so the observable server settings match the component state after reload.

clickhouse-gh · 2026-06-01T21:50:05Z

+    /// running counters are left alone.
+    void setThreadBuffer(Int64 bytes)
+    {
+        if (bytes > 0)


per_cpu_untracked_memory_thread_buffer is documented as a byte amount, but setThreadBuffer silently ignores 0 and keeps the previous/default buffer. That makes --per_cpu_untracked_memory_thread_buffer=0 unusable for disabling per-thread slack, and on reload it can leave an old value active while the config appears to request a stricter bound.

Please either store 0 as a real no-buffer value or reject/document it explicitly instead of falling back to the previous value.

clickhouse-gh · 2026-06-02T01:19:40Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.40%	84.40%	+0.00%
Functions	92.40%	92.40%	+0.00%
Branches	77.00%	77.00%	+0.00%

Changed lines: Changed C/C++ lines covered by tests: 274/281 (97.51%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 3 line(s) · Uncovered code

Full report · Diff report

azat mentioned this pull request May 28, 2026

Per-CPU memory budget for max_untracked_memory to avoid OOMs #104125

Closed

clickhouse-gh Bot added pr-performance Pull request with some performance improvements submodule changed At least one submodule changed in this PR. labels May 28, 2026

azat mentioned this pull request May 28, 2026

Add dynamic untracked memory limits for more precise memory tracking #101251

Draft

Add benchmark for memory tracking

5a8344c

azat force-pushed the per-cpu-untracked-memory branch from 359d2e9 to b4bca3f Compare May 28, 2026 16:08

clickhouse-gh Bot reviewed May 28, 2026

View reviewed changes

Comment thread src/Common/tests/gtest_per_cpu_memory.cpp

azat and others added 8 commits May 28, 2026 22:24

tests: fix 04117_max_threads_min_free_memory_per_thread flakiness

6167884

test_settings_constraints_distributed: raise initiator limit above wh…

1883487

…at the query needs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test_failed_async_inserts: move the 1000-byte limit from server to pe…

6e17d32

…r-query Now with per-cpu untracked memory server cannot start with such low limit Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Respect max_untracked_memory for per-cpu untracked memory flushes

dd442f8

test_early_memory_limit_exception: fix flakiness after per-cpu max_un…

edad089

…tracked_memory After per-cpu memory tracking it make flaky because now the flush (and hence the check) can happens earlier

test_settings_constraints_distributed_ddl: raise initiator limit abov…

d935181

…e what the query needs Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

azat force-pushed the per-cpu-untracked-memory branch from b4bca3f to d935181 Compare May 28, 2026 20:26

clickhouse-gh Bot reviewed May 28, 2026

View reviewed changes

Comment thread src/Common/CurrentMemoryTracker.cpp Outdated

azat and others added 2 commits May 29, 2026 13:02

gtest_per_cpu_memory: test for per-CPU flushes

68983b8

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Increase max_untracked_memory to improve performance

5dcbca9

clickhouse-gh Bot reviewed May 29, 2026

View reviewed changes

Comment thread src/Common/ThreadStatus.h

MemoryTrackerSwitcher: save/restore per-CPU pending counters

8f85796

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

clickhouse-gh Bot reviewed May 29, 2026

View reviewed changes

Comment thread src/Core/Settings.cpp Outdated

Ergus self-assigned this May 29, 2026

Ergus reviewed May 29, 2026

View reviewed changes

Comment thread src/Common/MemoryTrackerSwitcher.cpp Outdated

Comment thread tests/integration/test_early_memory_limit_exception/test.py Outdated

Comment thread tests/integration/test_early_memory_limit_exception/test.py

Comment thread src/Common/CurrentMemoryTracker.cpp Outdated

azat and others added 4 commits May 31, 2026 16:42

Fix a comment for ignoring memory_tracker->allocImpl()

61ea8b3

clickhouse-gh Bot reviewed May 31, 2026

View reviewed changes

Comment thread src/Common/PerCPUMemory.h Outdated

azat marked this pull request as draft June 1, 2026 09:27

azat and others added 3 commits June 1, 2026 18:43

Merge remote-tracking branch 'origin/master' into per-cpu-untracked-m…

2252404

…emory

clickhouse-gh Bot reviewed Jun 1, 2026

View reviewed changes

azat added 2 commits June 1, 2026 21:18

Fix new cppcoreguidelines-init-variables

14d33a6

Fix memory_tracker_benchmark build

30446d9

clickhouse-gh Bot reviewed Jun 1, 2026

View reviewed changes

Conversation

azat commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Uh oh!

clickhouse-gh Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Missing context / blind spots

Findings

Tests

Final Verdict

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ergus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

azat commented May 31, 2026

Uh oh!

Uh oh!

azat commented Jun 1, 2026

Uh oh!

clickhouse-gh Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot commented Jun 2, 2026

LLVM Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

azat commented May 28, 2026 •

edited

Loading

clickhouse-gh Bot commented May 28, 2026 •

edited

Loading