Skip to content

Per-CPU bound on untracked memory to prevent OOMs#106055

Draft
azat wants to merge 21 commits into
ClickHouse:masterfrom
azat:per-cpu-untracked-memory
Draft

Per-CPU bound on untracked memory to prevent OOMs#106055
azat wants to merge 21 commits into
ClickHouse:masterfrom
azat:per-cpu-untracked-memory

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented May 28, 2026

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Per-CPU bound on untracked memory to prevent OOMs

Previously ClickHouse works memory overcommit was "max_untracked_memory*threads", which may in theory lead to ±40GiB of overcommited memory in a worst case scenario and 10K threads

By bounding maximum untracked memory to "ncpu*max_untracked_memory" we can avoid this massive memory overcommit, and also with default max_server_memory_usage_to_ram_ratio=0.9, all this memory is already reserved, so for now per-cpu memory is not reserved explicitly.

"max_untracked_memory" per query continues to control the per-thread accumulator and additionally sizes the thread's per-CPU bucket (rounded up to a power of two), so raising it opts that thread into looser tracking without affecting other queries.

Note, on Linux it uses librseq gives only ±1% (in the best case), so this patch has been removed from the patch set

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented May 28, 2026

Workflow [PR], commit [30446d9]

Summary:

job_name test_name status info comment
Stateless tests (arm_asan_ubsan, targeted) FAIL
Server died FAIL cidb

AI Review

Summary

This PR adds a Linux-only process-wide PerCPUMemory budget to bound untracked memory by CPU instead of by thread count, plus server settings, tests, and a benchmark. The design is much stronger than the earlier bucket version, but the current code still loses the new bound on an allocation-limit exception path and leaves several server-setting surfaces inconsistent, so I do not think it is mergeable yet.

Missing context / blind spots
  • ⚠️ CI for head 30446d93 is still mostly pending: the fetched Praktika PR report only had config and style complete, with performance comparison jobs pending. fetch_perf_report.py could not run locally because no clickhouse binary is in PATH. Completed performance comparison or benchmark numbers would close this gap.
Findings

❌ Blockers

  • src/Common/CurrentMemoryTracker.cpp:88 The rollback path removes the thread's per-CPU contribution before calling MemoryTracker::allocImpl, but on MEMORY_LIMIT_EXCEEDED it restores only ThreadStatus::untracked_memory. A query that catches the exception can keep the previous deferred bytes outside per_cpu_memory, so the PR's bound is lost exactly on the OOM-prevention path. Restore or re-account the previous PerCPUMemoryThreadState contribution when rethrowing, or delay releasing it until the tracker update succeeds.

⚠️ Majors

  • programs/server/Server.cpp:2322 clickhouse-local loads ServerSettings and installs total_memory_tracker, but never applies max_per_cpu_untracked_memory or per_cpu_untracked_memory_thread_buffer. The default per-CPU accounting is active there, while --max_per_cpu_untracked_memory=0 and buffer overrides are ignored. Wire the same settings in LocalServer next to max_server_memory_usage.
  • programs/server/Server.cpp:2321 The new settings are applied on config reload, but system.server_settings will not report their current values because dumpToSystemServerSettingsColumns does not list them in changeable_settings. Add getters on PerCPUMemory if needed and expose the runtime values there.
  • src/Common/PerCPUMemory.h:137 per_cpu_untracked_memory_thread_buffer=0 is silently ignored by setThreadBuffer, keeping the previous/default buffer. Either make 0 a real no-buffer value or reject/document it explicitly.
  • src/Common/CurrentMemoryTracker.cpp [dismissed by author] The PR is categorized as Performance Improvement and changes the hot allocation/free path, but the current PR state does not include before/after measurements and the performance jobs are still pending. I still consider this real because existing broad perf tests are not evidence until their results are available for this head.
Tests
  • ⚠️ Add a focused regression for the MEMORY_LIMIT_EXCEEDED rollback path: start with a non-zero deferred contribution, force a per-CPU flush that throws, catch it, and assert the previous bytes remain accounted in per_cpu_memory.
  • ⚠️ Add coverage for the clickhouse-local setting path and for runtime visibility of max_per_cpu_untracked_memory / per_cpu_untracked_memory_thread_buffer through system.server_settings.
  • ⚠️ Provide before/after benchmark or performance-comparison evidence for representative allocation/free patterns before relying on the Performance Improvement claim.
Final Verdict

Status: ❌ Block

Minimum required actions: fix the allocation-exception rollback, wire the settings consistently in clickhouse-local and system.server_settings, define 0 semantics for the thread buffer setting, and provide completed CI/performance evidence.

@clickhouse-gh clickhouse-gh Bot added pr-performance Pull request with some performance improvements submodule changed At least one submodule changed in this PR. labels May 28, 2026
@azat azat force-pushed the per-cpu-untracked-memory branch from 359d2e9 to b4bca3f Compare May 28, 2026 16:08
Comment thread src/Common/tests/gtest_per_cpu_memory.cpp
azat and others added 8 commits May 28, 2026 22:24
…OMs)

Bounds untracked_memory to ncpu * SLICE (=4MiB) instead of nthreads *
max_untracked_memory, so server memory limits becomes a true upper bound
on real memory usage (almost).

But leave small per-thread buffer (BUFFER_SIZE=32KiB) as a hot path

With default max_server_memory_usage_to_ram_ratio=0.9, OOMs due to
untracked_memory is not possible (later we can do a real reservation of
this memory, but then we need to decrease
max_server_memory_usage_to_ram_ratio)

On Linux with librseq the slot bump is one rseq load-cbne-store;
otherwise it falls back to atomic operations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Those allocations may be outside of allowed range

  [memory_profiler_sample_min_allocation_size, memory_profiler_sample_max_allocation_size]

Since for isSizeOkForSampling() we may have size bigger then the initial
allocation (due to other allocations on this CPU), but later
MemorySample will be written with initial size.
…at the query needs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r-query

Now with per-cpu untracked memory server cannot start with such low limit

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tracked_memory

After per-cpu memory tracking it make flaky because now the flush (and
hence the check) can happens earlier
…e what the query needs

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@azat azat force-pushed the per-cpu-untracked-memory branch from b4bca3f to d935181 Compare May 28, 2026 20:26
Comment thread src/Common/CurrentMemoryTracker.cpp Outdated
azat and others added 2 commits May 29, 2026 13:02
Comment thread src/Common/ThreadStatus.h
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/Core/Settings.cpp Outdated
@Ergus Ergus self-assigned this May 29, 2026
Copy link
Copy Markdown
Member

@Ergus Ergus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is labelled "Performance Improvement" but the primary goal is OOM prevention... a correctness/safety fix.

So I think this is an Improvement,

Comment thread src/Common/MemoryTrackerSwitcher.cpp Outdated
Comment thread tests/integration/test_early_memory_limit_exception/test.py Outdated
Comment thread tests/integration/test_early_memory_limit_exception/test.py
Comment thread src/Common/CurrentMemoryTracker.cpp Outdated
azat and others added 4 commits May 31, 2026 16:42
…ry cleanup

"system free memory" frees the allocated memory deterministically, so
it and the user cleanup after it have headroom and need no retry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The per-CPU bucket size is rounded up to the next power of two
(std::bit_width in applyQuerySettings), so the effective threshold may
be slightly higher than the configured value.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Encapsulate the per-CPU state save/reset/restore in
PerCPUMemoryThreadState and use it from MemoryTrackerSwitcher, now
covering alloc_cpu/free_cpu and nallocs/nfrees in addition to the
pending counters (slice_log2 stays, it is a query property).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@azat
Copy link
Copy Markdown
Member Author

azat commented May 31, 2026

This is labelled "Performance Improvement" but the primary goal is OOM prevention... a correctness/safety fix.

It is marked as so to run performance tests, and in some sense "OOM prevention" can be "performance improvement"

Comment thread src/Common/PerCPUMemory.h Outdated
@azat azat marked this pull request as draft June 1, 2026 09:27
azat and others added 3 commits June 1, 2026 18:43
Restore max_untracked_memory and memory_profiler_step defaults (4 MiB)
and the per-thread wording, and drop the SettingsChangesHistory entries.
The per-CPU bound is reintroduced in the following commit as a separate
server-wide budget, leaving max_untracked_memory as the per-thread cap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
To keep the global MemoryTracker off the hot path, each thread accumulates
small allocations/deallocations in a thread-local `untracked_memory` and only
flushes to the tracker once it exceeds `max_untracked_memory`. The downside is
that the tracker undercounts by up to `threads * max_untracked_memory` at any
moment - with thousands of threads that is tens of GiB the server never sees,
so it can blow past `max_server_memory_usage` and OOM.

Bound that overcommit per-CPU instead of per-thread. Each CPU owns a signed
counter of the net un-flushed bytes of the threads running on it (allocations
add, frees subtract); a thread reflects its `untracked_memory` into that
counter and must flush once the counter leaves
[-max_per_cpu_untracked_memory, +max_per_cpu_untracked_memory]. Only one thread
runs on a CPU at a time and admission is gated, so the deferred total is bounded
by `ncpu * max_per_cpu_untracked_memory` regardless of the thread count, and
even threads that park while holding untracked memory cannot inflate it past the
budget. The counter is signed, so a thread's deferred allocations and another's
deferred frees on the same CPU cancel instead of both forcing a flush.

To avoid touching the shared counter on every allocation, a thread only updates
it after drifting by `per_cpu_untracked_memory_thread_buffer` (default 32 KiB);
that is the only slack on top of the per-CPU budget. The update is exact - a
thread's contribution is added to and removed from the same CPU and moved on
migration - so the counter never drifts; only the threshold check is best-effort
under concurrency, which is fine because the MemoryTracker stays authoritative.

`max_untracked_memory` keeps its per-thread meaning and still caps a single
thread; `max_per_cpu_untracked_memory` (default 8 MiB, 0 disables the bound) is
a new server setting. Supported on Linux (needs sched_getcpu); elsewhere the
behaviour is unchanged (per-thread only).
@azat
Copy link
Copy Markdown
Member Author

azat commented Jun 1, 2026

@Ergus I do not like current implementation, will keep it as draft for now, and ping you/undraft once it will be ready

new_server_settings[ServerSetting::min_allocation_size_to_throw_on_memory_limit]);

per_cpu_memory.setBudgetCapacity(new_server_settings[ServerSetting::max_per_cpu_untracked_memory]);
per_cpu_memory.setThreadBuffer(new_server_settings[ServerSetting::per_cpu_untracked_memory_thread_buffer]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These settings are only applied in clickhouse-server, but clickhouse-local also loads ServerSettings and installs total_memory_tracker in programs/local/LocalServer.cpp. With this change the process-wide per_cpu_memory default is active in clickhouse-local, yet --max_per_cpu_untracked_memory=0 and --per_cpu_untracked_memory_thread_buffer=... are silently ignored because LocalServer never calls setBudgetCapacity or setThreadBuffer.

Please wire the same settings in LocalServer next to max_server_memory_usage and min_allocation_size_to_throw_on_memory_limit, otherwise local runs and tests cannot tune or disable the new accounting behavior.

if (current_thread->untracked_memory > current_thread->untracked_memory_limit
|| !DB::per_cpu_memory.reserve(current_thread->untracked_memory, current_thread->per_cpu_untracked_memory))
{
DB::per_cpu_memory.release(current_thread->per_cpu_untracked_memory);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rollback restores only untracked_memory, but the per-CPU contribution has already been removed by reserve/release before this allocImpl call. If the flush throws MEMORY_LIMIT_EXCEEDED, the failed allocation is not made, while the previous unflushed bytes remain in ThreadStatus::untracked_memory with per_cpu_untracked_memory reset to zero. A query that catches the exception can then keep those bytes outside the per-CPU budget until a later allocation or free happens to re-reserve them, so the new bound is lost on the OOM path.

Please restore or re-account the previous PerCPUMemoryThreadState contribution together with previous_untracked_memory when the flush throws, or delay dropping the previous contribution until the tracker update has succeeded.

CurrentMemoryTracker::setMinAllocationSizeBytesToThrow(
new_server_settings[ServerSetting::min_allocation_size_to_throw_on_memory_limit]);

per_cpu_memory.setBudgetCapacity(new_server_settings[ServerSetting::max_per_cpu_untracked_memory]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These settings are applied on config reload, but system.server_settings will still report the stale value because dumpToSystemServerSettingsColumns only overrides current values for entries listed in its changeable_settings map. After an operator changes max_per_cpu_untracked_memory or per_cpu_untracked_memory_thread_buffer and reloads config, the new budget is active in per_cpu_memory but the system table continues to show the old/default value and changeable_without_restart = No.

Please add these settings to the runtime-current value path, with getters on PerCPUMemory if needed, so the observable server settings match the component state after reload.

Comment thread src/Common/PerCPUMemory.h
/// running counters are left alone.
void setThreadBuffer(Int64 bytes)
{
if (bytes > 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per_cpu_untracked_memory_thread_buffer is documented as a byte amount, but setThreadBuffer silently ignores 0 and keeps the previous/default buffer. That makes --per_cpu_untracked_memory_thread_buffer=0 unusable for disabling per-thread slack, and on reload it can leave an old value active while the config appears to request a stricter bound.

Please either store 0 as a real no-buffer value or reject/document it explicitly instead of falling back to the previous value.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Jun 2, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.40% 84.40% +0.00%
Functions 92.40% 92.40% +0.00%
Branches 77.00% 77.00% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 274/281 (97.51%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 3 line(s) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants