Per-CPU bound on untracked memory to prevent OOMs#106055
Conversation
|
Workflow [PR], commit [30446d9] Summary: ⏳
AI ReviewSummaryThis PR adds a Linux-only process-wide Missing context / blind spots
Findings❌ Blockers
Tests
Final VerdictStatus: ❌ Block Minimum required actions: fix the allocation-exception rollback, wire the settings consistently in |
359d2e9 to
b4bca3f
Compare
…OMs) Bounds untracked_memory to ncpu * SLICE (=4MiB) instead of nthreads * max_untracked_memory, so server memory limits becomes a true upper bound on real memory usage (almost). But leave small per-thread buffer (BUFFER_SIZE=32KiB) as a hot path With default max_server_memory_usage_to_ram_ratio=0.9, OOMs due to untracked_memory is not possible (later we can do a real reservation of this memory, but then we need to decrease max_server_memory_usage_to_ram_ratio) On Linux with librseq the slot bump is one rseq load-cbne-store; otherwise it falls back to atomic operations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Those allocations may be outside of allowed range [memory_profiler_sample_min_allocation_size, memory_profiler_sample_max_allocation_size] Since for isSizeOkForSampling() we may have size bigger then the initial allocation (due to other allocations on this CPU), but later MemorySample will be written with initial size.
…at the query needs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r-query Now with per-cpu untracked memory server cannot start with such low limit Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tracked_memory After per-cpu memory tracking it make flaky because now the flush (and hence the check) can happens earlier
…e what the query needs Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
b4bca3f to
d935181
Compare
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Ergus
left a comment
There was a problem hiding this comment.
This is labelled "Performance Improvement" but the primary goal is OOM prevention... a correctness/safety fix.
So I think this is an Improvement,
…ry cleanup "system free memory" frees the allocated memory deterministically, so it and the user cleanup after it have headroom and need no retry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The per-CPU bucket size is rounded up to the next power of two (std::bit_width in applyQuerySettings), so the effective threshold may be slightly higher than the configured value. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Encapsulate the per-CPU state save/reset/restore in PerCPUMemoryThreadState and use it from MemoryTrackerSwitcher, now covering alloc_cpu/free_cpu and nallocs/nfrees in addition to the pending counters (slice_log2 stays, it is a query property). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
It is marked as so to run performance tests, and in some sense "OOM prevention" can be "performance improvement" |
Restore max_untracked_memory and memory_profiler_step defaults (4 MiB) and the per-thread wording, and drop the SettingsChangesHistory entries. The per-CPU bound is reintroduced in the following commit as a separate server-wide budget, leaving max_untracked_memory as the per-thread cap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
To keep the global MemoryTracker off the hot path, each thread accumulates small allocations/deallocations in a thread-local `untracked_memory` and only flushes to the tracker once it exceeds `max_untracked_memory`. The downside is that the tracker undercounts by up to `threads * max_untracked_memory` at any moment - with thousands of threads that is tens of GiB the server never sees, so it can blow past `max_server_memory_usage` and OOM. Bound that overcommit per-CPU instead of per-thread. Each CPU owns a signed counter of the net un-flushed bytes of the threads running on it (allocations add, frees subtract); a thread reflects its `untracked_memory` into that counter and must flush once the counter leaves [-max_per_cpu_untracked_memory, +max_per_cpu_untracked_memory]. Only one thread runs on a CPU at a time and admission is gated, so the deferred total is bounded by `ncpu * max_per_cpu_untracked_memory` regardless of the thread count, and even threads that park while holding untracked memory cannot inflate it past the budget. The counter is signed, so a thread's deferred allocations and another's deferred frees on the same CPU cancel instead of both forcing a flush. To avoid touching the shared counter on every allocation, a thread only updates it after drifting by `per_cpu_untracked_memory_thread_buffer` (default 32 KiB); that is the only slack on top of the per-CPU budget. The update is exact - a thread's contribution is added to and removed from the same CPU and moved on migration - so the counter never drifts; only the threshold check is best-effort under concurrency, which is fine because the MemoryTracker stays authoritative. `max_untracked_memory` keeps its per-thread meaning and still caps a single thread; `max_per_cpu_untracked_memory` (default 8 MiB, 0 disables the bound) is a new server setting. Supported on Linux (needs sched_getcpu); elsewhere the behaviour is unchanged (per-thread only).
|
@Ergus I do not like current implementation, will keep it as draft for now, and ping you/undraft once it will be ready |
| new_server_settings[ServerSetting::min_allocation_size_to_throw_on_memory_limit]); | ||
|
|
||
| per_cpu_memory.setBudgetCapacity(new_server_settings[ServerSetting::max_per_cpu_untracked_memory]); | ||
| per_cpu_memory.setThreadBuffer(new_server_settings[ServerSetting::per_cpu_untracked_memory_thread_buffer]); |
There was a problem hiding this comment.
These settings are only applied in clickhouse-server, but clickhouse-local also loads ServerSettings and installs total_memory_tracker in programs/local/LocalServer.cpp. With this change the process-wide per_cpu_memory default is active in clickhouse-local, yet --max_per_cpu_untracked_memory=0 and --per_cpu_untracked_memory_thread_buffer=... are silently ignored because LocalServer never calls setBudgetCapacity or setThreadBuffer.
Please wire the same settings in LocalServer next to max_server_memory_usage and min_allocation_size_to_throw_on_memory_limit, otherwise local runs and tests cannot tune or disable the new accounting behavior.
| if (current_thread->untracked_memory > current_thread->untracked_memory_limit | ||
| || !DB::per_cpu_memory.reserve(current_thread->untracked_memory, current_thread->per_cpu_untracked_memory)) | ||
| { | ||
| DB::per_cpu_memory.release(current_thread->per_cpu_untracked_memory); |
There was a problem hiding this comment.
The rollback restores only untracked_memory, but the per-CPU contribution has already been removed by reserve/release before this allocImpl call. If the flush throws MEMORY_LIMIT_EXCEEDED, the failed allocation is not made, while the previous unflushed bytes remain in ThreadStatus::untracked_memory with per_cpu_untracked_memory reset to zero. A query that catches the exception can then keep those bytes outside the per-CPU budget until a later allocation or free happens to re-reserve them, so the new bound is lost on the OOM path.
Please restore or re-account the previous PerCPUMemoryThreadState contribution together with previous_untracked_memory when the flush throws, or delay dropping the previous contribution until the tracker update has succeeded.
| CurrentMemoryTracker::setMinAllocationSizeBytesToThrow( | ||
| new_server_settings[ServerSetting::min_allocation_size_to_throw_on_memory_limit]); | ||
|
|
||
| per_cpu_memory.setBudgetCapacity(new_server_settings[ServerSetting::max_per_cpu_untracked_memory]); |
There was a problem hiding this comment.
These settings are applied on config reload, but system.server_settings will still report the stale value because dumpToSystemServerSettingsColumns only overrides current values for entries listed in its changeable_settings map. After an operator changes max_per_cpu_untracked_memory or per_cpu_untracked_memory_thread_buffer and reloads config, the new budget is active in per_cpu_memory but the system table continues to show the old/default value and changeable_without_restart = No.
Please add these settings to the runtime-current value path, with getters on PerCPUMemory if needed, so the observable server settings match the component state after reload.
| /// running counters are left alone. | ||
| void setThreadBuffer(Int64 bytes) | ||
| { | ||
| if (bytes > 0) |
There was a problem hiding this comment.
per_cpu_untracked_memory_thread_buffer is documented as a byte amount, but setThreadBuffer silently ignores 0 and keeps the previous/default buffer. That makes --per_cpu_untracked_memory_thread_buffer=0 unusable for disabling per-thread slack, and on reload it can leave an old value active while the config appears to request a stricter bound.
Please either store 0 as a real no-buffer value or reject/document it explicitly instead of falling back to the previous value.
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 274/281 (97.51%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 3 line(s) · Uncovered code |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Per-CPU bound on untracked memory to prevent OOMs
Previously ClickHouse works memory overcommit was "max_untracked_memory*threads", which may in theory lead to ±40GiB of overcommited memory in a worst case scenario and 10K threads
By bounding maximum untracked memory to "ncpu*max_untracked_memory" we can avoid this massive memory overcommit, and also with default max_server_memory_usage_to_ram_ratio=0.9, all this memory is already reserved, so for now per-cpu memory is not reserved explicitly.
"max_untracked_memory" per query continues to control the per-thread accumulator and additionally sizes the thread's per-CPU bucket (rounded up to a power of two), so raising it opts that thread into looser tracking without affecting other queries.
Note, on Linux it uses librseq gives only ±1% (in the best case), so this patch has been removed from the patch set