Skip to content

Apply Poisson sampling correction to collapsed jemalloc heap profiles#102759

Merged
antonio2368 merged 5 commits intomasterfrom
fix-jemalloc-collapsed
Apr 16, 2026
Merged

Apply Poisson sampling correction to collapsed jemalloc heap profiles#102759
antonio2368 merged 5 commits intomasterfrom
fix-jemalloc-collapsed

Conversation

@antonio2368
Copy link
Copy Markdown
Member

Collapsed jemalloc heap profiles generated by system.jemalloc_profile_text and SYSTEM JEMALLOC FLUSH PROFILE reported raw sampled bytes without the Poisson sampling correction that jeprof applies. This caused the reported totals to underestimate actual allocation sizes — for example, ~5.9 GB raw vs ~11.2 GB corrected for a profile with the default 524288-byte sampling interval.

The fix parses the heap_v2/N header to extract the sampling interval, then scales each sample by 1/(1-exp(-mean_size/interval)) matching jeprof's AdjustSamples algorithm. Uses std::expm1 for numerical stability with small ratios and guards against edge cases (zero bytes, non-finite results, overflow).

Also adds a WriteBuffer overload of symbolizeJemallocHeapProfile and unit tests.

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Apply Poisson sampling correction to collapsed jemalloc heap profiles to match jeprof output. Previously the collapsed format underestimated actual allocation sizes by not accounting for the sampling probability.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

No documentation changes needed — this is a correctness fix for existing functionality.

🤖 Generated with Claude Code

Previously, `JemallocProfileSource::generateCollapsed` reported raw
sampled bytes/counts without the Poisson sampling correction that
jeprof applies via `AdjustSamples`. This caused collapsed profiles to
underestimate actual allocation sizes (e.g. ~5.9 GB raw vs ~11.2 GB
corrected for a real profile with 524288-byte sampling interval).

The fix parses the `heap_v2/N` header to extract the sampling interval,
then scales each sample by `1/(1-exp(-mean_size/interval))` using
`std::expm1` for numerical stability. Edge cases (zero bytes, zero
count, non-finite results, overflow) are guarded.

Also adds a `WriteBuffer` overload of `symbolizeJemallocHeapProfile`
and unit tests verifying the correction against a hardcoded expected
value computed from jeprof's formula.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@antonio2368 antonio2368 added pr-must-backport Pull request should be backported intentionally. Use this label with great care! v25.12-must-backport labels Apr 15, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Apr 15, 2026

Workflow [PR], commit [a795f5e]

Summary:

job_name test_name status info comment
Integration tests (amd_msan, 6/6) failure
test_storage_rabbitmq/test.py::test_rabbitmq_select[0] FAIL cidb
test_storage_rabbitmq/test.py::test_rabbitmq_select[1] FAIL cidb
test_storage_rabbitmq/test.py::test_rabbitmq_select_empty FAIL cidb
test_storage_rabbitmq/test.py::test_rabbitmq_json_without_delimiter FAIL cidb
test_storage_rabbitmq/test.py::test_rabbitmq_csv_with_delimiter FAIL cidb
test_storage_rabbitmq/test.py::test_rabbitmq_tsv_with_delimiter FAIL cidb
test_storage_rabbitmq/test.py::test_rabbitmq_macros FAIL cidb
test_storage_rabbitmq/test.py::test_rabbitmq_materialized_view FAIL cidb
test_storage_rabbitmq/test.py::test_rabbitmq_materialized_view_with_subquery FAIL cidb
test_storage_rabbitmq/test.py::test_rabbitmq_many_materialized_views FAIL cidb
47 more test cases not shown
Unit tests (msan) error

AI Review

Summary

This PR applies Poisson sampling correction to collapsed jemalloc heap profiles so totals match jeprof. The core algorithm is sound, but the collapsed parser became less robust on malformed profile records because it now parses both <live_count> and <live_bytes> unconditionally before deciding which metric to emit. This can turn previously best-effort processing into an exception path for corrupted input.

Missing context
  • ⚠️ No CI logs or test run results were provided in this review context.
Findings

❌ Blockers

  • [src/Processors/Sources/JemallocProfileSource.cpp:531-541] generateCollapsed now parses both numeric fields with parseInt on every record. In count mode, malformed <live_bytes> can throw even when that field is not needed; in bytes mode the symmetric case applies. This is a robustness regression for malformed/corrupted profiles.
    • Suggested fix: parse only the field required for output when sampling_interval == 0; parse both fields only when correction is active.

⚠️ Majors

  • [src/Processors/tests/gtest_jemalloc_profile_source.cpp:191-216] Tests cover malformed heap_v2 header but do not cover malformed metric fields in allocation lines. This leaves the above regression unguarded.
    • Suggested fix: add regression tests with invalid <live_bytes> in count mode and invalid <live_count> in bytes mode.
Tests
  • ⚠️ Add a test where sampling_interval == 0 and <live_bytes> is malformed while running count mode; verify no exception and best-effort output.
  • ⚠️ Add the symmetric test for bytes mode with malformed <live_count>.
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
No large/binary files
Final Verdict
  • Status: ⚠️ Request changes
  • Minimum required actions:
    • Fix malformed-line robustness in generateCollapsed so non-required malformed fields do not throw.
    • Add targeted regression tests for malformed allocation metric fields.

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Apr 15, 2026
Comment thread src/Processors/Sources/JemallocProfileSource.cpp Outdated
antonio2368 and others added 2 commits April 15, 2026 11:34
Use tryReadIntText instead of parseInt so that malformed heap_v2
headers (e.g. heap_v2/abc) return 0 instead of throwing, keeping
collapsed output generation best-effort.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verify that heap_v2/abc falls back to raw values without throwing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread src/Processors/Sources/JemallocProfileSource.cpp
antonio2368 and others added 2 commits April 15, 2026 13:09
# Conflicts:
#	src/Processors/Sources/JemallocProfileSource.cpp
Resolve conflict in symbolizeJemallocHeapProfile overloads — master
already extracted pullProfileLines and added
symbolizeJemallocHeapProfileToString, so drop the redundant WriteBuffer
overload and use the new API in tests. Also use std::move on
out.str() to avoid copying the string buffer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
std::filesystem::remove(input);
}

/// When the heap_v2 header has a malformed interval (e.g. "heap_v2/abc"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CollapsedMalformedHeaderRawPassthrough verifies malformed heap_v2 headers, but it does not cover malformed allocation records (e.g. invalid <live_bytes> in count mode or invalid <live_count> in bytes mode).

That gap matters here because collapsed parsing now unconditionally parses both fields before choosing the output metric, so malformed data in an otherwise-unused field can raise an exception where previous behavior was best-effort. Please add a regression test for malformed metric fields to lock this behavior down.

@azat azat self-assigned this Apr 15, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Apr 15, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.00% 84.00% +0.00%
Functions 90.90% 90.90% +0.00%
Branches 76.50% 76.50% +0.00%

Changed lines: 96.00% (120/125) · Uncovered code

Full report · Diff report

WriteBufferFromOwnString out;
pullProfileLines(input_filename, format, symbolize_with_inline, out);
return out.str();
return std::move(out.str());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it always return rvalue?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably makes sense to always return rvalue, not sure if it will break somewhere but we can check.

Copy link
Copy Markdown
Member

@azat azat Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, but out.str() should be an rvalue already, to be precise out.str() is an prvalue and std::move(out.str()) is an xvalue.

And RVO should be applied here before

I don't understand why clang-tidy does not complain about this line...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. out.str() returns std::string & so it's lvalue. Am I missing something?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry for the noise, I thought that str() returns std::string.

@antonio2368 antonio2368 added this pull request to the merge queue Apr 16, 2026
Merged via the queue into master with commit 3ec2366 Apr 16, 2026
158 of 161 checks passed
@antonio2368 antonio2368 deleted the fix-jemalloc-collapsed branch April 16, 2026 06:57
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 16, 2026
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label Apr 16, 2026
robot-ch-test-poll2 added a commit that referenced this pull request Apr 16, 2026
Cherry pick #102759 to 26.2: Apply Poisson sampling correction to collapsed jemalloc heap profiles
robot-clickhouse added a commit that referenced this pull request Apr 16, 2026
robot-ch-test-poll2 added a commit that referenced this pull request Apr 16, 2026
Cherry pick #102759 to 26.3: Apply Poisson sampling correction to collapsed jemalloc heap profiles
robot-clickhouse added a commit that referenced this pull request Apr 16, 2026
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Apr 16, 2026
clickhouse-gh Bot added a commit that referenced this pull request Apr 16, 2026
Backport #102759 to 26.2: Apply Poisson sampling correction to collapsed jemalloc heap profiles
antonio2368 added a commit that referenced this pull request Apr 16, 2026
Backport #102759 to 26.3: Apply Poisson sampling correction to collapsed jemalloc heap profiles
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-bugfix Pull request with bugfix, not backported by default pr-must-backport Pull request should be backported intentionally. Use this label with great care! pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo v25.12-must-backport

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants