Skip to content

Fix runningAccumulate with nested states#105085

Merged
alexey-milovidov merged 7 commits into
masterfrom
fix-msan-hll-uninit-write
May 30, 2026
Merged

Fix runningAccumulate with nested states#105085
alexey-milovidov merged 7 commits into
masterfrom
fix-msan-hll-uninit-write

Conversation

@antaljanosbenjamin
Copy link
Copy Markdown
Member

@antaljanosbenjamin antaljanosbenjamin commented May 15, 2026

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixes a use-after-free in runningAccumulate when called on a column whose aggregate function returns its own state (e.g. uniqStateOrDefaultState, sumStateOrDefaultState, uniqStateForEachState).


Note

Medium Risk
Changes how runningAccumulate materializes results for state-returning aggregate functions and alters Merge combinator behavior, which can affect query results and previously failing code paths for AggregateFunction state columns.

Overview
Fixes runningAccumulate for aggregate functions that return their own state: it now uses insertMergeResultInto (instead of insertResultInto) when agg_func.isState() is true, preventing aliasing of the stack-local accumulator and a use-after-free in ColumnAggregateFunction results.

Adds an AggregateFunctionMerge::insertMergeResultInto override to delegate through combinator chains (e.g. sumStateMerge) instead of hitting the base NOT_IMPLEMENTED path, and introduces a stateless regression test (04218_running_accumulate_state_uaf) covering -State plus combinators like -OrDefault, -ForEach, and -Merge.

Reviewed by Cursor Bugbot for commit 3d7ce6e. Bugbot is set up for automated code reviews on this repo. Configure here.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented May 15, 2026

Workflow [PR], commit [7fb4190]

Summary:


AI Review

Summary

This PR fixes runningAccumulate for AggregateFunction columns whose function returns a state by routing state materialization through insertMergeResultInto, and adds the needed Merge combinator override plus focused stateless regression coverage. I did not find unresolved correctness or safety issues in the current diff.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-backward-incompatible Pull request with backwards incompatible changes label May 15, 2026
Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a kludge? We should find and fix the root cause.

@alexey-milovidov alexey-milovidov self-assigned this May 15, 2026
Comment thread src/AggregateFunctions/Combinators/AggregateFunctionState.cpp Outdated
@antaljanosbenjamin
Copy link
Copy Markdown
Member Author

antaljanosbenjamin commented May 15, 2026

Isn't this a kludge? We should find and fix the root cause.

I think this is the proper fix. Without this runningAccumulate insert a state (technically a raw pointer) into the column while the allocated memory is freed on scope exit. This happens because the state is nested inside another state. This pattern can cause issues basically in multiple places, where we expect states, but got a state "nested into" another state.

Furthermoe, xxxStateState is already forbidden.

@antaljanosbenjamin antaljanosbenjamin changed the title Forbid nesting-State aggregate function combinators Fix runningAccumulate with nested states May 15, 2026
Comment thread src/AggregateFunctions/Combinators/AggregateFunctionState.cpp Outdated
Comment thread src/Functions/runningAccumulate.cpp Outdated
Comment thread tests/queries/0_stateless/04218_running_accumulate_state_uaf.sql
@antaljanosbenjamin antaljanosbenjamin removed the pr-backward-incompatible Pull request with backwards incompatible changes label May 15, 2026
@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 15, 2026
Comment thread src/Functions/runningAccumulate.cpp
@antaljanosbenjamin
Copy link
Copy Markdown
Member Author

@groeneai investigate the failures

@groeneai
Copy link
Copy Markdown
Contributor

@antaljanosbenjamin — quick summary, then details.

CI failures: chronic infra, not PR-caused. Only two checks failed:

Check Tests Status
Stress test (amd_msan) Cannot start clickhouse-server, Check failed (exit code 1) infra flake
Stress test (arm_tsan) same infra flake

CIDB cross-PR over 30 days (pull_request_number > 0, test_name = 'Cannot start clickhouse-server'):

Check master hits PR hits distinct PRs
Stress test (amd_msan) 0 10 10
Stress test (arm_tsan) 0 32 30

Zero master signal, broad cross-PR scatter — these are the well-known "server cannot start under stress" infra flakes. Nothing in your diff is plausibly related to server startup.


The bot's *StateMerge NOT_IMPLEMENTED concern is real — reproduced locally.

I applied your patch on top of master, built, and ran:

CREATE TABLE t_state_merge
    (k UInt32, s AggregateFunction(sumStateMerge, AggregateFunction(sum, UInt64)))
ENGINE = MergeTree ORDER BY k;
INSERT INTO t_state_merge
SELECT k, sumStateMergeState(s) FROM t_sum_state GROUP BY k;

SELECT runningAccumulate(s) FROM t_state_merge;

With the patch applied:

Code: 48. DB::Exception: Function sumStateMerge is marked as State but
method insertMergeResultInto is not implemented:
while executing 'FUNCTION runningAccumulate(__table1.s :: 1)
  -> runningAccumulate(__table1.s) AggregateFunction(sum, UInt64) : 0'. (NOT_IMPLEMENTED)

On master (no patch) the same query returns a (UAF-tainted) AggregateFunction(sum, UInt64) column. So the patch swaps a silent UAF for a hard NOT_IMPLEMENTED for this specific shape.

Scope of the regression. It only reproduces when the column itself has stored type AggregateFunction(sumStateMerge, ...) — i.e. an explicitly-typed MergeTree column, a stored MV with that schema, an arrayReduceInPlace-style construction, etc. Inline expressions like SELECT sumStateMerge(s) FROM ... resolve the result type to AggregateFunction(sum, UInt64) (the result type of -Merge), so the column stores plain sum, and column_with_states->getAggregateFunction() returns sum, whose isState() is false. So returns_state is false, the insertResultInto branch is taken, and nothing breaks. The patch's existing regression test suite all passes — none of those cases exercises a stored AggregateFunction(*StateMerge, ...) column.

Why it happens. AggregateFunctionMerge::isState() returns nested_func->isState() (AggregateFunctionMerge.h:137-140), which is true whenever the merged nested function is itself a state-returning combinator. AggregateFunctionMerge does not override insertMergeResultInto, so it falls through to IAggregateFunction::insertMergeResultInto (IAggregateFunction.cpp:148-155), which sees isState() == true and throws NOT_IMPLEMENTED.

AggregateFunctionMerge is the only combinator with this isState()/insertMergeResultInto asymmetry — State, If, Null, OrFill, Array, ForEach, Distinct, Resample, Map, argMinArgMax all either override insertMergeResultInto or never report isState() == true. So the fix is local.


Suggested fix. Add a delegating override to src/AggregateFunctions/Combinators/AggregateFunctionMerge.h:

void insertMergeResultInto(AggregateDataPtr __restrict place, IColumn & to, Arena * arena) const override
{
    nested_func->insertMergeResultInto(place, to, arena);
}

For sumStateMerge, nested_func is AggregateFunctionState<sum>, whose insertMergeResultInto is assert_cast<ColumnAggregateFunction &>(to).insertFrom(place) — the same safe arena-owning path your patch relies on for direct *State. Symmetrical to the existing insertResultInto override (AggregateFunctionMerge.h:125-128).

I built and tested this on top of your patch:

Case Master Your patch Patch + this override
runningAccumulate over stored AggregateFunction(sumStateMerge, …) silent UAF (returns garbage) NOT_IMPLEMENTED returns AggregateFunction(sum, UInt64), finalizes to [63, 133, 190] for the test data
Patch's 04218_running_accumulate_state_uaf.sql (sumStateOrDefaultState, uniqHLL12StateOrDefaultState, …ForEachState, initializeAggregation('sumState', …)) UAF / MSAN OK OK (identical reference)
runningAccumulate(uniqState(…)) classic OK OK OK

The same override probably also helps WindowTransform (src/Processors/Transforms/WindowTransform.cpp:1083-1088 takes the same is_aggregate_function_state ? insertMergeResultInto : insertResultInto branch) for the equivalent stored-*StateMerge-over-window case, but I didn't exercise that path.

Optional: a small regression test that creates a stored AggregateFunction(sumStateMerge, …) column and runs runningAccumulate over it would lock this in — your current test suite uses only inline state-returning aggregates, which (per above) doesn't reach the Merge-combinator code path.


Related. This PR also fixes STID 2243-483f (MSAN use-of-uninitialized-value in SingleValueDataFixed::writeColumnAggregateFunction::updateHashWithValueDistinctTransform, origin AlignedBuffer::~AlignedBuffer in FunctionRunningAccumulate::executeImpl), surfaced on PR #101354. The lifetime hypothesis in that report is exactly what your patch addresses.

Local repro session: cron:clickhouse-ci-task-worker:20260516-080000. Happy to file the AggregateFunctionMerge override as a follow-up PR if you'd rather keep this PR scoped to runningAccumulate.cpp.

alexey-milovidov and others added 3 commits May 23, 2026 09:42
When `runningAccumulate` is invoked on a column whose attached aggregate
function is a `Merge` wrapper over a state-returning nested function
(e.g. a column of type `AggregateFunction(sumStateMerge, ...)`),
`AggregateFunctionMerge::isState` reports `true` because it is
propagated from the nested function. The PR's `runningAccumulate` then
takes the `insertMergeResultInto` branch — but
`AggregateFunctionMerge` had no override, so the base
`IAggregateFunction::insertMergeResultInto` would throw
`NOT_IMPLEMENTED` for any function that advertises `isState`.

Add the override to delegate to `nested_func->insertMergeResultInto`,
matching how every other state-propagating combinator
(`ForEach`/`OrFill`/`If`/`Array`/...) already routes the call.
Extend the regression test in
`tests/queries/0_stateless/04218_running_accumulate_state_uaf.sql`
to cover this path via a `CAST` to `AggregateFunction(sumStateMerge, ...)`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented May 29, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.40% 84.40% +0.00%
Functions 91.30% 91.40% +0.10%
Branches 76.90% 77.00% +0.10%

Changed lines: Changed C/C++ lines covered by tests: 13/13 (100.00%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov merged commit 145b68d into master May 30, 2026
165 of 166 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-msan-hll-uninit-write branch May 30, 2026 05:12
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants