Fix sparse column aggregation with sum() and timeseries by mkmkme · Pull Request #95301 · ClickHouse/ClickHouse

mkmkme · 2026-01-27T12:47:46Z

This PR can be considered as a follow-up for #88440. That PR added a missing nullptr check to addBatchSparse function in IAggregateFunction.h. However, some of the child classes overrode that function and those overridden functions still didn't have that nullptr check. That could lead to crashes.

Unfortunately I was not able to create a stateless test case that would crash the server without that fix, but I have a consistent reproduction:

Fetch the ClickBench data: seq 0 99 | xargs -P10 -I{} bash -c 'wget --continue --progress=dot:giga https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_{}.parquet'
Create the hits table from ClickBench: clickhouse-client < ClickBench/clickhouse/create.sql
Start the following query:

SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth)
FROM hits
GROUP BY WatchID, ClientIP
ORDER BY c DESC
LIMIT 10
SETTINGS max_rows_to_group_by = 1000000,
         group_by_overflow_mode = 'any',
         distributed_aggregation_memory_efficient = true;

This led to a crash in AggregateFunctionSum: https://pastila.nl/?000338e6/ca7e40b9d60363e10869f5ac6cd9ade7#lcyVOtREywPI3v8iQZCwgg==GCM

The first commit of this PR fixes that crash.
After that, I took a look and found this pattern in TimeSeries as well, so the second commit adds the missing check there.

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix aggregation of sparse columns for sum and timeseries when group_by_overflow_mode is set to any

mode Similarly as with IAggregateFunction.h in [1], this should be done in AggregateFunctionSum.h as well [1]: ClickHouse#88440

mkmkme · 2026-01-27T12:49:40Z

@korowa @antaljanosbenjamin kindly pinging you for a review because you are in the context of the original PR

clickhouse-gh · 2026-01-27T14:25:50Z

Workflow [PR], commit [d2b1257]

Summary: ❌

job_name	test_name	status	info	comment
Stateless tests (amd_binary, ParallelReplicas, s3 storage, parallel)		failure
	03812_join_order_ubsan_overflow	FAIL	cidb	IGNORED
Stress test (amd_ubsan)		failure
	Logical error: Block structure mismatch in A stream: different number of columns: (STID: 0993-38e6)	FAIL	cidb, issue	ISSUE EXISTS
Upgrade check (amd_tsan)		failure
	Error message in clickhouse-server.log (see upgrade_error_messages.txt)	FAIL	cidb	IGNORED
Upgrade check (amd_msan)		failure
	Error message in clickhouse-server.log (see upgrade_error_messages.txt)	FAIL	cidb	IGNORED

antaljanosbenjamin · 2026-01-27T15:05:14Z

The changes looks good. For the test: in stateless tests we have hits dataset (at least some version of it) preloaded into clickhouse. A lot of tests uses test.hits. Maybe the issue would reproduce on that? Let me check.

mkmkme · 2026-01-27T15:09:04Z

The changes looks good. For the test: in stateless tests we have hits dataset (at least some version of it) preloaded into clickhouse. A lot of tests uses test.hits. Maybe the issue would reproduce on that? Let me check.

Thanks! If the whole hits dataset is there, then the query from my original post should reproduce the crash.
Let me know if you want me to add it as a test.

Also I haven't worked with TimeSeries in prior, so it might take some research for me to add a test for the changes related to it

antaljanosbenjamin · 2026-01-27T21:33:08Z

I didn't get to this, will check it tomorrow

antaljanosbenjamin · 2026-01-27T21:55:47Z

I lied, I checked it out now 😄

Use this as a test:

-- Tags: stateful
SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(Refresh), AVG(ResolutionWidth)
FROM test.hits
GROUP BY WatchID, ClientIP
ORDER BY c DESC
LIMIT 10
SETTINGS max_rows_to_group_by = 100,
         group_by_overflow_mode = 'any',
         distributed_aggregation_memory_efficient = true
FORMAT NULL;

I run it with python3 -m ci.praktika run "Stateless tests (amd_debug, parallel)" --test 03811_sparse_column_aggregation_with_sum but it didn't trigger the issue. Tomorrow I will give it a try to write a test.

mkmkme · 2026-01-29T08:19:13Z

Hey @antaljanosbenjamin,

Thanks for checking! I repeated your steps and indeed couldn't reproduce the issue. I think I see the reason why. In addition to the original query, I've added some debug information to the test:

-- Tags: stateful

SELECT
    column,
    serialization_kind
FROM system.parts_columns
WHERE database = 'test' AND table = 'hits' AND column = 'Refresh';

SELECT count(*) from test.hits;

SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(Refresh), AVG(ResolutionWidth)
FROM test.hits
GROUP BY WatchID, ClientIP
ORDER BY c DESC
LIMIT 10
SETTINGS max_rows_to_group_by = 100,
         group_by_overflow_mode = 'any',
         distributed_aggregation_memory_efficient = true;

And here's the output this test produces:

Refresh    Default
8873898
8853364585571868441        1944989242      2       2       1339
7706103715086359932        1944989242      2       0       1339
5925798870912605713        2293014471      2       0       1339
9018452160724543599        215582214       2       0       1339
8237934910330383345        1944989242      2       0       1339
8013614557750014932        2917578885      1       0       1339
9054025820695787027        1170811649      1       0       1339
8613910036935888828        931795585       1       0       1339
7492569369978058606        2850809659      1       0       1339
7678041744361854787        1690373472      1       0       1339

Probably having 8 million rows instead of 99 million does not make the difference. What is crucial, though, is the fact that hits in "Stateless tests" doesn't have sparse serialization, so this code doesn't get triggered.

With the whole hits dataset where I could reproduce the issue, I have this:

SELECT
    column,
    serialization_kind
FROM system.parts_columns
WHERE (`table` = 'hits') AND (column = 'IsRefresh')

Query id: 018de9e4-0902-4d73-87b1-a72f729591e9

   ┌─column────┬─serialization_kind─┐
1. │ IsRefresh │ Sparse             │
2. │ IsRefresh │ Default            │
3. │ IsRefresh │ Default            │
4. │ IsRefresh │ Default            │
   └───────────┴────────────────────┘

korowa · 2026-01-29T09:46:13Z

@mkmkme, I think this can be a reproducer for the issue: https://fiddle.clickhouse.com/fd9aa80c-8435-4633-932b-f44497839d97

mkmkme · 2026-01-29T09:56:24Z

@mkmkme, I think this can be a reproducer for the issue: https://fiddle.clickhouse.com/fd9aa80c-8435-4633-932b-f44497839d97

Hey @korowa, thanks a lot! That did the trick. Oddly enough I did try something similar as a test but couldn't make it fail. I'll push this test into the branch

korowa · 2026-01-29T10:17:12Z

tests/queries/0_stateless/03811_sparse_column_aggregation_with_sum.sql

@@ -0,0 +1,18 @@
+CREATE TABLE sum_overflow(key UInt128, val UInt16) ENGINE = MergeTree ORDER BY tuple();
+
+insert into sum_overflow SELECT number, rand() % 10000 = 0 from numbers(100000)


We can actually use number % 10000 just to keep the dataset deterministic (I just started with rand(), and realized that it's unnecessary only now)

First of all, thanks for both of you on making an effort to create a test!

Second, I would also prefer the deterministic test over a non-deterministic one.

Suggested change

insert into sum_overflow SELECT number, rand() % 10000 = 0 from numbers(100000)

insert into sum_overflow SELECT number, number % 10000 = 0 from numbers(100000)

I also need to drop the table after the test and probably need to have a better name for the table.

Is the FORMAT Null part still okay or do we want some actual output to verify?

I think it is okay. This test doesn't aim to test the correctness of the function, but the fact that it doesn't crash. Hopefully we already have tests to check its correctness.

Thanks, I've changed the test to be more deterministic, changed the table name and added a drop at the end of the test. Hope it looks fine now

antaljanosbenjamin · 2026-01-29T11:52:12Z

For the time series function, I couldn't reproduce it. Tried with this query and tried to tweak the values/settings/etc. Let me ask the author.

antaljanosbenjamin · 2026-01-29T17:51:48Z

Okay, I found the reason it cannot be triggered with time series functions: we convert functions to full columns in case there are more than a single column to aggregate:

ClickHouse/src/Interpreters/Aggregator.cpp

Line 1651 in 0ca9499

bool allow_sparse_arguments = aggregate_columns[i].size() == 1;

mkmkme · 2026-01-30T09:28:56Z

Okay, I found the reason it cannot be triggered with time series functions: we convert functions to full columns in case there are more than a single column to aggregate.

So AFAIU that means that the timeseries changes are not really applicable and I can simply drop it?

antaljanosbenjamin · 2026-01-30T09:45:32Z

So AFAIU that means that the timeseries changes are not really applicable and I can simply drop it?

At this moment, it is not important, but I think the question is more likely addBatchSparse: right now doesn't make sense as far as I can see. I think the limitation on a single sparse column was meant to be temporary and not a really well known thing, so I would assume we would like to get rid of that limitation eventually. To sum, I think our options are:

Keep addBatchSparse and fix it
Remove addBatchSparse

From out of these two, I would prefer no1: keep it and keep the fix too.

antaljanosbenjamin · 2026-01-30T09:51:40Z

Actually I think the implementation of addBatchSparse doesn't make sense. It only deals with one column. Let me follow up.

antaljanosbenjamin · 2026-01-30T10:23:18Z

So, sorry for going back and forth, but let's remove those functions. They are not correct, the only reason they don't cause issues is because they aren't called at all.

These functions had a missing nullptr check that could lead to a crash. During investigation, it was discovered that those functions are never being called because those timeseries functions are being converted to full columns having more than a single column to aggregate (see [1]). Therefore, it was decided to delete those functions for now [2]. [1]: https://github.com/ClickHouse/ClickHouse/blob/0ca9499b4f78f6ddb03835339514132de81547d5/src/Interpreters/Aggregator.cpp#L1651 [2]: ClickHouse#95301 (comment)

mkmkme · 2026-01-30T11:17:51Z

So, sorry for going back and forth, but let's remove those functions. They are not correct, the only reason they don't cause issues is because they aren't called at all.

No problem, I'm happy to help :) Removed the functions, please have a look

antaljanosbenjamin

LGTM! Thanks!

antaljanosbenjamin · 2026-02-02T12:45:14Z

I am not sure if you merged master, because of failed CI or there were some conflicts, but if the former one, then next time let me check the results. It is not necessary to have a fully green CI as we have flaky tests (we are constantly trying to fix, but it is a never ending battle 😄 )

mkmkme · 2026-02-02T12:47:12Z

I am not sure if you merged master, because of failed CI or there were some conflicts, but if the former one, then next time let me check the results. It is not necessary to have a fully green CI as we have flaky tests (we are constantly trying to fix, but it is a never ending battle 😄 )

Thanks! Yeah I wanted it to give another try with the CI, the failures looked irrelevant to me :) Do you know when it can be merged?

antaljanosbenjamin · 2026-02-02T13:22:15Z

Do you know when it can be merged?

There is no "hard rule" about this. If everything is green, then for sure it can be merged. If the CI is not fully green, then somebody (probably me, as I am reviewing the PR) has to check the failures and decide whether they are related, or not. My rule is when I am in doubt, then it is related. However for this PR I think it is very unlikely any of the failures will be related. So I would so the CI finishes and after that I can check it, which means in timewise either later today or tomorrow. I cannot promise though.

antaljanosbenjamin · 2026-02-03T12:01:30Z

03812_join_order_ubsan_overflow Fix flaky test 03812_join_order_ubsan_overflow.sql #95741
Upgrade check (amd_tsan) Fix upgrade check failure due to missing auxiliary ZooKeeper config #95760
Upgrade check (amd_msan) Fix upgrade check to respect no-{sanitizer} tags #95652

antaljanosbenjamin · 2026-02-03T12:02:56Z

Thanks for the fix and the huge effort you put into this PR @mkmkme!

mkmkme · 2026-02-03T12:03:44Z

Thanks for the quick review and merge :)

Fix sparse column aggregation with sum() and timeseries

mkmkme added 2 commits January 27, 2026 13:37

AggregateFunctionSum: fix sparse column aggregation for any overflow

3fa83f4

mode Similarly as with IAggregateFunction.h in [1], this should be done in AggregateFunctionSum.h as well [1]: ClickHouse#88440

TimeSeries: fix sparse columns aggregation for any overflow mode

c87fd32

antaljanosbenjamin self-assigned this Jan 27, 2026

antaljanosbenjamin added the can be tested Allows running workflows for external contributors label Jan 27, 2026

clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Jan 27, 2026

added a test for sum

ef23b85

korowa reviewed Jan 29, 2026

View reviewed changes

better test

eec9a01

antaljanosbenjamin approved these changes Jan 30, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/master' into aggegate-crash

d2b1257

antaljanosbenjamin added this pull request to the merge queue Feb 3, 2026

Merged via the queue into ClickHouse:master with commit 76ce6c3 Feb 3, 2026
129 of 134 checks passed

robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 3, 2026

mkmkme deleted the aggegate-crash branch February 3, 2026 12:02

zvonand pushed a commit to Altinity/ClickHouse that referenced this pull request Feb 5, 2026

Merge pull request ClickHouse#95301 from mkmkme/aggegate-crash

fb7317b

Fix sparse column aggregation with sum() and timeseries

zvonand mentioned this pull request Feb 5, 2026

24.8.14 Backport of #95301: Fix sparse column aggregation with sum() and timeseries Altinity/ClickHouse#1372

Open

25 tasks

		@@ -0,0 +1,18 @@
		CREATE TABLE sum_overflow(key UInt128, val UInt16) ENGINE = MergeTree ORDER BY tuple();

		insert into sum_overflow SELECT number, rand() % 10000 = 0 from numbers(100000)

Conversation

mkmkme commented Jan 27, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Uh oh!

mkmkme commented Jan 27, 2026

Uh oh!

clickhouse-gh bot commented Jan 27, 2026 • edited by antaljanosbenjamin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antaljanosbenjamin commented Jan 27, 2026

Uh oh!

mkmkme commented Jan 27, 2026

Uh oh!

antaljanosbenjamin commented Jan 27, 2026

Uh oh!

antaljanosbenjamin commented Jan 27, 2026

Uh oh!

mkmkme commented Jan 29, 2026

Uh oh!

korowa commented Jan 29, 2026

Uh oh!

mkmkme commented Jan 29, 2026

Uh oh!

korowa Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

antaljanosbenjamin Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

mkmkme Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

antaljanosbenjamin Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

mkmkme Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

antaljanosbenjamin commented Jan 29, 2026

Uh oh!

antaljanosbenjamin commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkmkme commented Jan 30, 2026

Uh oh!

antaljanosbenjamin commented Jan 30, 2026

Uh oh!

antaljanosbenjamin commented Jan 30, 2026

Uh oh!

antaljanosbenjamin commented Jan 30, 2026

Uh oh!

mkmkme commented Jan 30, 2026

Uh oh!

antaljanosbenjamin left a comment

Choose a reason for hiding this comment

Uh oh!

antaljanosbenjamin commented Feb 2, 2026

Uh oh!

mkmkme commented Feb 2, 2026

Uh oh!

antaljanosbenjamin commented Feb 2, 2026

Uh oh!

Uh oh!

antaljanosbenjamin commented Feb 3, 2026

Uh oh!

antaljanosbenjamin commented Feb 3, 2026

Uh oh!

mkmkme commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

clickhouse-gh bot commented Jan 27, 2026 •

edited by antaljanosbenjamin

Loading

antaljanosbenjamin commented Jan 29, 2026 •

edited

Loading