Skip to content

Fix sparse column aggregation with sum() and timeseries#95301

Merged
antaljanosbenjamin merged 6 commits intoClickHouse:masterfrom
mkmkme:aggegate-crash
Feb 3, 2026
Merged

Fix sparse column aggregation with sum() and timeseries#95301
antaljanosbenjamin merged 6 commits intoClickHouse:masterfrom
mkmkme:aggegate-crash

Conversation

@mkmkme
Copy link
Contributor

@mkmkme mkmkme commented Jan 27, 2026

This PR can be considered as a follow-up for #88440. That PR added a missing nullptr check to addBatchSparse function in IAggregateFunction.h. However, some of the child classes overrode that function and those overridden functions still didn't have that nullptr check. That could lead to crashes.

Unfortunately I was not able to create a stateless test case that would crash the server without that fix, but I have a consistent reproduction:

  1. Fetch the ClickBench data: seq 0 99 | xargs -P10 -I{} bash -c 'wget --continue --progress=dot:giga https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_{}.parquet'
  2. Create the hits table from ClickBench: clickhouse-client < ClickBench/clickhouse/create.sql
  3. Start the following query:
SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth)
FROM hits
GROUP BY WatchID, ClientIP
ORDER BY c DESC
LIMIT 10
SETTINGS max_rows_to_group_by = 1000000,
         group_by_overflow_mode = 'any',
         distributed_aggregation_memory_efficient = true;

This led to a crash in AggregateFunctionSum: https://pastila.nl/?000338e6/ca7e40b9d60363e10869f5ac6cd9ade7#lcyVOtREywPI3v8iQZCwgg==GCM

The first commit of this PR fixes that crash.
After that, I took a look and found this pattern in TimeSeries as well, so the second commit adds the missing check there.

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix aggregation of sparse columns for sum and timeseries when group_by_overflow_mode is set to any

mode

Similarly as with IAggregateFunction.h in [1], this should be done in
AggregateFunctionSum.h as well

[1]: ClickHouse#88440
@mkmkme
Copy link
Contributor Author

mkmkme commented Jan 27, 2026

@korowa @antaljanosbenjamin kindly pinging you for a review because you are in the context of the original PR

@antaljanosbenjamin antaljanosbenjamin self-assigned this Jan 27, 2026
@antaljanosbenjamin antaljanosbenjamin added the can be tested Allows running workflows for external contributors label Jan 27, 2026
@clickhouse-gh
Copy link

clickhouse-gh bot commented Jan 27, 2026

Workflow [PR], commit [d2b1257]

Summary:

job_name test_name status info comment
Stateless tests (amd_binary, ParallelReplicas, s3 storage, parallel) failure
03812_join_order_ubsan_overflow FAIL cidb IGNORED
Stress test (amd_ubsan) failure
Logical error: Block structure mismatch in A stream: different number of columns: (STID: 0993-38e6) FAIL cidb, issue ISSUE EXISTS
Upgrade check (amd_tsan) failure
Error message in clickhouse-server.log (see upgrade_error_messages.txt) FAIL cidb IGNORED
Upgrade check (amd_msan) failure
Error message in clickhouse-server.log (see upgrade_error_messages.txt) FAIL cidb IGNORED

@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Jan 27, 2026
@antaljanosbenjamin
Copy link
Member

The changes looks good. For the test: in stateless tests we have hits dataset (at least some version of it) preloaded into clickhouse. A lot of tests uses test.hits. Maybe the issue would reproduce on that? Let me check.

@mkmkme
Copy link
Contributor Author

mkmkme commented Jan 27, 2026

The changes looks good. For the test: in stateless tests we have hits dataset (at least some version of it) preloaded into clickhouse. A lot of tests uses test.hits. Maybe the issue would reproduce on that? Let me check.

Thanks! If the whole hits dataset is there, then the query from my original post should reproduce the crash.
Let me know if you want me to add it as a test.

Also I haven't worked with TimeSeries in prior, so it might take some research for me to add a test for the changes related to it

@antaljanosbenjamin
Copy link
Member

I didn't get to this, will check it tomorrow

@antaljanosbenjamin
Copy link
Member

I lied, I checked it out now 😄

Use this as a test:

-- Tags: stateful
SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(Refresh), AVG(ResolutionWidth)
FROM test.hits
GROUP BY WatchID, ClientIP
ORDER BY c DESC
LIMIT 10
SETTINGS max_rows_to_group_by = 100,
         group_by_overflow_mode = 'any',
         distributed_aggregation_memory_efficient = true
FORMAT NULL;

I run it with python3 -m ci.praktika run "Stateless tests (amd_debug, parallel)" --test 03811_sparse_column_aggregation_with_sum but it didn't trigger the issue. Tomorrow I will give it a try to write a test.

@mkmkme
Copy link
Contributor Author

mkmkme commented Jan 29, 2026

Hey @antaljanosbenjamin,

Thanks for checking! I repeated your steps and indeed couldn't reproduce the issue. I think I see the reason why. In addition to the original query, I've added some debug information to the test:

-- Tags: stateful

SELECT
    column,
    serialization_kind
FROM system.parts_columns
WHERE database = 'test' AND table = 'hits' AND column = 'Refresh';

SELECT count(*) from test.hits;

SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(Refresh), AVG(ResolutionWidth)
FROM test.hits
GROUP BY WatchID, ClientIP
ORDER BY c DESC
LIMIT 10
SETTINGS max_rows_to_group_by = 100,
         group_by_overflow_mode = 'any',
         distributed_aggregation_memory_efficient = true;

And here's the output this test produces:

Refresh    Default
8873898
8853364585571868441        1944989242      2       2       1339
7706103715086359932        1944989242      2       0       1339
5925798870912605713        2293014471      2       0       1339
9018452160724543599        215582214       2       0       1339
8237934910330383345        1944989242      2       0       1339
8013614557750014932        2917578885      1       0       1339
9054025820695787027        1170811649      1       0       1339
8613910036935888828        931795585       1       0       1339
7492569369978058606        2850809659      1       0       1339
7678041744361854787        1690373472      1       0       1339

Probably having 8 million rows instead of 99 million does not make the difference. What is crucial, though, is the fact that hits in "Stateless tests" doesn't have sparse serialization, so this code doesn't get triggered.

With the whole hits dataset where I could reproduce the issue, I have this:

SELECT
    column,
    serialization_kind
FROM system.parts_columns
WHERE (`table` = 'hits') AND (column = 'IsRefresh')

Query id: 018de9e4-0902-4d73-87b1-a72f729591e9

   ┌─column────┬─serialization_kind─┐
1. │ IsRefresh │ Sparse             │
2. │ IsRefresh │ Default            │
3. │ IsRefresh │ Default            │
4. │ IsRefresh │ Default            │
   └───────────┴────────────────────┘

@korowa
Copy link
Contributor

korowa commented Jan 29, 2026

@mkmkme, I think this can be a reproducer for the issue: https://fiddle.clickhouse.com/fd9aa80c-8435-4633-932b-f44497839d97

@mkmkme
Copy link
Contributor Author

mkmkme commented Jan 29, 2026

@mkmkme, I think this can be a reproducer for the issue: https://fiddle.clickhouse.com/fd9aa80c-8435-4633-932b-f44497839d97

Hey @korowa, thanks a lot! That did the trick. Oddly enough I did try something similar as a test but couldn't make it fail. I'll push this test into the branch

@@ -0,0 +1,18 @@
CREATE TABLE sum_overflow(key UInt128, val UInt16) ENGINE = MergeTree ORDER BY tuple();

insert into sum_overflow SELECT number, rand() % 10000 = 0 from numbers(100000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can actually use number % 10000 just to keep the dataset deterministic (I just started with rand(), and realized that it's unnecessary only now)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thanks for both of you on making an effort to create a test!

Second, I would also prefer the deterministic test over a non-deterministic one.

Suggested change
insert into sum_overflow SELECT number, rand() % 10000 = 0 from numbers(100000)
insert into sum_overflow SELECT number, number % 10000 = 0 from numbers(100000)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also need to drop the table after the test and probably need to have a better name for the table.

Is the FORMAT Null part still okay or do we want some actual output to verify?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is okay. This test doesn't aim to test the correctness of the function, but the fact that it doesn't crash. Hopefully we already have tests to check its correctness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've changed the test to be more deterministic, changed the table name and added a drop at the end of the test. Hope it looks fine now

@antaljanosbenjamin
Copy link
Member

For the time series function, I couldn't reproduce it. Tried with this query and tried to tweak the values/settings/etc. Let me ask the author.

@antaljanosbenjamin
Copy link
Member

antaljanosbenjamin commented Jan 29, 2026

Okay, I found the reason it cannot be triggered with time series functions: we convert functions to full columns in case there are more than a single column to aggregate:

bool allow_sparse_arguments = aggregate_columns[i].size() == 1;

@mkmkme
Copy link
Contributor Author

mkmkme commented Jan 30, 2026

Okay, I found the reason it cannot be triggered with time series functions: we convert functions to full columns in case there are more than a single column to aggregate.

So AFAIU that means that the timeseries changes are not really applicable and I can simply drop it?

@antaljanosbenjamin
Copy link
Member

So AFAIU that means that the timeseries changes are not really applicable and I can simply drop it?

At this moment, it is not important, but I think the question is more likely addBatchSparse: right now doesn't make sense as far as I can see. I think the limitation on a single sparse column was meant to be temporary and not a really well known thing, so I would assume we would like to get rid of that limitation eventually. To sum, I think our options are:

  1. Keep addBatchSparse and fix it
  2. Remove addBatchSparse

From out of these two, I would prefer no1: keep it and keep the fix too.

@antaljanosbenjamin
Copy link
Member

Actually I think the implementation of addBatchSparse doesn't make sense. It only deals with one column. Let me follow up.

@antaljanosbenjamin
Copy link
Member

So, sorry for going back and forth, but let's remove those functions. They are not correct, the only reason they don't cause issues is because they aren't called at all.

These functions had a missing nullptr check that could lead to a crash.
During investigation, it was discovered that those functions are never
being called because those timeseries functions are being converted to
full columns having more than a single column to aggregate (see [1]).
Therefore, it was decided to delete those functions for now [2].

[1]: https://github.com/ClickHouse/ClickHouse/blob/0ca9499b4f78f6ddb03835339514132de81547d5/src/Interpreters/Aggregator.cpp#L1651
[2]: ClickHouse#95301 (comment)
@mkmkme
Copy link
Contributor Author

mkmkme commented Jan 30, 2026

So, sorry for going back and forth, but let's remove those functions. They are not correct, the only reason they don't cause issues is because they aren't called at all.

No problem, I'm happy to help :) Removed the functions, please have a look

Copy link
Member

@antaljanosbenjamin antaljanosbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@antaljanosbenjamin
Copy link
Member

I am not sure if you merged master, because of failed CI or there were some conflicts, but if the former one, then next time let me check the results. It is not necessary to have a fully green CI as we have flaky tests (we are constantly trying to fix, but it is a never ending battle 😄 )

@mkmkme
Copy link
Contributor Author

mkmkme commented Feb 2, 2026

I am not sure if you merged master, because of failed CI or there were some conflicts, but if the former one, then next time let me check the results. It is not necessary to have a fully green CI as we have flaky tests (we are constantly trying to fix, but it is a never ending battle 😄 )

Thanks! Yeah I wanted it to give another try with the CI, the failures looked irrelevant to me :) Do you know when it can be merged?

@antaljanosbenjamin
Copy link
Member

Do you know when it can be merged?

There is no "hard rule" about this. If everything is green, then for sure it can be merged. If the CI is not fully green, then somebody (probably me, as I am reviewing the PR) has to check the failures and decide whether they are related, or not. My rule is when I am in doubt, then it is related. However for this PR I think it is very unlikely any of the failures will be related. So I would so the CI finishes and after that I can check it, which means in timewise either later today or tomorrow. I cannot promise though.

@antaljanosbenjamin antaljanosbenjamin added this pull request to the merge queue Feb 3, 2026
Merged via the queue into ClickHouse:master with commit 76ce6c3 Feb 3, 2026
129 of 134 checks passed
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 3, 2026
@antaljanosbenjamin
Copy link
Member

@mkmkme mkmkme deleted the aggegate-crash branch February 3, 2026 12:02
@antaljanosbenjamin
Copy link
Member

Thanks for the fix and the huge effort you put into this PR @mkmkme!

@mkmkme
Copy link
Contributor Author

mkmkme commented Feb 3, 2026

Thanks for the quick review and merge :)

zvonand pushed a commit to Altinity/ClickHouse that referenced this pull request Feb 5, 2026
Fix sparse column aggregation with sum() and timeseries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants