Fix sparse column aggregation with sum() and timeseries#95301
Fix sparse column aggregation with sum() and timeseries#95301antaljanosbenjamin merged 6 commits intoClickHouse:masterfrom
Conversation
mode Similarly as with IAggregateFunction.h in [1], this should be done in AggregateFunctionSum.h as well [1]: ClickHouse#88440
|
@korowa @antaljanosbenjamin kindly pinging you for a review because you are in the context of the original PR |
|
Workflow [PR], commit [d2b1257] Summary: ❌
|
|
The changes looks good. For the test: in stateless tests we have |
Thanks! If the whole Also I haven't worked with TimeSeries in prior, so it might take some research for me to add a test for the changes related to it |
|
I didn't get to this, will check it tomorrow |
|
I lied, I checked it out now 😄 Use this as a test: I run it with |
|
Hey @antaljanosbenjamin, Thanks for checking! I repeated your steps and indeed couldn't reproduce the issue. I think I see the reason why. In addition to the original query, I've added some debug information to the test: -- Tags: stateful
SELECT
column,
serialization_kind
FROM system.parts_columns
WHERE database = 'test' AND table = 'hits' AND column = 'Refresh';
SELECT count(*) from test.hits;
SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(Refresh), AVG(ResolutionWidth)
FROM test.hits
GROUP BY WatchID, ClientIP
ORDER BY c DESC
LIMIT 10
SETTINGS max_rows_to_group_by = 100,
group_by_overflow_mode = 'any',
distributed_aggregation_memory_efficient = true;
And here's the output this test produces: Probably having 8 million rows instead of 99 million does not make the difference. What is crucial, though, is the fact that With the whole SELECT
column,
serialization_kind
FROM system.parts_columns
WHERE (`table` = 'hits') AND (column = 'IsRefresh')
Query id: 018de9e4-0902-4d73-87b1-a72f729591e9
┌─column────┬─serialization_kind─┐
1. │ IsRefresh │ Sparse │
2. │ IsRefresh │ Default │
3. │ IsRefresh │ Default │
4. │ IsRefresh │ Default │
└───────────┴────────────────────┘
|
|
@mkmkme, I think this can be a reproducer for the issue: https://fiddle.clickhouse.com/fd9aa80c-8435-4633-932b-f44497839d97 |
Hey @korowa, thanks a lot! That did the trick. Oddly enough I did try something similar as a test but couldn't make it fail. I'll push this test into the branch |
| @@ -0,0 +1,18 @@ | |||
| CREATE TABLE sum_overflow(key UInt128, val UInt16) ENGINE = MergeTree ORDER BY tuple(); | |||
|
|
|||
| insert into sum_overflow SELECT number, rand() % 10000 = 0 from numbers(100000) | |||
There was a problem hiding this comment.
We can actually use number % 10000 just to keep the dataset deterministic (I just started with rand(), and realized that it's unnecessary only now)
There was a problem hiding this comment.
First of all, thanks for both of you on making an effort to create a test!
Second, I would also prefer the deterministic test over a non-deterministic one.
| insert into sum_overflow SELECT number, rand() % 10000 = 0 from numbers(100000) | |
| insert into sum_overflow SELECT number, number % 10000 = 0 from numbers(100000) |
There was a problem hiding this comment.
I also need to drop the table after the test and probably need to have a better name for the table.
Is the FORMAT Null part still okay or do we want some actual output to verify?
There was a problem hiding this comment.
I think it is okay. This test doesn't aim to test the correctness of the function, but the fact that it doesn't crash. Hopefully we already have tests to check its correctness.
There was a problem hiding this comment.
Thanks, I've changed the test to be more deterministic, changed the table name and added a drop at the end of the test. Hope it looks fine now
|
For the time series function, I couldn't reproduce it. Tried with this query and tried to tweak the values/settings/etc. Let me ask the author. |
|
Okay, I found the reason it cannot be triggered with time series functions: we convert functions to full columns in case there are more than a single column to aggregate: ClickHouse/src/Interpreters/Aggregator.cpp Line 1651 in 0ca9499 |
So AFAIU that means that the timeseries changes are not really applicable and I can simply drop it? |
At this moment, it is not important, but I think the question is more likely
From out of these two, I would prefer no1: keep it and keep the fix too. |
|
Actually I think the implementation of |
|
So, sorry for going back and forth, but let's remove those functions. They are not correct, the only reason they don't cause issues is because they aren't called at all. |
These functions had a missing nullptr check that could lead to a crash. During investigation, it was discovered that those functions are never being called because those timeseries functions are being converted to full columns having more than a single column to aggregate (see [1]). Therefore, it was decided to delete those functions for now [2]. [1]: https://github.com/ClickHouse/ClickHouse/blob/0ca9499b4f78f6ddb03835339514132de81547d5/src/Interpreters/Aggregator.cpp#L1651 [2]: ClickHouse#95301 (comment)
No problem, I'm happy to help :) Removed the functions, please have a look |
|
I am not sure if you merged master, because of failed CI or there were some conflicts, but if the former one, then next time let me check the results. It is not necessary to have a fully green CI as we have flaky tests (we are constantly trying to fix, but it is a never ending battle 😄 ) |
Thanks! Yeah I wanted it to give another try with the CI, the failures looked irrelevant to me :) Do you know when it can be merged? |
There is no "hard rule" about this. If everything is green, then for sure it can be merged. If the CI is not fully green, then somebody (probably me, as I am reviewing the PR) has to check the failures and decide whether they are related, or not. My rule is when I am in doubt, then it is related. However for this PR I think it is very unlikely any of the failures will be related. So I would so the CI finishes and after that I can check it, which means in timewise either later today or tomorrow. I cannot promise though. |
76ce6c3
|
|
Thanks for the fix and the huge effort you put into this PR @mkmkme! |
|
Thanks for the quick review and merge :) |
Fix sparse column aggregation with sum() and timeseries
This PR can be considered as a follow-up for #88440. That PR added a missing
nullptrcheck toaddBatchSparsefunction inIAggregateFunction.h. However, some of the child classes overrode that function and those overridden functions still didn't have thatnullptrcheck. That could lead to crashes.Unfortunately I was not able to create a stateless test case that would crash the server without that fix, but I have a consistent reproduction:
seq 0 99 | xargs -P10 -I{} bash -c 'wget --continue --progress=dot:giga https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_{}.parquet'hitstable from ClickBench:clickhouse-client < ClickBench/clickhouse/create.sqlThis led to a crash in
AggregateFunctionSum: https://pastila.nl/?000338e6/ca7e40b9d60363e10869f5ac6cd9ade7#lcyVOtREywPI3v8iQZCwgg==GCMThe first commit of this PR fixes that crash.
After that, I took a look and found this pattern in
TimeSeriesas well, so the second commit adds the missing check there.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix aggregation of sparse columns for
sumand timeseries whengroup_by_overflow_modeis set toany