-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Window Functions don't respect partitions (sometimes) #26580
Comments
If you do |
It does: The Plan looks like this:
and the specific query for this PLAN:
|
If it helps, this only happens when step0=step1=step2. For example, if the steps were distinct, the partitions are respected always. It seems almost as if the OR because of the ordering & ROWS in the frame, the partition "overflows". I'm going to try again using just |
AH! The event thing is a red-herring. The real deal seems to be the The
|
I'm trying to reproduce it locally but without success. Here is what I use: https://gist.github.com/akuzm/9acd7139dd7d790889fb958c775bec08 The query should show partitions which contain more than one person_ids, but it's always empty. |
I started with a clean install, and couldn't reproduce either. I suspect it may have something to do with my test database settings? Is there any other reason you can think of why it might happen in Database 1 but not in Database 2? Say, even if I insert the data into database 2 using the data in database 1? Also, is there a good way to check that the table in database 1 is the exact same as the table in database 2? I've simply been checking that Let me investigate this a bit to accurately replicate it in a fresh install. |
One reason might be the different MergeTree part structure, e.g. in some table there is single part and in some -- multiple. You can control this by runnging Different settings might also be the reason. To see which settings are changed, run |
By the way, I added some debug checks to window functions. If you wish, you can download a latest debug build of the master branch as described here and try to reproduce with it. It would check both of your hypotheses -- that we use a different column, or that we go out of frame. Make sure to download the debug build, not the release one, because these checks are only enabled in debug. |
Nice, I'll try this! Is there a corresponding docker image I can use? Been running CH in docker, trying to connect it up to a local build seems a lot more involved + more moving parts for me. |
No docker for this, unfortunately. To run local build, download the single binary, make a directory (e.g.
You'll need to initialize the configs in the local directory. Running Or you can run the entire script in a temporary DB w/o running the server, using |
Yep. You only need the |
Also, just to demonstrate that it's worth getting to the bottom of this: I wrote a test that uses your query (and then simplified it further, turns out the https://github.com/PostHog/posthog/pull/5317/checks?check_run_id=3144567440 (Got lucky on this run: Note how it passes for Python 3.7, but not not for 3.8 and 3.9. And also, the failures are different in both - the first one failed for 3.8, while the second one failed for 3.9) and the corresponding change: https://github.com/PostHog/posthog/pull/5317/files#diff-2b21391aae2176bf6c4c4d9efdede29feae45a91e78f401f570b752d39f64c2f No matter how the table is populated, or whatever is happening in CH, the |
Hey @akuzm , finally got this to work (hacked my way around to make it work inside an Ubuntu docker container). The https://gist.github.com/neilkakkar/5d6b958924ec149d248f832ad92c5ea3 The first query (groupArray) returned: while the second query (uniq) returned: |
I reran this query with changing https://gist.github.com/neilkakkar/d37b81b48db22aa5fcda58e7b0fa1b2e The first query (groupArray) returned: while the second query (uniq) returned: Maybe it's clearer to read in here, so the first query:
and the second query:
|
OMG this is so broken... |
....... |
Never have I been so happy for validation 😂 - thought I was going crazy since I couldn't reliably reproduce it. Thanks for looking into it! |
Thanks for reproducing :) It was so tricky because it depends on how the rows are grouped into blocks, which can change because of background merges or other processing steps. You can sometimes see the block structure in |
Describe the bug
From time to time, window functions stop respecting their partitions.
I haven't been able to figure out the exact conditions that lead to this, but there's a reproducible(ish) example below.
A clear and concise description of what works not as it is supposed to.
Does it reproduce on recent release?
Yes
How to reproduce
SETTINGS allow_experimental_window_functions = 1
CREATE TABLE
+data statements for all tables involvedHere's the smallest reproducible example I could create (sorry it's still huge, but without the inner joins, it works as expected, which is mistifying)
Basically, this query groups all the expected values in the frame together. In every frame, I'd only expect to see values corresponding to a
person_id
.Expected behavior
The groupArray only returns values where person_id is correct.
Further, this is non-deterministic, it happens some times, but not all times.
Error output (sometimes):
Notice the array:
['2021-06-09 13:37:00.000000','2021-06-09 13:00:00.000000','2021-06-12 06:00:00.000000']
- it has values fromuser b
s frame (13:00, 13:37), while it should only have values fromuser c
s frame:06:00
.Additional context
This doesn't happen always, but often enough to be worrysome. I did a few tests to try and figure out how often it occurs, running the same query in batches of 100s, and:
first two batches: all values correct.
third batch:
(most of them in the third batch were bad)
fourth batch: all values correct.
The text was updated successfully, but these errors were encountered: