Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Window Functions don't respect partitions (sometimes) #26580

Closed
neilkakkar opened this issue Jul 20, 2021 · 19 comments · Fixed by #26960
Closed

Window Functions don't respect partitions (sometimes) #26580

neilkakkar opened this issue Jul 20, 2021 · 19 comments · Fixed by #26960
Assignees
Labels
bug Confirmed user-visible misbehaviour in official release comp-window-functions

Comments

@neilkakkar
Copy link

neilkakkar commented Jul 20, 2021

Describe the bug

From time to time, window functions stop respecting their partitions.

I haven't been able to figure out the exact conditions that lead to this, but there's a reproducible(ish) example below.

A clear and concise description of what works not as it is supposed to.

Does it reproduce on recent release?

Yes

How to reproduce

  • Which ClickHouse server version to use: 21.7.4.18
  • Non-default settings, if any: SETTINGS allow_experimental_window_functions = 1
  • CREATE TABLE +data statements for all tables involved
CREATE DATABASE IF NOT EXISTS playground;
CREATE TABLE playground.events_prod
(
    uuid UUID,
    event VARCHAR,
    timestamp DateTime64(6, 'UTC'),
    distinct_id VARCHAR,
    team_id Int64,
    _timestamp DateTime

) ENGINE = ReplacingMergeTree(_timestamp)
PARTITION BY toYYYYMM(timestamp)
ORDER BY (team_id, toDate(timestamp), distinct_id, uuid)
SAMPLE BY uuid;

INSERT INTO playground.events_prod VALUES
    ( generateUUIDv4(), 'step one', toDateTime('2021-06-08 18:00:00'), 'user a', 1, now()),
    ( generateUUIDv4(), 'step one', toDateTime('2021-06-08 19:00:00'), 'user a', 1, now())
    ( generateUUIDv4(), 'step one', toDateTime('2021-06-08 21:00:00'), 'user a', 1, now())
    ( generateUUIDv4(), 'step one', toDateTime('2021-06-09 13:00:00'), 'user b', 1, now())
    ( generateUUIDv4(), 'step one', toDateTime('2021-06-09 13:37:00'), 'user b', 1, now())
    ( generateUUIDv4(), 'step one', toDateTime('2021-06-11 07:00:00'), 'user c', 1, now())
    ( generateUUIDv4(), 'step one', toDateTime('2021-06-12 06:00:00'), 'user c', 1, now());

CREATE TABLE playground.person
(
    id UUID,
    created_at DateTime64,
    team_id Int64,
    _timestamp DateTime
) ENGINE = ReplacingMergeTree(_timestamp)
Order By (team_id, id);

INSERT INTO playground.person values
    ('017ac40e-08de-0000-10cd-1b37fea376c4', now(), 1, now()),
    ('017ac40e-08ec-0000-3fc6-fe5226831996', now(), 1, now()),
    ('017ac40e-08ce-0000-bb19-ac0dedd370ea', now(), 1, now());

CREATE TABLE playground.person_distinct_id
(
    id Int64,
    distinct_id VARCHAR,
    person_id UUID,
    team_id Int64,
    _timestamp DateTime
) ENGINE = ReplacingMergeTree(_timestamp)
Order By (team_id, distinct_id, person_id, id)

INSERT INTO playground.person_distinct_id VALUES
    (1, 'user a', '017ac40e-08de-0000-10cd-1b37fea376c4', 1, now())
    (2, 'user c', '017ac40e-08ec-0000-3fc6-fe5226831996', 1, now())
    (3, 'user b', '017ac40e-08ce-0000-bb19-ac0dedd370ea', 1, now());
  • Queries to run that lead to unexpected result

Here's the smallest reproducible example I could create (sorry it's still huge, but without the inner joins, it works as expected, which is mistifying)

Basically, this query groups all the expected values in the frame together. In every frame, I'd only expect to see values corresponding to a person_id.

SELECT person_id,
       timestamp,
       step_0,
       latest_0,
       step_1,
       groupArray(latest_1) over (PARTITION by person_id ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) latest_1,
       step_2,
       groupArray(latest_2) over (PARTITION by person_id ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) latest_2
  FROM (
        SELECT person_id,
               timestamp,
               if(event = 'step one', 1, 0) as step_0,
               if(step_0 = 1, timestamp, null) as latest_0,
               if(event = 'step one', 1, 0) as step_1,
               if(step_1 = 1, timestamp, null) as latest_1,
               if(event = 'step one', 1, 0) as step_2,
               if(step_2 = 1, timestamp, null) as latest_2
          FROM (
                SELECT e.event as event,
                       e.team_id as team_id,
                       e.distinct_id as distinct_id,
                       e.timestamp as timestamp,
                       pdi.person_id as person_id
                  FROM playground.events_prod e
                 INNER JOIN (
                        SELECT person_id,
                               distinct_id
                          FROM playground.person_distinct_id
                       ) AS pdi
                    ON e.distinct_id = pdi.distinct_id
                 WHERE team_id = 1
                   AND event IN ['step one', 'step one', 'step one', 'step one', 'step one', 'step one']
                   AND timestamp >= '2021-06-07 00:00:00'
                   AND timestamp <= '2021-06-13 23:59:59'
               ) events
         WHERE (step_0 = 1 OR step_1 = 1 OR step_2 = 1)
       ) SETTINGS allow_experimental_window_functions = 1

Expected behavior

The groupArray only returns values where person_id is correct.
Further, this is non-deterministic, it happens some times, but not all times.

Error output (sometimes):

017ac40e-08ec-0000-3fc6-fe5226831996,2021-06-11 07:00:00.000000,1,2021-06-11 07:00:00.000000,1,"['2021-06-09 13:37:00.000000','2021-06-09 13:00:00.000000','2021-06-12 06:00:00.000000']",1,"['2021-06-09 13:37:00.000000','2021-06-09 13:00:00.000000','2021-06-12 06:00:00.000000']"

Notice the array: ['2021-06-09 13:37:00.000000','2021-06-09 13:00:00.000000','2021-06-12 06:00:00.000000'] - it has values from user bs frame (13:00, 13:37), while it should only have values from user cs frame: 06:00.

Additional context

This doesn't happen always, but often enough to be worrysome. I did a few tests to try and figure out how often it occurs, running the same query in batches of 100s, and:

first two batches: all values correct.
third batch:

[bad, bad, bad, bad, bad, bad, bad, bad, bad, bad, bad, bad, bad, bad, good, bad, bad, bad, bad, bad, bad, bad, bad, bad, good, bad, bad, bad, bad, bad, bad, bad, bad, bad, bad, good, good, bad, bad, bad, bad, bad, bad, bad, bad, bad, bad, bad, bad, bad, good, bad, bad, bad, bad, bad, bad, bad, bad, bad, bad, bad, bad, bad, bad, bad, bad, bad, good, bad, bad, bad, good, bad, bad, good, good, bad, bad, bad, bad, bad, bad, bad, good, good, bad, bad, bad, good, bad, bad, bad, good, bad, good, bad, bad, good, good]

(most of them in the third batch were bad)
fourth batch: all values correct.

@neilkakkar neilkakkar added the bug Confirmed user-visible misbehaviour in official release label Jul 20, 2021
@abyss7 abyss7 assigned abyss7 and akuzm and unassigned abyss7 Jul 21, 2021
@akuzm
Copy link
Contributor

akuzm commented Jul 21, 2021

Notice the array: ['2021-06-09 13:37:00.000000','2021-06-09 13:00:00.000000','2021-06-12 06:00:00.000000'] - it has values from user bs frame (13:00, 13:37), while it should only have values from user cs frame: 06:00.

If you do groupArray(person_id) over (partition by person_id ....), does it also have different person_ids? Maybe some sorting problem, the plan looks correct though...

@neilkakkar
Copy link
Author

neilkakkar commented Jul 22, 2021

It does: ['017acde4-f4be-0000-38c0-c64214745da0','017acde4-f4be-0000-38c0-c64214745da0','017acde4-f4a0-0000-7f61-e9fa34a1397e']

The Plan looks like this:

[
    ('Expression ((Projection + Before ORDER BY))',),
    ("  Window (Window step for window 'PARTITION BY person_id ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING')",),
    ("    MergingSorted (Merge sorted streams for window 'PARTITION BY person_id ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING')",),
    ("      MergeSorting (Merge sorted blocks for window 'PARTITION BY person_id ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING')",),
    ("        PartialSorting (Sort each block for window 'PARTITION BY person_id ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING')",),
    ('          Expression ((Before window functions + (Projection + Before ORDER BY)))',),
    ('            Filter ((WHERE + (Projection + Before ORDER BY)))',),
    ('              Filter (WHERE)',),
    ('                Join (JOIN)',),
    ('                  Expression (Before JOIN)',),
    ('                    SettingQuotaAndLimits (Set limits and quota after reading from storage)',),
    ('                      ReadFromMergeTree',),
    ('                  Expression ((Joined actions + (Rename joined columns + (Projection + (Before ORDER BY + (Projection + Before ORDER BY))))))',),
    ('                    Join (JOIN)',),
    ('                      Expression (Before JOIN)',),
    ('                        SettingQuotaAndLimits (Set limits and quota after reading from storage)',),
    ('                          ReadFromMergeTree',),
    ('                      Expression ((Joined actions + (Rename joined columns + (Projection + Before ORDER BY))))',),
    ('                        Aggregating',),
    ('                          Expression (Before GROUP BY)',),
    ('                            SettingQuotaAndLimits (Set limits and quota after reading from storage)',),
    ('                              ReadFromMergeTree',)
]

and the specific query for this PLAN:

EXPLAIN PLAN SELECT person_id,
       timestamp,
       step_0,
       latest_0,
       step_1,
       groupArray(person_id) over (PARTITION by person_id ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) latest_1,
       step_2,
       groupArray(person_id) over (PARTITION by person_id ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) latest_2
  FROM (
        SELECT person_id,
               timestamp,
               if(event = 'step one', 1, 0) as step_0,
               if(step_0 = 1, timestamp, null) as latest_0,
               if(event = 'step one', 1, 0) as step_1,
               if(step_1 = 1, timestamp, null) as latest_1,
               if(event = 'step one', 1, 0) as step_2,
               if(step_2 = 1, timestamp, null) as latest_2
          FROM (
                SELECT e.event as event,
                       e.team_id as team_id,
                       e.distinct_id as distinct_id,
                       e.timestamp as timestamp,
                       e.properties as properties,
                       e.elements_chain as elements_chain,
                       pdi.person_id as person_id
                  FROM posthog_test.events e
                 INNER JOIN (
                        SELECT person_id,
                               distinct_id
                          FROM (
                                SELECT *
                                  FROM posthog_test.person_distinct_id
                                  JOIN (
                                        SELECT distinct_id,
                                               max(_offset) as _offset
                                          FROM posthog_test.person_distinct_id
                                         GROUP BY distinct_id
                                       ) as person_max USING (distinct_id, _offset)
                               )
                       ) AS pdi
                    ON e.distinct_id = pdi.distinct_id
                 WHERE team_id = 2
                   AND event IN ['step one', 'step one', 'step one', 'step one', 'step one', 'step one']
                   AND timestamp >= '2021-06-07 00:00:00'
                   AND timestamp <= '2021-06-13 23:59:59'
               ) events
         WHERE (step_0 = 1 OR step_1 = 1 OR step_2 = 1)
       ) SETTINGS allow_experimental_window_functions = 1

@neilkakkar
Copy link
Author

neilkakkar commented Jul 22, 2021

If it helps, this only happens when step0=step1=step2.

For example, if the steps were distinct, the partitions are respected always.

It seems almost as if the event is the partition key, instead of person_id.

OR

because of the ordering & ROWS in the frame, the partition "overflows".

I'm going to try again using just PARTITION BY person_id in the window clause, to see if it still takes the wrong partitions.

@neilkakkar
Copy link
Author

neilkakkar commented Jul 22, 2021

AH! The event thing is a red-herring. The real deal seems to be the ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING

The X PRECEDING in the end is the problem. (That's what's different when you have distinct vs non-distinct events).

PARTITION clause Result
PARTITION BY person_id good
PARTITION BY person_id ORDER BY timestamp DESC good
PARTITION BY person_id ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING good
... ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING good
... ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING bad
... ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING bad
... ROWS BETWEEN UNBOUNDED PRECEDING AND 2 PRECEDING bad
... ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING bad

bad means: within 5 tries, I got a wrong partitioning.
good means: within 5 tries, I didn't get a wrong partitioning.

@akuzm
Copy link
Contributor

akuzm commented Jul 22, 2021

I'm trying to reproduce it locally but without success. Here is what I use: https://gist.github.com/akuzm/9acd7139dd7d790889fb958c775bec08

The query should show partitions which contain more than one person_ids, but it's always empty.

@neilkakkar
Copy link
Author

I started with a clean install, and couldn't reproduce either. I suspect it may have something to do with my test database settings? Is there any other reason you can think of why it might happen in Database 1 but not in Database 2? Say, even if I insert the data into database 2 using the data in database 1?

Also, is there a good way to check that the table in database 1 is the exact same as the table in database 2? I've simply been checking that DESCRIBE gives the same result.

Let me investigate this a bit to accurately replicate it in a fresh install.

@akuzm
Copy link
Contributor

akuzm commented Jul 23, 2021

One reason might be the different MergeTree part structure, e.g. in some table there is single part and in some -- multiple. You can control this by runnging system stop merges <table> that disables background merges, and then inserting in several queries. One insert (up to some size limit) becomes one part. But your example script does single inserts, so this is probably not the reason here.

Different settings might also be the reason. To see which settings are changed, run select * from system.settings where changed.

@akuzm
Copy link
Contributor

akuzm commented Jul 23, 2021

By the way, I added some debug checks to window functions. If you wish, you can download a latest debug build of the master branch as described here and try to reproduce with it. It would check both of your hypotheses -- that we use a different column, or that we go out of frame. Make sure to download the debug build, not the release one, because these checks are only enabled in debug.

@neilkakkar
Copy link
Author

Nice, I'll try this! Is there a corresponding docker image I can use? Been running CH in docker, trying to connect it up to a local build seems a lot more involved + more moving parts for me.

@akuzm
Copy link
Contributor

akuzm commented Jul 23, 2021

Nice, I'll try this! Is there a corresponding docker image I can use? Been running CH in docker, trying to connect it up to a local build seems a lot more involved + more moving parts for me.

No docker for this, unfortunately. To run local build, download the single binary, make a directory (e.g. local-server for configs and data, and run something like:

cd local-server
./clickhouse server

You'll need to initialize the configs in the local directory. Running <your CH source dir>/contrib/tests/config/install.sh local-server local-server/client is probably the simplest option, it will prepare a configuration we normally use for tests.

Or you can run the entire script in a temporary DB w/o running the server, using clickhouse-local, e.g. ./clickhouse local --queries-file=test-queries.sql.

@neilkakkar
Copy link
Author

Got it. Just to confirm:
image

This is the correct build, right? (2.6GB!)

@akuzm
Copy link
Contributor

akuzm commented Jul 23, 2021

Yep. You only need the clickhouse binary. It is huge because of the debug symbols, for release we split them into a separate package.

@neilkakkar
Copy link
Author

neilkakkar commented Jul 23, 2021

Also, just to demonstrate that it's worth getting to the bottom of this: I wrote a test that uses your query (and then simplified it further, turns out the INNER JOIN isn't necessary to make it happen)

https://github.com/PostHog/posthog/pull/5317/checks?check_run_id=3144567440

(Got lucky on this run: Note how it passes for Python 3.7, but not not for 3.8 and 3.9. And also, the failures are different in both - the first one failed for 3.8, while the second one failed for 3.9)

and the corresponding change: https://github.com/PostHog/posthog/pull/5317/files#diff-2b21391aae2176bf6c4c4d9efdede29feae45a91e78f401f570b752d39f64c2f

No matter how the table is populated, or whatever is happening in CH, the uniq query should always return an empty result.

@neilkakkar
Copy link
Author

Hey @akuzm , finally got this to work (hacked my way around to make it work inside an Ubuntu docker container).

The sample by uuid wasn't working, so I removed that, but thankfully the error still happens. Here's the log dump:

https://gist.github.com/neilkakkar/5d6b958924ec149d248f832ad92c5ea3

The first query (groupArray) returned: [(['user b', 'user b'],), (['user b', 'user b', 'user c'],), ([],), (['user b'],), ([],), (['user a'],), (['user a', 'user a'],)]

while the second query (uniq) returned:[(2,)]

@neilkakkar
Copy link
Author

I reran this query with changing 1 PRECEDING to 0 PRECEDING (the test then passes), and here's how the log looks like:

https://gist.github.com/neilkakkar/d37b81b48db22aa5fcda58e7b0fa1b2e

The first query (groupArray) returned: [(['user c'],), (['user c', 'user c'],), (['user b'],), (['user b', 'user b'],), (['user a'],), (['user a', 'user a'],), (['user a', 'user a', 'user a'],)]

while the second query (uniq) returned: [].

Maybe it's clearer to read in here, so the first query:

SELECT
            groupArray(person_id) over (PARTITION by person_id ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
        FROM (
                SELECT person_id,
                    timestamp,
                    if(event = 'step one', 1, 0) as step_0,
                    if(event = 'step one', 1, 0) as step_1,
                    if(event = 'step one', 1, 0) as step_2
                FROM (
                        SELECT e.event as event,
                            e.team_id as team_id,
                            e.distinct_id as person_id,
                            e.timestamp as timestamp
                        FROM posthog_test.events e
                        WHERE team_id = {self.team.pk}
                        AND event IN ['step one']
                        AND timestamp >= '2021-06-08 00:00:00'
                        AND timestamp <= '2021-06-13 23:59:59'
                    )
                WHERE (step_0 = 1 OR step_1 = 1 OR step_2 = 1)
            )
            order by timestamp desc SETTINGS allow_experimental_window_functions = 1;

and the second query:

SELECT *
        FROM
        (
            SELECT
                uniq(person_id) OVER w AS unique
            FROM
            (
                SELECT
                    person_id,
                    timestamp,
                    if(event = 'step one', 1, 0) AS step_0,
                    if(event = 'step one', 1, 0) AS step_1,
                    if(event = 'step one', 1, 0) AS step_2
                FROM
                (
                    SELECT
                        e.event AS event,
                        e.team_id AS team_id,
                        e.distinct_id AS person_id,
                        e.timestamp AS timestamp
                    FROM posthog_test.events AS e
                    WHERE (team_id = {self.team.pk}) AND (event IN ['step one'])
                      AND (timestamp >= '2021-06-07 00:00:00')
                      AND (timestamp <= '2021-06-13 23:59:59')
                ) AS events
                WHERE (step_0 = 1) OR (step_1 = 1) OR (step_2 = 1)
            )
            WINDOW w AS (PARTITION BY person_id ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
        )
        WHERE unique > 1
        SETTINGS allow_experimental_window_functions = 1

@akuzm
Copy link
Contributor

akuzm commented Jul 29, 2021

reate database if not exists posthog_test;
use posthog_test;
drop table if exists events;
CREATE TABLE events ( uuid UUID, event VARCHAR, timestamp DateTime64(6, 'UTC'), team_id Int64, distinct_id VARCHAR, created_at DateTime64(6, 'UTC'), _timestamp DateTime, _offset UInt64 ) ENGINE = ReplacingMergeTree(_timestamp) PARTITION BY toYYYYMM(timestamp) ORDER BY (team_id, toDate(timestamp), distinct_id, uuid) ;
INSERT INTO events SELECT '8cbf8225-e177-491f-839b-5ed5de0ae832', 'step one', '2021-06-08 18:00:00.000000', 5, 'user a', '2021-06-08 18:00:00.000000', now(), 0;
INSERT INTO events SELECT '07fa9072-16cd-47d9-a83d-28cdcc9008f7', 'step one', '2021-06-08 19:00:00.000000', 5, 'user a', '2021-06-08 19:00:00.000000', now(), 0;
INSERT INTO events SELECT 'a92da76b-3f5e-4c9e-ab2a-cb85e5c67643', 'step one', '2021-06-08 21:00:00.000000', 5, 'user a', '2021-06-08 21:00:00.000000', now(), 0;
INSERT INTO events SELECT '0e2f20ef-d51e-4532-bff0-a66152141c2f', 'step one', '2021-06-09 13:00:00.000000', 5, 'user b', '2021-06-09 13:00:00.000000', now(), 0;
INSERT INTO events SELECT 'b91f2c86-b7d2-442d-b527-a041717c616a', 'step one', '2021-06-09 13:37:00.000000', 5, 'user b', '2021-06-09 13:37:00.000000', now(), 0;
INSERT INTO events SELECT 'f4a7cd29-f59d-48b6-b93f-838212dd2a73', 'step one', '2021-06-11 07:00:00.000000', 5, 'user c', '2021-06-11 07:00:00.000000', now(), 0;
INSERT INTO events SELECT '2fd51cc8-2c54-4266-b0e1-57c356f3ca04', 'step one', '2021-06-12 06:00:00.000000', 5, 'user c', '2021-06-12 06:00:00.000000', now(), 0;


SELECT
    distinct_id,
    groupArray(distinct_id) OVER w,
    row_number() OVER w
FROM posthog_test.events
WINDOW w AS (PARTITION BY distinct_id ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)

Query id: 1f30464a-3081-4360-a4db-427720e17d77

┌─distinct_id─┬─groupArray(distinct_id) OVER w─┬─row_number() OVER w─┐
│ user a      │ []                             │                   1 │
│ user a      │ ['user a']                     │                   2 │
│ user a      │ ['user a','user a']            │                   3 │
│ user b      │ []                             │                   1 │
│ user b      │ ['user b']                     │                   2 │
└─────────────┴────────────────────────────────┴─────────────────────┘
┌─distinct_id─┬─groupArray(distinct_id) OVER w─┬─row_number() OVER w─┐
│ user c      │ ['user b','user b']            │                   3 │
└─────────────┴────────────────────────────────┴─────────────────────┘
┌─distinct_id─┬─groupArray(distinct_id) OVER w─┬─row_number() OVER w─┐
│ user c      │ ['user b','user b','user c']   │                   4 │
└─────────────┴────────────────────────────────┴─────────────────────┘

OMG this is so broken...

@akuzm
Copy link
Contributor

akuzm commented Jul 29, 2021

*2* :) select number, row_number() over (partition by number rows between unbounded preceding and 1 preceding) from numbers(4) settings max_block_size = 2;

SELECT
    number,
    row_number() OVER (PARTITION BY number ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
FROM numbers(4)
SETTINGS max_block_size = 2

Query id: 7877ce37-8133-40b1-b012-4fafc485df2a

┌─number─┬─row_number() OVER (PARTITION BY number ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)─┐
│      0 │                                                                                        1 │
│      1 │                                                                                        1 │
└────────┴──────────────────────────────────────────────────────────────────────────────────────────┘
┌─number─┬─row_number() OVER (PARTITION BY number ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)─┐
│      2 │                                                                                        2 │
│      3 │                                                                                        1 │
└────────┴──────────────────────────────────────────────────────────────────────────────────────────┘

.......

@neilkakkar
Copy link
Author

OMG this is so broken...

Never have I been so happy for validation 😂 - thought I was going crazy since I couldn't reliably reproduce it.

Thanks for looking into it!

@akuzm
Copy link
Contributor

akuzm commented Jul 29, 2021

OMG this is so broken...

Never have I been so happy for validation - thought I was going crazy since I couldn't reliably reproduce it.

Thanks for looking into it!

Thanks for reproducing :) It was so tricky because it depends on how the rows are grouped into blocks, which can change because of background merges or other processing steps. You can sometimes see the block structure in PrettyCompact format, it prints the header again for every block. Hopefully the fix will be simple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed user-visible misbehaviour in official release comp-window-functions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants