Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WITH FILL improvement (PARTITION BY, POPULATE) #33203

Open
UnamedRus opened this issue Dec 26, 2021 · 24 comments
Open

WITH FILL improvement (PARTITION BY, POPULATE) #33203

UnamedRus opened this issue Dec 26, 2021 · 24 comments
Labels
feature minor Priority: minor

Comments

@UnamedRus
Copy link
Contributor

UnamedRus commented Dec 26, 2021

Use case
Improve usability of WITH FILL.
Allow to utilize it with multiple time series.
Backfill some columns with previous value instead of default.

Describe the solution you'd like

Current behavior:

SELECT *
FROM values('dt Date, product String, sales UInt32', ('2021-12-01', 'A', 10), ('2021-12-05', 'A', 30), ('2021-12-02', 'B', 1), ('2021-12-07', 'B', 8))

┌─────────dt─┬─product─┬─sales─┐
│ 2021-12-01 │ A       │    10 │
│ 2021-12-05 │ A       │    30 │
│ 2021-12-02 │ B       │     1 │
│ 2021-12-07 │ B       │     8 │
└────────────┴─────────┴───────┘

SELECT *
FROM values('dt Date, product String, sales UInt32', ('2021-12-01', 'A', 10), ('2021-12-05', 'A', 30), ('2021-12-02', 'B', 1), ('2021-12-07', 'B', 8))
ORDER BY dt ASC WITH FILL

┌─────────dt─┬─product─┬─sales─┐
│ 2021-12-01 │ A       │    10 │
│ 2021-12-02 │ B       │     1 │
│ 2021-12-03 │         │     0 │
│ 2021-12-04 │         │     0 │
│ 2021-12-05 │ A       │    30 │
│ 2021-12-06 │         │     0 │
│ 2021-12-07 │ B       │     8 │
└────────────┴─────────┴───────┘

Rows 7

SELECT *
FROM values('dt Date, product String, sales UInt32', ('2021-12-01', 'A', 10), ('2021-12-05', 'A', 30), ('2021-12-02', 'B', 1), ('2021-12-07', 'B', 8))
ORDER BY
    product ASC,
    dt ASC WITH FILL

┌─────────dt─┬─product─┬─sales─┐
│ 2021-12-01 │ A       │    10 │
│ 2021-12-02 │         │     0 │
│ 2021-12-03 │         │     0 │
│ 2021-12-04 │         │     0 │
│ 2021-12-05 │ A       │    30 │
│ 2021-12-02 │ B       │     1 │
│ 2021-12-05 │         │     0 │
│ 2021-12-06 │         │     0 │
│ 2021-12-07 │ B       │     8 │
└────────────┴─────────┴───────┘

Rows 9

SELECT *
FROM values('dt Date, product String, sales UInt32', ('2021-12-01', 'A', 10), ('2021-12-05', 'A', 30), ('2021-12-02', 'B', 1), ('2021-12-07', 'B', 8))
ORDER BY
    product DESC,
    dt ASC WITH FILL

┌─────────dt─┬─product─┬─sales─┐
│ 2021-12-02 │ B       │     1 │
│ 2021-12-03 │         │     0 │
│ 2021-12-04 │         │     0 │
│ 2021-12-05 │         │     0 │
│ 2021-12-06 │         │     0 │
│ 2021-12-07 │ B       │     8 │
│ 2021-12-01 │ A       │    10 │
│ 2021-12-05 │ A       │    30 │
└────────────┴─────────┴───────┘

Rows 8

It's a bit unexpected, that WITH FILL produce different result set depends on sorting condition.

PARTITION BY expr

SELECT *
FROM values('dt Date, product String, sales UInt32', ('2021-12-01', 'A', 10), ('2021-12-05', 'A', 30), ('2021-12-02', 'B', 1), ('2021-12-07', 'B', 8))
ORDER BY
    product ASC,
    dt ASC WITH FILL PARTITION BY product
	
┌─────────dt─┬─product─┬─sales─┐
│ 2021-12-01 │ A       │    10 │
│ 2021-12-02 │         │     0 │
│ 2021-12-03 │         │     0 │
│ 2021-12-04 │         │     0 │
│ 2021-12-05 │ A       │    30 │
│ 2021-12-02 │ B       │     1 │
│ 2021-12-03 │         │     0 │
│ 2021-12-04 │         │     0 │
│ 2021-12-05 │         │     0 │
│ 2021-12-06 │         │     0 │
│ 2021-12-07 │ B       │     8 │
└────────────┴─────────┴───────┘

POPULATE column_name

SELECT *
FROM values('dt Date, product String, sales UInt32', ('2021-12-01', 'A', 10), ('2021-12-05', 'A', 30), ('2021-12-02', 'B', 1), ('2021-12-07', 'B', 8))
ORDER BY
    product ASC,
    dt ASC WITH FILL PARTITION BY product POPULATE product
	
┌─────────dt─┬─product─┬─sales─┐
│ 2021-12-01 │ A       │    10 │
│ 2021-12-02 │ A       │     0 │
│ 2021-12-03 │ A       │     0 │
│ 2021-12-04 │ A       │     0 │
│ 2021-12-05 │ A       │    30 │
│ 2021-12-02 │ B       │     1 │
│ 2021-12-03 │ B       │     0 │
│ 2021-12-04 │ B       │     0 │
│ 2021-12-05 │ B       │     0 │
│ 2021-12-06 │ B       │     0 │
│ 2021-12-07 │ B       │     8 │
└────────────┴─────────┴───────┘

POPULATE was implemented as INTERPOLATE function in that PR #35349

@devcrafter
Copy link
Member

devcrafter commented Feb 6, 2023

AFAIS, in PARTITION BYexample, the clause is not really necessary. It should be possible to group filling by sorting prefix in ORDER BY, by default. Or we may need end-to-end filling in some cases, i.e. w/o grouping?

@UnamedRus @KochetovNicolai

@UnamedRus
Copy link
Contributor Author

We may not have product ASC, column in ORDER BY and having logic like: all columns except latest numeric column in ORDER BY is kind of PARTITION BY key, is a bit weak and not user friendly.

@devcrafter
Copy link
Member

devcrafter commented Feb 6, 2023

having logic like: all columns except latest numeric column in ORDER BY is kind of PARTITION BY key

Not sure, I understand where from except latest numeric column comes. I proposed to use ORDER BYsorting prefix to a column with FILL as partition key.
Example:
... ORDER BY sensor_id, timestamp WITH FILL, value ... -- sensor_id is the sorting prefix here

It doesn't eliminate of necessity of PARTITION BY if user doesn't need/want sorting for sensor_id column. But with lack of PARTITION BY it could be a working solution (and it should be easier to implement)

@devcrafter
Copy link
Member

devcrafter commented Feb 13, 2023

After some discussions, I think PARTITION BY is necessary to control WITH FILL behavior. Will place some info gained during discussion, to not recall it again later.

(0) Currently, a column WITH FILL is independent on the order of any other columns. It depends on starting value either in column or FILL FROM clause, and it's just incremented according to STEP. Whenever current filling value is bigger than latest seen column value and less than next existing column value - we fill the column with current filling value (and rest columns according to INTERPOLATE or default value) and increment it by the step for next possible filling. In the example below, rows with row_number = 0 were created due to FILL WITH clause:

SELECT
    rowNumberInAllBlocks() + 1 AS row_number,
    *
FROM values('sensor_id UInt64, timestamp DateTime64(3, \'UTC\'), value Float64', (234, '2021-12-01 00:00:01', 1), (234, '2021-12-01 00:00:07', 2), (432, '2021-12-01 00:00:02', 0), (432, '2021-12-01 00:00:10', 3))
ORDER BY
    sensor_id ASC,
    timestamp ASC WITH FILL STEP 2
INTERPOLATE ( sensor_id AS sensor_id, value AS value )

Query id: 2aadaa59-44d2-4e43-9429-a903bbd0fd5f

┌─row_number─┬─sensor_id─┬───────────────timestamp─┬─value─┐
│          1 │       234 │ 2021-12-01 00:00:01.000 │     1 │
│          0 │       234 │ 2021-12-01 00:00:03.000 │     1 │
│          0 │       234 │ 2021-12-01 00:00:05.000 │     1 │
│          2 │       234 │ 2021-12-01 00:00:07.000 │     2 │
│          3 │       432 │ 2021-12-01 00:00:02.000 │     0 │
│          0 │       432 │ 2021-12-01 00:00:07.000 │     0 │
│          0 │       432 │ 2021-12-01 00:00:09.000 │     0 │
│          4 │       432 │ 2021-12-01 00:00:10.000 │     3 │
└────────────┴───────────┴─────────────────────────┴───────┘

(1) Kinda special case of current behavior, which is probably unexpected if we consider grouping by sorting prefix.
Just for clarity, query w/o FILL:

SELECT
    rowNumberInAllBlocks() + 1 AS row_number,
    *
FROM values('sensor_id UInt64, timestamp DateTime64(3, \'UTC\'), value Float64', (234, '2021-12-01 00:00:00', 1), (234, '2021-12-01 00:00:02', 2), (432, '2021-12-01 00:00:05', 0), (432, '2021-12-01 00:00:07', 3))
ORDER BY
    sensor_id ASC,
    timestamp ASC

┌─row_number─┬─sensor_id─┬───────────────timestamp─┬─value─┐
│          1 │       234 │ 2021-12-01 00:00:00.000 │     1 │
│          2 │       234 │ 2021-12-01 00:00:02.000 │     2 │
│          3 │       432 │ 2021-12-01 00:00:05.000 │     0 │
│          4 │       432 │ 2021-12-01 00:00:07.000 │     3 │
└────────────┴───────────┴─────────────────────────┴───────┘

In the example below, there is a filling (marked in bold) between rows_number 2 and 3. The filling is done because last timestamp for sensor_id 234 is less than first timestamp for sensor_id 432.

SELECT
    rowNumberInAllBlocks() + 1 AS row_number,
    *
FROM values('sensor_id UInt64, timestamp DateTime64(3, \'UTC\'), value Float64', (234, '2021-12-01 00:00:00', 1), (234, '2021-12-01 00:00:02', 2), (432, '2021-12-01 00:00:05', 0), (432, '2021-12-01 00:00:07', 3))
ORDER BY
    sensor_id ASC,
    timestamp ASC WITH FILL
INTERPOLATE ( sensor_id AS sensor_id, value AS value )

┌─row_number─┬─sensor_id─┬───────────────timestamp─┬─value─┐
│          1 │       234 │ 2021-12-01 00:00:00.000 │     1 │
│          0 │       234 │ 2021-12-01 00:00:01.000 │     1 │
│          2 │       234 │ 2021-12-01 00:00:02.000 │     2 │
│          0 │       234 │ 2021-12-01 00:00:03.000 │     2 │
│          0 │       234 │ 2021-12-01 00:00:04.000 │     2 │
│          3 │       432 │ 2021-12-01 00:00:05.000 │     0 │
│          0 │       432 │ 2021-12-01 00:00:06.000 │     0 │
│          4 │       432 │ 2021-12-01 00:00:07.000 │     3 │
└────────────┴───────────┴─────────────────────────┴───────┘

@alexey-milovidov Would you add some thoughts?

сс @yakov-olkhovskiy

@LukeHarrimanHQ
Copy link

+1 for this feature

@petrjanda
Copy link

+1

@LukeHarrimanHQ
Copy link

@alexey-milovidov Are you able to provide any thoughts on this. It would be an EXTREMELY useful for our application - the lack of this feature is a pretty big blocker atm.

@devcrafter
Copy link
Member

@LukeHarrimanHQ Currently, we agreed to implement this proposal from here as default behavior (i.e. w/o POPULATE).

Please let's know if it'd suffice for your use case

@Recodify
Copy link

Recodify commented Mar 17, 2023

+1 for this feature. I'm currently solving a similar use case as follows, but it would be super nice to have this feature to make things more straight forward.

Create a utility table containing all hours for 30 years.

CREATE TABLE hour_buckets
(
    `created_at` DateTime
)
ENGINE = MergeTree
ORDER BY created_at
SETTINGS index_granularity = 8192;

Populate hours table

INSERT INTO hour_buckets
SELECT toStartOfHour(now()) as created_at
FROM numbers(1)
ORDER BY created_at 
    WITH FILL 
        FROM date_sub(YEAR, 5, toStartOfHour(now())) 
        TO date_add(YEAR, 25, toStartOfHour(now())) 
        STEP INTERVAL 1 HOUR;

Create a metrics table, this is our main data table

CREATE TABLE metrics
(
    tag String,
    value UInt32,
    created_at DateTime
)
ENGINE = MergeTree
ORDER BY (tag, created_at);

Insert some test data


INSERT INTO metrics(tag, value, created_at)
VALUES ('server1.cpu', 84, now());

INSERT INTO metrics(tag, value, created_at)
VALUES ('server1.cpu', 73, date_add(HOUR, 1, now()));

INSERT INTO metrics(tag, value, created_at)
VALUES ('server1.cpu', 63, date_add(HOUR, 2, now()));

INSERT INTO metrics(tag, value, created_at)
VALUES ('server2.cpu', 16, now());

INSERT INTO metrics(tag, value, created_at)
VALUES ('server2.cpu', 33, date_add(HOUR, 1, now()));

INSERT INTO metrics(tag, value, created_at)
VALUES ('server2.cpu', 15, date_add(HOUR, 2, now()));

Select multiple time-series and 0 fill missing.

SELECT		
	c.avgValue AS avgValue,
	a.tag AS tag,
	b.created_at AS created_at
FROM hour_buckets b
CROSS JOIN (
	select distinct tag from metrics
) a
LEFT JOIN (
	SELECT
		AVG(value) AS avgValue,
		tag AS tag,
		toStartOfHour(created_at) AS created_at
	FROM metrics
	GROUP BY tag, created_at
) c ON c.tag = a.tag AND b.created_at = c.created_at
WHERE b.created_at < toStartOfHour(date_add(HOUR, 4, now()))
AND b.created_at > toStartOfHour(date_sub(HOUR, 3, now()));

@seanlaff
Copy link
Contributor

seanlaff commented Apr 30, 2023

Running into this- I had hoped to FILL both aaa and bbb in the example below, but only aaa is filled.

SELECT * FROM VALUES('id String, ts Date',
    ('aaa', '2023-04-20'),
    ('aaa', '2023-04-25'),
    ('bbb', '2023-04-22'),
    ('bbb', '2023-04-25')
) ORDER BY id, ts WITH FILL
┌─id──┬─────────ts─┐
│ aaa │ 2023-04-20 │
│     │ 2023-04-21 │
│     │ 2023-04-22 │
│     │ 2023-04-23 │
│     │ 2023-04-24 │
│ aaa │ 2023-04-25 │
│ bbb │ 2023-04-22 │
│ bbb │ 2023-04-25 │
└─────┴────────────┘

@devcrafter
Copy link
Member

Just for transparency, - some groundwork was done related to this issue:

@LukeHarrimanHQ
Copy link

Thanks @devcrafter! Can't wait.

@devcrafter
Copy link
Member

The proposed solution is implemented in #49503, and it's part of 23.5 release

@canopenerda
Copy link

canopenerda commented Sep 22, 2023

@devcrafter Thanks for your implementation. I have a scenarios which is hard to achieve even with your fix. I need to do forward fill for some metrics along a timeline across multiple categories. Let me elaborate with an example.
select * from values ('time Int64, category String, metric Float64',
(1, 'a', 1.0),
(3, 'b', 3.0),
(6, 'c', 5.0)
)

I would like to get ffilled output as below
1, 'a', 1.0
3, 'a', 1.0
6, 'a', 1.0
3, 'b', 3.0
6, 'b', 3.0
6, 'c', 5.0

With your fix, I can get below output with fill step 1
1, 'a', 1.0
2, 'a', 1.0
3, 'a', 1.0
4, 'a', 1.0
5, 'a', 1.0
6, 'a', 1.0
3, 'b', 3.0
4, 'b', 3.0
5, 'b', 3.0
6, 'b', 3.0
6, 'c', 5.0

However, in real-world, the interval between the adjacent ones could be huge, it's not feasible to fill with step 1, but other step size couldn't guarantine the later time slot is filled for each category.

@yakov-olkhovskiy
Copy link
Member

@canopenerda

I would like to get ffilled output as below
1, 'a', 1.0
3, 'a', 1.0
6, 'a', 1.0
3, 'b', 3.0
6, 'b', 3.0
6, 'c', 5.0

this filling is inconsistent - time values between first and second rows have step 2, and next steps are 3 - you can have with step 3 next:
1 a 1
4 a 1
3 b 3
6 b 3
6 c 5

@canopenerda
Copy link

canopenerda commented Sep 22, 2023

@canopenerda

I would like to get ffilled output as below
1, 'a', 1.0
3, 'a', 1.0
6, 'a', 1.0
3, 'b', 3.0
6, 'b', 3.0
6, 'c', 5.0

this filling is inconsistent - time values between first and second rows have step 2, and next steps are 3 - you can have with step 3 next: 1 a 1 4 a 1 3 b 3 6 b 3 6 c 5

@yakov-olkhovskiy a real world example is trading stocks. I could open a position for a stock at any time, and reduce / increase / close each position later on, I want to get total position of all stocks along a timeline.

@yakov-olkhovskiy
Copy link
Member

@canopenerda I don't quite understand what you want to achieve with FILL expression. WITH FILL is a simple algorithm which allows you to fill up column with values in specified interval with specified step.

@canopenerda
Copy link

@yakov-olkhovskiy I just described a forward fill scenario which is not possible to achieve in clickhouse. If WITH FILL could interpolate along values present in the filled column instead of interpolate in lock step.

@yakov-olkhovskiy
Copy link
Member

@canopenerda we do have interpolate feature for other than filled columns (see INTERPOLATE expression), but it has a limitation as it can only refer to previous row's values - since we don't read forward - so it's not exactly interpolate, but rather extrapolate.

@weres-sa
Copy link

I guess, this example would be more appreciative:

-- SET join_use_nulls = 1

with stocks as (
SELECT 
  dt as dt,
  stock,
  opened
FROM values(
    'dt Datetime, stock String, opened UInt32', 
    ('2023-09-01T00:00:00', 'a', 1.0),
    ('2023-09-01T00:03:00', 'b', 3.0),
    ('2023-09-01T12:00:00', 'c', 4.0),
    ('2023-09-01T12:10:00', 'a', 5.0),
    ('2023-09-01T13:20:00', 'b', 7.0)
    )
),
dates as (
    select distinct dt from stocks
),
stocks_a as (
  select 
    dates.dt,
    stocks.opened,
    case when stock = '' then 'a' else stock end as stock
  from dates
  left join stocks on stocks.dt = dates.dt and stocks.stock = 'a'
  order by dt
  union all 
  select 
    dates.dt,
    stocks.opened,
    case when stock = '' then 'b' else stock end as stock
  from dates
  left join stocks on stocks.dt = dates.dt and stocks.stock = 'b'
  order by dt
  union all 
  select 
    dates.dt,
    stocks.opened,
    case when stock = '' then 'c' else stock end as stock
  from dates
  left join stocks on stocks.dt = dates.dt and stocks.stock = 'c'
  order by dt
),
stocks_a2 as (
  select distinct
    dt,
    stock,
    -- last_value(case when stock = '' then NULL else stock end) over row_window as stock,
    last_value(case when opened = 0 then NULL else opened end) over row_window as opened
  from stocks_a
  window row_window as (PARTITION BY stock ORDER BY dt rows between unbounded preceding and current row)
  order by dt
)
select 
    *
from stocks_a2
;

It works and it shows for each time real opened positions for this time - so you need WINDOW here.

But I failed to found the way without setting cartesian with ugly unions combination.

@canopenerda
Copy link

@yakov-olkhovskiy I'm aware of the INTERPOLATE expression and the problem is that WITH FILL only supports fixed STEP which would not works in the stock scenario I depicted. By the way, I'm not saying there is something wrong with current implemenation of WITH FILL and INTERPOLATE, I'm just facing a scenario where WITH FILL is the closest I can get within clickhouse. If you know any other workaround, really appreciate it.

@canopenerda
Copy link

@weres-sa Thanks for your solution. It works if the number of stocks is relatively small and stable.

@yakov-olkhovskiy
Copy link
Member

@canopenerda we are positioning ClickHouse well suited for time series and we are always opened for suggestions for improvements, so if you have any - please step forward. The idea of not-fixed STEP sounds quite reasonable.
Also, maybe you will find this blog post useful:
https://clickhouse.com/blog/working-with-time-series-data-and-functions-ClickHouse

@UnamedRus
Copy link
Contributor Author

Why not just?

WITH
    stocks AS
    (
        SELECT
            dt AS dt,
            stock,
            opened
        FROM values('dt Datetime, stock String, opened UInt32', ('2023-09-01T00:00:00', 'a', 1.), ('2023-09-01T00:03:00', 'b', 3.), ('2023-09-01T12:00:00', 'c', 4.), ('2023-09-01T12:10:00', 'a', 5.), ('2023-09-01T13:20:00', 'b', 7.))
    ),
    groupArray(dt) OVER (ORDER BY dt ASC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS possible_ts,
    MIN(dt) OVER (PARTITION BY stock ORDER BY dt ASC ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) AS next_ts
SELECT
    stock,
    opened,
    arrayJoin(ts_arr) AS ts
FROM
(
    SELECT
        dt AS dt,
        stock,
        opened,
        arrayFilter(x -> (x < next_ts), possible_ts) AS ts_arr
    FROM stocks
)

Query id: 883e6ff7-56a8-49d4-8a9f-9c9a730e696b

┌─stock─┬─opened─┬──────────────────ts─┐
│ a     │      1 │ 2023-09-01 00:00:00 │
│ a     │      1 │ 2023-09-01 00:03:00 │
│ a     │      1 │ 2023-09-01 12:00:00 │
│ b     │      3 │ 2023-09-01 00:03:00 │
│ b     │      3 │ 2023-09-01 12:00:00 │
│ b     │      3 │ 2023-09-01 12:10:00 │
└───────┴────────┴─────────────────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature minor Priority: minor
Projects
None yet
Development

No branches or pull requests

10 participants