Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'-Resample' combinator limitations #13493

Open
RikeVoltz opened this issue Aug 7, 2020 · 16 comments
Open

'-Resample' combinator limitations #13493

RikeVoltz opened this issue Aug 7, 2020 · 16 comments

Comments

@RikeVoltz
Copy link

Describe the issue
Aggregate functions with -Resample combinator don't work with some expressions that can be pre-evaluated and converted to constants.

How to reproduce

CREATE TABLE IF NOT EXISTS Test
(
    `time` DateTime, 
    `id` Int32
)
ENGINE = Memory();
INSERT INTO Test VALUES ('2020-08-07 01:03:05',1), ('2020-08-07 01:03:06',2), 
                        ('2020-08-07 04:08:12',3), ('2020-08-07 15:06:13',4), 
                        ('2020-08-07 15:06:14',5), ('2020-08-07 15:06:15',6);

Following queries give the same exception:

SELECT countResample(toDateTime('2020-08-07 00:00:00'), toDateTime('2020-08-08 00:00:00'), INTERVAL 1 HOUR)(id, time) FROM Test;
SELECT countResample(toUnixTimestamp('2020-08-07 00:00:00'), toUnixTimestamp('2020-08-08 00:00:00'), INTERVAL 1 HOUR)(id, time) FROM Test;
SELECT countResample(toUnixTimestamp('2020-08-07 00:00:00'), toUnixTimestamp('2020-08-08 00:00:00'), 3600)(id, time) FROM Test;
SELECT countResample(1596747600, 1596834000, INTERVAL 1 HOUR)(id, time) FROM Test;

Expected behavior

SELECT countResample(1596747600, 1596834000, 3600)(id, time) FROM Test;

┌─countResample(1596747600, 1596834000, 3600)(id, time)─┐
│ [0,2,0,0,1,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0]     │
└───────────────────────────────────────────────────────┘

Error message and/or stacktrace
Code: 134. DB::Exception: Received from localhost:19019. DB::Exception: Parameters to aggregate functions must be literals.

@RikeVoltz
Copy link
Author

Also, it needs to be investigated if the Resample combinator works with the DateTime type at all.

@somewheve
Copy link

Can you provide me an example on how to resample n min bar?

Likes

CREATE TABLE IF NOT EXISTS Test
(
    `datetime` DateTime, 
    `last_price` Float64,
    `volume`  Int32,
    `amount`  Int32
)

@RikeVoltz
Copy link
Author

RikeVoltz commented Aug 12, 2020

Can you provide me an example on how to resample n min bar?

It depends on what you want to get in the answer - that's why it called "combinator" :)

In example, if you want to get total (summary) amount of "something" for every 7 minutes between 2020-01-02 00:00:00 and 2020-01-02 01:10:00 (yyyy-MM-dd hh:mm:ss example, 70 minutes interval), you need to:

  1. Convert DateTime to Unix timestamp (2020-01-02 00:00:00=1577923200, 2020-01-02 01:10:00=1577927400);
  2. Convert bucket size to seconds (7 min = 420 sec)
  3. Use -Resample combinator with sum aggregate function:
SELECT sumResample(1577923200, 1577927400, 420)(amount, datetime) FROM Test;

So, for dataset like this:

INSERT INTO Test VALUES ('2020-01-02 00:00:00', 123.11, 12, 10), ('2020-01-02 00:01:10', 245.12, 6, 5), ('2020-01-02 00:02:20', 789.13, 7, 11), ('2020-01-02 00:56:13', 1001.1, 31, 156), ('2020-01-02 00:57:11', 1.23, 57, 13), ('2020-01-03 00:00:00', 111.22, 12, 11);

We'll get:

SELECT sumResample(1577923200, 1577927400, 420)(amount, datetime)
FROM Test

┌─sumResample(1577912400, 1577916600, 420)(amount, datetime)─┐
│ [26,0,0,0,0,0,0,0,169,0]                                   │
└────────────────────────────────────────────────────────────┘

From my point of view, the issue is in the first two steps - ClickHouse can precalculate it easily for us and it would be great if we could write queries like:

SELECT sumResample('2020-01-02 00:00:00', '2020-01-02 01:10:00', INTERVAL 7 MINUTE)(amount, datetime) FROM Test;

@somewheve
Copy link

It works now ?

@RikeVoltz
Copy link
Author

The first one - yes:

SELECT sumResample(1577923200, 1577927400, 420)(amount, datetime) FROM Test

The second one - no, that's why issue is open:

SELECT sumResample('2020-01-02 00:00:00', '2020-01-02 01:10:00', INTERVAL 7 MINUTE)(amount, datetime) FROM Test;

@somewheve
Copy link

Got it, Thanks for your reply, I will try the first one.

@somewheve
Copy link

When I try to add the interval between start and end,

infi.clickhouse_orm.database.ServerError: The range given in function sumResample contains too many elements (version 20.5.4.40 (official build)) (69)

and the resample could only operate one element? , I hope to get the max price and min price, open price, last_price in the n-minute bars, also I need the sum the amount and volume,

here is the 1min resample code link

if you can, could you give me some suggestions? I am new to clickhouse.

@RikeVoltz RikeVoltz reopened this Aug 12, 2020
@RikeVoltz
Copy link
Author

infi.clickhouse_orm.database.ServerError: The range given in function sumResample contains too many elements (version 20.5.4.40 (official build)) (69)

This exception occurs when we try to split our interval to more, than 4096 buckets (probably, this and this lines are responsible for this). If there are more, than 4096 buckets in your case, probably the -Resample combinator isn't suitable for you, so you can use some of the other approaches.
One of them is to use some function from toStartOf family of functions, as supposed in the SO solution, you provided (In our example, probably toStartOfInterval(datetime, INTERVAL 7 MINUTE) (function from ClickHouse docs) will be the best one), but IMHO it's not the fully correct solution, because we can't control StartTimestamp of the first bucket (toStartOfInterval returns DateTime, that is divisible by the interval in seconds, but our start of the interval may not).
So, not the best, but the way more correct solution is to split timestamps by ourselves.

    WITH 
        toDateTime('2020-01-02 00:00:00') AS StartTimestamp, 
        toDateTime('2020-01-02 01:10:00') AS EndTimestamp, 
        7 * 60 AS BucketSize, 
        datetime - StartTimestamp AS RelativeDatetime, 
        intDiv(RelativeDatetime, BucketSize) AS slot
    SELECT 
        StartTimestamp + slot * BucketSize as StartOfBucket, 
        StartTimestamp + (slot+1) * BucketSize as EndOfBucket, 
        max(last_price) AS HighestPrice, 
        min(last_price) AS LowestPrice, 
        argMin(last_price, datetime) AS OpenPrice, 
        argMax(last_price, datetime) AS LatestPrice
    FROM Test
    WHERE (datetime >= StartTimestamp) AND (datetime < EndTimestamp)
    GROUP BY slot

The result is:

┌───────StartOfBucket─┬─────────EndOfBucket─┬─HighestPrice─┬─LowestPrice─┬─OpenPrice─┬─LatestPrice─┐
│ 2020-01-02 00:00:002020-01-02 00:07:00789.13123.11123.11789.13 │
│ 2020-01-02 00:56:002020-01-02 01:03:001001.11.231001.11.23 │
└─────────────────────┴─────────────────────┴──────────────┴─────────────┴───────────┴─────────────┘

Compare with the usage of toStartOfInterval function:

WITH 
    toDateTime('2020-01-02 00:00:00') AS StartTimestamp, 
    toDateTime('2020-01-02 01:10:00') AS EndTimestamp, 
    7 * 60 AS BucketSize, 
    toStartOfInterval(datetime, toIntervalMinute(7)) AS StartOfBucket
SELECT 
    StartOfBucket, 
    StartOfBucket + BucketSize AS EndOfBucket, 
    max(last_price) AS HighestPrice, 
    min(last_price) AS LowestPrice, 
    argMin(last_price, datetime) AS OpenPrice, 
    argMax(last_price, datetime) AS LatestPrice
FROM Test
WHERE (datetime >= StartTimestamp) AND (datetime < EndTimestamp)
GROUP BY StartOfBucket
ORDER BY StartOfBucket ASC

┌───────StartOfBucket─┬─────────EndOfBucket─┬─HighestPrice─┬─LowestPrice─┬─OpenPrice─┬─LatestPrice─┐
│ 2020-01-01 23:58:002020-01-02 00:05:00789.13123.11123.11789.13 │
│ 2020-01-02 00:54:002020-01-02 01:01:001001.11.231001.11.23 │
└─────────────────────┴─────────────────────┴──────────────┴─────────────┴───────────┴─────────────┘

P. S. The above queries may be incorrect (and, of course, can be rewritten in the shorter and more correct way)!

@somewheve
Copy link

somewheve commented Aug 12, 2020

really thanks for your reply, you are so handsome and cool ~!!

ps: Have a good day,

@RikeVoltz
Copy link
Author

RikeVoltz commented Aug 12, 2020

really thanks for your reply, you are so handsome and cool ~!!

ps: Have a good day,

You're welcome :)
P. S.: If you want empty buckets too, you can use trick with join:

WITH 
    toDateTime('2020-01-02 00:00:00') AS StartTimestamp, 
    toDateTime('2020-01-02 01:10:00') AS EndTimestamp, 
    7 * 60 AS BucketSize
SELECT 
    StartTimestamp + (slot * BucketSize) AS StartOfBucket, 
    StartTimestamp + ((slot + 1) * BucketSize) AS EndOfBucket, 
    HighestPrice, 
    LowestPrice, 
    OpenPrice, 
    LatestPrice
FROM 
(
    WITH 
        toDateTime('2020-01-02 00:00:00') AS StartTimestamp, 
        toDateTime('2020-01-02 01:10:00') AS EndTimestamp, 
        7 * 60 AS BucketSize, 
        datetime - StartTimestamp AS RelativeDatetime, 
        intDiv(RelativeDatetime, BucketSize) AS slot
    SELECT 
        slot, 
        max(last_price) AS HighestPrice, 
        min(last_price) AS LowestPrice, 
        argMin(last_price, datetime) AS OpenPrice, 
        argMax(last_price, datetime) AS LatestPrice
    FROM Test
    WHERE (datetime >= StartTimestamp) AND (datetime < EndTimestamp)
    GROUP BY slot
)
ANY RIGHT JOIN 
(
    WITH 
        toDateTime('2020-01-02 00:00:00') AS StartTimestamp, 
        toDateTime('2020-01-02 01:10:00') AS EndTimestamp, 
        7 * 60 AS BucketSize
    SELECT toInt32(number) AS slot
    FROM system.numbers
    LIMIT intDiv(EndTimestamp - StartTimestamp, BucketSize)
) USING (slot)
ORDER BY slot ASC

The result will be:

┌───────StartOfBucket─┬─────────EndOfBucket─┬─HighestPrice─┬─LowestPrice─┬─OpenPrice─┬─LatestPrice─┐
│ 2020-01-02 00:00:002020-01-02 00:07:00789.13123.11123.11789.13 │
│ 2020-01-02 00:07:002020-01-02 00:14:000000 │
│ 2020-01-02 00:14:002020-01-02 00:21:000000 │
│ 2020-01-02 00:21:002020-01-02 00:28:000000 │
│ 2020-01-02 00:28:002020-01-02 00:35:000000 │
│ 2020-01-02 00:35:002020-01-02 00:42:000000 │
│ 2020-01-02 00:42:002020-01-02 00:49:000000 │
│ 2020-01-02 00:49:002020-01-02 00:56:000000 │
│ 2020-01-02 00:56:002020-01-02 01:03:001001.11.231001.11.23 │
│ 2020-01-02 01:03:002020-01-02 01:10:000000 │
└─────────────────────┴─────────────────────┴──────────────┴─────────────┴───────────┴─────────────┘

@somewheve
Copy link

ok got it ~

@somewheve
Copy link

and I can use the same to resample price, if price up or down 4, right?

@RikeVoltz
Copy link
Author

Probably I didn't quite understand what you meant, could you please describe it?

@somewheve
Copy link

the price is

[ 3001, 3002, 3004,3006, 3007, 3008, 3009, 3001, 3002, 3002, 3003, 3004,3000] 
# I want to resample the bar to  
[3001, 3002,3004, 3006] , [ 3007, 3008, 3009, 3001] , [ 3002, 3002, 3003, 3000] 

When the difference between the lowest price and the highest price is more than 4, it should be regarded as a bar
also calculate the max, min, open, close and sum(volume) and sum(amount)

@RikeVoltz
Copy link
Author

RikeVoltz commented Aug 12, 2020

Do you want to resample by the DateTime or the other column (price, smth else) or both?

@somewheve
Copy link

not both but single, maybe price or volume, it's important to analyze them,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants