New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Projection with window function #54818
Comments
I don't understand what you want with sumMerge, I don't see I think you really need is |
simpler would be to remove all maxMerge columns and replace with one maxMerge column, not sure this makes it much simpler to understand though. so again, we are focusing on volume column. We don't want argMaxMerge for volume, since volume is a sum (inside given minute). I omitted sum from a PROJECTION to make it clear, as it's not what we want and not clear how this could be added to projection. Here if that is what you need for better understanding: for peace of mind. However, in this way we get latest numbers for prices (bid_price, etc.) and sum of all volume for entire table duration. However, we want only sum of volumes in LAST minute. (so we can't use argMaxMerge). the only workaround would be to do calculation of accumulating sum within a minute in application, and then as you say use argMaxMerge. But again, this is not what I described above. I also provided a way for obtaining correct data for us, which is slow however: This gives correct result as to what we expect. I am not really sure what you mean by simpler. I provided null table -> materialized view -> aggregate minute table + projection on top of aggregate minute table to illustrate what we have and what particular query we struggle with. I am not really sure how finalizeAggregation could help here, since we are not struggling with finalization of query. |
I see. I guess Window view should solve it, but they never worked properly. So I would solve this task outside of Clickhouse. |
unfortunately window view is not an option either, even if it worked properly. since we want last minute per id, not necessarily uniform across all ids (and related to real time), since if some id has not been updated for an hour, last minute will be 1 hour old (because no update came) so we want 1 hour old latest minute for that id. maybe what i asked above could be answered (if function like sumMergeIf(d.agg_volume, d.agg_time >= f.threshold_time) AS LatestVolume) could be more optimized -> window function, or used directly in projection. I saw multiple items in github related to optimization of window functions (so maybe in half year or so that will be much quicker or supported for mat views/projections). Ideal case is that we do not need to calculate cumulative volume ourselves and leave that to clickhouse (since then we can take any data source and freely backfill data, not having to worry about that ourselves). |
window function cannot be used in projections. The scope of the projection calculation is limited by a part. |
@jozefRudy You can try the following and see if the performance is acceptable.
Then you can use this query to calculate last sum
In general, we can introduce an aggregate function: argMaxReduce, which aggregates data for maximum arguments only. |
thank you @amosbird for a good suggestion which I was not aware of. We plan to have years of 1-minute prices. My assumption is that doing this to get last will not perform well (if I did a performance test correctly today, with 1-year of 1-minute data for 10,000 instruments, it ended up taking seconds, so I abandoned this. (did not get into arrayReduce('argMax') which would be more effecient that arraySort, but I still believe the approach I came up with might be much more performant. I decided to calculate last minute volume per id in separate table. We want to get last available minute for all ids (potentially 10,000 instruments). Getting prices as mentioned above is fast with projection. The only problem is volume. I am using replacingmergetree, when we would keep all ids, and update version is defined by endOfMinute(time) in materialized view. So all updates per id stay, and are efficiently disposed when next minute comes.
then to get volume for all ids from table (in case replacing did not yet merge) we need to select latest interval (but there should not be too many intervals per id since last merge operation, so should be reasonably quick.
This is then selection i am using for latest volumes which i merge on id with latest prices from price table and i am done. |
@jozefRudy Projections are not supported in ReplacingMergeTree yet. You will end up with incorrect query result.
Yeah, the solution I posted will not scale well. I guess the right path is to build argMaxReduce function. Your use case is good enough which deserves the effort. Here is a prototype :
|
that would look awesome (basically the ability to have different aggregation function on smaller interval [minute -> sum] and different on higher -> max of those). i think maybe for now we live with NOT being able to have latest minute's volume and wait for implementation. Obviously do not want to push for this, but is 6 months a good estimate for this feature to be implemented (just to have some rough idea)? |
I'll try to implement it at this weekend. If everything goes fine, it could be landed in October. |
It's better to implement -Min/-Max combinators.
|
@amosbird i noticed that the #54947 has not had activity recently. Should we understand there is a certain fundamental problem? This should generalize to different combinators, e.g. imagine we also want open,high,low,close of last minute price (e.g. bid) as a projection. So we would have sumMax (as you suggested) and also maxMax, minMax, and then maybe firstMax, lastMax? extending the projection from above
|
@alexey-milovidov said he will review the PR in this week.
What do BTW, the combinator has been changed to |
the idea was to get the first price for last hence for same |
That's invalid, because |
I think that is what you mean.
aggregation table ->
materialized view ->
But projection currently is not possible for In other words, projection below is possible currently, but not first trade of last minute (as opposed to first trade of entire series which is not what we want), same with min and max of trade in last minute.
|
You need the following:
In projection:
|
yes, what i meant, this is currently not possible, only after this PR gets merged. |
@jozefRudy It's merged. |
We are working with time series data. We have a raw table for ingestion, imagine usually 100 000 items per minute (not too much).
Then on top of this we aggregate into 1-minute interval.
aggregation table ->
materialized view ->
Our use case is querying certain frequency (e.g. 1-min or 1-hour) for given id, which is reasonably quick.
We have additional query which we are struggling with -> latest minute data.
To make this quick we added a projection ->
This works as we desire (giving last non-null values for prices and sizes). However we are struggling with volume. Since we have multiple updates in a minute, we are using sumState. However, when selecting last minute in a projection unfortunately it is impossible to do something like maxMerge(sumMerge(last_minute)). We do not want to sum volume for entire table, just for the last minute.
We have tried spending time on this in various ways, and there are limitations in every direction. We cannot do a window function in a projection, so following query is slow ->
So we are able to get sumMerge of volumes for last minute but not in pre-calculated way.
Is there something we are missing or some feature planned to be added reasonably soon for us to wait?
The text was updated successfully, but these errors were encountered: