New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Global Streaming Aggregation over infinite stream data in Kafka #54776
Comments
Syntax / Semantics The concept of EXTERNAL STREAM looks good. I think we will need it eventually. However, it would be easier to start with the existing StorageKafka. Flat Transformation Query For the query example, it is interesting why a special syntax like
^ This looks good to me. Global Aggregation Query Here, I see a new keyword: |
Thanks @KochetovNicolai for your comments.
Sounds good. Let's stick to StorageKafka engine for now.
BTW, Timeplus's streaming query syntax / semantics are based on / derived from some of the academic researches / industry practices. You can find more information in this blog. Using ClickHouse settings to control the behaviors will technically work as well, I think. Since Timeplus focuses on streaming first solution, in day one, we adopt the streaming industry practices. |
Tackling this feature in PR #54870 |
Use case
Streaming Query / Aggregation over Kafka topic
Describe the solution you'd like
This issue targets porting the first batch of the existing streaming processing functionalities from Timeplus streaming analytic open source repo Proton to the ClickHouse community.
It is also the first batch implementing Streaming Query RFC :
allow aggregation and JOINing of infinite streams of data;
which has quite different query behaviors than regular historical query since it emits (intermediate) query results according to “watermark”.Syntax / Semantics
We can either leverage the existing ClickHouse Kafka table engine or introduce a separate external stream concept like Timeplus Proton does
If we like to introduce a new stream concept, the provisioning SQL will look something like this
Otherwise we can keep using the existing ClickHouse Kafka table engine which may be problematic regarding offset checkpoint etc during query recovery.
No matter which storage engine we adopt, after provisioning the table which points to a Kafka topic. Users can just run different queries against this special table by using regular ClickHouse query syntax but with streaming processing semantics detailed below.
Flat Transformation Query
The query example below runs forever since the stream data lasts forever. It first rewinds the offset to earliest data to replay all of the available history and does json extract for each github event and emit the parsed fields to end users continuously.
Global Aggregation Query
The following query example continuously evaluates real-time data from the Kafka topic and emits the top 10 contributors every 5 seconds.
Data Enrichment Join
Like data enrichment join in Timeplus, it is an infinite stream joins a static historical table data.
In this query,
users_dim
is a regular ClickHouse (Replicated/Shared) MergeTree table. Once theusers_dim
is loaded and builds the hash table, it keeps static (no updates to the hash table in future for this case). The left stream events will join the right static hash table continuously as new events arrive on the left.Global Aggregation after Data Enrichment Join
As in streaming global aggregation, users can run global aggregation over the streaming data enrichment join. The
user_nt_name
in the following query is from theusers_dim
dimension table.Please note for all of the above example queries, we enable streaming query semantics automatically and only for the Kafka table engine (or external stream if we like to create a new concept). There are no opt-in tuning knobs to enable this streaming behavior since this is probably the most regular query semantics for the Kafka table engine.
Backward Compatibility
The above query syntax / semantics shall have no conflicts with existing Kafka table engine query behaviors, they are (just additional) enhancements.
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: