-
Notifications
You must be signed in to change notification settings - Fork 49
chore: update flow dev guide #2184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
b64c1ba
chore: update stat
discord9 39993f6
docs: explain batching mode
discord9 64fa933
side bar
discord9 1812d70
front matter
discord9 5febf8e
revert unwanted change(already explain elsewhere)
discord9 bb89df8
docs: add arch image
discord9 9edbe0a
fix link
discord9 0c98494
中
discord9 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,74 @@ | ||
| --- | ||
| keywords: [streaming process, flow management, Flownode components, Flownode limitations, batching mode] | ||
| description: Overview of Flownode's batching mode, a component providing continuous data aggregation capabilities to the database, including its architecture and query execution flow. | ||
| --- | ||
|
|
||
| # Flownode Batching Mode Developer Guide | ||
|
|
||
| This guide provides a brief overview of the batching mode in `flownode`. It's intended for developers who want to understand the internal workings of this mode. | ||
|
|
||
| ## Overview | ||
|
|
||
| The batching mode in `flownode` is designed for continuous data aggregation. It periodically executes a user-defined SQL query over small, discrete time windows. This is in contrast to a streaming mode where data is processed as it arrives. | ||
|
|
||
| The core idea is to: | ||
| 1. Define a `flow` with a SQL query that aggregates data from a source table into a sink table. | ||
| 2. The query typically includes a time window function (e.g., `date_bin`) on a timestamp column. | ||
| 3. When new data is inserted into the source table, the system marks the corresponding time windows as "dirty." | ||
| 4. A background task periodically wakes up, identifies these dirty windows, and re-runs the aggregation query for those specific time ranges. | ||
| 5. The results are then inserted into the sink table, effectively updating the aggregated view. | ||
|
|
||
| ## Architecture | ||
|
|
||
| The batching mode consists of several key components that work together to achieve this continuous aggregation. As shown in the diagram below: | ||
|
|
||
|  | ||
|
|
||
| ### `BatchingEngine` | ||
|
|
||
| The `BatchingEngine` is the heart of the batching mode. It's a central component that manages all active flows. Its primary responsibilities are: | ||
|
|
||
| - **Task Management**: It maintains a map of `FlowId` to `BatchingTask`. It handles the creation, deletion, and retrieval of these tasks. | ||
| - **Event Dispatching**: When new data arrives (via `handle_inserts_inner`) or when time windows are explicitly marked as dirty (`handle_mark_dirty_time_window`), the `BatchingEngine` identifies which flows are affected and forwards the information to the corresponding `BatchingTask`s. | ||
|
|
||
| ### `BatchingTask` | ||
|
|
||
| A `BatchingTask` represents a single, independent data flow. Each task is associated with one `flow` definition and runs in its own asynchronous loop. | ||
|
|
||
| - **Configuration (`TaskConfig`)**: This struct holds the immutable configuration for a flow, such as the SQL query, source and sink table names, and time window expression. | ||
| - **State (`TaskState`)**: This contains the dynamic, mutable state of the task, most importantly the `DirtyTimeWindows`. | ||
| - **Execution Loop**: The task runs an infinite loop (`start_executing_loop`) that: | ||
| 1. Checks for a shutdown signal. | ||
| 2. Waits for a scheduled interval or until it's woken up. | ||
| 3. Generates a new query plan (`gen_insert_plan`) based on the current set of dirty time windows. | ||
| 4. Executes the query (`execute_logical_plan`) against the database. | ||
| 5. Cleans up the processed dirty windows. | ||
|
|
||
| ### `TaskState` and `DirtyTimeWindows` | ||
|
|
||
| - **`TaskState`**: This struct tracks the runtime state of a `BatchingTask`. It includes `dirty_time_windows`, which is crucial for determining what work needs to be done. | ||
| - **`DirtyTimeWindows`**: This is a key data structure that keeps track of which time windows have received new data since the last query execution. It stores a set of non-overlapping time ranges. When a task's execution loop runs, it consults this structure to build a `WHERE` clause that filters the source table for only the dirty time windows. | ||
|
|
||
| ### `TimeWindowExpr` | ||
|
|
||
| The `TimeWindowExpr` is a helper utility for dealing with time window functions like `TUMBLE`. | ||
|
|
||
| - **Evaluation**: It can take a timestamp and evaluate the time window expression to determine the start and end of the window that the timestamp falls into. | ||
| - **Window Size**: It can also determine the size (duration) of the time window from the expression. | ||
|
|
||
| This is essential for both marking windows as dirty and for generating the correct filter conditions when querying the source table. | ||
|
|
||
| ## Query Execution Flow | ||
|
|
||
| Here's a simplified step-by-step walkthrough of how a query is executed in batch mode: | ||
|
|
||
| 1. **Data Ingestion**: New data is written to a source table. | ||
| 2. **Marking Dirty**: The `BatchingEngine` receives a notification about the new data. It uses the `TimeWindowExpr` associated with each relevant flow to determine which time windows are affected by the new data points. These windows are then added to the `DirtyTimeWindows` set in the corresponding `TaskState`. | ||
| 3. **Task Wake-up**: The `BatchingTask`'s execution loop wakes up, either due to its periodic schedule or because it was notified of a large backlog of dirty windows. | ||
| 4. **Plan Generation**: The task calls `gen_insert_plan`. This method: | ||
| - Inspects the `DirtyTimeWindows`. | ||
| - Generates a series of `OR`'d `WHERE` clauses (e.g., `(ts >= 't1' AND ts < 't2') OR (ts >= 't3' AND ts < 't4') ...`) that cover the dirty windows. | ||
| - Rewrites the original SQL query to include this new filter, ensuring that only the necessary data is processed. | ||
| 5. **Execution**: The modified query plan is sent to the `Frontend` for execution. The database processes the aggregation on the filtered data. | ||
| 6. **Upsert**: The results are inserted into the sink table. The sink table is typically defined with a primary key that includes the time window column, so new results for an existing window will overwrite (upsert) the old ones. | ||
| 7. **State Update**: The `DirtyTimeWindows` set is cleared of the windows that were just processed. The task then goes back to sleep until the next interval. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
74 changes: 74 additions & 0 deletions
74
...usaurus-plugin-content-docs/current/contributor-guide/flownode/batching_mode.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,74 @@ | ||
| --- | ||
| keywords: [流处理, flow 管理, Flownode 组件, Flownode 限制, 批处理模式] | ||
| description: Flownode 批处理模式概述,一个为数据库提供持续数据聚合能力的组件,包括其架构和查询执行流程。 | ||
| --- | ||
|
|
||
| # Flownode 批处理模式开发者指南 | ||
|
|
||
| 本指南简要概述了 `flownode` 中的批处理模式。它旨在帮助希望了解此模式内部工作原理的开发人员。 | ||
|
|
||
| ## 概述 | ||
|
|
||
| `flownode` 中的批处理模式专为持续数据聚合而设计。它在离散的、微小的时间窗口上周期性地执行用户定义的 SQL 查询。这与数据在到达时即被处理的流处理模式形成对比。 | ||
|
|
||
| 其核心思想是: | ||
| 1. 定义一个带有 SQL 查询的 `flow`,该查询将数据从源表聚合到目标表。 | ||
| 2. 查询通常在时间戳列上包含一个时间窗口函数(例如 `date_bin`)。 | ||
| 3. 当新数据插入源表时,系统会将相应的时间窗口标记为“脏”(dirty)。 | ||
| 4. 一个后台任务会周期性地唤醒,识别这些脏窗口,并为那些特定的时间范围重新运行聚合查询。 | ||
| 5. 然后将结果插入到目标表中,从而有效地更新聚合视图。 | ||
|
|
||
| ## 架构 | ||
|
|
||
| 批处理模式由几个协同工作的关键组件组成,以实现这种持续聚合。如下图所示: | ||
|
|
||
|  | ||
|
|
||
| ### `BatchingEngine` | ||
|
|
||
| `BatchingEngine` 是批处理模式的核心。它是一个管理所有活动 flow 的中心组件。其主要职责是: | ||
|
|
||
| - **任务管理**: 维护一个从 `FlowId` 到 `BatchingTask` 的映射。它处理这些任务的创建、删除和检索。 | ||
| - **事件分发**: 当新数据到达(通过 `handle_inserts_inner`)或当时间窗口被显式标记为脏(`handle_mark_dirty_time_window`)时,`BatchingEngine` 会识别受影响的 flow,并将信息转发给相应的 `BatchingTask`。 | ||
|
|
||
| ### `BatchingTask` | ||
|
|
||
| `BatchingTask` 代表一个独立的、单个的数据流。每个任务都与一个 `flow` 定义相关联,并在其自己的异步循环中运行。 | ||
|
|
||
| - **配置 (`TaskConfig`)**: 此结构体持有 flow 的不可变配置,例如 SQL 查询、源表和目标表名以及时间窗口表达式。 | ||
| - **状态 (`TaskState`)**: 包含任务的动态、可变状态,最重要的是 `DirtyTimeWindows`。 | ||
| - **执行循环**: 任务运行一个无限循环 (`start_executing_loop`),该循环: | ||
| 1. 检查关闭信号。 | ||
| 2. 等待一个预定的时间间隔或直到被唤醒。 | ||
| 3. 基于当前的脏时间窗口集合生成一个新的查询计划 (`gen_insert_plan`)。 | ||
| 4. 对数据库执行查询 (`execute_logical_plan`)。 | ||
| 5. 清理已处理的脏窗口。 | ||
|
|
||
| ### `TaskState` 和 `DirtyTimeWindows` | ||
|
|
||
| - **`TaskState`**: 此结构体跟踪 `BatchingTask` 的运行时状态。它包括 `dirty_time_windows`,这对于确定需要完成哪些操作至关重要。 | ||
| - **`DirtyTimeWindows`**: 这是一个关键的数据结构,用于跟踪自上次查询执行以来哪些时间窗口接收到了新数据。它存储一组不重叠的时间范围。当任务的执行循环运行时,它会参考此结构来构建一个 `WHERE` 子句,该子句仅过滤源表中的脏时间窗口。 | ||
|
|
||
| ### `TimeWindowExpr` | ||
|
|
||
| `TimeWindowExpr` 是一个用于处理像 `TUMBLE` 这样的时间窗口函数的辅助工具。 | ||
|
|
||
| - **求值**: 它可以接受一个时间戳并对时间窗口表达式求值,以确定该时间戳所属窗口的开始和结束。 | ||
| - **窗口大小**: 它还可以从表达式中确定时间窗口的大小(持续时间)。 | ||
|
|
||
| 这对于标记窗口为脏以及在查询源表时生成正确的过滤条件都至关重要。 | ||
|
|
||
| ## 查询执行流程 | ||
|
|
||
| 以下是批处理模式下查询执行的简化分步演练: | ||
|
|
||
| 1. **数据摄取**: 新数据被写入源表。 | ||
| 2. **标记为脏**: `BatchingEngine` 收到有关新数据的通知。它使用与每个相关 flow 关联的 `TimeWindowExpr` 来确定哪些时间窗口受到新数据点的影响。然后将这些窗口添加到相应 `TaskState` 中的 `DirtyTimeWindows` 集合中。 | ||
| 3. **任务唤醒**: `BatchingTask` 的执行循环被唤醒,原因可能是其周期性调度,也可能是因为它被通知有大量积压的脏窗口。 | ||
| 4. **计划生成**: 任务调用 `gen_insert_plan`。此方法: | ||
| - 检查 `DirtyTimeWindows`。 | ||
| - 生成一系列 `OR` 连接的 `WHERE` 子句(例如 `(ts >= 't1' AND ts < 't2') OR (ts >= 't3' AND ts < 't4') ...`),覆盖所有脏窗口。 | ||
| - 重写原始 SQL 查询以包含此新过滤器,确保只处理必要的数据。 | ||
| 5. **执行**: 修改后的查询计划被发送到 `Frontend` 执行。数据库处理已过滤数据的聚合。 | ||
| 6. **Upsert**: 结果被插入到目标表中。目标表通常定义了一个包含时间窗口列的主键,因此现有窗口的新结果将覆盖(upsert)旧结果。 | ||
| 7. **状态更新**: `DirtyTimeWindows` 集合中刚刚处理过的窗口被清除。然后任务返回睡眠状态,直到下一个时间间隔。 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.