Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand support for sql window functions to allow more functions, window definitions, and updating inputs #240

Open
jacksonrnewhouse opened this issue Aug 10, 2023 · 1 comment
Labels
enhancement New feature or request sql Related to the DataFusion SQL integration

Comments

@jacksonrnewhouse
Copy link
Contributor

Arroyo supports window functions, but only under very specific circumstances:

  • Only ROW_NUMBER() is supported.
  • The input data can't be updating.
  • The first term of the partition clause has to be a window struct, from either hop() or tumble().

The first two constraints are simply a matter of missing implementations. There's no semantic issue with supporting window functions such as LAG or ntile(), just a matter of correctly generating the code.

Similarly, updating inputs just require being able to remove or modify records from state, which is doable.

For the final constraint, it's a little trickier. Many of the window functions, such as ROW_NUMBER(), are sensitive to the total order of the partition. A single record coming in can change the value for an arbitrary number of rows. Supporting this style of update may be asking for trouble. That being said, there are some other time-sensitive windowing functions we could support. For instance, if your ordering is by event time and your window function only needs to know preceding terms, you could compute the streaming result by using the watermarks.

This is a tracking issue just to start the discussion. Individual improvements should reference it as we make progress.

@jacksonrnewhouse jacksonrnewhouse added enhancement New feature or request sql Related to the DataFusion SQL integration labels Aug 10, 2023
@lzwaan
Copy link

lzwaan commented Mar 16, 2024

Things like a LAG function is very handy for example to calculate the difference in time between 2 different column values like this:

WITH cte1 AS
  (
    SELECT
        window_start AS start,
        node,
        to_time(lastoccurrence, 'UTC') AS last_occurrence,
        to_time(lastclear, 'UTC') AS last_clear,
        raisetally AS alert_count
      FROM tumble(netcool, to_time(actiontime, 'UTC'), 10s)
      WHERE match(summary, 'is not reachable$')
      AND alert_count > 1
      PARTITION BY node
      GROUP BY start, node, last_occurrence, last_clear, alert_count
      ORDER BY last_occurrence
      SETTINGS seek_to='earliest'
  ),
  cte2 AS
    (
      SELECT
        *,
        lag(last_occurrence) OVER (PARTITION BY node) AS prev_occurrence,
        lag(alert_count) OVER (PARTITION BY node) AS prev_alert_count
      FROM alerts
    )
    SELECT
      window_start AS start,
      node,
      date_diff('second',prev_occurrence, last_clear) AS downtime
    FROM tumble(cte2, ts, 1m)
    WHERE alert_count > prev_alert_count;

┌───────────────────start─┬─node───┬─────────last_occurrence─┬──────────────last_clear─┬─alert_count─┬─────────prev_occurrence─┬─prev_alert_count─┐
│ 2024-03-09 20:40:10.000 │ node01 │ 2024-03-09 20:40:10.0002024-03-09 20:40:06.00042024-03-09 20:40:08.0003 │
│ 2024-03-09 20:40:10.000 │ node01 │ 2024-03-09 20:40:12.0002024-03-09 20:40:10.00052024-03-09 20:40:10.0004 │
│ 2024-03-09 20:40:10.000 │ node01 │ 2024-03-09 20:40:14.0002024-03-09 20:40:12.00062024-03-09 20:40:12.0005 │
│ 2024-03-09 20:40:10.000 │ node01 │ 2024-03-09 20:40:16.0002024-03-09 20:40:12.00072024-03-09 20:40:14.0006 │
│ 2024-03-09 20:40:10.000 │ node01 │ 2024-03-09 20:40:18.0002024-03-09 20:40:13.00082024-03-09 20:40:16.0007 │
└─────────────────────────┴────────┴─────────────────────────┴─────────────────────────┴─────────────┴─────────────────────────┴──────────────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request sql Related to the DataFusion SQL integration
Projects
None yet
Development

No branches or pull requests

2 participants