windowing functions: partial:false - option to exclude incomplete window #3618

jangorecki · 2023-10-07T09:30:20Z

Hello,
I would like to file a FR for additional argument to windowing functions interface that could control computation for an incomplete (partial) window.
My observation is that SQL databases are generally not giving any flexibility in that matter, and always compute and return partial window. On the other hand analytical tools are generally not only giving such flexibility but most of them defaults to not returning partial window results. What behavior is desires depends on the use case. From my personal uses cases, it is definitely the latter one that was useful.
My perspective is that if an analysts asks in his/her query for SMA10, then he/she wants SMA10, not an SMA9, SMA8, etc. Therefore whenever SMA10 is not possible to be computed (partial window) then NULL/missing values is expected.
Asking for changing the default in prql would be insane, but an option would be highly desired, and it actually could be another advantage of PRQL over SQL.
So lets look at the example

library(prqlr)
"from EuStockMarkets | select SMI | derive {rn = row_number this} | sort rn | window rolling:10 (derive {sma10 = average SMI})" |> prql_compile() |> cat()
#WITH table_0 AS (
#  SELECT
#    "SMI",
#    ROW_NUMBER() OVER () AS rn
#  FROM
#    "EuStockMarkets"
#)
#SELECT
#  "SMI",
#  rn,
#  AVG("SMI") OVER (
#    ORDER BY
#      rn ROWS BETWEEN 9 PRECEDING AND CURRENT ROW
#  ) AS sma10
#FROM
#  table_0
#ORDER BY
#  rn

This SQL query will compute SMA10, but also SMA9...SMA1 whenever 10 observations will not be available in the window.

So my FR is about extra arg, lets call it partial

window rolling:10 partial:false (derive {sma10 = average SMI})"

and the SQL generated with partial=false would need to have a subquery and case when

SELECT
  SMI,
  CASE WHEN partial=1 THEN NULL ELSE AVG(SMI) OVER (ORDER BY id) END AS sma10
FROM (
  SELECT
    SMI,
    CASE WHEN rn<10 THEN 1 ELSE 0 END AS partial
  FROM (
    SELECT
      SMI,
      ROW_NUMBER() OVER () AS rn FROM "EuStockMarkets"
  )
)

value added for prql is potentially big, just one attribute partial:false and much less typing than standard SQL way.

The text was updated successfully, but these errors were encountered:

jangorecki · 2023-10-07T09:35:10Z

Related discussion in duckdb can be found here: duckdb/duckdb#8340
In case duckdb will provide option for that, then pqrl compiler could for this target use new duckdb option

snth · 2023-10-07T14:50:26Z

Hi,

this is a good observation and something I noticed as well when answering your other question.

It's not too difficult to add this in the PRQL query but I agree with you that it might be nice for PRQL to add this as an option as you suggested.

In the meantime you could work around it with the following:

let data = [
  {id=1, value=2},
  {id=2, value=3},
  {id=3, value=4},
  {id=6, value=2},
  {id=5, value=3},
  {id=4, value=4},
  ]

from data
window rolling:3 (
  sort id
  derive {rn = row_number this, sma3 = average value}
  filter rn>=3
  select {data.*, sma3}
  )

jangorecki · 2023-10-07T18:44:58Z

Your example won't exactly give what was requested in the question because it filter out rows having partial window. Question asked to just have SMA10 value NULL but still keep those rows.
Could you please point me to place in the code where changes would have to happen for 1) adding partial argument to window, 2) adding handling of new argument in compiler.

snth · 2023-10-07T19:47:10Z

Ah, I misunderstood your requirement.

The following query should produce the result you want:

let data = [
  {id=1, value=2},
  {id=2, value=3},
  {id=3, value=4},
  {id=6, value=2},
  {id=5, value=3},
  {id=4, value=4},
  ]

from data
window rolling:3 (
  sort id
  derive {rn = row_number this, 
          sma3 = case {rn<3 => null, true => average value}
          }
  select {data.*, sma3}
  )

Please note: case currently still uses {} but this will change to [] in 0.10 (we can't release this currently because it's a breaking change).

I'm not the best person to ask about the compiler. @aljazerzen or @max-sixty , can you answer the following?

Could you please point me to place in the code where changes would have to happen for 1) adding partial argument to window, 2) adding handling of new argument in compiler.

aljazerzen · 2023-10-12T13:44:07Z

This would be super useful. But I'm not sure how it fits with current window semantics.

To implement it, we'd probably have to hack something up during resolution. Not the easiest.

eitsupi · 2023-10-12T13:59:33Z

Just a link to a book that compares this:
https://eitsupi.github.io/querying-with-prql/timeseries#sec-moving-ave

max-sixty · 2023-10-12T19:28:13Z

I think the request is for something similar to pandas' min_periods, is that correct?

To develop this, the first thing to do would be to write out the SQL that we'd want to generate. I think this can probably be done with another window function which counts the number of values that are non-null, which we then return NULL for when it's below the min_periods value.

(FWIW I remember writing something a bit like this here, though that's possibly more complicated than this case)

max-sixty · 2023-10-12T19:28:28Z

Just a link to a book that compares this:
https://eitsupi.github.io/querying-with-prql/timeseries#sec-moving-ave

These are really great!!

snth · 2023-10-13T05:05:57Z

Just a link to a book that compares this:
https://eitsupi.github.io/querying-with-prql/timeseries#sec-moving-ave

Very cool!

I also used an SMA like that in my presentations, as well as an EWMA which you can find in this Google Colab if you want to include that as well?
https://colab.research.google.com/drive/1asOWKjQbv6VWW9WjWR2Mer7PLmm0S0dT?usp=sharing

jangorecki · 2023-10-13T05:34:23Z

I think the request is for something similar to pandas' min_periods, is that correct?

To develop this, the first thing to do would be to write out the SQL that we'd want to generate. I think this can probably be done with another window function which counts the number of values that are non-null, which we then return NULL for when it's below the min_periods value.

(FWIW I remember writing something a bit like this here, though that's possibly more complicated than this case)

yes, min_period in pandas

For a window that is specified by an integer, min_periods will default to the size of the window.

while in SQL it is always 1.

The request is about adding an option for pandas default behavior.

The written SQL is in the first post, no need to count nulls there, just case row_number().

max-sixty · 2023-10-16T18:57:03Z

The written SQL is in the first post, no need to count nulls there, just case row_number().

Yes great, thanks.

jangorecki changed the title ~~windowing functions: simple moving average, option to exclude incomplete window~~ windowing functions: partial:false - option to exclude incomplete window Oct 7, 2023

eitsupi mentioned this issue Oct 13, 2023

EWMA in PRQL with DuckDB eitsupi/querying-with-prql#4

Open

eitsupi added enhancement language-design Changes to PRQL-the-language sql-translation labels Oct 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

windowing functions: partial:false - option to exclude incomplete window #3618

windowing functions: partial:false - option to exclude incomplete window #3618

jangorecki commented Oct 7, 2023 •

edited

jangorecki commented Oct 7, 2023

snth commented Oct 7, 2023

jangorecki commented Oct 7, 2023 •

edited

snth commented Oct 7, 2023

aljazerzen commented Oct 12, 2023

eitsupi commented Oct 12, 2023

max-sixty commented Oct 12, 2023

max-sixty commented Oct 12, 2023

snth commented Oct 13, 2023

jangorecki commented Oct 13, 2023

max-sixty commented Oct 16, 2023

windowing functions: partial:false - option to exclude incomplete window #3618

windowing functions: partial:false - option to exclude incomplete window #3618

Comments

jangorecki commented Oct 7, 2023 • edited

jangorecki commented Oct 7, 2023

snth commented Oct 7, 2023

jangorecki commented Oct 7, 2023 • edited

snth commented Oct 7, 2023

aljazerzen commented Oct 12, 2023

eitsupi commented Oct 12, 2023

max-sixty commented Oct 12, 2023

max-sixty commented Oct 12, 2023

snth commented Oct 13, 2023

jangorecki commented Oct 13, 2023

max-sixty commented Oct 16, 2023

jangorecki commented Oct 7, 2023 •

edited

jangorecki commented Oct 7, 2023 •

edited