Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windowing functions: partial:false - option to exclude incomplete window #3618

Open
jangorecki opened this issue Oct 7, 2023 · 11 comments
Open
Labels

Comments

@jangorecki
Copy link
Contributor

jangorecki commented Oct 7, 2023

Hello,
I would like to file a FR for additional argument to windowing functions interface that could control computation for an incomplete (partial) window.
My observation is that SQL databases are generally not giving any flexibility in that matter, and always compute and return partial window. On the other hand analytical tools are generally not only giving such flexibility but most of them defaults to not returning partial window results. What behavior is desires depends on the use case. From my personal uses cases, it is definitely the latter one that was useful.
My perspective is that if an analysts asks in his/her query for SMA10, then he/she wants SMA10, not an SMA9, SMA8, etc. Therefore whenever SMA10 is not possible to be computed (partial window) then NULL/missing values is expected.
Asking for changing the default in prql would be insane, but an option would be highly desired, and it actually could be another advantage of PRQL over SQL.
So lets look at the example

library(prqlr)
"from EuStockMarkets | select SMI | derive {rn = row_number this} | sort rn | window rolling:10 (derive {sma10 = average SMI})" |> prql_compile() |> cat()
#WITH table_0 AS (
#  SELECT
#    "SMI",
#    ROW_NUMBER() OVER () AS rn
#  FROM
#    "EuStockMarkets"
#)
#SELECT
#  "SMI",
#  rn,
#  AVG("SMI") OVER (
#    ORDER BY
#      rn ROWS BETWEEN 9 PRECEDING AND CURRENT ROW
#  ) AS sma10
#FROM
#  table_0
#ORDER BY
#  rn

This SQL query will compute SMA10, but also SMA9...SMA1 whenever 10 observations will not be available in the window.

So my FR is about extra arg, lets call it partial

window rolling:10 partial:false (derive {sma10 = average SMI})"

and the SQL generated with partial=false would need to have a subquery and case when

SELECT
  SMI,
  CASE WHEN partial=1 THEN NULL ELSE AVG(SMI) OVER (ORDER BY id) END AS sma10
FROM (
  SELECT
    SMI,
    CASE WHEN rn<10 THEN 1 ELSE 0 END AS partial
  FROM (
    SELECT
      SMI,
      ROW_NUMBER() OVER () AS rn FROM "EuStockMarkets"
  )
)

value added for prql is potentially big, just one attribute partial:false and much less typing than standard SQL way.

@jangorecki
Copy link
Contributor Author

Related discussion in duckdb can be found here: duckdb/duckdb#8340
In case duckdb will provide option for that, then pqrl compiler could for this target use new duckdb option

@jangorecki jangorecki changed the title windowing functions: simple moving average, option to exclude incomplete window windowing functions: partial:false - option to exclude incomplete window Oct 7, 2023
@snth
Copy link
Member

snth commented Oct 7, 2023

Hi,

this is a good observation and something I noticed as well when answering your other question.

It's not too difficult to add this in the PRQL query but I agree with you that it might be nice for PRQL to add this as an option as you suggested.

In the meantime you could work around it with the following:

let data = [
  {id=1, value=2},
  {id=2, value=3},
  {id=3, value=4},
  {id=6, value=2},
  {id=5, value=3},
  {id=4, value=4},
  ]

from data
window rolling:3 (
  sort id
  derive {rn = row_number this, sma3 = average value}
  filter rn>=3
  select {data.*, sma3}
  )

@jangorecki
Copy link
Contributor Author

jangorecki commented Oct 7, 2023

Your example won't exactly give what was requested in the question because it filter out rows having partial window. Question asked to just have SMA10 value NULL but still keep those rows.
Could you please point me to place in the code where changes would have to happen for 1) adding partial argument to window, 2) adding handling of new argument in compiler.

@snth
Copy link
Member

snth commented Oct 7, 2023

Ah, I misunderstood your requirement.

The following query should produce the result you want:

let data = [
  {id=1, value=2},
  {id=2, value=3},
  {id=3, value=4},
  {id=6, value=2},
  {id=5, value=3},
  {id=4, value=4},
  ]

from data
window rolling:3 (
  sort id
  derive {rn = row_number this, 
          sma3 = case {rn<3 => null, true => average value}
          }
  select {data.*, sma3}
  )

Please note: case currently still uses {} but this will change to [] in 0.10 (we can't release this currently because it's a breaking change).

I'm not the best person to ask about the compiler. @aljazerzen or @max-sixty , can you answer the following?

Could you please point me to place in the code where changes would have to happen for 1) adding partial argument to window, 2) adding handling of new argument in compiler.

@aljazerzen
Copy link
Member

This would be super useful. But I'm not sure how it fits with current window semantics.

To implement it, we'd probably have to hack something up during resolution. Not the easiest.

@eitsupi
Copy link
Member

eitsupi commented Oct 12, 2023

Just a link to a book that compares this:
https://eitsupi.github.io/querying-with-prql/timeseries#sec-moving-ave

@max-sixty
Copy link
Member

I think the request is for something similar to pandas' min_periods, is that correct?

To develop this, the first thing to do would be to write out the SQL that we'd want to generate. I think this can probably be done with another window function which counts the number of values that are non-null, which we then return NULL for when it's below the min_periods value.

(FWIW I remember writing something a bit like this here, though that's possibly more complicated than this case)

@max-sixty
Copy link
Member

Just a link to a book that compares this:
https://eitsupi.github.io/querying-with-prql/timeseries#sec-moving-ave

These are really great!!

@snth
Copy link
Member

snth commented Oct 13, 2023

Just a link to a book that compares this:
https://eitsupi.github.io/querying-with-prql/timeseries#sec-moving-ave

Very cool!

I also used an SMA like that in my presentations, as well as an EWMA which you can find in this Google Colab if you want to include that as well?
https://colab.research.google.com/drive/1asOWKjQbv6VWW9WjWR2Mer7PLmm0S0dT?usp=sharing

@jangorecki
Copy link
Contributor Author

I think the request is for something similar to pandas' min_periods, is that correct?

To develop this, the first thing to do would be to write out the SQL that we'd want to generate. I think this can probably be done with another window function which counts the number of values that are non-null, which we then return NULL for when it's below the min_periods value.

(FWIW I remember writing something a bit like this here, though that's possibly more complicated than this case)

yes, min_period in pandas

For a window that is specified by an integer, min_periods will default to the size of the window.

while in SQL it is always 1.

The request is about adding an option for pandas default behavior.

The written SQL is in the first post, no need to count nulls there, just case row_number().

@max-sixty
Copy link
Member

The written SQL is in the first post, no need to count nulls there, just case row_number().

Yes great, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants