-
Notifications
You must be signed in to change notification settings - Fork 996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it easier to create WindowFunctions with the Expr API #6747
Comments
Here is another example from #10345 / @timsaucer showing how non easy it is to create a window function via the expr API use datafusion::{logical_expr::{expr::WindowFunction, BuiltInWindowFunction, WindowFrame, WindowFunctionDefinition}, prelude::*};
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
let ctx = SessionContext::new();
let mut df = ctx.read_csv("/Users/tsaucer/working/testing_ballista/lead_lag/example.csv", CsvReadOptions::default()).await?;
df = df.with_column("array_col", make_array(vec![col("a"), col("b"), col("c")]))?;
df.clone().show().await?;
let lag_expr = Expr::WindowFunction(WindowFunction::new(
WindowFunctionDefinition::BuiltInWindowFunction(
BuiltInWindowFunction::Lead,
),
vec![col("array_col")],
vec![],
vec![],
WindowFrame::new(None),
None,
));
df = df.select(vec![col("a"), col("b"), col("c"), col("array_col"), lag_expr.alias("lagged")])?;
df.show().await?;
Ok(())
} It would be great if instead of let lag_expr = Expr::WindowFunction(WindowFunction::new(
WindowFunctionDefinition::BuiltInWindowFunction(
BuiltInWindowFunction::Lead,
),
vec![col("array_col")],
vec![],
vec![],
WindowFrame::new(None),
None,
)); It looked more like let lag_expr = lead(
vec![col("array_col")],
vec![],
vec![],
WindowFrame::new(None),
None,
)); Maybe even better like a builder style let lag_expr = lead(col("array_col")).build() Which would permit adding the various let lag_expr = lead(col("array_col"))
.partition_by(vec![])
.order_by(vec![])
.build() Maybe there are some inspirations in the polars API too: https://docs.pola.rs/user-guide/expressions/window/#group-by-aggregations-in-selection |
🤔 it seems like spark's API is like
https://stackoverflow.com/questions/32769328/how-to-use-window-functions-in-pyspark-using-dataframes So maybe for DataFusion it could look like let w = Window::new()
.partition_by(col("id"))
.order_by(col("dt"));
let lag_expr = lag(col("array_col"))
.over(w) |
Note I have some code in #6746 that had some part of it (along with an example) |
I am willing to help with this task. |
Great! I've rebased @alamb 's branch and added the changes I suggested. I was about to start testing the code and then I was going to write up the unit tests. My work in progress is here: https://github.com/timsaucer/datafusion/tree/feature/easier_window_funcs There was a little bit of changes I needed to make around the null_options. I got distracted by a task in the |
Thanks for your update! I'll work on the tests. |
FYI, my work is in: https://github.com/shanretoo/datafusion/tree/feat-window-fn |
@timsaucer I have fixed the calls of |
Oh, great. Have you been able to run the example code above using the new easy interface? |
You can check it in the unit test: |
Thank you. I pulled your branch and many of the tests are failing for me even though the functions are returning correct values when I add additional debug statements. I think what's happening here is that because we have the partition_by there is no guarantee what order the results come back as. On my machine the unit tests are returning the partitions on column C in order 10 then 1. I'm guessing on yours it was the opposite. There are a couple of things I think we can do to resolve this. One way would be to make a new macro for testing these partitioned functions. I could do something like
And then the lead function test would become
I've added an |
The one thing I think we're missing is the other variants for these. I don't think it's covered in other unit tests that I can find. So for example, for lead we would want to validate:
What do you think? I might try to write a macro around all these variants. I'm now unblocked on the other task I was working on, so I can pick it up if you'd like or I'm happy to work on other things. Please let me know. |
Sorry, my fault. I haven't taken into account the ordering issue. Maybe we could add a following match arm in the macro to omit the macro_rules! assert_sorted_fn_batches {
($EXPR:expr, $EXPECTED: expr) => {
let sort_by = $EXPR
.iter()
.map(|e| {
let alias = e.name_for_alias().expect("failed to get an alias");
col(alias).sort(true, true)
})
.collect::<Vec<_>>();
assert_sorted_fn_batches!($EXPR, $EXPECTED, sort_by);
}; |
Have you checked tests in sqllogictest? You can take over this and I'm happy to help when needed. |
I think you're doing a great job, and good point on the sqllogictest. TBH I find those tests harder to wrap my head around than the rust tests, but that's more personal preference. About the test function, I realize we can probably make it simpler:
What do you think? |
Looks good. It is clearer to understand the results in this way. |
Update here is that @jayzhan211 and I have been working on a similar API for creating |
Is your feature request related to a problem or challenge?
Follow on to #5781
There are at least three things named
WindowFunction
in DataFusion --Expr::WindowFunction
,window_function::WindowFunction
andexpr::WindowFunction
https://docs.rs/datafusion-expr/26.0.0/datafusion_expr/index.html?search=WindowFunction
Constructing an Expr::WindowFunction to pass to
LogicalPlanBuilder::window
is quite challengingDescribe the solution you'd like
I would like to make this process easier with a builder style:
for
lead(foo) OVER(PARTITION BY bar)
for example:Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: