New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hook in agg expressions to the table.agg API (global mode only). #702
Conversation
pub fn agg_global(&self, to_agg: &[(Expr, &str)]) -> DaftResult<Table> { | ||
// Convert the input (child, name) exprs to the enum form. | ||
// e.g. (expr, "sum") to Expr::Agg(AggEXpr::Sum(expr)) | ||
// (We could do this at the pyo3 layer but I'm not sure why they're strings in the first place) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaychia @samster25 I was a bit confused about why we do (expr, agg name) in the Table.agg API. In the interest of speed I'm just rolling with it for now, but let me know if there's something I'm missing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I did some more digging after you asked about it yesterday
This is an artifact of our old code where we didn't allow for evaluating on aggregation expressions, and the evaluation was handled separately outside of eval_expressions
as a separate call to Block.agg
which takes a string argument: https://github.com/Eventual-Inc/Daft/blob/main/daft/runners/partitioning.py#L461-L484
It does seem a little odd though and we can probably change that behavior. But yes in the interest of speed we can just keep this API first until we're fully integrated.
Maybe @samster25 can add some more color here
daft_table = Table.from_pydict({"input": input}) | ||
daft_table = daft_table.agg( | ||
[ | ||
(col("input").cast(DataType.int32()).alias("count"), "count"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These casts are necessary because the null columns get typed as Null and the aggregations are not implemented on them.
It would not be too hard to implement all aggregations on NullType, and then users can e.g. call sum() on columns without worrying if they're empty.
If this sounds like a good thing to do, I can do it as a fast follow. (@jaychia the type checking logic would also have to change though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the type checking logic is entirely in your control, so feel free to make any changes you deem fit. I also did run into issues too with some of the table tests which run on empty arrays and resorted to casting like you do so here.
The one thing we should try to adhere to though is that things that pass type-checks should actually be runnable and vice versa :P
I can run you through the unit test which I use to do that tomorrow (https://github.com/Eventual-Inc/Daft/blob/main/tests/test_runtime_matches_static_types.py#L70-L71)
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #702 +/- ##
==========================================
+ Coverage 85.86% 85.91% +0.05%
==========================================
Files 143 146 +3
Lines 11991 12155 +164
==========================================
+ Hits 10296 10443 +147
- Misses 1695 1712 +17
|
Implement the table.agg API, but only for global mode. This is so we can start using ungrouped aggregations right away instead of going through eval_expression_list.