Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hook in agg expressions to the table.agg API (global mode only). #702

Merged
merged 6 commits into from Mar 15, 2023

Conversation

xcharleslin
Copy link
Contributor

Implement the table.agg API, but only for global mode. This is so we can start using ungrouped aggregations right away instead of going through eval_expression_list.

@xcharleslin xcharleslin marked this pull request as ready for review March 15, 2023 05:37
pub fn agg_global(&self, to_agg: &[(Expr, &str)]) -> DaftResult<Table> {
// Convert the input (child, name) exprs to the enum form.
// e.g. (expr, "sum") to Expr::Agg(AggEXpr::Sum(expr))
// (We could do this at the pyo3 layer but I'm not sure why they're strings in the first place)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaychia @samster25 I was a bit confused about why we do (expr, agg name) in the Table.agg API. In the interest of speed I'm just rolling with it for now, but let me know if there's something I'm missing here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I did some more digging after you asked about it yesterday

This is an artifact of our old code where we didn't allow for evaluating on aggregation expressions, and the evaluation was handled separately outside of eval_expressions as a separate call to Block.agg which takes a string argument: https://github.com/Eventual-Inc/Daft/blob/main/daft/runners/partitioning.py#L461-L484

It does seem a little odd though and we can probably change that behavior. But yes in the interest of speed we can just keep this API first until we're fully integrated.

Maybe @samster25 can add some more color here

daft_table = Table.from_pydict({"input": input})
daft_table = daft_table.agg(
[
(col("input").cast(DataType.int32()).alias("count"), "count"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These casts are necessary because the null columns get typed as Null and the aggregations are not implemented on them.

It would not be too hard to implement all aggregations on NullType, and then users can e.g. call sum() on columns without worrying if they're empty.

If this sounds like a good thing to do, I can do it as a fast follow. (@jaychia the type checking logic would also have to change though)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the type checking logic is entirely in your control, so feel free to make any changes you deem fit. I also did run into issues too with some of the table tests which run on empty arrays and resorted to casting like you do so here.

The one thing we should try to adhere to though is that things that pass type-checks should actually be runnable and vice versa :P

I can run you through the unit test which I use to do that tomorrow (https://github.com/Eventual-Inc/Daft/blob/main/tests/test_runtime_matches_static_types.py#L70-L71)

@codecov
Copy link

codecov bot commented Mar 15, 2023

Codecov Report

Merging #702 (c6e445a) into main (676b776) will increase coverage by 0.05%.
The diff coverage is 80.48%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #702      +/-   ##
==========================================
+ Coverage   85.86%   85.91%   +0.05%     
==========================================
  Files         143      146       +3     
  Lines       11991    12155     +164     
==========================================
+ Hits        10296    10443     +147     
- Misses       1695     1712      +17     
Impacted Files Coverage Δ
daft/runners/blocks.py 85.59% <ø> (ø)
daft/table/table.py 82.80% <ø> (+3.89%) ⬆️
src/series/ops/arithmetic.rs 99.01% <ø> (ø)
src/dsl/expr.rs 81.79% <69.23%> (-0.53%) ⬇️
src/table/ops/agg.rs 78.94% <78.94%> (ø)
src/python/table.rs 89.30% <100.00%> (+0.94%) ⬆️

... and 7 files with indirect coverage changes

@xcharleslin xcharleslin merged commit 62ac0c1 into main Mar 15, 2023
7 of 8 checks passed
@xcharleslin xcharleslin deleted the charles/rs-globalagg branch March 15, 2023 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants