Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CHORE] Improve error messages when calling aggregation methods on dataframe without input columns #1587

Merged
merged 5 commits into from
Nov 13, 2023

Conversation

colin-ho
Copy link
Contributor

@colin-ho colin-ho commented Nov 9, 2023

Fixes #1583.

When a user does not specify columns in df aggregation methods, e.g. df.count():

  • Default to running aggregation on all columns
  • Log warning messages with an example to pass in columns.

Copy link

codecov bot commented Nov 9, 2023

Codecov Report

Merging #1587 (5c787ae) into main (96b73d6) will decrease coverage by 2.01%.
Report is 7 commits behind head on main.
The diff coverage is 100.00%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1587      +/-   ##
==========================================
- Coverage   84.90%   82.89%   -2.01%     
==========================================
  Files          54       54              
  Lines        5162     5222      +60     
==========================================
- Hits         4383     4329      -54     
- Misses        779      893     +114     
Files Coverage Δ
daft/dataframe/dataframe.py 87.07% <100.00%> (-2.52%) ⬇️

... and 12 files with indirect coverage changes

logger.warning(
"No columns specified; performing count on all columns. Specify columns using df.count('col1', 'col2', ...) or use df.count_rows() for row counts."
)
cols = tuple(self.columns)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @samster25 ,

Could you take a look at this and lmk if this aligns with what you expected? Once it's all good with you I'll modify the rest of the agg methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! Though I wonder if we should corner-case the .count() aggregation method to just forward the call to .count_rows() instead 😛 . I believe that in PostgreSQL for example, COUNT(*) is actually a row count of the entire result set vs COUNT(c) which does a null-aware count of a column.

These semantics sound good for all the other aggregation-type methods though!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, though .count_rows() returns an int whereas .count() should return a dataframe, so instead of simply forwarding the call we could do return DataFrame(self._builder.count())? which would result in a dataframe that looks like:

+--------+
| count  |
| UInt64 |
+--------+
| 3      |
+--------+

Copy link
Contributor

@jaychia jaychia Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes good point. The current semantics of this PR (broadcasting the count operation on all columns) makes perfect sense then 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet, thanks! just made changes for the rest of the aggregation methods in latest commit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, some of the integration tests are failing with Error: Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers, example. Do you know what's the cause 😅 ?

the release drafter label test too: [Error: Resource not accessible by integration](https://github.com/Eventual-Inc/Daft/actions/runs/6829508921/job/18575800562?pr=1587#step:2:25)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes that's security related because these CI tests are actually running on your fork, not on the main repo 😛

We need to figure out a (secure) way to run our integration tests on incoming PRs, or maybe a better policy here around accepting external contributions.

Copy link
Member

@samster25 samster25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @colin-ho,

Thanks for making this PR and taking on the issue. One issue I see is that we should use the python warning module for propagating these user warnings rather than the python logger which will be noisy.

warnings.warn("...")

Also don't worry about the integration tests for now, currently we rely on an identity provider that requires you to be in the eventual github org to run those correctly.

@colin-ho
Copy link
Contributor Author

Got it, replaced the logger with warnings.warn, thanks for bringing this up!

@jaychia
Copy link
Contributor

jaychia commented Nov 13, 2023

Approved! Looks like it's failing style lint checks though:
https://github.com/Eventual-Inc/Daft/actions/runs/6855581643/job/18640931993?pr=1587

You can automatically apply lints locally by installing pre-commit which will have it run on every commit you're making. To fix this retroactively, you can run pre-commit run --all-files which will run our pre-commit checks/lints!

@samster25
Copy link
Member

@jaychia good to merge?

@jaychia jaychia merged commit b4467ad into Eventual-Inc:main Nov 13, 2023
31 of 37 checks passed
@jaychia
Copy link
Contributor

jaychia commented Nov 13, 2023

Merged, congrats on first PR @colin-ho :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve Error message when calling aggregation method on dataframe without any columns
3 participants