Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support column level statistics #514

Open
nicor88 opened this issue Nov 18, 2023 · 4 comments
Open

Support column level statistics #514

nicor88 opened this issue Nov 18, 2023 · 4 comments
Labels
feature New feature or request

Comments

@nicor88
Copy link
Member

nicor88 commented Nov 18, 2023

https://aws.amazon.com/about-aws/whats-new/2023/11/aws-glue-data-catalog-generating-column-level-statistics/

Add additional configurations that allow the user to add column level statistics to the table.
Minimal config to make it work:

  • collect_statistics: boolean (can be configured on the project level)
  • glue role that collects statistics
  • columns to get statistics (if empty or None we must collect statistics for the all columns) - optional if statistics are enabled

Open questions

  • for incremental loads do we drop all statistics and recreate new ones? Or we just run a new start_column_statistics_task_run
  • are all table types supported? seems only supported by hive tables, not iceberg.

Notes

Currently not available in all regions

@nicor88 nicor88 added the feature New feature or request label Nov 18, 2023
@roslovets
Copy link
Contributor

Did anyone find a use case for the column statistics feature?

I tried to apply it to the unique Id field of a big table and after several minutes it computed a totally wrong number of unique values. Also it did not speed up simple sql queries at all.

I agree that it looks appealing to automate these statistics with dbt. But would it be useful in real life? Given that it can slow down project building significantly.

@jessedobbelaere
Copy link
Member

@roslovets I believe the main reason is a potential performance gain indeed, according to this new Cost-Based Optimizer for Athena. I haven't seen hands-on test results yet though.

@nicor88
Copy link
Member Author

nicor88 commented Nov 22, 2023

@roslovets
Copy link
Contributor

Thank you for the links folks. According to their fancy examples we should be able to really save time on downstream models and tests even if it takes up to several minutes to compute statistics for one table.

But I still cannot get why select count distinct differs from the value I see in computed statistics. It ruins the whole idea and potentially makes a query planning and data processing inadequate.

Maybe you could do tests on your big tables as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants