Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support configurable management of Table Optimisers for Iceberg tables #627

Open
1 task done
antonysouthworth-halter opened this issue Apr 18, 2024 · 3 comments
Labels
feature New feature or request

Comments

@antonysouthworth-halter
Copy link
Contributor

Is this your first time submitting a feature request?

  • I have searched the existing issues, and I could not find an existing issue for this feature

Describe the feature

https://aws.amazon.com/blogs/aws/aws-glue-data-catalog-now-supports-automatic-compaction-of-apache-iceberg-tables/

Honestly I have not fully thought through how it would work, hoping to spark some discussion in thread.

Perhaps just another config variable for this? E.g.

{{config(
    materialized='incremental', -- (or sp_insert_by_period; not relevant for table/view)
    table_type='iceberg',
    use_glue_automatic_compaction=true,
)}}

When use_glue_automatic_compaction is specified, then we would use the Glue {Create,Update}TableOptimizer API operations to create the optimiser for compaction.

Describe alternatives you've considered

You can just OPTIMIZE {{ this.schema }}.{{ this.identifier }} ... in your post_hook yes, but on full-refresh of a very large table (e.g. requiring insert_by_period) this may fail due to timeout or the iceberg "not finished, please run compaction again" message. Regardless I think it would be good to let AWS just handle it.

Caveat; I haven't actually tried to use the automatic compaction feature so I have no idea how it performs in practise. Maybe it just scan your entire table once a day and you get charged for 100 DPUs 😂.

Who will this benefit?

Anybody with large datasets in Iceberg. I would think quite a lot of overlap with users of insert_by_period.

Are you interested in contributing this feature?

maybe, depends how much work it would be

Anything else?

#514 somewhat related, in the realm of "table optimisation management"

@antonysouthworth-halter antonysouthworth-halter added the feature New feature or request label Apr 18, 2024
@nicor88
Copy link
Member

nicor88 commented Apr 19, 2024

@antonysouthworth-halter regarding this:

You can just OPTIMIZE {{ this.schema }}.{{ this.identifier }} ... in your post_hook yes, but on full-refresh of a very large table (e.g. requiring insert_by_period) this may fail due to timeout or the iceberg "not finished, please run compaction again" message. Regardless I think it would be good to let AWS just handle it.

If you partition your table I can share a "post-hook", that is ugly, but does the job, and pretty much optimize your table by partition values, using a batch size < 100, to avoid partition limitation issue in athena.

Said so, what you describe can be relevant, and should be relatively easy to implement:

  • we need a new method in the impl.py that does a boto3call add automatic compaction
  • we expose the method via a macro that then can be called in different table materializations.

@antonysouthworth-halter
Copy link
Contributor Author

If you partition your table I can share a "post-hook", that is ugly, but does the job, and pretty much optimize your table by partition values, using a batch size < 100, to avoid partition limitation issue in athena.

😮 I would love to see it! I think it might also be helpful for others that stumble upon this ticket

@Jrmyy
Copy link
Member

Jrmyy commented May 20, 2024

Also note that now we have a fix on optimize post_hook on dbt athena which will retry if it encounters an issue because optimize needs to be run again, cf here:

def run_optimize_with_partition_limit_catching(self, optimize_query: str) -> None:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants