Support incremental_strategy and data cleaning up with boto3 #27

tuan-seek · 2021-08-10T07:34:54Z

Changes:

Add support for two incremental update strategy 'insert_overwrite' and 'append'
For 'insert_overwrite', overlapping partitions are cleaned up from S3 location
Work with tables partitioned by multiple keys. Only tested with keys of types 'string' and 'integer'.
Add support for 'external_location' by removing physical files on S3. This change overcomes HIVE_PATH_ALREADY_EXISTS error.

Incremental update

roslovets · 2021-08-16T21:41:06Z

Look forward to merging this PR. I can't use this adapter while data in S3 is not overwriting.

mrshu · 2021-08-18T19:57:28Z

dbt/adapters/athena/connections.py

+        glue_client = boto3.client('glue')
+        s3_resource = boto3.resource('s3')
+        partitions = glue_client.get_partitions(
+            # CatalogId='awsdatacatalog',  # Using this caused permission error that 'glue:GetPartitions' is required


@tuan-seek I am afraid this will need to be configurable

@mrshu Thanks for checking out. Can you elaborate the scenarios that you would need to configure this CatalogID, instead of using the default AWS Account ID?

@tuan-seek please take this with a grain of salt (the bigger the better).

It is my understanding that you can have multiple Glue Data Catalogs in one account. I don't think it happens often and by default the AWS Account ID is used -- that would probably explain why passing awsdatacatalog resulted in errors and why not passing it ended up working.

Here is what the docs for Glue's CreateDatabase method say:

The ID of the Data Catalog in which to create the database. If none is provided, the AWS account ID is used by default.

All of the other Glue Database methods take the CatalogId as a parameter, so having a non-default one does seem like a real possibility.

I wouldn't worry too much about it for now though.

Sorry @mrshu for my late response.

Yes, I agree with you there is possibility that non-default one will be needed. Part of the reason for my late response was to take sometime to investigate if we do have such a need in our systems. And I confirmed that all our use cases use default one. Btw, we have a quite complex data platform with a lot of data teams interacting with it from different AWS accounts and IAM setting.

Also for that reason, I agree with you that we don't need to worry too much about making this configurable now. We can always address it when we cross the bridge :)

Dandandan · 2021-09-08T09:00:33Z

dbt/include/athena/macros/materializations/incremental.sql

+    {%- do partitions.append('(' + single_partition_expression + ')') -%}
+  {%- endfor -%}
+  {%- set partition_expression = partitions | join(' or ') -%}
+    delete from


Is DELETE FROM supported in Athena?
Running it manually on a table yields [ErrorCategory:USER_ERROR, ErrorCode:NOT_SUPPORTED], Detail:NOT_SUPPORTED: Cannot delete from non-managed Hive table in Athena (version 2).

I believe these delete from statements are getting caught in the execute function and not actually being ran.

I'm not a particular fan of this since it is it not very clear and seems verbose. @tuan-seek is there a reason behind this pattern, why not just call _clean_up_partitions directly?

The short answer is, I don't know if there is a way to call _clean_up_partitions() directly from macros. All Athena queries and AWS SDK calls are delegated back to connections.py. So I followed the same pattern here.

DELETE FROM is currently not supported in Athena. So the statement is intercepted in connections.py for us to do custom cleaning up and not issued towards Athena. If AWS is going to support DELETE FROM in the future, we can just clean up this custom implementation.

@tuan-seek If you add functions to the Adapter class you can call them within Jinja. The Adapter class can be found within imply.py and can be called called as macro like {% do adapter.your_function(your_parameters) %} or {{ adapter.drop_relation(existing_relation) }}.

I will add a review pointing out what would be good to migrate to the Adapter class.

Sounds good. Let me try that one!

@tuan-seek Great and a big thanks for this PR, the functionality it could provide will be appreciated by so many, also apologies for taking so long to review!

Likewise, thanks for starting this project. I've refactored the change as discussed. Please have a look when you have time @Tomme

Tomme · 2021-09-09T13:59:50Z

dbt/adapters/athena/connections.py

@@ -57,6 +59,74 @@ def _collect_result_set(self, query_id: str) -> AthenaResultSet:
            retry_config=self._retry_config,
        )

+    def _clean_up_partitions(


I would advise migration to the AthenaAdapter class within imply.py and calling directly within incremental.sql file

Tomme · 2021-09-09T14:04:12Z

dbt/adapters/athena/connections.py

+                s3_bucket.objects.filter(Prefix=prefix).delete()
+
+
+    def _clean_up_table(


Similar to _clean_up_partitions I would migrate this to imply.py and it should be incorporated into the drop_relation function.

tuan-seek · 2021-09-19T11:46:09Z

Hi @Tomme Have you got time to review the latest change?

Tomme · 2021-09-27T09:46:40Z

Hi @Tomme Have you got time to review the latest change?

@tuan-seek Planned for this week. Apologies for my slow replies and reviews.

Added the fix for Glue Crawler get_partitions where clause 2048 char limit

tuan-seek · 2021-10-25T22:09:44Z

Hi @Tomme Is there any blocker for this PR to be merged? I'm planning some new feature change and would prefer to do it in a new PR from master, in stead of adding to this one.

Btw, I have updated README to reflect the change in this PR. Feel free to update README as you like.

tuan-seek added 2 commits August 10, 2021 17:26

Support incremental_strategy and data cleaning up with boto3

b10cc8e

Delete table data on S3 for table drop

12918c5

Tomme self-requested a review August 11, 2021 13:53

Ojasvi Harsola and others added 4 commits August 13, 2021 11:35

add the catch block when table not exists

6162ba8

add date partition type

6340621

Merge pull request #1 from blotoutio/incremental_update

4851eb9

Incremental update

Update error message

d66fda0

Tomme mentioned this pull request Aug 17, 2021

Implementing snapshots #19

Closed

mrshu reviewed Aug 18, 2021

View reviewed changes

Dandandan reviewed Sep 8, 2021

View reviewed changes

Tomme requested changes Sep 9, 2021

View reviewed changes

tuan-seek added 2 commits September 11, 2021 00:15

Refactor clean_up_partitions()

d4ef012

Refactor clean_up_table()

920e862

tuan-seek force-pushed the incremental_update branch from 9d479e7 to 920e862 Compare September 11, 2021 08:32

Ojasvi Harsola and others added 2 commits October 8, 2021 23:26

add fix for glue api where clause 2048 limit

bd551ca

Merge pull request #2 from blotoutio/delete_partitions

72fb083

Added the fix for Glue Crawler get_partitions where clause 2048 char limit

tuan-seek and others added 2 commits October 26, 2021 09:22

Update README with instruction on incremental udpate

4fc61b7

Merge branch 'master' into incremental_update

dae757b

Tomme merged commit 968cd10 into Tomme:master Oct 29, 2021

tuan-seek deleted the incremental_update branch November 22, 2021 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support incremental_strategy and data cleaning up with boto3 #27

Support incremental_strategy and data cleaning up with boto3 #27

tuan-seek commented Aug 10, 2021 •

edited

Loading

roslovets commented Aug 16, 2021

mrshu Aug 18, 2021

tuan-seek Aug 18, 2021

mrshu Aug 19, 2021

tuan-seek Aug 27, 2021

Dandandan Sep 8, 2021

Tomme Sep 8, 2021 •

edited

Loading

tuan-seek Sep 9, 2021

Tomme Sep 9, 2021

tuan-seek Sep 10, 2021

Tomme Sep 10, 2021

tuan-seek Sep 11, 2021

Tomme Sep 9, 2021

Tomme Sep 9, 2021

tuan-seek commented Sep 19, 2021

Tomme commented Sep 27, 2021

tuan-seek commented Oct 25, 2021 •

edited

Loading

		s3_bucket.objects.filter(Prefix=prefix).delete()


		def _clean_up_table(

Support incremental_strategy and data cleaning up with boto3 #27

Support incremental_strategy and data cleaning up with boto3 #27

Conversation

tuan-seek commented Aug 10, 2021 • edited Loading

roslovets commented Aug 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tomme Sep 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tuan-seek commented Sep 19, 2021

Tomme commented Sep 27, 2021

tuan-seek commented Oct 25, 2021 • edited Loading

tuan-seek commented Aug 10, 2021 •

edited

Loading

Tomme Sep 8, 2021 •

edited

Loading

tuan-seek commented Oct 25, 2021 •

edited

Loading