Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support s3_data_dir and s3_data_naming #50

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chronitis
Copy link

Problem

Currently, the only options for determining where data ends up in S3 are to set s3_staging_bucket in the connection properties or set external_location on each model.

The s3_staging_bucket argument is ignored if the Athena workgroup already has a staging bucket configured (but is used by dbt seed, which always sets a location).

We have a (not uncommon?) layout of one S3 bucket for the results of all Athena queries from which objects are rapidly expired, and another with appropriate lifecycle configuration for storing created tables we want to keep.

Aside from setting external_location on every model (which isn't so easily composable across different profiles), there isn't a nice way to control S3 layout in dbt-athena as it stands.

Implementation

This adds two new (optional) connection options:

  • s3_data_dir: if set, the default root directory to create tables and seeds in (eg, s3://my-data-bucket/dbt/)
  • s3_data_naming: the strategy for naming subdirectories of s3_data_dir in which we'll actually store tables; two implemented options:
    • uuid: tables or seeds are saved to {s3_data_dir}/{uuid4}/ (this was how seed tables were already named)
    • schema_table: name according to the schema (ie, Athena database) and table like {s3_data_dir}/{schema}/{table}/

If s3_data_dir is unset, the behaviour should be unchanged:

  • Seed tables are saved to {s3_staging_dir}/tables/{uuid4}/
  • Tables are saved to external_location if set, and Athena's default otherwise.

README.md Outdated Show resolved Hide resolved
Copy link

@fbrueck fbrueck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some manual testing- looks good to me 👍 Looking forward to use it!

What are the contribution guidelines here? Can we merge to main?

fbrueck
fbrueck previously approved these changes Jan 4, 2022
@chronitis
Copy link
Author

Worth noting that if using a static external location (ie, s3_data_naming=schema_table), this is affected by #54, so after #43 non-incremental tables don't get cleaned up, so seeds get their data duplicated and non-incremental tables will fail with a warning that there is data in the prefix already.

Either reverting #43 or merging #49 should fix this. s3_data_naming=uuid isn't affected since it won't try and reuse old prefixes.

@chronitis
Copy link
Author

Rebased on 1.0.1

@jessedobbelaere
Copy link

jessedobbelaere commented May 25, 2022

That looks exactly what I was looking for. We want to move to a similar S3 structure, where we have a query bucket (s3_staging_dir) and a data bucket. With a lot of dbt deploys, the S3 bucket grows a lot because of the new UUID folders which we want to clean-up and set lifecycle policy rules on both the query results and parquet data.

@jessedobbelaere
Copy link

FYI, #49 has been merged (actually by #73) so this PR should be unblocked now? 😃

@mrshu
Copy link
Contributor

mrshu commented Jun 23, 2022

@Tomme is there anything preventing this from being merged? It does seem like a very useful addition 🙂

@VDFaller
Copy link

VDFaller commented Jul 8, 2022

Not sure if it helps if I comment to gain visibility on this approval. But my team is also waiting on this. If there's anything I can do to help I'd love to help.

Edit I did run it (s3_data_naming: schema_table, as well as s3_data_naming: uuid) and they both ran as expected.

@rumbin
Copy link

rumbin commented Oct 24, 2022

I'd also be interested in having this PR merged.

@mattiamatrix
Copy link

This PR is currently being review at dbt-athena/dbt-athena#4

@nicor88
Copy link

nicor88 commented Dec 1, 2022

This PR and its refactor is being merged in dbt-athena/dbt-athena#39 and available in dbt-athena-community==1.3.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants