Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read the data only once for incremental models using the merge strategy #455

Open
iconara opened this issue Oct 13, 2023 · 3 comments
Open
Labels
feature New feature or request

Comments

@iconara
Copy link

iconara commented Oct 13, 2023

Incremental models can in the worst case end up costing double because data is written to a temp table before being inserted into the destination. Even when there are reductions in the size of the data from the source to the destination, the cost will be higher than if the data was written to the destination directly. It would be great to have an option to not use a temp table in cases where it is not strictly needed, to be able to reduce costs.

@nicor88 nicor88 added the feature New feature or request label Oct 13, 2023
@nicor88
Copy link
Member

nicor88 commented Oct 13, 2023

First of all, thank you for the issue and for raising such a relevant point. The behavior that you observed is mostly liked because we use a tmp table to understand which columns to add in case of schema change.
Said so I believe that in case of on_schema_change equals ignore or fail, we can simply avoid using an intermediate table - for the case where on_schema_change is sync_all_columns and append_new_columns we need to investigate further if we can avoid using the tmp table.

@iconara
Copy link
Author

iconara commented Jan 22, 2024

I think there may be a way to discover schema changes and avoid creating a temp in all cases: there's a trick where running a SELECT with LIMIT 0 will return the result set schema, but not execute the query (the query planner figures out that the query would always result in zero rows, so it short circuits query execution and returns a full result set with metadata, but zero rows). This scans no data, and is therefore without cost (except for Glue Data Catalog and S3 overhead costs).

Instead of running a CTAS and looking at the resulting table to determine if the schema has changed, the SELECT … LIMIT 0 trick can be used. If the schema hasn't changed, the most common result, an INSERT INTO or MERGE can be used directly. If the schema has changed, the change can be applied and then run the INSERT INTO or MERGE.

@nicor88 do you think this is feasible?

@nicor88
Copy link
Member

nicor88 commented Jan 22, 2024

In theory the approach should work, but the core logic to detect schema changes is in
process_schema_changes macro, see here - equivalent usage for merge statements.

I didn't investigate much, but seems that process_schema_changes comes from dbt-core, and it expect a tmp relation, so we might need to overwrite it with the logic that you propose:

  • select * from ({{sql}}) limit 0 - extracting column and types
  • use the column and types to see what needs to be added/removed/synced based on the on schema change behvior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants