Read the data only once for incremental models using the merge strategy #455

iconara · 2023-10-13T08:36:56Z

Incremental models can in the worst case end up costing double because data is written to a temp table before being inserted into the destination. Even when there are reductions in the size of the data from the source to the destination, the cost will be higher than if the data was written to the destination directly. It would be great to have an option to not use a temp table in cases where it is not strictly needed, to be able to reduce costs.

nicor88 · 2023-10-13T08:42:48Z

First of all, thank you for the issue and for raising such a relevant point. The behavior that you observed is mostly liked because we use a tmp table to understand which columns to add in case of schema change.
Said so I believe that in case of on_schema_change equals ignore or fail, we can simply avoid using an intermediate table - for the case where on_schema_change is sync_all_columns and append_new_columns we need to investigate further if we can avoid using the tmp table.

iconara · 2024-01-22T09:36:37Z

I think there may be a way to discover schema changes and avoid creating a temp in all cases: there's a trick where running a SELECT with LIMIT 0 will return the result set schema, but not execute the query (the query planner figures out that the query would always result in zero rows, so it short circuits query execution and returns a full result set with metadata, but zero rows). This scans no data, and is therefore without cost (except for Glue Data Catalog and S3 overhead costs).

Instead of running a CTAS and looking at the resulting table to determine if the schema has changed, the SELECT … LIMIT 0 trick can be used. If the schema hasn't changed, the most common result, an INSERT INTO or MERGE can be used directly. If the schema has changed, the change can be applied and then run the INSERT INTO or MERGE.

@nicor88 do you think this is feasible?

nicor88 · 2024-01-22T09:43:15Z

In theory the approach should work, but the core logic to detect schema changes is in
process_schema_changes macro, see here - equivalent usage for merge statements.

I didn't investigate much, but seems that process_schema_changes comes from dbt-core, and it expect a tmp relation, so we might need to overwrite it with the logic that you propose:

select * from ({{sql}}) limit 0 - extracting column and types
use the column and types to see what needs to be added/removed/synced based on the on schema change behvior

nicor88 added the feature New feature or request label Oct 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read the data only once for incremental models using the merge strategy #455

Read the data only once for incremental models using the merge strategy #455

iconara commented Oct 13, 2023

nicor88 commented Oct 13, 2023

iconara commented Jan 22, 2024 •

edited

nicor88 commented Jan 22, 2024 •

edited

Read the data only once for incremental models using the merge strategy #455

Read the data only once for incremental models using the merge strategy #455

Comments

iconara commented Oct 13, 2023

nicor88 commented Oct 13, 2023

iconara commented Jan 22, 2024 • edited

nicor88 commented Jan 22, 2024 • edited

iconara commented Jan 22, 2024 •

edited

nicor88 commented Jan 22, 2024 •

edited