Skip to content

feat: Add SCD Type 2 Column Support#1997

Merged
eakmanrq merged 8 commits intomainfrom
eakmanrq/add_scd_type_2_column_support
Feb 9, 2024
Merged

feat: Add SCD Type 2 Column Support#1997
eakmanrq merged 8 commits intomainfrom
eakmanrq/add_scd_type_2_column_support

Conversation

@eakmanrq
Copy link
Collaborator

@eakmanrq eakmanrq commented Jan 21, 2024

Prior to this PR you could only create an SCD Type 2 table if you had an "Updated At" timestamp in the source table. This PR makes it so that you can create an SCD Type 2 from any source by checking if specific columns have changed. Since you no longer have "Updated At" to tell you when that change was made it uses execution_time instead. As a result you can think of it as a less precise approach to SCD Type 2.

This PR includes support both for native SQLMesh and dbt adapter.

One challenge when testing the dbt runtime is that dbt doesn't allow freezing "now()" (their execution time). Before I had a simple way of patching their "now()" with the frozen now but it had a bug and I fixed it in this PR. This bug is actually what created the perceived behavior difference between dbt and SQLMesh so now they actually appear to behave the same.

I decided to leave the SCD_TYPE_2 model kind in place and alias it to be SCD_TYPE_2_BY_TIME and this is what this PR is not a breaking change. To me this seems fair since by time is the recommended and default approach so having the unqualified version f the name point to what we recommend seems fine. If others disagree then I can remove and make this a breaking change.

@eakmanrq eakmanrq force-pushed the eakmanrq/add_scd_type_2_column_support branch 4 times, most recently from 6ab2385 to 3276825 Compare January 21, 2024 21:44
@eakmanrq eakmanrq requested a review from a team January 22, 2024 19:34
@plaflamme
Copy link
Contributor

@eakmanrq we have a slightly similar use-case, but where the "updated at" timestamp could still be inferred from the source.

The source table is a daily snapshot of the raw data that changes over time. It has a "snapshot_date" timestamp, this column doesn't tell you if the record changed (so it doesn't fulfill the updated_at semantics), but it could be used as the resulting valid_from / valid_to column values. For example:

name,price,snapshot_date
foo,20,2024-01-01
foo,20,2024-01-02
foo,21,2024-01-03
foo,21,2024-01-04
food,21,2024-01-05

We can build an SCD type 2 from this and use snapshot_date instead of execution_time (if I understand correctly), the expected result would be

name,price,valid_from,valid_to
foo,20,2024-01-01,2024-01-02
foo,21,2024-01-03,2024-01-04
food,21,2024-01-05,NULL

Would it make sense to add support for this case out of the box? Perhaps this is already supported by the new kind?

@eakmanrq eakmanrq force-pushed the eakmanrq/add_scd_type_2_column_support branch from 3276825 to 5030084 Compare January 23, 2024 21:13
@eakmanrq
Copy link
Collaborator Author

eakmanrq commented Jan 23, 2024

@eakmanrq we have a slightly similar use-case, but where the "updated at" timestamp could still be inferred from the source.

The source table is a daily snapshot of the raw data that changes over time. It has a "snapshot_date" timestamp, this column doesn't tell you if the record changed (so it doesn't fulfill the updated_at semantics), but it could be used as the resulting valid_from / valid_to column values. For example:

name,price,snapshot_date
foo,20,2024-01-01
foo,20,2024-01-02
foo,21,2024-01-03
foo,21,2024-01-04
food,21,2024-01-05

We can build an SCD type 2 from this and use snapshot_date instead of execution_time (if I understand correctly), the expected result would be

name,price,valid_from,valid_to
foo,20,2024-01-01,2024-01-02
foo,21,2024-01-03,2024-01-04
food,21,2024-01-05,NULL

Would it make sense to add support for this case out of the box? Perhaps this is already supported by the new kind?

Thanks @plaflamme for sharing this use case. Basically you are wanting to create a SCD Type 2 table out of a snapshot table and that use case makes sense. There is already a bool which is updated_at_as_valid_from but that is exclusive to SCD_TYPE_2_BY_TIME. Conceptually this makes sense since if you have an updated_at then you would just use BY_TIME but you have a good example here where you want it BY_COLUMN but still use an updated_at value.

I like the idea of adding support for this but leaning towards adding support in another PR. The reason is that this one is already complex enough.

@plaflamme
Copy link
Contributor

@eakmanrq great! glad to hear this might be a useful addition.

Another PR sounds like a good idea. I'll open a separate issue so it this can be tracked separately, feel free to close it if it's not useful.

@eakmanrq eakmanrq force-pushed the eakmanrq/add_scd_type_2_column_support branch from 5030084 to b33b498 Compare January 23, 2024 21:30
@eakmanrq eakmanrq force-pushed the eakmanrq/add_scd_type_2_column_support branch 4 times, most recently from f001248 to b61e094 Compare January 28, 2024 22:12
@eakmanrq eakmanrq force-pushed the eakmanrq/add_scd_type_2_column_support branch from 9a76e6f to e1459f9 Compare January 31, 2024 21:13
@CLAassistant
Copy link

CLAassistant commented Jan 31, 2024

CLA assistant check
All committers have signed the CLA.

@eakmanrq eakmanrq force-pushed the eakmanrq/add_scd_type_2_column_support branch from e1459f9 to 4c6f23b Compare January 31, 2024 21:18
Copy link
Contributor

@tobymao tobymao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

z

@eakmanrq eakmanrq force-pushed the eakmanrq/add_scd_type_2_column_support branch from 5c79bcd to 85a9961 Compare February 1, 2024 21:28
@eakmanrq eakmanrq force-pushed the eakmanrq/add_scd_type_2_column_support branch 5 times, most recently from 254d31a to f8789ab Compare February 7, 2024 02:52
@eakmanrq eakmanrq force-pushed the eakmanrq/add_scd_type_2_column_support branch from 97aa0d5 to 5934c61 Compare February 8, 2024 22:58
@eakmanrq eakmanrq enabled auto-merge (squash) February 9, 2024 00:09
@eakmanrq eakmanrq merged commit 3397645 into main Feb 9, 2024
@eakmanrq eakmanrq deleted the eakmanrq/add_scd_type_2_column_support branch February 9, 2024 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants