Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collisions in dbt_scd_id while calculating snapshots #160

Closed
rafaelkrysciak opened this issue May 19, 2024 · 3 comments · Fixed by #168
Closed

Collisions in dbt_scd_id while calculating snapshots #160

rafaelkrysciak opened this issue May 19, 2024 · 3 comments · Fixed by #168
Labels
bug Something isn't working

Comments

@rafaelkrysciak
Copy link

rafaelkrysciak commented May 19, 2024

Describe the bug

The snapshot calculation relies on the Teradata HASHROW function. The dbt_scd_id is generated for each row based on the provided unique_key and the current timestamp. However, the HASHROW function produces a 4-byte hash, which is highly prone to collisions. For instance, the values d3dadd49420542fb49ffbf6a77349b45 and 34f325fe5a4216f27357328b61c9eccb both produce the same hash 02-27-E3-B4. Similarly, the numbers 162181727 and 880145039 generate the same hash 2E-5B-FE-DD. In a source with 36 million numbers, we have over 180 thousand duplicate dbt_scd_id.

These collisions cause the snapshot update to fail with the error: [Error 7547] Target row updated by multiple source rows.

Steps To Reproduce

Create a source with the provided values as IDs and then try to create a snapshot of them.

Expected behavior

Calculating the snapshot without errors.

Screenshots and log output

The output of dbt --version:

Core:
  - installed: 1.7.11
  - latest:    1.8.0  - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - teradata: 1.7.2 - Up to date!

The operating system you're using:
Windows 11

The output of python --version:
Python 3.11.3

Additional context

@rafaelkrysciak rafaelkrysciak added the bug Something isn't working label May 19, 2024
@datenbaecker
Copy link

I had the same problem, would be nice if they can fix that! @rafaelkrysciak In the mean time you can redefine the macro that generates the hash:

{% macro teradata__snapshot_hash_arguments(args) -%}
    usrlib.hash_md5({%- for arg in args -%}
        coalesce(cast({{ arg }} as varchar(200)), '')
        {% if not loop.last %} || '|' || {% endif %}
    {%- endfor -%})
{%- endmacro %}

You may have to install hash_md5 on your system first...

@rafaelkrysciak
Copy link
Author

Thanks @datenbaecker. It works fine 👍

@tallamohan
Copy link
Contributor

@rafaelkrysciak , the fix for this issue is available in dbt-teradata 1.8.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants