Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Link record is NULL and not loaded when some FKs are NULL #65

Closed
koillinengit opened this issue Sep 22, 2021 · 12 comments
Closed
Assignees
Labels
bug Something isn't working needs documentation Not a bug. Needs guidance in dbtvault documentation

Comments

@koillinengit
Copy link

koillinengit commented Sep 22, 2021

Describe the bug
Link record is NULL and not loaded when some FKs are NULL

Versions

dbt: 0.20.2
dbtvault: 0.7.7
snowflake: 5.34.1

To Reproduce
Steps to reproduce the behavior:

  1. Go to hashed view ddl
  2. See how hashkeys are generated: CAST(MD5_BINARY(NULLIF(CONCAT_WS('||', IFNULL(NULLIF(UPPER(TRIM(CAST(adrnr__address AS VARCHAR))), ''), '^^') ), '^^')) AS BINARY(16)) AS ADDRESS_HK
  3. It seems that snowflake returns null hashkey when there null value in foreign key field.
  4. Because in link populating sql's all hashkeys are checked that they are not null, the rows that have null foreign keys does not come to link table at all:
WITH row_rank_1 AS (
    SELECT lnk_sales_document_partner_hk, sales_document_hk, sales_document_partner_hkd, address_hk, customer_hk, sales_document_item_hk, dv_load_date, dv_record_source,
           ROW_NUMBER() OVER(
               PARTITION BY lnk_sales_document_partner_hk
               ORDER BY dv_load_date
           ) AS row_number
    FROM xxx.xxx.hashed_vbpa__sales_document_partner
    WHERE lnk_sales_document_partner_hk IS NOT NULL
    AND sales_document_hk IS NOT NULL
    AND sales_document_partner_hkd IS NOT NULL
    AND address_hk IS NOT NULL // <--- This
    AND customer_hk IS NOT NULL
    AND sales_document_item_hk IS NOT NULL
    QUALIFY row_number = 1
),
records_to_insert AS (
    SELECT a.lnk_sales_document_partner_hk, a.sales_document_hk, a.sales_document_partner_hkd, a.address_hk, a.customer_hk, a.sales_document_item_hk, a.dv_load_date, a.dv_record_source
    FROM row_rank_1 AS a
)
SELECT count(1) FROM records_to_insert
@koillinengit koillinengit added the bug Something isn't working label Sep 22, 2021
@DVAlexHiggs DVAlexHiggs changed the title [BUG] [BUG] Link record is NULL and not loaded when some FKs are NULL Sep 22, 2021
@DVAlexHiggs
Copy link
Member

DVAlexHiggs commented Sep 22, 2021

Thanks for this report, I will investigate this and get back to you shortly

P.S I edited your report but only for formatting and such to make it easier to read.

@koillinengit
Copy link
Author

koillinengit commented Sep 22, 2021

I edited it also. Because problem was not the amount of hat signs. It seems to be how the ifnull and nullif functions are used. Almost like hashkey is wanted to be null, when the actual foreign key value is null. Is this intended? How you should model those links then, if you cannot add those nullable foreign keys in there and the link row is completely ignored (even if there is some other foreign keys that should be active)?

@DVAlexHiggs
Copy link
Member

DVAlexHiggs commented Sep 22, 2021

Hello. I've talked to our internal Data Vault 2.0 expert on this. Let me explain.

For a Link, the business will likely have an idea of what keys (foreign or primary) are necessary for business meaning, and which are not. This is the concept of mandatory and optional keys, i.e. values the business requires and values the business doesn't mind being NULL.

We don't want to load any records into a link which are truly NULL, hence why we have the logic in your initial report above, which generates IF NOT NULL for foreign and primary keys. However, we may want to ensure that missing keys do not prevent the loading of a Link record if they are NULL. To do this, we substitute these values with a token, usually 11111... for keys which are mandatory and 22222... for keys which are optional. These tokens indicate that the value is missing, but because it is not a NULL value, the Links will load them.

We provide these tokens as part of the staging process, and not as part of the Link logic itself. This allows us to still handle the base or 'simple' case of not loading a Link record if any keys are NULL.

The above explanation is Data Vault 2.0 Best practise, and dbtvault does support this approach, however our documentation/guidance on this is missing, and we would like to add helper macros, syntactic sugar and make this approach easier for the user.

Whilst not a bug, this will be an issue for many users, and is something we want to improve. I will leave this issue here to help other users, and as a way of tracking changes aimed at improving this functionality. Thank you for the report and hope this helps!

@koillinengit
Copy link
Author

Thanks, I think this is adequate explanation for me at least. I checked the datavault material. In my material the values were -1 and -2:

Dealing with Null Business Keys
HANDLING NULL BUSINESS KEYS

Null Business Key Arrival
• Hash = BK Arrives NULL but is Required
— Replace BK with -1

• Hash = BK Arrives NULL but is Optional
— Replace BK with -2

• ZERO KEY is the concept, but it should never be "ZERO"
that is used.

• Pre-Populate all Tables in the Data Vault with two zero
key records to set "earliest" load-dates possible

@DVAlexHiggs DVAlexHiggs added the needs documentation Not a bug. Needs guidance in dbtvault documentation label Sep 22, 2021
@koillinengit
Copy link
Author

I have to get back little bit in this one. I have an issue because the dbtvault hashing is converting empty strings as null. Because for example in SAP, empty string could actually mean something. At the moment I'm not converting empty strings as zero keys, but dbtvault hashing is interpreting it as null and not showing those values in link table. I can get round of this, if I implement the same empty string conversion as you have. But then the auditability is lost. Because the empty string key might have in source system some actual meaning and description. Now it is converted as zero key, which meaning is totally different...

@DVAlexHiggs
Copy link
Member

DVAlexHiggs commented Sep 27, 2021

I have to get back little bit in this one. I have an issue because the dbtvault hashing is converting empty strings as null. Because for example in SAP, empty string could actually mean something. At the moment I'm not converting empty strings as zero keys, but dbtvault hashing is interpreting it as null and not showing those values in link table. I can get round of this, if I implement the same empty string conversion as you have. But then the auditability is lost. Because the empty string key might have in source system some actual meaning and description. Now it is converted as zero key, which meaning is totally different...

In Data Vault 2.0, this is the standard, An empty string generally means absolutely nothing, and they often imply data quality issues. If they do mean something, you should convert them to tokens/codes prior to staging. This would be a layer (tables or views) between your raw stage and the prepared stage (dbtvault staging). This would mean that by the time these values get to the dbtvault staging layer for hashing, they are no longer empty strings. To clarify, you should give the empty strings the same business meaning in Data Vault, ensuring they have codes for these different business meanings. This also solves the issue of empty strings meaning different things in different places; in this world, you would just have different codes instead.

@koillinengit
Copy link
Author

koillinengit commented Sep 27, 2021

Ok. Probably I have to add -3 for empty string as a token. Then I can preserve the auditability. Thanks for clarification.

@DatavaultUK
Copy link
Member

DatavaultUK commented Sep 27, 2021 via email

@koillinengit
Copy link
Author

koillinengit commented Sep 27, 2021

It is not missing value. It is empty string which has actual key attach to it. It could be something like id='' and description='Description for empty string'. If I use the same values as for null, auditability is broken. Because null means that value is not given at all and the empty string has real world values attached to it. That's why I need a third token as Alex suggested.

@DatavaultUK
Copy link
Member

DatavaultUK commented Sep 27, 2021 via email

@DVAlexHiggs
Copy link
Member

The token approach should only be used for cases where you have genuine missing values. For actual values that are empty strings you should use reference data/lookups which Neil is getting at, or apply business rules downstream.

@DVAlexHiggs
Copy link
Member

Closing due to age and resolution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs documentation Not a bug. Needs guidance in dbtvault documentation
Projects
None yet
Development

No branches or pull requests

3 participants