Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Hub natural keys as strings #135

Open
abettsmp opened this issue Jul 12, 2022 · 6 comments
Open

[FEATURE] Hub natural keys as strings #135

abettsmp opened this issue Jul 12, 2022 · 6 comments
Assignees
Labels
feature This is is requesting a new feature

Comments

@abettsmp
Copy link

abettsmp commented Jul 12, 2022

Is your feature request related to a problem? Please describe.
The current release of dbtvault creates the natural keys in the hub based on the type exposed from staging view. If the type of this column is an integer (or other restrictive type) then the hub cannot be the target for sources with broader types for their representation of the natural key. As an example first system uses some id type column (integer) for company entities and then you wish to bring in SalesForce company type entities which represents the natural key as a guid (or similar).

Describe the solution you'd like
Default hub macro to force all fields of the natural key of a hub to be a string type.

Describe alternatives you've considered
If this is an issue for backward compatibility, expose the functionality via a flag/parameter.

Additional context
taken from: Cuba, Patrick. The Data Vault Guru: a pragmatic guide on building a data vault (p. 99). Kindle Edition.

Cast to text Statement: all business keys in a data warehouse or a data lake must be stored as text fields.

There are additional, and I believe valid, reasons given in this section.

AB#5373

@abettsmp abettsmp added the feature This is is requesting a new feature label Jul 12, 2022
@DVAlexHiggs
Copy link
Member

DVAlexHiggs commented Dec 17, 2022

Hi. Apologies for the very late response on this. I thought I had replied.

Can I ask if you've tried derived columns? You can cast columns to whatever you want using derived columns.

Here's an example

Generally speaking you shouldn't be modifying the types on the way in to the Raw Vault, but I do understand there's sometimes a need.

I can see how this could be easily implemented as a config flag in the hub though, if that seems more sensible to you?

@FrenkLoonen
Copy link

Hi Alex, just to chime in here.

In my project, where we generate all dbtvault config files, we have indeed baked it in to the derived columns. But as you said, it would be nicer to have something like a config flag, as we must make sure that it is indeed applied to all Hub BKs.

Although, if you build it in to the Hub macro, it might not be really visible to the end user. You could perhaps also make it similar to the null_columns, maybe something like string_columns?

In any case, this feature gets my upvote :)

@DVAlexHiggs
Copy link
Member

DVAlexHiggs commented Dec 18, 2022

In that case, I'm thinking a typed_columns/cast_columnssection:

typed_columns:
  string_columns:
    - CUSTOMER_ID
  integer_columns:
    - STOCK_QUANTITY

etc.. We can make it string columns only for now and expand on it, but this could be a solution for all typing issues/rules long-term.

Perhaps uneccesary to do anything other than strings though? Not sure

@DVAlexHiggs
Copy link
Member

DVAlexHiggs commented Dec 18, 2022

Additionally, we could have a force_hub_bks_to_string global config as a way to do it across all hubs?

The reason I suggest a hub config option though is because these can be provided in YAML properties files and dbt_project.yml

All this config option would do is trigger a CAST on the BK insider the Hub SQL. So, in that regard it would be visible.

@FrenkLoonen
Copy link

Not sure if I can think of a need for integer columns right now. But maybe date columns? Based on a certain format. On my current project I have to deal with all types of date formatting that are being delivered as strings.

DD-MM-YYYY
DD-MMM-YYYY
YYYY.MM.DD
And the list goes on and on..

typed_columns:
  string_columns:
    - CUSTOMER_ID
  date_columns:
    - SHIPPING_DATE: DD-MM-YYYY
    - ORDER_DATE: YYYY.MM.DD

@FrenkLoonen
Copy link

Additionally, we could have a force_hub_bks_to_string global config as a way to do it across all hubs?

The reason I suggest a hub config option though is because these can be provided in YAML properties files and dbt_project.yml

All this config option would do is trigger a CAST on the BK insider the Hub SQL. So, in that regard it would be visible.

This sounds perfect!

@DVAlexHiggs DVAlexHiggs added feature This is is requesting a new feature and removed feature This is is requesting a new feature labels May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature This is is requesting a new feature
Projects
None yet
Development

No branches or pull requests

3 participants