Staging

This macro creates the staging layer for the Data Vault model. This layer is mainly for hashing, and additionally gives the option to create derived columns, conduct prejoins and add NULL values for missing columns. Always create one stage per source table that you want to add to the Data Vault model. The staging layer is not to harmonize data. That will be done in the later layers.

Parameter	Data Type	Explanation
ldts	string	Name of the column inside the source data, that holds information about the Load Date Timestamp. Can also be a SQL expression.
rsrc	string	Name of the column inside the source data, that holds information about the Record Source. Can also be a SQL expression or a static string. A static string must begin with a '!'.
source_model	string \| dictionary	Can be just a string holding the name of the refered dbt model to use as a source. But if the 'source' functionality inside the .yml file is used, it must be a dictionary with 'source_name': 'source_table'.
include_source_columns	boolean	Defines if all columns from the refered source table should be included in the result table, or if only the added columns should be part of the result table. By default the source columns should be included.
hashed_columns	dictionary	Defines the names and input for all hashkeys and hashdiffs to create. The key of each hash column is the name of the hash column. The value for Hashkeys is a list of input Business Keys, for Hashdiffs another dictionary with the keys 'is_hashdiff:true', 'columns: ' and optionally you can set a key 'use_rtrim:true' if you want the Hashdiff to be trimmed on the right side (for now option only available in Snowflake adapter)
derived_columns	dictionary	Learn more about Derived Columns [here](https://github.com/ScalefreeCOM/datavault4dbt/wiki/Derived-Columns)
sequence	string	Name of the column inside the source data, that holds a sequence number that was generated during the data source extraction process. Optional and not required.
prejoined_columns	dictionary	Learn more about Prejoining [here](https://github.com/ScalefreeCOM/datavault4dbt/wiki/Prejoining).
missing_columns	dictionary	If the schema of the source changes over time and columns are disappearing, this parameter gives you the option to create additional columns holding NULL values, that replace columns that were previously there. By this procedure, hashdiff calculations and satellite payloads wont break. The dictionary holds the names of those columns as keys, and the SQL datatypes of these columns as values.
multi_active_config	dictionary	If the source data holds multi-active data, define here the column(s) holding the multi-active key and the main hashkey column. If the source data is multi-active but has no natural multi-active key, create one using the row_number SQL function (or similar) one layer before. Then insert the name of that artificial column into the multi-active-key parameter. The combination of the multi-active key(s), the main-hashkey and the ldts column should be unique in the final result satellite. If not set, the stage will be treated as a single-active stage.

Example 1

    {{ config(materialized='view') }}

    {%- set yaml_metadata -%}
    source_model: 'source_account'
    ldts: 'edwLoadDate'
    rsrc: 'edwRecordSource'
    hashed_columns: 
        hk_account_h:
            - account_number
            - account_key
        hd_account_s:
            is_hashdiff: true
            columns:
                - name
                - address
                - phone
                - email
    derived_columns:
        conversion_duration:
            value: 'TIMESTAMP_DIFF(conversion_date, created_date, DAY)'
            datatype: 'INT64'
        country_isocode:
            value: '!GER'
            datatype: 'STRING'
        account_name:
            value: 'name'
            datatype: 'String'
    prejoined_columns:
        contractnumber:
            src_name: 'source_data'
            src_table: 'contract'
            bk: 'contractnumber'
            this_column_name: 'ContractId'
            ref_column_name: 'Id'
        master_account_key:
            src_name: 'source_data'
            src_table: 'account'
            bk: 'account_key'
            this_column_name: 'master_account_id'
            ref_column_name: 'Id'
    missing_columns:
        legacy_account_uuid: 'INT64'
        shipping_address: 'STRING'
    {%- endset -%}

    {%- set metadata_dict = fromyaml(yaml_metadata) -%}

    {%- set source_model = metadata_dict.get('source_model') -%}
    {%- set ldts = metadata_dict.get('ldts') -%}
    {%- set rsrc = metadata_dict.get('rsrc') -%}
    {%- set hashed_columns = metadata_dict.get('hashed_columns') -%}
    {%- set derived_columns = metadata_dict.get('derived_columns') -%}
    {%- set prejoined_columns = metadata_dict.get('prejoined_columns') -%}
    {%- set missing_columns = metadata_dict.get('missing_columns') -%}

    {{ datavault4dbt.stage(source_model=source_model,
                        ldts=ldts,
                        rsrc=rsrc,
                        hashed_columns=hashed_columns,
                        derived_columns=derived_columns,
                        prejoined_columns=prejoined_columns,
                        missing_columns=missing_columns) }}

Description:

source_model: The source model that you want to use for the stage is available as another dbt model with the name 'source_account'.
ldts: Uses the column called 'edwLoadDate' as it is from the source model.
rsrc: Uses the column called 'edwRecordSource' as it is from the source model.
hashed_columns:
- hk_account_h: A hashkey called 'hk_account_h' is defined, that is calculated out of the two business keys 'account_number' and 'account_key'
- hd_account_s: A hashdiff called 'hd_account_s' is calculated out of the descriptive attributes 'name', 'address', 'phone', and 'email'.
derived_columns:
- conversion_duration: The column 'conversion_duration' calculates the amount of days between two columns available inside the source data.
- country_isocode: The column 'country_isocode' inserts the static string 'GER' for all rows.
- account_name: The column 'account_name' duplicates an already existing column and gives it another name.
prejoined_columns:
- contractnumber: Creates a column called 'contractnumber' that holds values of the column with the same name (specified in 'bk') from the source table 'contract' in the source 'source_data' by joining on 'this.ContractId = contract.Id'. In this case the prejoined column alias equals the name of the original business key column, which should be the case for most prejoins. But sometimes the same object is prejoined multiple times or a self-prejoin happens, and then you would have to rename the final columns to not have duplicate column names. That behaviour is seen in the next prejoined column.
- master_account_key: The column 'master_account_key' holds values of the column 'account_key' inside the source table 'account'. If this prejoin is done inside account, we would now have a self-prejoin ON 'account.master_account_id = account.Id'. Because the table 'account' already has a column 'account_key', we rename the prejoined column to 'master_account_key'.
missing_columns: Two additional columns are added to the source table holding NULL values. The column 'legacy_account_uuid' will have the datatype 'INT64' and the column 'shipping_address' will have the datatype 'STRING'.

Example 2

    {%- set yaml_metadata -%}
    source_model: 
        'source_data': 'source_account'
    ldts: 'PARSE_TIMESTAMP('%Y-%m-%dT%H-%M-%S', edwLoadDate)'
    rsrc: "CONCAT(source_system, '||', source_object)"
    hashed_columns: 
        hk_account_h:
            - account_number
            - account_key
        hd_account_s:
            is_hashdiff: true
            columns:
                - name
                - address
                - phone
                - email
        hk_account_contract_l:
            - account_number
            - account_name
            - contractnumber
    derived_columns:
        conversion_duration:
            value: 'TIMESTAMP_DIFF(conversion_date, created_date, DAY)'
            datatype: 'INT64'
        country_isocode:
            value: '!GER'
            datatype: 'STRING'
        account_name:
            value: 'name'
            datatype: 'String'
    prejoined_columns:
        contractnumber:
            src_name: 'source_data'
            src_table: 'contract'
            bk: 'contractnumber'
            this_column_name: 'ContractId'
            ref_column_name: 'Id'
    multi_active_config:
        multi_active_key: 'ma_attribute'
        main_hashkey_column: 'hk_test'
    {%- endset -%}

    {%- set metadata_dict = fromyaml(yaml_metadata) -%}

    {%- set source_model = metadata_dict.get('source_model') -%}
    {%- set ldts = metadata_dict.get('ldts') -%}
    {%- set rsrc = metadata_dict.get('rsrc') -%}
    {%- set hashed_columns = metadata_dict.get('hashed_columns') -%}
    {%- set derived_columns = metadata_dict.get('derived_columns') -%}
    {%- set prejoined_columns = metadata_dict.get('prejoined_columns') -%}
    {%- set multi_active_config = metadata_dict.get('multi_active_config') -%}

    {{ datavault4dbt.stage(source_model=source_model,
                        ldts=ldts,
                        rsrc=rsrc,
                        hashed_columns=hashed_columns,
                        derived_columns=derived_columns,
                        prejoined_columns=prejoined_columns,
                        missing_columns=none,
                        multi_active_config=multi_active_config) }}

Description

source_model: The source model that you want to use for the stage is available as a source defined inside the .yml file with the name 'source_data', and you select the table 'source_account' out of that source.
ldts: Applies the SQL function 'PARSE_TIMESTAMP' on the input column 'edwLoadDate'.
rsrc: Applies the SQL function 'CONCAT' to concatenate two source columns.
hashed_columns:
- hk_account_h: A hub hashkey called 'hk_account_h' is defined, that is calculated out of the two business keys 'account_number' and 'account_key'
- hd_account_s: A hashdiff called 'hd_account_s' is calculated out of the descriptive attributes 'name', 'address', 'phone', and 'email'.
- hk_account_contract_l: A link hashkey called 'hk_account_contract_l' is defined, calculated of all business keys of the connected hubs.
derived_columns:
- conversion_duration: The column 'conversion_duration' calculates the amount of days between two columns available inside the source data.
- country_isocode: The column 'country_isocode' inserts the static string 'GER' for all rows.
- account_name: The column 'account_name' duplicates an already existing column and gives it another name.
prejoined_columns:
- contractnumber: Creates a column called 'contractnumber' that holds values of the column with the same name (specified in 'bk') from the source table 'contract' in the source 'source_data' by joining on 'this.ContractId = contract.Id'. In this case the prejoined column alias equals the name of the original business key column, which should be the case for most prejoins. But sometimes the same object is prejoined multiple times or a self-prejoin happens, and then you would have to rename the final columns to not have duplicate column names. That behaviour is seen in the next prejoined column.
multi_active_config:
- multi_active_key: The multi active key(s) inside the source data. The combination of all Business Keys and multi active keys needs to be unique per load date inside the source data.
- main_hashkey_column: The one hashkey column, that would be unique over the stage when combined with the multi active key(s). Needs to be one of the columns that is generated inside hashed_columns and should be the hashkey of the Hub generated out of the source data.

Table of Content

Staging
DV-Entities
1. Hubs
  1. Standard Hub
2. Links
3. Satellites
  1. Standard Satellite
    1. Standard Satellite v0
    2. Standard Satellite v1
  2. Multi-Active Satellite
    1. Multi-Active Satellite v0
    2. Multi-Active Satellite v1
  3. Record-Tracking Satellite
  4. Non-Historized Satellite
4. Reference Data
  1. Reference Hub
  2. Reference Satellite
    1. Reference Satellite v0
    2. Reference Satellite v1
  3. Reference Tables
5. Business Vault
  1. PIT
    1. Hook Cleanup PITs
  2. Snapshot Control
    1. Snapshot Control v0
    2. Snapshot Control v1
General
1. Global Variables
2. Multi-Source Entites
3. How To Track Effectivity
4. The rsrc_static Attribute
5. Adaper Specific Notes
  1. RedShift
  2. PostgeSQL
  3. Exasol
  4. Synapse
Releases

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Staging

Example 1

Description:

Example 2

Description

Clone this wiki locally