We are the boutique analytics consultancy that turns disorganised data into real business value. Get in touch to learn more about how Tasman can help solve your organisations data challenges.
tasman_dbt_utils
is a dbt package with reusable macro's. It includes macro's for, but not limited to:
- tests
- monitoring & auditing
- SQL functions
- ops functions
The intention for this package is to have each macro available for dbt-snowflake and dbt-bigquery. Some functions will be only available for one of the platforms because of the fundamental differences between them. For example managing warehouse-size's is only a thing in Snowflake. This package does not intend to replace other packages which are commonly used such as dbt-utils or dbt-expectations. This package add's functionalities which those package's don't provide.
packages:
- git: "https://github.com/TasmanAnalytics/tasman_dbt_utils.git"
revision: 1.0.0
test_count_distinct_matches_source (source)
- Count distinct of the test column (e.g. transaction_id)
- Aggregates this by a specified field (e.g. date) to get a aggregated measure (e.g. date | count_transactions)
- Compares this against another model (ideally with the same granularity) (e.g. date | count_transactions)
- Returns any rows where there is a discrepancy between the aggregated measures
- (Bonus) If you are fine with tests not being an exact match, then you can specify a threshold for which failures can occur e.g. count transactions can fluctuate within ±5% range from source
Arguments
source_model
(required): The name of the model that contains the source of truth. Specify this as a ref function e.g.ref('raw_jaffle_shop')
.- These can be seed files or dbt models, so there's a degree of flexibility here
source_metric
(required): The name of the column/metric sourced fromsource_model
comparison_field
(required): The name of the column/metric sourced frommodel
in the YAML file i.e. the column/metric that is being compared againstpercent_mismatch_threshold
(optional, default = 0): The threshold that you would allow your tests to be out by. e.g. if you are happy with ±5% discrepancy, then set to 5.
Usage
This works similarly to the off-the-box tests offered by dbt (unique
, not_null
etc)
- Create the test in the YAML config, specifying values for all arguments marked as required above. Example:
- add any additional filtering conditions to your model via
config/where
block
- add any additional filtering conditions to your model via
version: 2
models:
- name: dmn_jaffle_shop
description: ""
columns:
- name: transaction_id
description: ""
tests:
- not_null
- unique
- count_aggregate_matches_source:
name: count_transactions_matches_source__dmn_jaffle_shop
source_model: ref('raw_jaffle_shop')
source_field: sale_date
source_metric: transaction_amount
comparison_field: date_trunc(day, created_timestamp)
config:
where: date_trunc(day, created_timestamp) between '2022-01-11' and '2022-12-31' and sale_type != 'CANCELLED'
- Specify a unique test name for
name
- If this is not specified then dbt will, by default, concatenate all the test arguments into a long list, making the whole test unreadable.
- Run dbt test as you normally would e.g.
dbt test -s dmn_jaffle_shop
test_sum_matches_source (source)
- Sum of the test column (e.g. revenue)
- Aggregates this by a specified field (e.g. date) to get a aggregated measure (e.g. date | sum_revenue)
- Compares this against another model (ideally with the same granularity) (e.g. date | sum_revenue)
- Returns any rows where there is a discrepancy between the aggregated measures
- (Bonus) If you are fine with tests not being an exact match, then you can specify a threshold for which failures can occur e.g. Sum revenue can fluctuate within ±5% range from source
Arguments
source_model
(required): The name of the model that contains the source of truth. Specify this as a ref function e.g.ref('raw_jaffle_shop')
- These can be seed files or dbt models, so there's a degree of flexibility here
source_metric
(required): The name of the column/metric sourced fromsource_model
comparison_field
(required): The name of the column/metric sourced frommodel
in the YAML file i.e. the column/metric that is being compared againstpercent_mismatch_threshold
(optional, default = 0): The threshold that you would allow your tests to be out by. e.g. if you are happy with ±5% discrepancy, then set to 5.
Usage
This works similarly to the off-the-box tests offered by dbt (unique
, not_null
etc)
- Create the test in the YAML config, specifying values for all arguments marked as required above. Example:
- add any additional filtering conditions to your model via
config/where
block
- add any additional filtering conditions to your model via
version: 2
models:
- name: dmn_jaffle_shop
description: ""
columns:
- name: revenue
description: ""
tests:
- not_null
- sum_aggregate_matches_source:
name: sum_revenue_matches_source__dmn_jaffle_shop
source_model: ref('raw_jaffle_shop')
source_field: sale_date
source_metric: sum_revenue
comparison_field: date_trunc(day, created_timestamp)
config:
where: date_trunc(day, created_timestamp) between '2022-01-11' and '2022-12-31' and sale_type != 'CANCELLED'
- Specify a unique test name for
name
- If this is not specified then dbt will, by default, concatenate all the test arguments into a long list, making the whole test unreadable.
- Run dbt test as you normally would e.g.
dbt test -s dmn_jaffle_shop
include_source (source)
A frequently used pattern for creating initial CTEs to reference sources to create a dbt model dependancy.
Platform | Support |
---|---|
BigQuery | ✅ |
Snowflake | ✅ |
Arguments
source
: (required) Source model name to be used in script. This is also used to name the CTE.
Usage
{{include_source('dbo','user')}}
{{include_source('dbo','event')}}
include_ref (source)
A frequently used pattern for creating initial CTEs to reference upstream models to create a dbt model dependancy.
Platform | Support |
---|---|
BigQuery | ✅ |
Snowflake | ✅ |
Arguments
source
: (required) Source model name to be used in script. This is also used to name the CTE.where_statement
: (optional) This can be used to do an initial filter on the model.
Usage
{{include('stg_user', 'where user_active = true')}}
{{include('dmn_pipeline')}}
create_table_profile (source)
Prints a summary of statistics about the target model to the terminal.
| database_name | schema_name | table_name | column_name | ordinal_position | row_count | distinct_count | null_count | is_unique | max | min | avg |
| --------------- | ----------- | ---------- | -------------------- | ---------------- | --------- | -------------- | ---------- | --------- | ---------- | ---------- | ------- |
| tasman-internal | dbt_jurri | customers | customer_id | 1 | 100 | 100 | 0 | True | 100 | 1 | 50.500… |
| tasman-internal | dbt_jurri | customers | first_name | 2 | 100 | 79 | 0 | False | | | |
| tasman-internal | dbt_jurri | customers | last_name | 3 | 100 | 19 | 0 | False | | | |
| tasman-internal | dbt_jurri | customers | first_order | 4 | 100 | 46 | 38 | False | 2018-04-07 | 2018-01-01 | |
| tasman-internal | dbt_jurri | customers | most_recent_order | 5 | 100 | 52 | 38 | False | 2018-04-09 | 2018-01-09 | |
| tasman-internal | dbt_jurri | customers | number_of_orders | 6 | 100 | 4 | 38 | False | 5 | 1 | 1.597… |
| tasman-internal | dbt_jurri | customers | customer_lifetime... | 7 | 100 | 35 | 38 | False | 99 | 1 | 26.968… |
Platform | Support |
---|---|
BigQuery | ✅ |
Snowflake | ✅ |
Scope: model, seed, snapshot
Arguments
table
: the name of the table it should do the table scan on.schema
(optional, default=target.schema): the schema of where the target table is located.database
(optional, default=target.database): the database of where the target table is located.
get_object_keys (source)
Gets all of the object keys (including nested keys) of a column and prints them to the terminal.
Platform | Support |
---|---|
BigQuery | ❌ |
Snowflake | ✅ |
Scope: model, snapshot
Argements
column
: the name of the object column.table
: the name of the target table.schema
(optional, default=target.schema): the schema of where the target table is located.database
(optional, default=target.database): the database of where the target table is located.
set_warehouse_size (source)
Sets a custom warehouse size for individual models.
Platform | Support |
---|---|
BigQuery | ❌ |
Snowflake | ✅ |
Scope: model, snapshot
Arguments
size
(required): the size of the warehouse
Usage
{{
config(
snowflake_warehouse=tasman_dbt_utils.set_warehouse('M')
)
}}
This requires a dict of environment, warehouses and the available warehouse size's to be set in the dbt_project.yml. If the environment is missing from dbt_project.yml
it uses the default warehouse.
vars:
tasman_dbt_utils:
snowflake_warehouses:
prod:
warehouse_prefix: "PROD_WH_"
size: ["XS", "S", "M"]
ci:
warehouse_prefix: "CI_WH_"
size: ["XS", "S", "M"]
drop_old_relations (source)
This macro takes the relations in the manifest and compares it to the tables and views in the warehouse. Tables and views which are in the warehouse but not in the manifest will be dropped.
Platform | Support |
---|---|
BigQuery | ❌ |
Snowflake | ✅ |
Scope: model, seed, snapshot
Arguments
schema_prefix
(optional, default=target.schema): the prefix of the schema's where the relations should be deleted.database
(optional, default=target.database): the database where the relations should be deleted.dry_run
(optional, default=True): when set to True it will print the statements, when set to False it will actually remove the relations.
Usage
dbt run-operation drop_old_relations --args '{dry_run: False, schema: dbt}'