Introducing (and a refresher of) the more advance dbt_ features and usage. This repo contains the examples used during the teach-in session.
- The dbt_ documentation website for this repo!
- dbt presentation for Data Council KL
- Data Engineering's Confluence page on dbt
-
A Database
- You may either use the Analytics Redshift; or
- Spin up a local Postgres database instance (using docker)
-
Python
- Version 3.6 or above
-
Git
- To clone this repository on Github
-
dbt_
-
For other Python modules (if any), see:
-
Prepare the Database
-
If you are using the Analytics Redshift, make sure to
- Set the
profile.yml
in the dbt project accordingly.- Set your
user
- Set your
schema
- Change the
target
toredshift
- Set your
- Export your password to the environment variable:
export PASS='<your_password_here>'
- Set the
-
If you want to run a local Postgres database instance, you will need to:
- Install Docker
- Docker pull and run a local Postgres instance
Tips: add
docker run --rm --name pg-docker -e POSTGRES_PASSWORD=docker -d -p 5432:5432 postgres
-v $HOME/docker/volumes/postgres:/var/lib/postgresql/data
to persist the database. - To stop the database when done, run:
docker stop pg-docker
-
-
Installing Python
- Follow the guides to install Python if you don't have it installed already.
-
Cloning this repository
- If you haven't already, install Git (we recommend installing via Homebrew)
- Create a directory
mkdir ~/github cd ~/github
- Clone the repository
git clone git@github.com:1bk/dbt-teach-in.git
If you are having trouble with this, please reach out to me.
-
Installing dbt_
- While there are many ways to install dbt_, we suggest to install it in a python virtual environment.
- Follow the official dbt_ guide that details each step:
- Create a local environment
cd dbt-teach-in # set the working directory to the cloned repo python3 -m venv dbt-env # create the environment source dbt-env/bin/activate # activate the environment
- Install
dbt
pip install dbt
- Create a local environment
The topics here is shared in the order in which it was presented.
Note:
- Here we are applying dbt on a local Postgres instance.
- Make sure to be in the dbt project directory
cd ~/github/dbt-teach-in/dbt_teach_in
Teach-in file reference:
Teach-in dbt_ documentation reference:
Official dbt_ docs:
-
Specify a schema to populate the seed files
- In
dbt_project.yml
seeds: dbt_teach_in: +schema: seeds quote_columns: true
- In
-
Seed the data
- Making sure the files you want to seed are in
/data
, run:dbt seed --profiles-dir ./
- Making sure the files you want to seed are in
-
Setting custom columns
- In
dbt_project.yml
seeds: dbt_teach_in: ... raw_orders: # Override column types +column_types: order_priority: varchar(1)
- In
-
Enable/Disable seed files:
- In
dbt_project.yml
seeds: dbt_teach_in: +schema: seeds ... snapshots: +schema: snapshots raw_orders_snapshot_backup: +enabled: true
- In
-
Seeding specific files:
- Using
--select
:dbt seed --profiles-dir ./ --select raw_orders_snapshot_backup
- Using
Official dbt_ docs:
-
Schema Tests
-
Creating simple schema tests:
- In the properties (e.g.
schema.yml
files), addtests:
- For example, in
/data/seeds.yml
(Yes, you can do it for seeds file):seeds: - name: raw_orders description: ... columns: - name: id description: Integer ID of an order entry in this table. tests: - unique - not_null - name: order_priority tests: - accepted_values: values: ["C", "H", "M", "L"] ...
- In the properties (e.g.
-
Creating
relationships
tests between tables:- In the properties (e.g.
schema.yml
files), addtests:
- For example, in
/data/seeds.yml
:seeds: - name: raw_orders description: ... columns: ... - name: country_id description: Integer ID of a country entry in the `raw_countries` table. tests: - relationships: to: ref('raw_countries') field: id - name: item_type_id description: Integer ID of an item type entry in the `raw_item_types` table. tests: - relationships: to: ref('raw_item_types') field: id
- In the properties (e.g.
-
Reusing tests using anchors in
yaml
:- For example, in
/models/reporting/staging/reporting_staging.yml.yml
: - Define an anchor:
############################################## ## Defining some anchor labels to stay DRY. ## ############################################## test__boolean_true_only: &test__boolean_true_only tests: - not_null - accepted_values: values: - true
- Using the anchor:
models: - name: stg_orders__calc_check description: | A simple view use to visualy check the reports v.s. calculated: - Revenue - Cost - Profit Schema test is executed on the Boolean column and throws error if any calculation is off! columns: - name: same_reveue description: "" <<: *test__boolean_true_only - name: same_cost description: "" <<: *test__boolean_true_only - name: same_profit description: "" <<: *test__boolean_true_only
- For example, in
-
But note that Schema test only show if a test
PASS
orFAILS
. To get the number of rows that fails a particular tests, you can use Data Test instead.
-
-
Data Tests
-
Creating simple data test:
- Create a query that would return rows for a specific conditions (in the
WHERE
clause). - If rows are returned, then the test will fail and return an
ERROR
(default) orWARN
(see below) - Save the query in the
/tests
directory - See example
test_stg_orders__calc_check.sql
- Create a query that would return rows for a specific conditions (in the
-
Defining custom config for each test:
- By default, a test is
enabled
and theseverity
is error. - To overwrite this, define custom config in the test's
.sql
files:{{ "{{ config( enabled=true | false, severity='warn' | 'error' ) " }}}}
- See example
test_stg_orders__calc_check.sql
- By default, a test is
-
Use snapshot to track changes to a table overtime. You can also use it to view a snapshot of the table at a specific period in time.
-
Create a snapshot
- Follow the official dbt guide
- For example, see
/snapshots/raw_orders_snapshot.sql
-
To revisit the table at a specific period in time
- In our example, there was an error in the "units_sold" in the
raw_orders
table where the number was doubled in the month of June 2017. - We use a custom query to put ourselves at the point in time before or after the fix was executed (i.e. the fix was on '2017-07-01').
- For the custom query, see
/analysis/tracing_units_sold_error.sql
- Documentation of the Query
- In our example, there was an error in the "units_sold" in the
Teach-in file reference:
Teach-in dbt_ documentation reference:
Official dbt_ docs:
Basically, if you have a "analytical" queries that you want to version control and also run using dbt
but not materialise the table, then you can dump them in the /analysis
directory.
- Remember, anything in the
/analysis
directory will not run when you executedbt run
. Instead, it will only be compiled. - Compiling turns the queries in the
.sql
files to runnable sql you can copy & paste into your SQL clients or Sisense (Periscope) - To compile manually, just run
dbt compile
.
Official dbt_ docs:
-
Setting variables in
dbt_project.yml
:- A simple example is to set the variable name
test_name
toBob
:vars: dbt_teach_in: test_name: "Bob"
- This variable is available project wide.
- A simple example is to set the variable name
-
Setting variables within
.sql
scripts- A simple example is to set the variable name
test_name
toBob
:{{ " {% set test_name = 'Bob' "}}%}
- This variable is available only in this script.
- We can even use Python modules within this jinja contexts:
- Here we set the variable called
now
to the python module datetime{{ " {% set now = modules.datetime.datetime.now() "}}%}
- This example is implemented in
models/experiments/crazy_experiment.sql
- Here we set the variable called
- A simple example is to set the variable name
-
Calling variables within
.sql
scripts.- Simply use the doubt curly brackets in
.sql
files, like this:{{ " {{ test_name "}}}}
- For variables assign python modules, you can call its sub method, liek this:
{{ " {{ now.minute "}}}}
- Simply use the doubt curly brackets in
-
Note that
ref
andsource
used are macros called using the jinja:- e.g.
with source_table_name as ( select * from {{ " {{ source('schema_name', 'table_name') "}}}} ) , model_name as ( select * from {{ " {{ ref('model_name') "}}}} )
- e.g.
There are multiple type of Hooks & Operations dbt can run, see the official dbt documentation
- Run Post-Hooks (i.e. after model is materialised)
- Example:
{{ " {{ config( materialized='table', post_hook=[ 'drop table if exists dbt_experiment.crazy_experiment_fruit_only', 'create table dbt_experiment.crazy_experiment_fruit_only as (select * from dbt_experiment.crazy_experiment where item_type = ''Fruits'')', ] ) "}}}}
- Example:
Official dbt_ docs:
Official dbt_ docs:
-
Formatting YAML Multiline Strings in descriptions with “>”, “|”, etc.
-
Using and linking to markdown files with
{{ "{% docs <name_of_doc> " }}%}
- For example declaration, see
/analysis/docs/analysis.md
- For example implementation, see
/analysis/analysis.yml
:analyses: - name: <name_of_sql> description: '{{ "{{ doc(''<name_of_doc>'') " }}}}'
- Note: can be used for
seeds
,analyses
, ormodels
too.
- For example declaration, see
-
Adding Images or Hyperlinks
-
Formatting Reference
Official dbt_ docs:
- Applying to the final
fct_
table- For example declaration, see
/models/reporting/core/fct_orders__dashboard.sql
-- In `config` here, `sort` and `dist` only works for Redshift and not for Postgres. {{ " {{ config( materialized='table', sort=[ 'item_type', 'order_id', 'order_ref_id', ], dist='order_id', ) "}}}}
- For example declaration, see