# dbt with Databricks - Hands-on Exercises

Welcome to the dbt training session! In this workshop, you'll learn to build data transformations using the medallion architecture pattern.

![Medallion Architecture Overview](../z_assets/medallion-architecture-overview.png)

## What You'll Build
- **Bronze Layer**: Raw TPC-H data (already provided)
- **Silver Layer**: Cleansed data with surrogate keys and SCD2 tracking
- **Gold Layer**: Dimensional model for analytics

## Prerequisites
- Access to Databricks workspace

## Exercise 1: dbt Setup

### Goal
Initialize your dbt project and establish connection to Databricks.

### Steps

#### 1.1 Initialize virtual environment
```bash
python -m venv .venv

# Windows
.venv\Scripts\activate

# Mac/Linux
. .venv/bin/activate
```

#### 1.1 Initialize dbt Project
```bash
pip install -r requirements-dev.txt

dbt init
dbt debug
```

#### 1.2 Review Connection
Review `C:\Users\<Username>\.dbt\profiles.yml` (Windows) or `~/.dbt/profiles.yml` (Mac/Linux) and optionally copy the file to dbt-workshop:
```yaml
dbt-workshop:
  outputs:
    dev:
      catalog: dbt-workshop
      host: adb-673939630363416.16.azuredatabricks.net
      http_path: /sql/1.0/warehouses/42d4ebf47e760187
      schema: <YOUR INITIALS>
      threads: 4
      token: <PAT>
      type: databricks
  target: dev
```

#### 1.3 Test Connection
```bash
dbt debug
```

### Expected Outcome
- `dbt debug` shows all green checkmarks (except git)
- Connection to Databricks is successful

## Exercise 2: Explore Bronze Data

### Goal
Understand the TPC-H source data structure.

### TPC-H Data Model
![TPC-H Data Model](../z_assets/tpc-h.png)

>> The listed columns are not correct. The correct column names have prefixes as seen in the paranthesis after the table name. E.g., `custkey` of the table **custoer** is actually `c_custkey`. You have to keep that in mind during the exercises.

### Steps

#### 2.1 Login to the Databricks Workspace
1. Go to: https://adb-673939630363416.16.azuredatabricks.net/
2. On the menu click on "SQL Editor"
3. Make sure that you have the catalog 'dbt-workshop' and the schema 'default' selected.
![image.png](../z_assets/exercise-2-1.png)

#### 2.2 Explore Tables
Explore the tables in the catalog using the UI.

Run these queries in Databricks to understand the data:

```sql
-- Check available tables
SHOW TABLES IN bronze;

-- Explore customer data
SELECT * FROM bronze.customer LIMIT 10;
DESCRIBE bronze.customer;

-- Check data volumes
SELECT 'customer' as table_name, COUNT(*) as row_count FROM bronze.customer
UNION ALL
SELECT 'orders', COUNT(*) FROM bronze.orders
UNION ALL
SELECT 'lineitem', COUNT(*) FROM bronze.lineitem;
```

### Tables Available
- `customer`: Customer information
- `orders`: Customer orders 
- `lineitem`: Order line items (largest table)
- `part`: Product parts
- `partsupp`: Part-supplier relationships
- `supplier`: Supplier information
- `nation`: Countries
- `region`: Geographic regions

## Exercise 3: Silver Layer - Basic Models (15 minutes)

### Goal
Create your first dbt models with renamed tables and columns.

### Steps

#### 3.1 Create Source Configuration
Create `models/bronze/sources.yml`:
```yaml
version: 2

sources:
  - name: bronze
    description: 'Raw TPC-H data'
    tables:
      - name: customer
      - name: orders
      - name: lineitem
      - name: part
      - name: partsupp
      - name: supplier
      - name: nation
      - name: region
```

#### 3.2 Configure Silver Models
Create `models/silver/schema.yml`:
```yaml
version: 2

models:
  - name: customers
    description: 'Cleansed customer data'
  - name: orders
    description: 'Cleansed orders data'
  - name: lineitems
    description: 'Cleansed line item data'
  - name: parts
    description: 'Cleansed parts data'
  - name: suppliers
    description: 'Cleansed supplier data'
  - name: nations
    description: 'Cleansed nation data'
  - name: regions
    description: 'Cleansed region data'
```

#### 3.3 Create Silver Models

Create `models/silver/customers.sql`:
```sql
{{ config(materialized='table') }}

SELECT 
    c_custkey as customer_id,
    c_name as customer_name,
    c_address as customer_address,
    c_nationkey as nation_id,
    c_phone as customer_phone,
    c_acctbal as account_balance,
    c_mktsegment as market_segment,
    c_comment as customer_comment
FROM {{ source('bronze', 'customer') }}
```

Create similar models for:
- `orders.sql` (rename orderkey → order_id, custkey → customer_id, etc.)
- `lineitems.sql` (rename orderkey → order_id, partkey → part_id, etc.)
- `parts.sql`, `suppliers.sql`, `nations.sql`, `regions.sql`

#### 3.4 Run Your Models
```bash
dbt run --select silver
```

### Success Criteria
- All silver models run successfully
- Tables have consistent naming (plural, snake_case)
- Columns are renamed for clarity

## Exercise 4: Silver Layer - Surrogate Keys & Snapshots (15 minutes)

### Goal
Add surrogate keys and implement SCD2 tracking with snapshots.

### Silver Layer Data Model
Your target silver layer model with surrogate keys and SCD2 tracking:

![Silver Layer Data Model](../z_assets/silver-layer-model.png)

### Steps

#### 4.1 Add dbt-utils Package
Create `packages.yml` in project root:
```yaml
packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1
```

Install packages:
```bash
dbt deps
```

#### 4.2 Update Models with Surrogate Keys
Update `models/silver/customers.sql`:
```sql
{{ config(materialized='table') }}

SELECT 
    {{ dbt_utils.surrogate_key(['custkey']) }} as customer_key,
    custkey as customer_id,
    name as customer_name,
    address as customer_address,
    {{ dbt_utils.surrogate_key(['nationkey']) }} as nation_key,
    nationkey as nation_id,
    phone as customer_phone,
    acctbal as account_balance,
    mktsegment as market_segment,
    comment as customer_comment
FROM {{ source('bronze', 'customer') }}
```

Update all other silver models to include surrogate keys for all ID fields.

#### 4.3 Create Snapshots
Create `snapshots/customers_snapshot.sql`:
```sql
{% snapshot customers_snapshot %}

{{
    config(
      target_schema='silver_snapshots',
      unique_key='customer_id',
      strategy='check',
      check_cols=['customer_name', 'customer_address', 'account_balance', 'market_segment']
    )
}}

SELECT * FROM {{ ref('customers') }}

{% endsnapshot %}
```

Create snapshots for all dimension tables (customers, orders, parts, suppliers, nations, regions).

#### 4.4 Run Updated Models
```bash
# Drop old tables first
dbt run --models silver --full-refresh

# Create snapshots
dbt snapshot
```

### Success Criteria
- All models include surrogate keys
- Snapshots are created successfully
- SCD2 tracking is enabled (check dbt_valid_from, dbt_valid_to columns)

## Exercise 5: Gold Layer - Dimensions (20 minutes)

### Goal
Create dimensional model with clean dimension tables.

### Gold Layer Data Model
Your target star schema for analytics:

![Gold Layer Data Model](../z_assets/gold-layer-model.png)

### Steps

#### 5.1 Configure Gold Models
Create `models/gold/schema.yml`:
```yaml
version: 2

models:
  - name: dim_customer
    description: 'Customer dimension with geography'
    columns:
      - name: customer_key
        description: 'Surrogate key for customer'
        tests:
          - unique
          - not_null
      - name: customer_id
        description: 'Natural key for customer'
        tests:
          - unique
          - not_null
  
  - name: dim_supplier
    description: 'Supplier dimension with geography'
  
  - name: dim_part
    description: 'Part dimension'
  
  - name: fact_lineitem
    description: 'Line item fact table'
```

#### 5.2 Create Customer Dimension
Create `models/gold/dim_customer.sql`:
```sql
{{ config(materialized='table') }}

SELECT 
    c.customer_key,
    c.customer_id,
    c.customer_name,
    c.customer_address,
    c.customer_phone,
    c.account_balance,
    c.market_segment,
    n.nation_name,
    r.region_name
FROM {{ ref('customers_snapshot') }} c
JOIN {{ ref('nations_snapshot') }} n ON c.nation_key = n.nation_key
JOIN {{ ref('regions_snapshot') }} r ON n.region_key = r.region_key
WHERE c.dbt_valid_to IS NULL  -- Current records only
  AND n.dbt_valid_to IS NULL
  AND r.dbt_valid_to IS NULL
```

#### 5.3 Create Supplier Dimension
Create `models/gold/dim_supplier.sql`:
```sql
{{ config(materialized='table') }}

SELECT 
    s.supplier_key,
    s.supplier_id,
    s.supplier_name,
    s.supplier_address,
    s.supplier_phone,
    s.account_balance,
    n.nation_name,
    r.region_name
FROM {{ ref('suppliers_snapshot') }} s
JOIN {{ ref('nations_snapshot') }} n ON s.nation_key = n.nation_key
JOIN {{ ref('regions_snapshot') }} r ON n.region_key = r.region_key
WHERE s.dbt_valid_to IS NULL
  AND n.dbt_valid_to IS NULL
  AND r.dbt_valid_to IS NULL
```

#### 5.4 Create Part Dimension
Create `models/gold/dim_part.sql`:
```sql
{{ config(materialized='table') }}

SELECT 
    part_key,
    part_id,
    part_name,
    manufacturer,
    brand,
    part_type,
    part_size,
    container,
    retail_price
FROM {{ ref('parts_snapshot') }}
WHERE dbt_valid_to IS NULL
```

#### 5.5 Run Dimension Models
```bash
dbt run --models gold
```

### Success Criteria
- All dimension tables created successfully
- Joins between dimensions work correctly
- Only current records (dbt_valid_to IS NULL) are included

## Exercise 6: Gold Layer - Fact Table (20 minutes)

### Goal
Create an incremental fact table for line items.

### Steps

#### 6.1 Create Fact Table
Create `models/gold/fact_lineitem.sql`:
```sql
{{
    config(
        materialized='incremental',
        unique_key='lineitem_key',
        on_schema_change='fail'
    )
}}

SELECT 
    l.lineitem_key,
    l.order_key,
    l.part_key,
    l.supplier_key,
    l.line_number,
    l.quantity,
    l.extended_price,
    l.discount,
    l.tax,
    l.return_flag,
    l.line_status,
    l.ship_date,
    l.commit_date,
    l.receipt_date,
    l.ship_instructions,
    l.ship_mode,
    
    -- Calculated measures
    l.extended_price * (1 - l.discount) as discounted_price,
    l.extended_price * (1 - l.discount) * (1 + l.tax) as total_price,
    
    -- Foreign keys to dimensions
    o.customer_key,
    
    -- Audit fields
    l.dbt_updated_at
    
FROM {{ ref('lineitems_snapshot') }} l
JOIN {{ ref('orders_snapshot') }} o ON l.order_key = o.order_key
WHERE l.dbt_valid_to IS NULL
  AND o.dbt_valid_to IS NULL

{% if is_incremental() %}
    -- Only process new or updated records
    AND l.dbt_updated_at > (SELECT max(dbt_updated_at) FROM {{ this }})
{% endif %}
```

#### 6.2 Run Fact Table
```bash
# Initial full load
dbt run --models fact_lineitem

# Subsequent incremental runs
dbt run --models fact_lineitem
```

#### 6.3 Test Your Star Schema
Run this query to verify your dimensional model:
```sql
SELECT 
    c.customer_name,
    c.nation_name,
    p.part_name,
    s.supplier_name,
    f.quantity,
    f.total_price
FROM gold.fact_lineitem f
JOIN gold.dim_customer c ON f.customer_key = c.customer_key
JOIN gold.dim_part p ON f.part_key = p.part_key
JOIN gold.dim_supplier s ON f.supplier_key = s.supplier_key
LIMIT 10;
```

### Success Criteria
- Fact table loads successfully
- Incremental logic works (run twice to test)
- Star schema query returns meaningful results
- All foreign key relationships are intact

## Exercise 7: Testing & Documentation (If Time Permits)

### Goal
Add data quality tests and generate documentation.

### Steps

#### 7.1 Run Tests
```bash
dbt test
```

#### 7.2 Generate Documentation
```bash
dbt docs generate
dbt docs serve
```

#### 7.3 View Lineage
In the docs interface, explore:
- Data lineage graph
- Model documentation
- Test results

### Success Criteria
- All tests pass
- Documentation site loads
- Lineage graph shows bronze → silver → gold flow

## Final Validation

### Check Your Work
Run these commands to validate your complete solution:

```bash
# Run everything
dbt build

# Check row counts
dbt run-operation print_relation_row_counts
```

### Expected Results
You should have:
- 8 bronze source tables
- 7 silver models + snapshots
- 3 gold dimensions + 1 fact table
- SCD2 tracking with dbt_valid_from/to
- Surrogate keys throughout
- Incremental fact table
- Star schema for analytics

## Congratulations!

You've successfully built a complete medallion architecture data pipeline using dbt!

### What You've Learned
- dbt project setup and configuration
- Source and model definitions
- Data transformation and cleansing
- Surrogate key generation
- SCD2 implementation with snapshots
- Dimensional modeling
- Incremental models for performance
- Testing and documentation

### Next Steps
- Add more complex business logic
- Implement additional tests
- Create marts for specific use cases
- Set up CI/CD for production deployment
- Explore dbt Cloud features