# LAB 08: Lakeflow Jobs — Triggers, Dependencies & Orchestration

**Duration:** ~30 min | **Day:** 3 | **Difficulty:** Intermediate
**After module:** M08: Lakeflow Jobs & Orchestration

> *"The RetailHub pipeline works — now automate it. Configure triggers, define task dependencies, handle failures with repair runs, and monitor via system tables."*

### Lab Structure

| Section | Focus | Format |
|---------|-------|--------|
| **Section 1: Workshop** | Hands-on job creation in Databricks UI | Guided walkthrough with screenshots |
| **Section 2: Practice** | Orchestration concepts & configuration | Fill-in-the-blank exercises with verification |

## Setup

In [0]:
%run ../../setup/00_setup

---

## Section 1: Workshop — Creating & Running Jobs in Databricks UI

In this section you will create two Lakeflow Jobs through the Databricks UI:

| Job | Tasks | Purpose |
|-----|-------|---------|
| **customer_pipeline** | `bronze_customer` → `silver_customer` | Ingest and cleanse customer data |
| **orders_pipeline** | `bronze_orders` → `silver_orders` → `gold_daily_orders` → `gold_summary` | Full orders medallion pipeline, triggered by customer updates |

> **Goal:** Create both jobs, configure task dependencies, set up a **Table Update trigger** so the orders pipeline runs automatically when `silver_customers` is updated, then execute the pipeline end-to-end.

### Step 1: Create a New Job — `customer_pipeline`

Navigate to **Jobs & Pipelines** in the left sidebar, then click **Create Job**.

![Create Job](../../../assets/images/training_2026/day3/6f524625f8464c29a2572b0a324333a5.webp)

---

### Step 2: Set the Job Name

Enter a **unique name** for your job (e.g., `<your_name>_customer_pipeline`).

![Job Name](../../../assets/images/training_2026/day3/ae8ab8cafdcf4690b5d9665cbf058798.webp)

> **Note:** The job name does not need to be globally unique — each job is identified by a unique **Job ID** assigned automatically by Databricks.

![Job Details — ID & Creator](../../../assets/images/training_2026/day3/460925af2b214f93a2252aed0ea86936.webp)

---

### Step 3: Configure Job Parameters

Add the following **job parameters** so that all tasks share the same catalog and source path:

| Parameter | Value |
|-----------|-------|
| `catalog` | `retailhub_<your_name>` |
| `source_path` | `/Volumes/retailhub_<your_name>/default/datasets` |

![Job Parameters](../../../assets/images/training_2026/day3/e24f8fbe50f642e0a28054972140dcb1.webp)

---

### Step 4: Add the First Task — `bronze_customer`

Click **Add task** and configure:

| Field | Value |
|-------|-------|
| **Task name** | `bronze_customer` |
| **Type** | Notebook |
| **Source** | Workspace |
| **Path** | Path to your `bronze_customers` notebook |
| **Compute** | Shared training cluster *(for workshop only — in production, use a dedicated Job cluster)* |

![Add Task](../../../assets/images/training_2026/day3/8d2245e891b849e5966aae38a0e33c5f.webp)

![Task Configuration](../../../assets/images/training_2026/day3/c89c69190d40463faae2b5186c78c271.webp)

Click **Create task** when done.

---

### Step 5: Add the Second Task — `silver_customer`

Repeat the same process for `silver_customer`:
- Set **Depends on** → `bronze_customer` (so it runs only after bronze succeeds)
- Point the notebook path to your `silver_customers` notebook

Your completed job should look like this:

![Customer Pipeline — 2 Tasks](../../../assets/images/training_2026/day3/8bdeb6d46c284e3c85bf1c04f4b27657.webp)

### Step 6: Create the Second Job — `orders_pipeline`

Create a **new job** following the same steps as above. Name it `<your_name>_orders_pipeline` and add all four medallion tasks with proper dependencies.

Your completed pipeline should look like this:

![Orders Pipeline — 4 Tasks](../../../assets/images/training_2026/day3/119b8804bea74938aff23d816140bf4b.webp)

---

### Step 7: Configure a Table Update Trigger

Navigate to the **Triggers** tab of the `orders_pipeline` job and add a **Table Update** trigger on `silver_customers`.

This means the orders pipeline will **start automatically** whenever the `silver_customers` table is updated by the customer pipeline.

![Table Update Trigger Configuration](../../../assets/images/training_2026/day3/d4dd6298fc9b41e9bc784878ccd8dc3a.webp)

---

### Step 8: Run the Pipeline

Click **Run now** on the `customer_pipeline` job first.

![Run Job](../../../assets/images/training_2026/day3/73685979cbb243b3af4e01fdcadb1a32.webp)

> **Expected result:** The customer pipeline completes successfully, which updates `silver_customers`. This triggers the orders pipeline automatically — both jobs should show a successful run in the **Run History** tab.

### Reference: YAML Definition — `orders_pipeline`

> The YAML below shows the **Databricks Asset Bundle** definition for the orders pipeline. This is the programmatic equivalent of what you configured in the UI above.

```yaml
resources:
  jobs:
    demo_orders_pipeline:
      name: demo_orders_pipeline
      trigger:
        pause_status: UNPAUSED
        table_update:
          table_names:
            - retailhub_trainer.silver.silver_customers
      tasks:
        - task_key: bronze_orders
          notebook_task:
            notebook_path: /Workspace/.../bronze_orders
            source: WORKSPACE
          existing_cluster_id: <CLUSTER_ID>
        - task_key: silver_orders
          depends_on:
            - task_key: bronze_orders
          notebook_task:
            notebook_path: /Workspace/.../silver_orders
            source: WORKSPACE
          existing_cluster_id: <CLUSTER_ID>
        - task_key: gold_daily_orders
          depends_on:
            - task_key: silver_orders
          notebook_task:
            notebook_path: /Workspace/.../gold_daily_orders
            source: WORKSPACE
          existing_cluster_id: <CLUSTER_ID>
        - task_key: gold_customer_orders_summary
          depends_on:
            - task_key: gold_daily_orders
          notebook_task:
            notebook_path: /Workspace/.../gold_customer_orders_summary
            source: WORKSPACE
          existing_cluster_id: <CLUSTER_ID>
      queue:
        enabled: true
      parameters:
        - name: catalog
          default: retailhub_trainer
        - name: source_path
          default: /Volumes/retailhub_trainer/default/datasets
```

### Reference: YAML Definition — `customer_pipeline`

```yaml
resources:
  jobs:
    demo_customer_pipeline:
      name: demo_customer_pipeline
      tasks:
        - task_key: bronze_customer
          notebook_task:
            notebook_path: /Workspace/.../bronze_customers
            source: WORKSPACE
          existing_cluster_id: <CLUSTER_ID>
        - task_key: silver_customer
          depends_on:
            - task_key: bronze_customer
          notebook_task:
            notebook_path: /Workspace/.../silver_customers
            source: WORKSPACE
          existing_cluster_id: <CLUSTER_ID>
      queue:
        enabled: true
      parameters:
        - name: catalog
          default: retailhub_trainer
        - name: source_path
          default: /Volumes/retailhub_trainer/default/datasets
```

> **Key differences:** The customer pipeline has **no trigger** (runs manually or on-demand), while the orders pipeline uses a **table_update trigger** on `silver_customers`.

---

## Section 2: Practice — Orchestration Concepts & Configuration

Complete the exercises below to reinforce your understanding of Lakeflow Jobs orchestration.

| Task | Topic | Skills Tested |
|------|-------|---------------|
| 1 | Job Configuration (JSON) | Reading job definitions, timeouts, retries |
| 2 | Trigger Types | Matching scenarios to trigger types |
| 3 | CRON Expressions | Writing schedule expressions |
| 4 | Task Dependencies (DAG) | Fan-out / fan-in patterns |
| 5 | Repair Runs | Understanding re-execution scope |
| 6 | Task Values | Passing parameters between tasks |
| 7 | System Tables | SQL queries for job monitoring |
| 8 | Cluster Selection | Job cluster vs. All-purpose cluster |

---
### Task 1: Understanding Job Configuration (JSON)

Databricks Jobs can be configured via UI or programmatically via **REST API / JSON / YAML**.
Examine the job configuration below and answer the questions.

```json
{
  "name": "RetailHub_Daily_Refresh",
  "tasks": [
    {
      "task_key": "refresh_pipeline",
      "pipeline_task": { "pipeline_id": "<PIPELINE_ID>" },
      "timeout_seconds": 1800
    },
    {
      "task_key": "validate_results",
      "depends_on": [{ "task_key": "refresh_pipeline" }],
      "notebook_task": { "notebook_path": "/Workspace/.../task_01_validate" },
      "max_retries": 2,
      "retry_on_timeout": false
    },
    {
      "task_key": "generate_report",
      "depends_on": [{ "task_key": "validate_results" }],
      "notebook_task": { "notebook_path": "/Workspace/.../task_03_report" }
    }
  ],
  "trigger": {
    "periodic": { "interval": 1, "unit": "DAYS" }
  }
}
```

> **Hint:** Pay attention to `timeout_seconds`, `max_retries`, and the `depends_on` chains.

In [0]:
# TODO: Answer the questions about the job configuration above

# Q1: How many tasks does this job have?
num_tasks = ______  # int

# Q2: Which task runs first (has no dependencies)?
first_task = "______"  # str

# Q3: What is the maximum time (in minutes) the pipeline task can run before timeout?
timeout_minutes = ______  # int

# Q4: How many times will validate_results retry on failure?
max_retries = ______  # int

# Q5: If refresh_pipeline fails, will validate_results run?
validate_runs_on_failure = ______  # bool (True/False)

In [0]:
# Verification
assert num_tasks == 3, f"Expected 3 tasks, got {num_tasks}"
assert first_task == "refresh_pipeline", f"First task should be refresh_pipeline, got {first_task}"
assert timeout_minutes == 30, f"1800 seconds = 30 minutes, got {timeout_minutes}"
assert max_retries == 2, f"Expected 2 retries, got {max_retries}"
assert validate_runs_on_failure == False, "Dependent tasks do NOT run if their dependency fails"
print("Task 1 PASSED: Job configuration understood correctly")

---
### Task 2: Trigger Types

Match each business scenario to the correct **trigger type**.

| Scenario | Trigger Type |
|----------|:------------:|
| Run ETL pipeline every day at 6 AM | ? |
| Process files as soon as they land in a Volume | ? |
| Continuously process streaming data with minimal latency | ? |
| Run only when manually triggered by a data engineer | ? |

> **Options:** `scheduled`, `file_arrival`, `continuous`, `manual`

In [0]:
# TODO: Fill in the correct trigger type for each scenario
# Options: "scheduled", "file_arrival", "continuous", "manual"

trigger_daily_6am = "______"
trigger_new_files = "______"
trigger_streaming = "______"
trigger_adhoc = "______"

In [0]:
# Verification
assert trigger_daily_6am == "scheduled", f"Daily at 6 AM = scheduled trigger, got {trigger_daily_6am}"
assert trigger_new_files == "file_arrival", f"Process on file landing = file_arrival, got {trigger_new_files}"
assert trigger_streaming == "continuous", f"Minimal latency streaming = continuous, got {trigger_streaming}"
assert trigger_adhoc == "manual", f"Ad-hoc = manual, got {trigger_adhoc}"
print("Task 2 PASSED: Trigger types matched correctly")

---
### Task 3: CRON Expressions

Write the CRON expression for each schedule using the standard 5-field format.

**Format:** `minute hour day_of_month month day_of_week`

| Field | Allowed Values |
|-------|---------------|
| Minute | `0–59` |
| Hour | `0–23` |
| Day of month | `1–31` |
| Month | `1–12` |
| Day of week | `0–6` (0 = Sunday) or `1-5` for Mon–Fri |

> **Tip:** Use `*` for "every" and `*/N` for "every N units".

In [0]:
# TODO: Write CRON expressions

# Every day at 6:00 AM
cron_daily_6am = "______"

# Every hour (at minute 0)
cron_hourly = "______"

# Monday to Friday at 8:00 AM
cron_weekdays_8am = "______"

# Every 15 minutes
cron_every_15min = "______"

In [0]:
# Verification
assert cron_daily_6am == "0 6 * * *", f"Expected '0 6 * * *', got '{cron_daily_6am}'"
assert cron_hourly == "0 * * * *", f"Expected '0 * * * *', got '{cron_hourly}'"
assert cron_weekdays_8am == "0 8 * * 1-5", f"Expected '0 8 * * 1-5', got '{cron_weekdays_8am}'"
assert cron_every_15min == "*/15 * * * *", f"Expected '*/15 * * * *', got '{cron_every_15min}'"
print("Task 3 PASSED: CRON expressions correct")

---
### Task 4: Task Dependencies — DAG Design

Design a task dependency graph (DAG) for the following requirements:

1. **`ingest`** — runs first (no dependencies)
2. **`build_dim_tables`** and **`build_fact_tables`** — run **in parallel** after `ingest` completes *(fan-out)*
3. **`generate_report`** — runs only after **both** DIM and FACT tasks complete *(fan-in)*
4. **`send_notification`** — runs after `generate_report`

```
  ingest
    ├── build_dim_tables ──┐
    └── build_fact_tables ─┤
                           └── generate_report
                                 └── send_notification
```

Define the dependencies for each task as a list.

In [0]:
# TODO: Define task dependencies
# Use a list of task names that each task depends on.
# An empty list [] means no dependencies (runs first).

task_dependencies = {
    "ingest":            ______,  # list of dependencies
    "build_dim_tables":  ______,  # list of dependencies
    "build_fact_tables": ______,  # list of dependencies
    "generate_report":   ______,  # list of dependencies
    "send_notification":  ______,  # list of dependencies
}

In [0]:
# Verification
assert task_dependencies["ingest"] == [], "Ingest has no dependencies"
assert task_dependencies["build_dim_tables"] == ["ingest"], "DIM depends on ingest"
assert task_dependencies["build_fact_tables"] == ["ingest"], "FACT depends on ingest"
assert sorted(task_dependencies["generate_report"]) == ["build_dim_tables", "build_fact_tables"], \
    "Report depends on BOTH dim and fact (fan-in)"
assert task_dependencies["send_notification"] == ["generate_report"], "Notification depends on report"
print("Task 4 PASSED: DAG dependencies correct")
print()
print("DAG structure:")
print("  ingest")
print("    +-- build_dim_tables")
print("    +-- build_fact_tables")
print("          +-- generate_report (waits for both)")
print("                +-- send_notification")

---
### Task 5: Repair Run Scenarios

The job from Task 4 ran, but **`build_fact_tables` failed**. All other completed tasks succeeded.

Answer the following questions about **Repair Run** behavior.

> **Reminder:** A Repair Run re-executes only the **failed task** and all its **downstream dependents** — it skips tasks that already succeeded.

In [0]:
# TODO: Answer Repair Run questions

# Q1: Which tasks will be RE-EXECUTED during a Repair Run?
# Options: list the task names that will run again
repair_rerun_tasks = [______]  # list of str

# Q2: Will 'ingest' run again during repair?
ingest_reruns = ______  # bool

# Q3: Will 'build_dim_tables' run again during repair?
dim_reruns = ______  # bool

# Q4: Is Repair Run cheaper than a full re-run? (less compute used)
repair_is_cheaper = ______  # bool

In [0]:
# Verification
expected_repair = sorted(["build_fact_tables", "generate_report", "send_notification"])
assert sorted(repair_rerun_tasks) == expected_repair, \
    f"Repair re-runs the failed task + all downstream. Expected {expected_repair}, got {sorted(repair_rerun_tasks)}"
assert ingest_reruns == False, "Ingest succeeded -- NOT re-executed in repair"
assert dim_reruns == False, "DIM succeeded -- NOT re-executed in repair"
assert repair_is_cheaper == True, "Repair skips successful tasks, using less compute"
print("Task 5 PASSED: Repair Run behavior understood correctly")

---
### Task 6: Passing Parameters Between Tasks (`dbutils.jobs.taskValues`)

Tasks within a Lakeflow Job can share data using **`dbutils.jobs.taskValues`**.

**Setting a value (in Task A):**
```python
dbutils.jobs.taskValues.set(key="row_count", value=42)
```

**Getting a value (in Task B, which depends on Task A):**
```python
count = dbutils.jobs.taskValues.get(taskKey="task_a", key="row_count")
```

> Since we are not running inside a Job, the code below **simulates** this mechanism using a dictionary. Complete the missing values.

In [0]:
# TODO: Complete the task value operations
# Since we are not inside a Job, we will simulate with a dictionary

# Simulated task values store
task_values = {}

# Task A: "refresh_pipeline" -- sets the number of rows processed
rows_processed = 15420
task_values["refresh_pipeline"] = {"rows_processed": ______}  # TODO: set the value

# Task A also sets the processing timestamp
from datetime import datetime
processing_time = datetime.now().isoformat()
task_values["refresh_pipeline"]["processing_time"] = ______  # TODO: set the value

# Task B: "validate_results" -- reads values from Task A
retrieved_rows = task_values[______][______]  # TODO: get rows_processed from refresh_pipeline
retrieved_time = task_values["refresh_pipeline"]["processing_time"]

In [0]:
# Verification
assert task_values["refresh_pipeline"]["rows_processed"] == 15420, "rows_processed should be 15420"
assert task_values["refresh_pipeline"]["processing_time"] == processing_time, "processing_time should match"
assert retrieved_rows == 15420, "Should retrieve 15420 from refresh_pipeline"
print("Task 6 PASSED: Task value passing works correctly")
print(f"  rows_processed: {retrieved_rows}")
print(f"  processing_time: {retrieved_time}")

---
### Task 7: Job Monitoring via System Tables

Databricks **system tables** provide metadata about job executions for monitoring and auditing.

| System Table | Content |
|-------------|---------|
| `system.lakeflow.job_run_timeline` | Job-level run history (status, duration, result) |
| `system.lakeflow.job_task_run_timeline` | Task-level run details (per-task status, timing) |

Write a SQL query to find all **failed** job runs in the last **7 days**.

In [0]:
# TODO: Write a SQL query to find all FAILED job runs in the last 7 days
# Columns available: job_id, run_id, result_state, start_time, end_time

query_failed_runs = """
SELECT job_id, run_id, result_state, start_time, end_time
FROM system.lakeflow.job_run_timeline
WHERE result_state = '______'
  AND start_time >= current_date() - INTERVAL ______ DAYS
ORDER BY start_time DESC
"""

In [0]:
# Verification
assert "FAILED" in query_failed_runs.upper(), "Should filter for FAILED result_state"
assert "7" in query_failed_runs, "Should look back 7 days"
assert "ORDER BY" in query_failed_runs.upper(), "Should order results"
print("Task 7 PASSED: System table query is correct")

---
### Task 8: Job Cluster vs. All-Purpose Cluster

Choose the correct **cluster type** for each scenario.

| Scenario | Cluster Type |
|----------|:------------:|
| Scheduled nightly ETL job (runs 2 hours) | ? |
| Interactive notebook development | ? |
| Ad-hoc data investigation by an analyst | ? |
| Production ML model training (weekly) | ? |

> **Rule of thumb:** Use **Job clusters** for automated/scheduled workloads (cost-optimized, auto-terminates). Use **All-purpose clusters** for interactive/development work (long-running, shared).

In [0]:
# TODO: Assign the correct cluster type
# Options: "job_cluster" or "all_purpose"

# Scenario 1: Scheduled nightly ETL job that runs for 2 hours
nightly_etl = "______"

# Scenario 2: Interactive notebook development and exploration
interactive_dev = "______"

# Scenario 3: Ad-hoc data investigation by an analyst
adhoc_analysis = "______"

# Scenario 4: Production ML model training triggered weekly
ml_training = "______"

In [0]:
# Verification
assert nightly_etl == "job_cluster", "Scheduled ETL -> job cluster (cheaper, auto-terminates)"
assert interactive_dev == "all_purpose", "Interactive work -> all-purpose (stays running)"
assert adhoc_analysis == "all_purpose", "Ad-hoc analysis -> all-purpose (interactive)"
assert ml_training == "job_cluster", "Scheduled ML training -> job cluster (cost-effective)"
print("Task 8 PASSED: Cluster type selection correct")
print()
print("Key insight:")
print("  job_cluster -> automated/scheduled workloads (cost-optimized)")
print("  all_purpose -> interactive/development work (long-running)")

---

## Summary

### Section 1 — Workshop
You created two Lakeflow Jobs through the Databricks UI, configured task dependencies, set up a **Table Update trigger**, and executed the full pipeline end-to-end.

### Section 2 — Practice
You reinforced key orchestration concepts:

| Topic | Key Takeaway |
|-------|-------------|
| **Job Configuration** | Jobs are defined as JSON/YAML with tasks, dependencies, triggers, and compute settings |
| **Trigger Types** | `scheduled`, `file_arrival`, `continuous`, `manual` — choose based on latency requirements |
| **CRON Expressions** | 5-field format: `minute hour day month weekday` |
| **DAG Dependencies** | Fan-out (parallel) and fan-in (wait for all) patterns |
| **Repair Runs** | Re-execute only the failed task + downstream — saves compute cost |
| **Task Values** | `dbutils.jobs.taskValues` for passing data between tasks |
| **System Tables** | `system.lakeflow.job_run_timeline` for monitoring and auditing |
| **Cluster Selection** | Job clusters for automation, All-purpose for interactive work |

> **Exam Tip:** Focus on **when** to use each trigger type, how repair runs scope re-execution, and the cost implications of cluster selection.