# LAB 07B: Orchestration — Multi-task Jobs with Triggers

**Duration:** ~45 min | **Day:** 3 | **Difficulty:** Advanced  
**After module:** M08: Orchestration & Lakeflow Jobs

> *"Build two Lakeflow Jobs that orchestrate medallion pipeline notebooks:  
> Job A uses File Arrival trigger, Job B uses Table Update trigger."*

**What you'll do:**
1. Import medallion notebooks to your workspace
2. Create **Job A**: orders pipeline with File Arrival trigger
3. Create **Job B**: customer pipeline with Table Update trigger
4. Trigger both jobs and verify results

## Setup

In [None]:
%run ../../setup/00_setup

---
## Task 1: Prepare Workspace

Import the medallion notebooks from `materials/medallion/` into your Databricks workspace.

**Steps:**
1. In Databricks Workspace, create folder: `/Workspace/Users/<your-email>/medallion/`
2. Import (or copy) the following notebooks:

| Notebook | Layer | Purpose |
|----------|-------|---------|
| `bronze_orders.py` | Bronze | Batch JSON → Delta |
| `silver_orders_cleaned.py` | Silver | Quality filters + computed columns |
| `gold_daily_orders.py` | Gold | Daily aggregation |
| `bronze_customers.py` | Bronze | Batch CSV → Delta |
| `silver_customers.py` | Silver | Dedup + normalize |
| `gold_customer_orders_summary.py` | Gold | Join + aggregate metrics |

3. Also import: `materials/orchestration/task_validate_pipeline.py`

<!-- INSTRUCTOR: Screenshot placeholder - workspace with imported notebooks -->

In [None]:
# Verify dataset files exist
print("Orders files:")
for f in dbutils.fs.ls(f"{DATASET_PATH}/orders/stream/"):
    print(f"  {f.name} ({f.size} bytes)")

print("\nCustomers files:")
for f in dbutils.fs.ls(f"{DATASET_PATH}/customers/"):
    print(f"  {f.name} ({f.size} bytes)")

---
## Task 2: Create Job A — Orders Pipeline (File Arrival)

Create a multi-task Job that processes orders through the medallion layers.

**Job configuration:**

| Setting | Value |
|---------|-------|
| Job name | `LAB_Orders_Pipeline` |
| Cluster | Serverless or new Job cluster |

**Tasks (in order):**

| # | Task name | Notebook | Depends on | Parameters |
|---|-----------|----------|------------|------------|
| 1 | `bronze_orders` | `medallion/bronze_orders` | — | `catalog`, `schema=bronze`, `source_path` |
| 2 | `silver_orders` | `medallion/silver_orders_cleaned` | `bronze_orders` | `catalog`, `schema_bronze=bronze`, `schema_silver=silver` |
| 3 | `gold_daily` | `medallion/gold_daily_orders` | `silver_orders` | `catalog`, `schema_silver=silver`, `schema_gold=gold` |

**Trigger:**
- Type: **File Arrival**
- URL: `/Volumes/<catalog>/default/landing_zone/trigger`
- Min time between triggers: `60s`
- Wait after last change: `15s`

<!-- INSTRUCTOR: Screenshot placeholder - Job A DAG view -->
<!-- INSTRUCTOR: Screenshot placeholder - File Arrival trigger config -->

**Steps:**
- [ ] Workflows → Create Job
- [ ] Add 3 tasks with dependencies (DAG)
- [ ] Set parameters for each task
- [ ] Add File Arrival trigger
- [ ] **Do NOT run yet** — we'll trigger it in Task 3

---
## Task 3: Trigger Job A — Create Signal File

Job A is configured with File Arrival trigger. To start it, create a file in the monitored Volume path.

**Fill in** the Volume path and run the cell to create a signal file:

In [None]:
import uuid, json
from datetime import datetime

# TODO: Set your Volume path (same as configured in Job A trigger)
volume_path = f"/Volumes/{CATALOG}/________/landing_zone/trigger"  # Fill schema

# Create signal file
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
file_id = str(uuid.uuid4())[:8]
file_path = f"{volume_path}/signal_{timestamp}_{file_id}.json"

signal = {"event": "data_ready", "timestamp": timestamp, "id": file_id}

dbutils.fs.mkdirs(volume_path)
dbutils.fs.put(file_path, json.dumps(signal), overwrite=True)

print(f"Signal file created: {file_path}")
print(f"Job A should trigger within ~60 seconds!")

In [None]:
# Validation: file was created
files = dbutils.fs.ls(volume_path)
assert len(files) > 0, "No files found in trigger path"
print(f"Task 3 OK: {len(files)} file(s) in {volume_path}")
for f in files:
    print(f"  {f.name}")

---
## Task 4: Create Job B — Customer Pipeline (Table Update)

Create a second Job that processes customers and joins with orders.
This Job triggers when **silver_orders_cleaned** table is updated (by Job A).

**Job configuration:**

| Setting | Value |
|---------|-------|
| Job name | `LAB_Customer_Pipeline` |
| Cluster | Serverless or new Job cluster |

**Tasks (in order):**

| # | Task name | Notebook | Depends on | Parameters |
|---|-----------|----------|------------|------------|
| 1 | `bronze_customers` | `medallion/bronze_customers` | — | `catalog`, `schema=bronze`, `source_path` |
| 2 | `silver_customers` | `medallion/silver_customers` | `bronze_customers` | `catalog`, `schema_bronze=bronze`, `schema_silver=silver` |
| 3 | `gold_summary` | `medallion/gold_customer_orders_summary` | `silver_customers` | `catalog`, `schema_silver=silver`, `schema_gold=gold` |
| 4 | `validate` | `orchestration/task_validate_pipeline` | `gold_summary` | `catalog`, all schemas, `job_run_id={{run.id}}` |

**Trigger:**
- Type: **Table updated**
- Table: `<catalog>.<schema>.silver_orders_cleaned`
- Condition: Any new rows
- Min time between triggers: `60s`

<!-- INSTRUCTOR: Screenshot placeholder - Job B DAG view -->
<!-- INSTRUCTOR: Screenshot placeholder - Table Update trigger config -->

**Steps:**
- [ ] Workflows → Create Job
- [ ] Add 4 tasks with dependencies (DAG)
- [ ] Set parameters — use `{{job.parameters.catalog}}` for dynamic catalog
- [ ] Add **Table updated** trigger → point to `silver_orders_cleaned`
- [ ] Save Job (it will trigger automatically when Job A updates silver_orders_cleaned)

---
## Task 5: Verify Pipeline Execution

After both jobs complete, verify the results.

**Wait for Job A to complete** (check Workflows UI), then check tables:

In [None]:
# TODO: Uncomment and run after jobs complete

# -- Check bronze layer
# print("Bronze Orders:", spark.table(f"{CATALOG}.bronze.bronze_orders").count())
# print("Bronze Customers:", spark.table(f"{CATALOG}.bronze.bronze_customers").count())

# -- Check silver layer
# print("Silver Orders:", spark.table(f"{CATALOG}.silver.silver_orders_cleaned").count())
# print("Silver Customers:", spark.table(f"{CATALOG}.silver.silver_customers").count())

# -- Check gold layer
# print("Gold Daily Orders:", spark.table(f"{CATALOG}.gold.gold_daily_orders").count())
# print("Gold Customer Summary:", spark.table(f"{CATALOG}.gold.gold_customer_orders_summary").count())

---
## Task 6: Check Event Log

The validation task logs results to `pipeline_event_log`. Check if it recorded success:

In [None]:
# TODO: Uncomment and run after validation task completes

# display(spark.sql(f"""
#     SELECT event_id, event_timestamp, job_run_id, status, details
#     FROM {CATALOG}.default.pipeline_event_log
#     WHERE event_type = 'PIPELINE_VALIDATION'
#     ORDER BY event_timestamp DESC
#     LIMIT 5
# """))

---
## Task 7: Cross-Job Orchestration Pattern

Fill in the blanks to describe the trigger chain:

```
Signal file → ________ trigger → Job A runs → writes silver_orders_cleaned
                                                         ↓
                                              ________ trigger → Job B runs → validates all tables
```

**Questions:**
1. What happens if Job A fails at `silver_orders` task? Does Job B trigger?
2. What is the advantage of Repair Runs when a task fails?
3. How would you add an email alert for Job B failures?

---
## Lab Complete!

### What you've learned:
- Creating multi-task Jobs with task dependencies (DAG)
- Configuring **File Arrival** triggers on UC Volumes
- Configuring **Table Update** triggers for cross-job orchestration
- Using a validation task with `pipeline_event_log` for monitoring
- Passing parameters with Job-level variables and `{{run.id}}`

### Exam Tips:
- **File Arrival trigger** monitors a cloud storage path or UC Volume for new files
- **Table Update trigger** fires when a Delta table receives new rows (`inserted_count > 0`)
- **Repair Runs** re-execute only failed and downstream tasks, skipping successful ones
- **`dbutils.notebook.exit()`** returns a JSON result that can be read by downstream tasks via `taskValues`
- **`max_concurrent_runs: 1`** prevents duplicate runs when triggers fire rapidly
- Jobs can use **Serverless compute** or dedicated **Job clusters** (not All-Purpose clusters)

### Trigger Comparison:

| Trigger | Fires when | Best for | Config key |
|---------|-----------|----------|------------|
| **Scheduled** | CRON time reached | Regular ETL | `cron_expression` |
| **File Arrival** | New file in path | Event-driven ingestion | `file_arrival.url` |
| **Table Update** | Table DML detected | Cross-pipeline chains | `table.table_name` |
| **Continuous** | Always running | Near-real-time | `continuous` |

---

## Cleanup (Optional)

In [None]:
# Delete trigger files
# dbutils.fs.rm(f"/Volumes/{CATALOG}/default/landing_zone/trigger", recurse=True)

# Delete both Jobs from Workflows UI
print("LAB 07B complete. Delete Jobs from Workflows UI when done.")