# M08: Orchestration & Lakeflow Jobs

Orchestration is the glue connecting pipelines into production workflows. We'll learn to create multi-task Jobs with dependencies (DAG), configure triggers (CRON, File Arrival), set up retries and alerts, pass parameters between tasks (taskValues), and monitor execution via System Tables.

| Exam Domain | Weight |
|---|---|
| Production Pipelines | 13% |


---

## Introduction to Lakeflow Jobs

**Lakeflow Jobs** (formerly Databricks Jobs) is a managed orchestration service.

| Scenario | Solution |
|----------|----------|
| ETL pipeline with multiple steps | Multi-task Job |
| Daily report at fixed time | Scheduled Job |
| Reaction to new files | File Arrival Trigger |
| ML training pipeline | Job with notebook tasks |
| Run Lakeflow Pipeline | Job with Pipeline task |

| Feature | Jobs | Lakeflow Pipelines |
|---------|------|-----|
| Orchestration | General | ETL only |
| Dependencies | Manual configuration | Automatic (DAG) |
| Data Quality | Custom code | Built-in expectations |
| Flexibility | High | Opinionated |

**Best Practice**: Use Lakeflow Pipelines for ETL, Jobs for orchestrating Pipelines + other tasks.

---

### Task Types in Lakeflow Jobs

Overview of all available task types in Lakeflow Jobs and when to use each one.

| Task Type | Description | Use Case |
|-----------|-------------|----------|
| Notebook | Run a Databricks notebook | ETL logic, ML training |
| Pipeline | Run a Lakeflow Declarative Pipeline | Streaming/batch pipelines |
| Python Script | Run a Python file | Utility scripts |
| SQL | Run a SQL query | DDL, reporting queries |
| JAR | Run a Java/Scala JAR | Legacy Spark jobs |
| Spark Submit | Submit a Spark application | Custom Spark apps |
| If/Else Condition | Branch based on condition | Conditional workflows |
| For Each | Iterate over a list | Parameterized batch runs |

**Repair Runs:** Re-runs only **failed and downstream tasks**, skipping successful ones — saves compute and time.

**Exam Note:** Know that repair runs skip already-successful tasks. If/Else and For Each enable conditional and iterative workflows.

## Preparing Notebooks for Job

Below are 3 simple notebooks that we'll use in the demo.

**Instructions**: 
1. Create folder `/Workspace/Users/<your-email>/jobs_demo/`
2. Copy each of the following code snippets to a separate notebook

---

### Task 1: Validate Source

Validates row count against `min_rows` threshold. Returns status + count via `dbutils.notebook.exit()`.

In [0]:
# TASK 1: Validate Source Data
# Copy this code to notebook: task_01_validate

# Parameters from Job
dbutils.widgets.text("source_table", "samples.nyctaxi.trips")
dbutils.widgets.text("min_rows", "100")

source_table = dbutils.widgets.get("source_table")
min_rows = int(dbutils.widgets.get("min_rows"))

# Validation
df = spark.table(source_table)
row_count = df.count()

if row_count < min_rows:
    raise Exception(f"Validation FAILED: {row_count} rows < {min_rows} minimum")

# Return result to next task
import json
dbutils.notebook.exit(json.dumps({
    "status": "SUCCESS",
    "source_table": source_table,
    "row_count": row_count
}))

### Task 2: Transform Data

Reads previous task result via `taskValues`, applies transformations (duration, cost per mile). Returns row count.

In [0]:
# TASK 2: Transform Data
# Copy this code to notebook: task_02_transform

from pyspark.sql.functions import *
import json

# Parameters
dbutils.widgets.text("source_table", "samples.nyctaxi.trips")
dbutils.widgets.text("run_date", "")

source_table = dbutils.widgets.get("source_table")
run_date = dbutils.widgets.get("run_date") or str(current_date())

# Get result from previous task (optional)
try:
    prev_result = dbutils.jobs.taskValues.get(
        taskKey="validate",
        key="returnValue",
        default="{}"
    )
    prev_data = json.loads(prev_result)
    print(f"Previous task result: {prev_data}")
except:
    print("Running standalone (no previous task)")

# Transformation
print(f"Transforming: {source_table}")

df = spark.table(source_table)

df_transformed = (
    df
    .withColumn("trip_duration_minutes", 
                round((col("tpep_dropoff_datetime").cast("long") - 
                       col("tpep_pickup_datetime").cast("long")) / 60, 2))
    .withColumn("cost_per_mile", 
                when(col("trip_distance") > 0, 
                     round(col("fare_amount") / col("trip_distance"), 2))
                .otherwise(0))
    .withColumn("processing_date", lit(run_date))
)

row_count = df_transformed.count()
print(f"Transformed {row_count} rows")

df_transformed.select(
    "trip_distance", "fare_amount", "trip_duration_minutes", "cost_per_mile"
).show(5)

# Return result
dbutils.notebook.exit(json.dumps({
    "status": "SUCCESS",
    "rows_transformed": row_count
}))

### Task 3: Generate Report

Aggregates metrics (total trips, revenue, avg fare/distance). Prints summary report.

In [0]:
# TASK 3: Generate Report
# Copy this code to notebook: task_03_report

from pyspark.sql.functions import *
import json
from datetime import datetime

# Parameters
dbutils.widgets.text("source_table", "samples.nyctaxi.trips")

source_table = dbutils.widgets.get("source_table")

# Aggregations
df = spark.table(source_table)

report = df.agg(
    count("*").alias("total_trips"),
    round(sum("fare_amount"), 2).alias("total_revenue"),
    round(avg("fare_amount"), 2).alias("avg_fare"),
    round(avg("trip_distance"), 2).alias("avg_distance"),
    round(max("fare_amount"), 2).alias("max_fare")
).collect()[0]

# Display report
print("\n" + "="*50)
print("DAILY REPORT")
print("="*50)
print(f"Avg Fare:       ${report.avg_fare:.2f}")
print(f"Avg Distance:   {report.avg_distance:.2f} miles")
print(f"Max Fare:       ${report.max_fare:.2f}")
print("="*50)
print(f"Generated at:   {datetime.now()}")
print("="*50 + "\n")

# Return result
dbutils.notebook.exit(json.dumps({
    "status": "SUCCESS",
    "total_trips": report.total_trips,
    "total_revenue": float(report.total_revenue)
}))

### [UI DEMO] Creating Multi-task Job

**Step 1: Create new Job**
- [ ] Workflows → Create Job
- [ ] Name: `Demo_ETL_Pipeline`

<img src="../../../assets/images/93c107ca21a54aab98249cf47db0337d.png" width="800">


- [ ] Cluster Job: Create new cluster job

<img src="../../../assets/images/a967557a143a40c0ac7ed26ce469866a.png" width="800">

**Step 2: Add Task 1 (Validate)**
- [ ] Task name: `validate`
- [ ] Type: Notebook
- [ ] Path: `/Workspace/.../task_01_validate`
- [ ] Cluster: Serverless or new Job Cluster
- [ ] Parameters: `source_table` = `samples.nyctaxi.trips`

<img src="../../../assets/images/214b868309344df3a6e81f3cc2a84c13.png" width="800">

**Step 3: Add Task 2 (Transform)**
- [ ] Task name: `transform`
- [ ] Depends on: `validate`
- [ ] Path: `/Workspace/.../task_02_transform`
- [ ] Parameters: `source_table` = `samples.nyctaxi.trips`

**Step 4: Add Task 3 (Report)**
- [ ] Task name: `report`
- [ ] Depends on: `transform`
- [ ] Path: `/Workspace/.../task_03_report`

<img src="../../../assets/images/a3cc387f44de4247bf275bdbd38efb84.png" width="800">

**Step 5: Run Job**
- [ ] Run now
- [ ] Show: DAG visualization
- [ ] Show: Task logs and output

---

## [UI DEMO 2] Medallion Pipeline Job — Ready-to-Use Config

We already have 6 production-ready notebooks in `lab/materials/medallion/` that implement a full Medallion pipeline:

| Layer | Notebook | Input | Output |
|-------|----------|-------|--------|
| **Bronze** | `bronze_customers` | CSV files | `bronze.bronze_customers` |
| **Bronze** | `bronze_orders` | JSON files | `bronze.bronze_orders` |
| **Silver** | `silver_customers` | bronze_customers | `silver.silver_customers` |
| **Silver** | `silver_orders_cleaned` | bronze_orders | `silver.silver_orders_cleaned` |
| **Gold** | `gold_customer_orders_summary` | silver_customers + silver_orders | `gold.gold_customer_orders_summary` |
| **Gold** | `gold_daily_orders` | silver_orders | `gold.gold_daily_orders` |

**DAG Structure:**
```
bronze_customers ──→ silver_customers ──────→ gold_customer_orders_summary
bronze_orders ────→ silver_orders_cleaned ─┤
                                           └→ gold_daily_orders
```

**How to deploy:** Copy the JSON config below → Workflows → Create Job → switch to JSON editor (or use Databricks CLI: `databricks jobs create --json @job.json`).

---

In [None]:
import json

# ============================================================
# Medallion Pipeline Job — JSON Configuration
# Copy this JSON → Databricks Workflows → JSON Editor
# Or deploy via CLI: databricks jobs create --json @job.json
# ============================================================

# IMPORTANT: Replace these placeholders before deploying:
#   <YOUR_CATALOG>     → your Unity Catalog name (e.g. retailhub_jsmith)
#   <NOTEBOOK_ROOT>    → workspace path to medallion notebooks
#   <SOURCE_CUSTOMERS> → path to customers CSV (e.g. /Volumes/.../customers.csv)
#   <SOURCE_ORDERS>    → path to orders JSON  (e.g. /Volumes/.../orders/)

job_config = {
    "name": "RetailHub_Medallion_Pipeline",
    "tasks": [
        {
            "task_key": "bronze_customers",
            "notebook_task": {
                "notebook_path": "<NOTEBOOK_ROOT>/bronze_customers",
                "base_parameters": {
                    "catalog": "<YOUR_CATALOG>",
                    "schema": "bronze",
                    "source_path": "<SOURCE_CUSTOMERS>"
                }
            },
            "job_cluster_key": "pipeline_cluster"
        },
        {
            "task_key": "bronze_orders",
            "notebook_task": {
                "notebook_path": "<NOTEBOOK_ROOT>/bronze_orders",
                "base_parameters": {
                    "catalog": "<YOUR_CATALOG>",
                    "schema": "bronze",
                    "source_path": "<SOURCE_ORDERS>"
                }
            },
            "job_cluster_key": "pipeline_cluster"
        },
        {
            "task_key": "silver_customers",
            "depends_on": [{"task_key": "bronze_customers"}],
            "notebook_task": {
                "notebook_path": "<NOTEBOOK_ROOT>/silver_customers",
                "base_parameters": {
                    "catalog": "<YOUR_CATALOG>",
                    "schema_bronze": "bronze",
                    "schema_silver": "silver"
                }
            },
            "job_cluster_key": "pipeline_cluster"
        },
        {
            "task_key": "silver_orders_cleaned",
            "depends_on": [{"task_key": "bronze_orders"}],
            "notebook_task": {
                "notebook_path": "<NOTEBOOK_ROOT>/silver_orders_cleaned",
                "base_parameters": {
                    "catalog": "<YOUR_CATALOG>",
                    "schema_bronze": "bronze",
                    "schema_silver": "silver"
                }
            },
            "job_cluster_key": "pipeline_cluster"
        },
        {
            "task_key": "gold_customer_orders_summary",
            "depends_on": [
                {"task_key": "silver_customers"},
                {"task_key": "silver_orders_cleaned"}
            ],
            "notebook_task": {
                "notebook_path": "<NOTEBOOK_ROOT>/gold_customer_orders_summary",
                "base_parameters": {
                    "catalog": "<YOUR_CATALOG>",
                    "schema_silver": "silver",
                    "schema_gold": "gold"
                }
            },
            "job_cluster_key": "pipeline_cluster"
        },
        {
            "task_key": "gold_daily_orders",
            "depends_on": [{"task_key": "silver_orders_cleaned"}],
            "notebook_task": {
                "notebook_path": "<NOTEBOOK_ROOT>/gold_daily_orders",
                "base_parameters": {
                    "catalog": "<YOUR_CATALOG>",
                    "schema_silver": "silver",
                    "schema_gold": "gold"
                }
            },
            "job_cluster_key": "pipeline_cluster"
        }
    ],
    "job_clusters": [
        {
            "job_cluster_key": "pipeline_cluster",
            "new_cluster": {
                "spark_version": "16.2.x-scala2.12",
                "num_workers": 1,
                "node_type_id": "i3.xlarge",
                "data_security_mode": "SINGLE_USER"
            }
        }
    ],
    "trigger": {
        "pause_status": "PAUSED",
        "periodic": {
            "interval": 1,
            "unit": "DAYS"
        }
    },
    "queue": {"enabled": True},
    "max_concurrent_runs": 1
}

print(json.dumps(job_config, indent=2))
print(f"\n--- Total tasks: {len(job_config['tasks'])}")
print("--- DAG: bronze(2) → silver(2) → gold(2)")

### Deploy Checklist

**Option A — JSON Editor (UI):**
1. [ ] Workflows → Create Job
2. [ ] Click `⚙ Edit as JSON` (top-right)
3. [ ] Paste the JSON above (replace placeholders)
4. [ ] Save → Run now

**Option B — Databricks CLI:**
```bash
# Save JSON to file, then:
databricks jobs create --json @medallion_pipeline_job.json
```

**After deployment — show participants:**
- [ ] DAG visualization (fan-out at Bronze, fan-in at Gold)
- [ ] Task-level parameters (catalog, schema, source_path)
- [ ] Trigger: set to PAUSED — run manually for demo
- [ ] Run → observe sequential layer execution (Bronze → Silver → Gold)
- [ ] Show Repair Run: intentionally fail one task → repair reruns only failed + downstream

---

## [UI DEMO] Triggers and Schedule

How to configure different trigger types for Lakeflow Jobs — scheduled (CRON), file arrival, continuous, and manual triggers.

---

| Trigger Type | Usage | Example |
|---|---|---|
| **Scheduled** | Fixed schedule (CRON) | `0 0 2 * * ?` — daily at 2:00 |
| **File arrival** | Reaction to new files | New file in UC Volume |
| **Continuous** | Continuous processing | Streaming-like |
| **Manual** | On-demand | Testing |

**Exam Note:** Know CRON syntax and File Arrival trigger configuration.


### Trigger Configuration Checklist

Step-by-step instructor checklist for demonstrating trigger options in the Lakeflow Jobs UI.

**Trigger Options** (Triggers tab):

| Trigger Type | Usage | Example |
|--------------|-------|---------|
| **Scheduled** | Fixed schedule | Daily at 2:00 |
| **File arrival** | Reaction to new files | New file in `/landing/` |
| **Continuous** | Continuous processing | Streaming-like |
| **Manual** | On-demand | Testing |

<img src="../../../assets/images/6e23746ce063491fa8afb3dea6268a1d.png" width="800">


**Demo: Scheduled Trigger**
- [ ] Add trigger → Scheduled
- [ ] Cron expression: `0 0 2 * * ?` (daily at 2:00)
- [ ] Timezone: `Europe/Warsaw`
- [ ] Show: Preview next runs


<img src="../../../assets/images/cf3cfc85162a466eb77e15e20df5c15c.png" width="800">

**Demo: File Arrival Trigger** (optional)
- [ ] Add trigger → File arrival
- [ ] URL: Unity Catalog Volume path
- [ ] Min files: 1
### Useful CRON Expressions

Common CRON patterns for scheduling Lakeflow Jobs at various intervals.

```
0 0 2 * * ?        # Daily at 2:00
0 0 * * * ?        # Every hour
0 0 9 ? * MON-FRI  # Mon-Fri at 9:00
0 0 0 1 * ?        # First day of month
0 */15 * * * ?     # Every 15 minutes
```

---

## [UI DEMO] Options, Retry and Alerting

**Task-level:** Timeout, Retries (count + delay)  
**Job-level:** Max concurrent runs, Job timeout  
**Notifications:** Email (on failure/success), Webhooks (Slack/Teams via Destinations)

| Scenario | Retry? | Why |
|----------|--------|-----|
| Network timeout | Yes | Transient error |
| API rate limit | Yes | Transient error |
| Data quality issue | No | Retry won't fix data |
| Code bug | No | Retry won't fix code |

---

Demo: Use yAML to create job

create medalion architecutre job with dependecies base on that yaml

In [0]:
resources:
  jobs:
    job_medalion_run:
      name: job_medalion_run
      tasks:
        - task_key: bronze_customer
          notebook_task:
            notebook_path: /Workspace/Users/krzysztof.burejza@outlook.com/databricks-de-associate-training/notebooks/day3/lab/materials/medallion/bronze_customers
            source: WORKSPACE
          existing_cluster_id: 0215-230117-3c5px09z
        - task_key: bronze_orders
          notebook_task:
            notebook_path: /Workspace/Users/krzysztof.burejza@outlook.com/databricks-de-associate-training/notebooks/day3/lab/materials/medallion/bronze_orders
            source: WORKSPACE
          existing_cluster_id: 0215-230117-3c5px09z
        - task_key: silver_customers
          depends_on:
            - task_key: bronze_customer
          notebook_task:
            notebook_path: /Workspace/Users/krzysztof.burejza@outlook.com/databricks-de-associate-training/notebooks/day3/lab/materials/medallion/silver_customers
            source: WORKSPACE
          existing_cluster_id: 0215-230117-3c5px09z
        - task_key: silver_orders_cleaned
          depends_on:
            - task_key: bronze_orders
          notebook_task:
            notebook_path: /Workspace/Users/krzysztof.burejza@outlook.com/databricks-de-associate-training/notebooks/day3/lab/materials/medallion/silver_orders_cleaned
            source: WORKSPACE
          existing_cluster_id: 0215-230117-3c5px09z
        - task_key: gold_customer_order_summary
          depends_on:
            - task_key: silver_customers
            - task_key: silver_orders_cleaned
          notebook_task:
            notebook_path: /Workspace/Users/krzysztof.burejza@outlook.com/databricks-de-associate-training/notebooks/day3/lab/materials/medallion/gold_customer_orders_summary
            source: WORKSPACE
          existing_cluster_id: 0215-230117-3c5px09z
        - task_key: gold_daily_orders
          depends_on:
            - task_key: gold_customer_order_summary
          notebook_task:
            notebook_path: /Workspace/Users/krzysztof.burejza@outlook.com/databricks-de-associate-training/notebooks/day3/lab/materials/medallion/gold_daily_orders
            source: WORKSPACE
          existing_cluster_id: 0215-230117-3c5px09z
      queue:
        enabled: true
      parameters:
        - name: catalog
          default: retailhub_trainer
        - name: source_path
          default: /Volumes/retailhub_trainer/default/datasets


## Demo: Widgets and Parameters

Databricks Widgets allow you to parameterize notebooks.

---

In [0]:
# Widget types

# Text - any text
dbutils.widgets.text("environment", "dev", "Environment")

# Dropdown - select from list
dbutils.widgets.dropdown("region", "EU", ["EU", "US", "APAC"], "Region")

# Combobox - dropdown with typing option
dbutils.widgets.combobox("table", "orders", ["orders", "customers", "products"], "Table")

# Multiselect - multiple selection
dbutils.widgets.multiselect("columns", "id", ["id", "name", "date", "amount"], "Columns")

In [0]:
# Getting values
environment = dbutils.widgets.get("environment")
region = dbutils.widgets.get("region")
table = dbutils.widgets.get("table")
columns = dbutils.widgets.get("columns")  # returns comma-separated string

print(f"Environment: {environment}")
print(f"Region: {region}")
print(f"Table: {table}")
print(f"Columns: {columns}")

In [0]:
# Dynamic parameters in Job
# These values are available when notebook is run as a task in Job

dynamic_params = {
    "{{job.start_time.iso_date}}": "Run date (YYYY-MM-DD)",
    "{{job.start_time}}": "Full timestamp",
    "{{job.id}}": "Job ID",
    "{{run.id}}": "Current run ID",
    "{{task.name}}": "Current task name"
}

for param, description in dynamic_params.items():
    print(f"{param:35} -> {description}")

In [0]:
# Widget cleanup
dbutils.widgets.removeAll()

## Monitoring via System Tables

Key table: `system.lakeflow.job_run_timeline` — contains run history, duration, result state.

---

In [0]:
spark.sql("""
    SELECT 
        *
    FROM system.lakeflow.job_run_timeline
    ORDER BY 1 DESC
    LIMIT 20
""").display()

In [0]:
spark.sql("""
    SELECT 
        DATE(period_start_time) as run_date,
        run_name as job_name,
        ROUND(
            AVG(
                (UNIX_TIMESTAMP(period_end_time) - UNIX_TIMESTAMP(period_start_time)) / 60
            ), 1
        ) as avg_duration_min,
        COUNT(*) as runs
    FROM system.lakeflow.job_run_timeline
    WHERE period_start_time >= current_date() - INTERVAL 14 DAYS
        AND result_state = 'SUCCESS'
    GROUP BY run_date, job_name
    ORDER BY run_date DESC
""").display()

## Summary

| Topic | Key Concept | Exam Keywords |
|---|---|---|
| **Multi-task Jobs** | DAG workflow, task dependencies | Task types, Repair Runs |
| **Triggers** | Scheduled (CRON), File arrival, Continuous | `0 0 2 * * ?` |
| **Options** | Timeout, Retry, Max concurrent runs | Transient vs permanent errors |
| **Alerting** | Email, Webhooks (Slack/Teams) | Notification destinations |
| **Parameters** | Widgets, dynamic values, taskValues | `dbutils.widgets`, `dbutils.jobs.taskValues` |
| **Monitoring** | System tables, success rate, duration | `system.lakeflow.job_run_timeline` |

---

> **← M07: Medallion & Lakeflow | Day 3 | M09: Governance →**