# LAB 08: Lakeflow Jobs -- Triggers, Dependencies & Orchestration

**Duration:** ~30 min | **Day:** 3 | **Difficulty:** Intermediate
**After module:** M08: Lakeflow Jobs & Orchestration

> *"The RetailHub pipeline works -- now automate it. Configure triggers, define task dependencies, handle failures with repair runs, and monitor via system tables."*

## Setup

In [None]:
%run ../../setup/00_setup

---
## Task 1: Understanding Job Configuration (JSON)

Databricks Jobs can be configured via UI or programmatically via REST API / JSON.
Examine the job configuration structure below and answer the questions.

```json
{
  "name": "RetailHub_Daily_Refresh",
  "tasks": [
    {
      "task_key": "refresh_pipeline",
      "pipeline_task": { "pipeline_id": "<PIPELINE_ID>" },
      "timeout_seconds": 1800
    },
    {
      "task_key": "validate_results",
      "depends_on": [{ "task_key": "refresh_pipeline" }],
      "notebook_task": { "notebook_path": "/Workspace/.../task_01_validate" },
      "max_retries": 2,
      "retry_on_timeout": false
    },
    {
      "task_key": "generate_report",
      "depends_on": [{ "task_key": "validate_results" }],
      "notebook_task": { "notebook_path": "/Workspace/.../task_03_report" }
    }
  ],
  "trigger": {
    "periodic": { "interval": 1, "unit": "DAYS" }
  }
}
```

In [None]:
# TODO: Answer the questions about the job configuration above

# Q1: How many tasks does this job have?
num_tasks = ______  # int

# Q2: Which task runs first (has no dependencies)?
first_task = "______"  # str

# Q3: What is the maximum time (in minutes) the pipeline task can run before timeout?
timeout_minutes = ______  # int

# Q4: How many times will validate_results retry on failure?
max_retries = ______  # int

# Q5: If refresh_pipeline fails, will validate_results run?
validate_runs_on_failure = ______  # bool (True/False)

In [None]:
# Verification
assert num_tasks == 3, f"Expected 3 tasks, got {num_tasks}"
assert first_task == "refresh_pipeline", f"First task should be refresh_pipeline, got {first_task}"
assert timeout_minutes == 30, f"1800 seconds = 30 minutes, got {timeout_minutes}"
assert max_retries == 2, f"Expected 2 retries, got {max_retries}"
assert validate_runs_on_failure == False, "Dependent tasks do NOT run if their dependency fails"
print("Task 1 PASSED: Job configuration understood correctly")

---
## Task 2: Trigger Types

Match each scenario to the correct trigger type.

| Scenario | Trigger Type |
|----------|-------------|
| Run ETL pipeline every day at 6 AM | ? |
| Process files as soon as they land in a Volume | ? |
| Continuously process streaming data with minimal latency | ? |
| Run only when manually triggered by a data engineer | ? |

In [None]:
# TODO: Fill in the correct trigger type for each scenario
# Options: "scheduled", "file_arrival", "continuous", "manual"

trigger_daily_6am = "______"
trigger_new_files = "______"
trigger_streaming = "______"
trigger_adhoc = "______"

In [None]:
# Verification
assert trigger_daily_6am == "scheduled", f"Daily at 6 AM = scheduled trigger, got {trigger_daily_6am}"
assert trigger_new_files == "file_arrival", f"Process on file landing = file_arrival, got {trigger_new_files}"
assert trigger_streaming == "continuous", f"Minimal latency streaming = continuous, got {trigger_streaming}"
assert trigger_adhoc == "manual", f"Ad-hoc = manual, got {trigger_adhoc}"
print("Task 2 PASSED: Trigger types matched correctly")

---
## Task 3: CRON Expressions

Write the CRON expression for each schedule.

CRON format: `minute hour day_of_month month day_of_week`

| Field | Values |
|-------|--------|
| Minute | 0-59 |
| Hour | 0-23 |
| Day of month | 1-31 |
| Month | 1-12 |
| Day of week | 0-6 (0=Sunday) or 1-5 for Mon-Fri |

In [None]:
# TODO: Write CRON expressions

# Every day at 6:00 AM
cron_daily_6am = "______"

# Every hour (at minute 0)
cron_hourly = "______"

# Monday to Friday at 8:00 AM
cron_weekdays_8am = "______"

# Every 15 minutes
cron_every_15min = "______"

In [None]:
# Verification
assert cron_daily_6am == "0 6 * * *", f"Expected '0 6 * * *', got '{cron_daily_6am}'"
assert cron_hourly == "0 * * * *", f"Expected '0 * * * *', got '{cron_hourly}'"
assert cron_weekdays_8am == "0 8 * * 1-5", f"Expected '0 8 * * 1-5', got '{cron_weekdays_8am}'"
assert cron_every_15min == "*/15 * * * *", f"Expected '*/15 * * * *', got '{cron_every_15min}'"
print("Task 3 PASSED: CRON expressions correct")

---
## Task 4: Task Dependencies -- DAG Design

The RetailHub team needs a more complex job with the following requirements:

1. **Ingest** task runs first (no dependencies)
2. **Build DIM tables** and **Build FACT tables** run in parallel AFTER Ingest
3. **Generate Report** runs AFTER both DIM and FACT are complete
4. **Send Notification** runs AFTER Generate Report

Define the dependencies for each task.

In [None]:
# TODO: Define task dependencies
# Use a list of task names that each task depends on.
# An empty list [] means no dependencies (runs first).

task_dependencies = {
    "ingest":            ______,  # list of dependencies
    "build_dim_tables":  ______,  # list of dependencies
    "build_fact_tables": ______,  # list of dependencies
    "generate_report":   ______,  # list of dependencies
    "send_notification":  ______,  # list of dependencies
}

In [None]:
# Verification
assert task_dependencies["ingest"] == [], "Ingest has no dependencies"
assert task_dependencies["build_dim_tables"] == ["ingest"], "DIM depends on ingest"
assert task_dependencies["build_fact_tables"] == ["ingest"], "FACT depends on ingest"
assert sorted(task_dependencies["generate_report"]) == ["build_dim_tables", "build_fact_tables"], \
    "Report depends on BOTH dim and fact (fan-in)"
assert task_dependencies["send_notification"] == ["generate_report"], "Notification depends on report"
print("Task 4 PASSED: DAG dependencies correct")
print()
print("DAG structure:")
print("  ingest")
print("    +-- build_dim_tables")
print("    +-- build_fact_tables")
print("          +-- generate_report (waits for both)")
print("                +-- send_notification")

---
## Task 5: Repair Run Scenarios

The job from Task 4 ran, but `build_fact_tables` failed.
Answer the following questions about Repair Run behavior.

In [None]:
# TODO: Answer Repair Run questions

# Q1: Which tasks will be RE-EXECUTED during a Repair Run?
# Options: list the task names that will run again
repair_rerun_tasks = [______]  # list of str

# Q2: Will 'ingest' run again during repair?
ingest_reruns = ______  # bool

# Q3: Will 'build_dim_tables' run again during repair?
dim_reruns = ______  # bool

# Q4: Is Repair Run cheaper than a full re-run? (less compute used)
repair_is_cheaper = ______  # bool

In [None]:
# Verification
expected_repair = sorted(["build_fact_tables", "generate_report", "send_notification"])
assert sorted(repair_rerun_tasks) == expected_repair, \
    f"Repair re-runs the failed task + all downstream. Expected {expected_repair}, got {sorted(repair_rerun_tasks)}"
assert ingest_reruns == False, "Ingest succeeded -- NOT re-executed in repair"
assert dim_reruns == False, "DIM succeeded -- NOT re-executed in repair"
assert repair_is_cheaper == True, "Repair skips successful tasks, using less compute"
print("Task 5 PASSED: Repair Run behavior understood correctly")

---
## Task 6: Passing Parameters Between Tasks (dbutils.jobs.taskValues)

In Databricks, tasks within a Job can share data using `dbutils.jobs.taskValues`.

**Setter (in Task A):**
```python
dbutils.jobs.taskValues.set(key="row_count", value=42)
```

**Getter (in Task B, which depends on Task A):**
```python
count = dbutils.jobs.taskValues.get(taskKey="task_a", key="row_count")
```

Complete the code below to simulate parameter passing.

In [None]:
# TODO: Complete the task value operations
# Since we are not inside a Job, we will simulate with a dictionary

# Simulated task values store
task_values = {}

# Task A: "refresh_pipeline" -- sets the number of rows processed
rows_processed = 15420
task_values["refresh_pipeline"] = {"rows_processed": ______}  # TODO: set the value

# Task A also sets the processing timestamp
from datetime import datetime
processing_time = datetime.now().isoformat()
task_values["refresh_pipeline"]["processing_time"] = ______  # TODO: set the value

# Task B: "validate_results" -- reads values from Task A
retrieved_rows = task_values[______][______]  # TODO: get rows_processed from refresh_pipeline
retrieved_time = task_values["refresh_pipeline"]["processing_time"]

In [None]:
# Verification
assert task_values["refresh_pipeline"]["rows_processed"] == 15420, "rows_processed should be 15420"
assert task_values["refresh_pipeline"]["processing_time"] == processing_time, "processing_time should match"
assert retrieved_rows == 15420, "Should retrieve 15420 from refresh_pipeline"
print("Task 6 PASSED: Task value passing works correctly")
print(f"  rows_processed: {retrieved_rows}")
print(f"  processing_time: {retrieved_time}")

---
## Task 7: Job Monitoring via System Tables

Databricks system tables provide metadata about job executions.
Write SQL queries to answer monitoring questions.

**Key system tables:**
- `system.lakeflow.job_run_timeline` -- Job-level run history
- `system.lakeflow.job_task_run_timeline` -- Task-level run details

In [None]:
# TODO: Write a SQL query to find all FAILED job runs in the last 7 days
# Columns available: job_id, run_id, result_state, start_time, end_time

query_failed_runs = """
SELECT job_id, run_id, result_state, start_time, end_time
FROM system.lakeflow.job_run_timeline
WHERE result_state = '______'
  AND start_time >= current_date() - INTERVAL ______ DAYS
ORDER BY start_time DESC
"""

In [None]:
# Verification
assert "FAILED" in query_failed_runs.upper(), "Should filter for FAILED result_state"
assert "7" in query_failed_runs, "Should look back 7 days"
assert "ORDER BY" in query_failed_runs.upper(), "Should order results"
print("Task 7 PASSED: System table query is correct")

---
## Task 8: Job Cluster vs All-Purpose Cluster

Choose the correct cluster type for each scenario.

In [None]:
# TODO: Assign the correct cluster type
# Options: "job_cluster" or "all_purpose"

# Scenario 1: Scheduled nightly ETL job that runs for 2 hours
nightly_etl = "______"

# Scenario 2: Interactive notebook development and exploration
interactive_dev = "______"

# Scenario 3: Ad-hoc data investigation by an analyst
adhoc_analysis = "______"

# Scenario 4: Production ML model training triggered weekly
ml_training = "______"

In [None]:
# Verification
assert nightly_etl == "job_cluster", "Scheduled ETL -> job cluster (cheaper, auto-terminates)"
assert interactive_dev == "all_purpose", "Interactive work -> all-purpose (stays running)"
assert adhoc_analysis == "all_purpose", "Ad-hoc analysis -> all-purpose (interactive)"
assert ml_training == "job_cluster", "Scheduled ML training -> job cluster (cost-effective)"
print("Task 8 PASSED: Cluster type selection correct")
print()
print("Key insight:")
print("  job_cluster -> automated/scheduled workloads (cost-optimized)")
print("  all_purpose -> interactive/development work (long-running)")

---
## Summary

In this lab you practiced:
- Understanding multi-task Job configuration (JSON structure)
- Matching trigger types to business scenarios
- Writing CRON expressions for scheduling
- Designing task dependency DAGs (fan-out/fan-in)
- Repair Run behavior (re-execute failed + downstream only)
- Passing parameters between tasks with taskValues
- Querying system tables for job monitoring
- Choosing between Job cluster and All-purpose cluster

> **Exam Tip:** The exam tests understanding of job orchestration concepts: trigger types, dependency DAGs, repair runs, and cluster selection. Focus on WHEN to use each approach rather than memorizing API syntax.