# Lesson 7: Workflow Orchestration (Prefect)

**Module 3: Data & Pipeline Engineering**  
**Estimated Time**: 1-2 hours  
**Difficulty**: Beginner-Intermediate

---

## ðŸŽ¯ Learning Objectives

By the end of this lesson, you will:

âœ… Understand Python-native orchestration (Prefect vs Airflow)  
âœ… Define **Flows** and **Tasks** using decorators  
âœ… Implement automatic **Retries** and **Caching**  
âœ… Answer interview questions on Pipeline Orchestration  

---

## ðŸ“š Table of Contents

1. [The "Cron Job" Problem](#1-cron-problem)
2. [Introduction to Prefect](#2-intro-prefect)
3. [Hands-On: First Prefect Flow](#3-hands-on)
4. [Scheduling & UI](#4-scheduling)
5. [Interview Preparation](#5-interview-questions)

---

## 1. The "Cron Job" Problem

You have a script `train_model.py`. You schedule it with Linux `cron` to run daily.

**Issues**:
1. **Failures**: If it fails, who knows? (You need to check logs).
2. **Retries**: It doesn't retry automatically.
3. **Dependencies**: Hard to say "Run B only if A succeeds".
4. **History**: No dashboard of past runs.

**Solution**: Use an Orchestrator (Airflow, Prefect, Dagster).

## 2. Introduction to Prefect

Prefect is "Modern Airflow".
- **Native Python**: No complex DSL concepts.
- **Decorators**: Just add `@task` and `@flow` to your functions.
- **Dynamic DAGs**: Logic is built at runtime.

## 3. Hands-On: First Prefect Flow

Note: Requires `pip install prefect`. We simulate the output structure.

In [None]:
# Pseudo-code for Prefect 2.0 Syntax

print("Concept Code - Prefect 2.0 flow definition\n")

code = """
from prefect import task, flow
import time

# 1. Define Tasks
@task(retries=3, retry_delay_seconds=10)
def extract_data(url: str):
    print(f"Fetching {url}...")
    # return pd.read_csv(...)
    return [1, 2, 3]

@task
def transform_data(data: list):
    print("Cleaning data...")
    return [x * 10 for x in data]

@task
def load_data(data: list):
    print("Saving to Database...")
    print(f"Loaded: {data}")

# 2. Define Flow (The DAG)
@flow(name="etl-pipeline-daily")
def main_flow():
    raw = extract_data("s3://bucket/data.csv")
    clean = transform_data(raw)
    load_data(clean)

# 3. Run it
if __name__ == '__main__':
    main_flow()
"""

print(code)
print("\n--- OUTPUT SIMULATION ---")
print("15:00:00.000 | INFO | Created task run 'extract_data'")
print("15:00:00.500 | INFO | Fetching s3://bucket/data.csv...")
print("15:00:01.000 | INFO | Task run 'extract_data' completed")
print("15:00:01.100 | INFO | Created task run 'transform_data'")
print("...")

## 4. Scheduling & UI

To schedule:
```bash
prefect deployment build my_flow.py:main_flow -n daily-etl -q test_queue --cron "0 0 * * *"
prefect deployment apply main_flow-deployment.yaml
```

This creates a deployment that runs every day at midnight.

## 5. Interview Preparation

### Common Questions

#### Q1: "Prefect vs Airflow?"
**Answer**: "Airflow is the industry standard (maturity, ecosystem). Prefect is more modern and 'Pythonic'. Prefect handles dynamic workflows (loops, dynamic mapping) much better than Airflow's static DAG structure."

#### Q2: "How to handle Backfilling?"
**Answer**: "If I fix a bug in the code today, I might need to re-process data from the last 30 days. Orchestrators allow you to trigger past runs (Backfill) easily, ensuring historical data consistency."