# Module 07: Workflow Orchestration with Airflow

**Estimated Time:** 45-60 minutes

## Learning Objectives

By the end of this module, you will:
- Understand what workflow orchestration is
- Learn Apache Airflow concepts (DAGs, Operators, Tasks)
- Understand task dependencies and scheduling
- Know how to monitor and debug workflows
- Compare orchestration tools (Airflow, Prefect, Dagster)

---

**Note**: This module covers Airflow concepts theoretically. Setting up Airflow requires Docker or a dedicated environment. We'll learn the concepts and patterns that apply to any orchestration tool.

## 1. What is Workflow Orchestration?

### Definition

**Workflow Orchestration** is the automated management, coordination, and scheduling of complex data workflows with dependencies.

### Why Do We Need It?

Without orchestration:
- [FAIL] Manual execution of pipelines
- [FAIL] Hard to manage dependencies
- [FAIL] No visibility into failures
- [FAIL] Difficult to schedule recurring jobs
- [FAIL] No centralized monitoring

With orchestration:
- [OK] Automated scheduling
- [OK] Dependency management
- [OK] Error handling and retries
- [OK] Monitoring and alerting
- [OK] Historical run data

### Common Use Cases

- ETL pipeline scheduling
- Data warehouse loading
- ML model training pipelines
- Report generation
- Data quality checks
- Multi-step data transformations

---

## 2. Apache Airflow Core Concepts

### DAG (Directed Acyclic Graph)

A DAG is a collection of tasks with dependencies:

```
start → extract_data → transform_data → load_data → end
                              ↓
                        data_quality_check
```

**Directed**: Tasks flow in one direction

**Acyclic**: No loops (task A can't depend on task B if B depends on A)

**Graph**: Collection of nodes (tasks) and edges (dependencies)

### Key Components

1. **DAG**: The workflow definition
2. **Operators**: Templates for tasks (PythonOperator, BashOperator, etc.)
3. **Tasks**: Instances of operators
4. **Task Instances**: Specific runs of tasks
5. **Scheduler**: Triggers task execution
6. **Executor**: Runs the tasks
7. **Webserver**: UI for monitoring

---

## 3. Example Airflow DAG (Conceptual)

Here's what an Airflow DAG looks like:

```python
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

# Default arguments for all tasks
default_args = {
    'owner': 'data_team',
    'depends_on_past': False,
    'start_date': datetime(2024, 1, 1),
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 3,
    'retry_delay': timedelta(minutes=5)
}

# Define the DAG
dag = DAG(
    'sales_etl_pipeline',
    default_args=default_args,
    description='Daily sales data ETL pipeline',
    schedule_interval='0 2 * * *',  # Daily at 2 AM
    catchup=False
)

# Define tasks
def extract_sales_data():
    # Extract logic here
    print("Extracting sales data...")
    return "extraction_complete"

def transform_sales_data():
    # Transform logic here
    print("Transforming sales data...")
    return "transformation_complete"

def load_sales_data():
    # Load logic here
    print("Loading sales data...")
    return "load_complete"

# Create task instances
extract_task = PythonOperator(
    task_id='extract_sales',
    python_callable=extract_sales_data,
    dag=dag
)

transform_task = PythonOperator(
    task_id='transform_sales',
    python_callable=transform_sales_data,
    dag=dag
)

load_task = PythonOperator(
    task_id='load_sales',
    python_callable=load_sales_data,
    dag=dag
)

quality_check_task = BashOperator(
    task_id='quality_check',
    bash_command='python /path/to/quality_check.py',
    dag=dag
)

# Define dependencies
extract_task >> transform_task >> load_task >> quality_check_task
```

---

## 4. Simulating Workflow Orchestration

In [None]:
# Simple workflow orchestrator simulation
from datetime import datetime
from enum import Enum
import time


class TaskStatus(Enum):
    PENDING = "pending"
    RUNNING = "running"
    SUCCESS = "success"
    FAILED = "failed"


class Task:
    def __init__(self, task_id, func, retries=3):
        self.task_id = task_id
        self.func = func
        self.status = TaskStatus.PENDING
        self.retries = retries
        self.start_time = None
        self.end_time = None

    def execute(self):
        self.status = TaskStatus.RUNNING
        self.start_time = datetime.now()

        for attempt in range(1, self.retries + 1):
            try:
                print(f"\n[{self.task_id}] Attempt {attempt}/{self.retries}")
                result = self.func()
                self.status = TaskStatus.SUCCESS
                self.end_time = datetime.now()
                print(f"[{self.task_id}] [OK] SUCCESS")
                return result
            except Exception as e:
                if attempt == self.retries:
                    self.status = TaskStatus.FAILED
                    self.end_time = datetime.now()
                    print(f"[{self.task_id}] [FAIL] FAILED after {self.retries} attempts")
                    raise
                print(f"[{self.task_id}] [WARNING] Error: {e}, retrying...")
                time.sleep(1)  # Retry delay


class SimpleWorkflow:
    def __init__(self, workflow_name):
        self.workflow_name = workflow_name
        self.tasks = []
        self.dependencies = {}

    def add_task(self, task):
        self.tasks.append(task)
        self.dependencies[task.task_id] = []

    def set_dependency(self, upstream_task_id, downstream_task_id):
        """Set task dependency: upstream >> downstream"""
        if downstream_task_id in self.dependencies:
            self.dependencies[downstream_task_id].append(upstream_task_id)

    def run(self):
        print(f"\n{'='*60}")
        print(f"Running Workflow: {self.workflow_name}")
        print(f"{'='*60}")

        executed = set()

        while len(executed) < len(self.tasks):
            for task in self.tasks:
                if task.task_id in executed:
                    continue

                # Check if all dependencies are met
                dependencies_met = all(
                    dep_id in executed for dep_id in self.dependencies[task.task_id]
                )

                if dependencies_met:
                    task.execute()
                    executed.add(task.task_id)

        print(f"\n{'='*60}")
        print(f"Workflow Complete: {self.workflow_name}")
        print(f"{'='*60}")
        self._print_summary()

    def _print_summary(self):
        print("\nTask Summary:")
        for task in self.tasks:
            duration = (task.end_time - task.start_time).total_seconds() if task.end_time else 0
            print(f"  {task.task_id}: {task.status.value} ({duration:.2f}s)")


print("[OK] Simple workflow orchestrator created")

In [None]:
# Define pipeline tasks
def extract_data():
    print("  Extracting data from source...")
    time.sleep(0.5)
    return "data_extracted"


def transform_data():
    print("  Transforming data...")
    time.sleep(0.5)
    return "data_transformed"


def load_data():
    print("  Loading data to warehouse...")
    time.sleep(0.5)
    return "data_loaded"


def quality_check():
    print("  Running quality checks...")
    time.sleep(0.3)
    return "quality_passed"


def send_notification():
    print("  Sending success notification...")
    return "notification_sent"


# Create workflow
workflow = SimpleWorkflow("Daily Sales ETL")

# Add tasks
extract_task = Task("extract", extract_data)
transform_task = Task("transform", transform_data)
load_task = Task("load", load_data)
quality_task = Task("quality_check", quality_check)
notify_task = Task("notify", send_notification)

workflow.add_task(extract_task)
workflow.add_task(transform_task)
workflow.add_task(load_task)
workflow.add_task(quality_task)
workflow.add_task(notify_task)

# Set dependencies
workflow.set_dependency("extract", "transform")
workflow.set_dependency("transform", "load")
workflow.set_dependency("load", "quality_check")
workflow.set_dependency("quality_check", "notify")

# Run workflow
workflow.run()

---

## 5. Scheduling Patterns

### Cron Expressions

Airflow uses cron expressions for scheduling:

```
┌─── minute (0 - 59)
│ ┌─── hour (0 - 23)
│ │ ┌─── day of month (1 - 31)
│ │ │ ┌─── month (1 - 12)
│ │ │ │ ┌─── day of week (0 - 6, Sunday = 0)
│ │ │ │ │
* * * * *
```

### Common Schedules

| Schedule | Cron Expression | Meaning |
|----------|----------------|----------|
| Every minute | `* * * * *` | Every minute |
| Every hour | `0 * * * *` | At minute 0 of every hour |
| Daily at 2 AM | `0 2 * * *` | 2:00 AM every day |
| Every Monday at 9 AM | `0 9 * * 1` | 9:00 AM every Monday |
| First day of month | `0 0 1 * *` | Midnight on the 1st |
| Every 15 minutes | `*/15 * * * *` | Every 15 minutes |

### Airflow Presets

```python
@daily      # 0 0 * * *
@hourly     # 0 * * * *
@weekly     # 0 0 * * 0
@monthly    # 0 0 1 * *
@yearly     # 0 0 1 1 *
```

---

## 6. Orchestration Tools Comparison

### Apache Airflow

**Pros**:
- [OK] Most popular and mature
- [OK] Rich ecosystem of operators
- [OK] Great UI
- [OK] Strong community

**Cons**:
- [FAIL] Complex setup
- [FAIL] Requires infrastructure management
- [FAIL] Steep learning curve

**Best For**: Enterprise workflows, complex dependencies

---

### Prefect

**Pros**:
- [OK] Modern Python-first approach
- [OK] Easier to set up
- [OK] Better error handling
- [OK] Cloud-native

**Cons**:
- [FAIL] Less mature than Airflow
- [FAIL] Smaller ecosystem

**Best For**: Python-centric teams, modern data stacks

---

### Dagster

**Pros**:
- [OK] Development-focused
- [OK] Strong typing
- [OK] Testing built-in
- [OK] Data-aware orchestration

**Cons**:
- [FAIL] Newer, smaller community
- [FAIL] Different paradigm (learning curve)

**Best For**: Software engineering teams, data apps

---

### Others

- **Luigi** (Spotify): Older, simpler, less features
- **Argo Workflows** (Kubernetes-native)
- **Temporal** (General workflow engine)
- **AWS Step Functions** (AWS-specific)
- **dbt** (SQL transformations only)

---

## 7. Best Practices

### DAG Design

1. **Keep DAGs simple**: One DAG per business process
2. **Make tasks idempotent**: Safe to run multiple times
3. **Use operators wisely**: Don't put too much logic in operators
4. **Handle failures**: Add retries and alerts
5. **Document**: Add descriptions to DAGs and tasks

### Performance

1. **Parallelize**: Run independent tasks in parallel
2. **Pool resources**: Limit concurrent tasks
3. **Optimize sensors**: Don't poll too frequently
4. **Monitor**: Track execution times and resource usage

### Maintenance

1. **Version control**: Keep DAGs in Git
2. **Testing**: Test DAGs before deployment
3. **Backfilling**: Handle historical data loads carefully
4. **Cleanup**: Remove old task instances and logs

### Security

1. **Use connections**: Store credentials securely
2. **Variables**: Use Airflow variables for config
3. **Secrets backend**: Integrate with secret managers
4. **RBAC**: Control access to DAGs and features

---

## 8. Key Takeaways

[OK] **Orchestration**: Automated management of complex workflows

[OK] **DAG**: Directed Acyclic Graph - workflow with dependencies

[OK] **Airflow Components**: DAG, Operators, Tasks, Scheduler, Executor

[OK] **Scheduling**: Use cron expressions or presets (@daily, @hourly)

[OK] **Tools**: Airflow (mature), Prefect (modern), Dagster (dev-focused)

[OK] **Best Practices**: Idempotency, retries, monitoring, testing

### When to Use Orchestration?

- Multiple dependent tasks
- Scheduled recurring jobs
- Need for monitoring and alerts
- Complex data pipelines
- Production environments

---

## Next Steps

In **Module 08: Data Quality and Validation**, we'll:
- Learn data quality dimensions
- Implement validation checks
- Use data quality frameworks
- Test data pipelines
- Set up data contracts

---

**Ready to ensure data quality?** Open `08_data_quality_validation.ipynb`!