# Exercise 9: Apache Airflow - Workflow Orchestration

## Why Airflow?

In previous exercises, we ran individual notebooks manually. In production, data pipelines need:

- **Scheduling**: Run at specific times (daily, hourly)
- **Dependencies**: Job B runs only after Job A completes
- **Retries**: Automatic retry on failures
- **Monitoring**: Track pipeline health
- **Alerting**: Notify on failures

**Apache Airflow** solves these challenges!

```
┌─────────────────────────────────────────────────────────────────┐
│                    WITHOUT AIRFLOW                             │
│                                                                 │
│   Cron Job 1    Cron Job 2    Cron Job 3                       │
│   (Extract)     (Transform)   (Load)                           │
│       │             │            │                             │
│       │   Hope they run in order! ❌                           │
│       └─────────────┴────────────┘                             │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    WITH AIRFLOW                                │
│                                                                 │
│   ┌─────────┐    ┌───────────┐    ┌──────────┐                 │
│   │ Extract │───▶│ Transform │───▶│   Load   │                 │
│   └─────────┘    └───────────┘    └──────────┘                 │
│        │               │               │                       │
│        ▼               ▼               ▼                       │
│   ┌───────────────────────────────────────────┐               │
│   │  Airflow Scheduler (Manages Dependencies) │ ✓             │
│   └───────────────────────────────────────────┘               │
└─────────────────────────────────────────────────────────────────┘
```

## Learning Objectives
- Understand Airflow concepts (DAGs, Tasks, Operators)
- Examine real DAG examples
- Trigger and monitor DAG runs
- Create pipelines that orchestrate Hive and Spark jobs

---

## Part 1: Airflow Core Concepts

### Key Terms

| Concept | Description | Example |
|---------|-------------|----------|
| **DAG** | Directed Acyclic Graph - the pipeline | `daily_sales_etl` |
| **Task** | A single unit of work | `extract_sales_data` |
| **Operator** | Template for a task type | `BashOperator`, `PythonOperator` |
| **Sensor** | Waits for a condition | `FileSensor` (wait for file) |
| **XCom** | Share data between tasks | Pass values between operators |
| **Connection** | External system credentials | Hive, Spark, S3 connections |

In [None]:
# Airflow is running as a service in our cluster
# Let's check its status

print("""
╔════════════════════════════════════════════════════════════════╗
║  AIRFLOW WEB UI                                                ║
║                                                                ║
║  URL: http://localhost:8080                                    ║
║  Username: admin                                               ║
║  Password: admin                                               ║
║                                                                ║
║  Open in browser to see DAGs, runs, and logs!                  ║
╚════════════════════════════════════════════════════════════════╝
""")

## Part 2: Anatomy of a DAG

Let's examine a real DAG file from our cluster:

In [None]:
# Show the Hive ETL DAG
hive_etl_dag = '''
"""
Sample DAG: Hive ETL Pipeline
Demonstrates Hive operations from Airflow
"""
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator

# Default arguments applied to all tasks
default_args = {
    'owner': 'teaching-lab',
    'depends_on_past': False,
    'start_date': datetime(2024, 1, 1),
    'retries': 2,
    'retry_delay': timedelta(minutes=2),
}

BEELINE_CMD = 'beeline -u "jdbc:hive2://hiveserver2:10000/default" --silent=true -e'

# Define the DAG
with DAG(
    'hive_etl_example',           # DAG ID
    default_args=default_args,
    description='Example Hive ETL pipeline',
    schedule_interval=None,        # Manual trigger only
    catchup=False,
    tags=['demo', 'hive', 'etl'],
) as dag:
    
    # Task 1: Create database
    create_database = BashOperator(
        task_id='create_database',
        bash_command=f'{BEELINE_CMD} "CREATE DATABASE IF NOT EXISTS airflow_demo;"',
    )
    
    # Task 2: Create table
    create_table = BashOperator(
        task_id='create_table',
        bash_command=f\'\'\'
            {BEELINE_CMD} "
                USE airflow_demo;
                CREATE TABLE IF NOT EXISTS daily_metrics (
                    metric_name STRING,
                    metric_value DOUBLE,
                    recorded_at TIMESTAMP
                )
                PARTITIONED BY (date_key STRING)
                STORED AS PARQUET;
            "
        \'\'\',
    )
    
    # Define task dependencies
    create_database >> create_table  # create_table runs AFTER create_database
'''

print("=== Example Hive ETL DAG ===")
print(hive_etl_dag)

## Part 3: Understanding Task Dependencies

In Airflow, you define execution order using:

```python
# Method 1: Bit shift operators
task_a >> task_b >> task_c  # Linear: A then B then C

# Method 2: List syntax
[task_a, task_b] >> task_c  # Parallel: A and B then C

# Method 3: set_downstream/set_upstream
task_a.set_downstream(task_b)
task_c.set_upstream(task_b)
```

In [None]:
# Visual representation of DAG dependencies
print("""
HIVE ETL DAG VISUALIZATION

  create_database
        |
        v
  create_table
        |
        v
  insert_data
        |
        v
  run_aggregation
        |
        v
  show_partitions
""")

## Part 4: Common Airflow Operators

Airflow provides many built-in operators:

In [None]:
print("""
COMMON AIRFLOW OPERATORS

  BashOperator        - Run shell commands
  PythonOperator      - Execute Python functions
  SparkSubmitOperator - Submit Spark jobs to cluster
  HiveOperator        - Execute Hive queries
  FileSensor          - Wait for file to appear
  EmailOperator       - Send email notifications
  BranchOperator      - Conditional execution paths
  DummyOperator       - No-op placeholder task
  DockerOperator      - Run commands in Docker containers
  HTTPSensor          - Wait for HTTP endpoint response
""")

## Part 5: Interacting with Airflow via API

In [None]:
import requests

# Airflow REST API base URL
AIRFLOW_URL = "http://airflow-webserver:8080/api/v1"
AUTH = ("admin", "admin")  # Default credentials

def list_dags():
    try:
        response = requests.get(
            f"{AIRFLOW_URL}/dags",
            auth=AUTH,
            headers={"Content-Type": "application/json"}
        )
        if response.status_code == 200:
            return response.json()
        else:
            return f"Error: {response.status_code}"
    except Exception as e:
        return f"Connection error: {e}"

print("=== Available DAGs ===")
dags = list_dags()
if isinstance(dags, dict) and 'dags' in dags:
    for dag in dags['dags']:
        status = "Active" if not dag.get('is_paused') else "Paused"
        print(f"  {dag['dag_id']}: {status}")
else:
    print(f"Could not connect to Airflow: {dags}")
    print("(This is expected if running outside the Docker network)")

## Part 6: Production Pipeline Pattern

In [None]:
print("""
PRODUCTION ETL PIPELINE PATTERN

      Daily Schedule Trigger
              |
              v
      Wait for Source File (FileSensor)
              |
    +---------+---------+
    |         |         |
    v         v         v
 Extract   Extract   Extract
 Orders    Products  Customers
    |         |         |
    +---------+---------+
              |
              v
      Transform & Join (Spark)
              |
              v
      Load to Hive Table
              |
    +---------+---------+
    |         |         |
    v         v         v
 Run Tests  Notify   Update Stats
""")

## Part 7: Airflow Best Practices

In [None]:
print("""
AIRFLOW BEST PRACTICES

1. IDEMPOTENCY
   - Tasks should produce same result if run multiple times
   - Use OVERWRITE mode, not APPEND
   - Use date partitions for incremental loads

2. ATOMICITY
   - Each task should be a complete unit of work
   - Write to temp location, then move atomically

3. AVOID STORING DATA IN XCOMS
   - XComs are for small metadata only (file paths, counts)
   - Store actual data in HDFS/S3/Database

4. USE CONNECTIONS & VARIABLES
   - Store credentials in Airflow Connections
   - Store config in Airflow Variables
   - Never hardcode secrets in DAG files

5. MEANINGFUL TASK IDS
   - extract_sales_data NOT task1
   - Makes debugging and monitoring easier
""")

## Part 8: Troubleshooting Commands

In [None]:
print("""
COMMON TROUBLESHOOTING COMMANDS

# View DAG runs
docker exec airflow-webserver airflow dags list-runs -d hive_etl_example

# Trigger a DAG manually
docker exec airflow-webserver airflow dags trigger hive_etl_example

# View task logs
docker exec airflow-webserver airflow tasks logs hive_etl_example create_database

# Test a single task
docker exec airflow-webserver airflow tasks test hive_etl_example create_database 2024-01-01

# Clear failed tasks (for retry)
docker exec airflow-webserver airflow tasks clear hive_etl_example -t create_table

# Pause/unpause a DAG
docker exec airflow-webserver airflow dags pause hive_etl_example
docker exec airflow-webserver airflow dags unpause hive_etl_example
""")

## Exercises

### Exercise 1: Explore the Airflow UI
1. Open http://localhost:8080
2. Login with admin/admin
3. Find the `hive_etl_example` DAG
4. Trigger it manually and watch the execution

### Exercise 2: Create Your Own DAG
Create a DAG that:
1. Checks if a Hive table exists
2. Runs a Spark aggregation job
3. Writes results to a new Hive table

In [None]:
print("""
KEY TAKEAWAYS

1. Airflow = Workflow orchestration for data pipelines

2. DAGs define:
   - WHAT tasks to run
   - WHEN to run (schedule)
   - In what ORDER (dependencies)

3. Operators execute work:
   - BashOperator for shell commands
   - SparkSubmitOperator for Spark jobs
   - HiveOperator for Hive queries

4. Integration with Big Data stack:
   - Submits jobs to YARN cluster
   - Queries Hive tables
   - Monitors HDFS for new files

5. Production features:
   - Retries, alerting, logging
   - Backfill historical data
   - Parallel task execution
""")