Apache Airflow ETL Workflow using Docker
π Project Overview
This project demonstrates a production-ready data engineering workflow built using Apache Airflow and Docker.
It showcases how to orchestrate end-to-end ETL pipelines, perform data transformations, implement conditional workflows, and handle success/failure notifications in a reproducible containerized environment.
The solution includes five independent DAGs, each designed to highlight a real-world Airflow orchestration pattern commonly used in data engineering teams.
ποΈ Architecture
Technology Stack
Apache Airflow 2.8.0
Webserver
Scheduler
LocalExecutor
PostgreSQL
Airflow metadata database
Data warehouse
Docker & Docker Compose
Parquet
pyarrow engine
snappy compression
Data Flow
CSV β PostgreSQL (raw) β PostgreSQL (transformed) β Parquet
Β β Conditional Workflow
Β β Notification Workflow
π Repository Structure
project-root/
βββ docker-compose.yml
βββ requirements.txt
βββ README.md
βββ dags/
β βββ dag1\_csv\_to\_postgres.py
β βββ dag2\_data\_transformation.py
β βββ dag3\_postgres\_to\_parquet.py
β βββ dag4\_conditional\_workflow.py
β βββ dag5\_notification\_workflow.py
βββ tests/
β βββ test\_dag1.py
β βββ test\_dag2.py
β βββ test\_utils.py
βββ data/
β βββ input.csv
βββ output/
β βββ (generated parquet files)
βββ plugins/
βοΈ Prerequisites
Ensure the following are installed on your system:
Docker
Docker Compose
Git
π Setup Instructions
1οΈβ£ Clone the Repository
git clone
cd project-root
2οΈβ£ Start Airflow Using Docker
docker-compose up
This command starts the following services:
PostgreSQL database
Airflow initialization container
Airflow webserver
Airflow scheduler
π Access Airflow UI
Open your browser and navigate to:
Default Credentials
Username: admin
Password: admin
π DAGs Overview
DAG 1: CSV to Postgres Ingestion
DAG ID: csv_to_postgres_ingestion
Schedule: @daily
Functionality
Reads employee data from a CSV file
Creates the target table if it does not exist
Truncates the table to ensure idempotency
Loads data into PostgreSQL
Output
Table: raw_employee_data
DAG 2: Data Transformation Pipeline
DAG ID: data_transformation_pipeline
Schedule: @daily
Transformations Applied
full_info β name - city
age_group β Young / Mid / Senior
salary_category β Low / Medium / High
year_joined β extracted from join date
Output
Table: transformed_employee_data
DAG 3: Postgres to Parquet Export
DAG ID: postgres_to_parquet_export
Schedule: @weekly
Functionality
Validates source table existence and data
Exports transformed data to Parquet format
Uses snappy compression
Validates schema and file integrity
Output
Parquet files in /opt/airflow/output/
Filename includes execution date
DAG 4: Conditional Workflow
DAG ID: conditional_workflow_pipeline
Schedule: @daily
Branching Logic
MondayβWednesday β Weekday processing
ThursdayβFriday β End-of-week processing
SaturdayβSunday β Weekend processing
A unified end task runs regardless of the branch executed.
DAG 5: Notification Workflow
DAG ID: notification_workflow
Schedule: @daily
Functionality
Simulates a risky operation
Triggers success or failure callbacks
Executes cleanup logic regardless of outcome
Demonstrates Airflow trigger rules and callbacks
Open the Airflow UI
Enable the DAG
Click Trigger DAG
Monitor execution using Graph View and Logs
π¦ Output Verification
PostgreSQL Tables
raw_employee_data
transformed_employee_data
Parquet Files
Location: /opt/airflow/output/
Example:
employee_data_2024-01-01.parquet
π§ͺ Running Unit Tests
Unit tests validate DAG structure only and do not require Airflow or PostgreSQL to be running.
pytest tests/ -v
Tests Cover
DAGs load without import errors
Correct DAG IDs
Correct number of tasks
Task dependencies
Schedule intervals
No cyclic dependencies
All DAGs load successfully
π οΈ Troubleshooting
Airflow UI Not Accessible
Ensure port 8080 is free
Check logs:
docker-compose logs airflow-webserver
DAGs Not Appearing
Verify DAG files are inside the dags/ directory
Restart services:
docker-compose restart
Containers Fail to Start
docker-compose down -v
docker-compose up
π Dependencies
All Python dependencies are listed in requirements.txt.
β Expected Outcome
Airflow environment starts with a single command
Five DAGs visible and operational in the UI
PostgreSQL tables populated correctly
Parquet files generated successfully
Unit tests pass
Clear, evaluator-ready documentation
π Notes
Email and alert notifications are simulated via logging
No external services are required
Designed for local execution, learning, and evaluation purposes