Skip to content

LALITHA-14/project-root

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Apache Airflow ETL Workflow using Docker

πŸ“Œ Project Overview

This project demonstrates a production-ready data engineering workflow built using Apache Airflow and Docker.

It showcases how to orchestrate end-to-end ETL pipelines, perform data transformations, implement conditional workflows, and handle success/failure notifications in a reproducible containerized environment.

The solution includes five independent DAGs, each designed to highlight a real-world Airflow orchestration pattern commonly used in data engineering teams.

πŸ—οΈ Architecture

Technology Stack

Apache Airflow 2.8.0

Webserver

Scheduler

LocalExecutor

PostgreSQL

Airflow metadata database

Data warehouse

Docker & Docker Compose

Parquet

pyarrow engine

snappy compression

Data Flow

CSV β†’ PostgreSQL (raw) β†’ PostgreSQL (transformed) β†’ Parquet

Β  β†˜ Conditional Workflow

Β  β†˜ Notification Workflow

πŸ“‚ Repository Structure


project-root/

β”œβ”€β”€ docker-compose.yml

β”œβ”€β”€ requirements.txt

β”œβ”€β”€ README.md

β”œβ”€β”€ dags/

β”‚   β”œβ”€β”€ dag1\_csv\_to\_postgres.py

β”‚   β”œβ”€β”€ dag2\_data\_transformation.py

β”‚   β”œβ”€β”€ dag3\_postgres\_to\_parquet.py

β”‚   β”œβ”€β”€ dag4\_conditional\_workflow.py

β”‚   └── dag5\_notification\_workflow.py

β”œβ”€β”€ tests/

β”‚   β”œβ”€β”€ test\_dag1.py

β”‚   β”œβ”€β”€ test\_dag2.py

β”‚   └── test\_utils.py

β”œβ”€β”€ data/

β”‚   └── input.csv

β”œβ”€β”€ output/

β”‚   └── (generated parquet files)

└── plugins/

βš™οΈ Prerequisites

Ensure the following are installed on your system:

Docker

Docker Compose

Git

πŸš€ Setup Instructions

1️⃣ Clone the Repository

git clone

cd project-root

2️⃣ Start Airflow Using Docker

docker-compose up

This command starts the following services:

PostgreSQL database

Airflow initialization container

Airflow webserver

Airflow scheduler

🌐 Access Airflow UI

Open your browser and navigate to:

http://localhost:8080

Default Credentials

Username: admin

Password: admin

πŸ” DAGs Overview

DAG 1: CSV to Postgres Ingestion

DAG ID: csv_to_postgres_ingestion

Schedule: @daily

Functionality

Reads employee data from a CSV file

Creates the target table if it does not exist

Truncates the table to ensure idempotency

Loads data into PostgreSQL

Output

Table: raw_employee_data

DAG 2: Data Transformation Pipeline

DAG ID: data_transformation_pipeline

Schedule: @daily

Transformations Applied

full_info β†’ name - city

age_group β†’ Young / Mid / Senior

salary_category β†’ Low / Medium / High

year_joined β†’ extracted from join date

Output

Table: transformed_employee_data

DAG 3: Postgres to Parquet Export

DAG ID: postgres_to_parquet_export

Schedule: @weekly

Functionality

Validates source table existence and data

Exports transformed data to Parquet format

Uses snappy compression

Validates schema and file integrity

Output

Parquet files in /opt/airflow/output/

Filename includes execution date

DAG 4: Conditional Workflow

DAG ID: conditional_workflow_pipeline

Schedule: @daily

Branching Logic

Monday–Wednesday β†’ Weekday processing

Thursday–Friday β†’ End-of-week processing

Saturday–Sunday β†’ Weekend processing

A unified end task runs regardless of the branch executed.

DAG 5: Notification Workflow

DAG ID: notification_workflow

Schedule: @daily

Functionality

Simulates a risky operation

Triggers success or failure callbacks

Executes cleanup logic regardless of outcome

Demonstrates Airflow trigger rules and callbacks

▢️ How to Trigger DAGs

Open the Airflow UI

Enable the DAG

Click Trigger DAG

Monitor execution using Graph View and Logs

πŸ“¦ Output Verification

PostgreSQL Tables

raw_employee_data

transformed_employee_data

Parquet Files

Location: /opt/airflow/output/

Example:

employee_data_2024-01-01.parquet

πŸ§ͺ Running Unit Tests

Unit tests validate DAG structure only and do not require Airflow or PostgreSQL to be running.

pytest tests/ -v

Tests Cover

DAGs load without import errors

Correct DAG IDs

Correct number of tasks

Task dependencies

Schedule intervals

No cyclic dependencies

All DAGs load successfully

πŸ› οΈ Troubleshooting

Airflow UI Not Accessible

Ensure port 8080 is free

Check logs:

docker-compose logs airflow-webserver

DAGs Not Appearing

Verify DAG files are inside the dags/ directory

Restart services:

docker-compose restart

Containers Fail to Start

docker-compose down -v

docker-compose up

πŸ“œ Dependencies

All Python dependencies are listed in requirements.txt.

βœ… Expected Outcome

Airflow environment starts with a single command

Five DAGs visible and operational in the UI

PostgreSQL tables populated correctly

Parquet files generated successfully

Unit tests pass

Clear, evaluator-ready documentation

πŸ“Œ Notes

Email and alert notifications are simulated via logging

No external services are required

Designed for local execution, learning, and evaluation purposes

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages