An automated, end-to-end Data Engineering pipeline scaling 3.5GB of web logs using a modern zero-copy cloud Lakehouse architecture.
This is a Graduation Project for the DataTalks.Club Data Engineering Zoomcamp 2026.
The goal of this project is to build a production-grade pipeline that extracts, processes, and visualizes massive web server access logs. It is specifically designed to stream data completely through memory with zero local disk usage for data files, making it highly cost-effective and scalable.
The pipeline orchestrates a flow from Raw Data to a fully modeled Kimball Star Schema, capped by an interactive BI dashboard.
graph TD
Kaggle[☁️ Kaggle API] -->|Boto3 Stream| Raw[(S3 Raw Zone)]
Raw -->|Read Partitions| Spark[⚡ Apache Spark]
CSV[client_hostname.csv] -->|Broadcast Join| Spark
Spark -->|Snappy Parquet| Silver[(S3 Silver Zone)]
Silver -->|Zero-Copy View| MD[🦆 MotherDuck DWH]
MD -->|Run Models| dbt[🏗️ dbt Core]
dbt -->|Star Schema| MD
MD -->|SQLAlchemy| BI[📊 Apache Superset]
| Phase | Technology | Purpose |
|---|---|---|
| Orchestration | 🌬️ Apache Airflow 2.9 | Containerized DAG scheduling & failure management |
| Ingestion | 🐍 Python / boto3 | In-memory data streaming to Cloud |
| Storage | 🪣 AWS S3 | Data Lake (Raw / Silver Parquet) |
| Processing | ⚡ Apache Spark 3.5 | Fast CSV parsing, enrichment, and deduplication |
| Data Warehouse | 🦆 MotherDuck | Serverless OLAP connected directly to S3 |
| Transformation | 🏗️ dbt (dbt-duckdb) | staging → core (Kimball) → dashboard aggregations |
| Visualization | 📊 Apache Superset | Real-time interactive dashboards with Dark Theme |
- Docker Desktop
- AWS account (Access Keys)
- MotherDuck account (Service Token)
- Kaggle account (API Token
kaggle.json)
git clone https://github.com/Ibrahim-Ayman/log-analysis.git
cd log-analysis
# Create your .env file
cp .env.example .env
# Open .env and carefully add your API keysdocker compose up -dBootstraps Airflow, Spark Master, Spark Workers, Postgres, and Superset.
# 1. Create S3 raw and silver buckets
docker compose exec airflow-webserver python /opt/airflow/scripts/setup_s3.py
# 2. Register datasets and MotherDuck connection in Superset
docker compose exec superset python /app/superset_register.pyGo to http://localhost:8080 (admin/admin).
Trigger the DAG nginx_ingestion. It will automatically cascade:
nginx_ingestion(Extract/Load to Bronze)nginx_processing(Spark Transform to Silver)nginx_warehouse(MotherDuck View Setup + dbt Models + Data Quality Tests)
Go to http://localhost:8088 (admin/admin). Your datasets are pre-registered! Apply the nginx_dark_theme.css and arrange your charts.
log-analysis/
├── dags/
│ ├── nginx_ingestion.py # DAG A: Kaggle → S3
│ ├── nginx_processing.py # DAG B: Spark Transformations
│ └── nginx_warehouse.py # DAG C: View Mapping & dbt Execution
├── dbt/logs_analytics/
│ ├── models/
│ │ ├── staging/ # Base views and surrogate key hashing
│ │ ├── core/ # Kimball Dimensional Model (Dim/Fact)
│ │ └── dashboard/ # Pre-aggregated tables for BI
│ └── schema.yml # Source mapping & Data Quality tests
├── docker/
│ ├── airflow/ Dockerfile # Custom Airflow + boto3 + dbt
│ ├── spark/ Dockerfile # Spark + Hadoop-AWS jars
│ └── superset/ Dockerfile # Superset + duckdb-engine
├── scripts/ # Registration & Setup utilities
├── spark/
│ └── transform.py # Ultra-fast PySpark CSV processor
└── superset/
└── dashboards/ # Custom CSS themes
- Zero-Local-Disk IO: The
boto3stream literally fetches from Kaggle and streams via multipart chunks directly to S3 memory buffers. We bypass thetmpstorage entirely to survive 3.5GB logs on a free-tier VPS. - Spark BroadCasting: Hostname enrichment relies on Spark's
broadcast()join, significantly optimizing execution plans for IP lookups. - MotherDuck Zero-Copy: The Data Warehouse does not copy data. The
nginx_silver_viewis anhttpfsmapping directly over the S3 partitioned compressed.parquetfiles. - Strict Dimensional Architecture: Transformed into a Kimball Star Schema for optimal query aggregation speed.