Learning Data Engineering

A complete, containerized data engineering learning platform
Master modern data engineering in 6 months • Zero local installations • Production-ready projects

Quick Start • Learning Path • Tech Stack • Contribute

Welcome to Your Data Engineering Journey!

Tired of piecing together scattered tutorials and wrestling with complex local setups? You've found the solution.

This isn't just another tutorial repository—it's a complete, production-ready learning environment that mirrors real-world data engineering workflows. Whether you're transitioning into data engineering, leveling up your skills, or building your portfolio, this platform provides everything you need in one place.

Why This Platform Exists

Data engineering is one of the fastest-growing fields in tech, but learning it effectively requires:

Real infrastructure (not just isolated code examples)
Production patterns (not just theoretical concepts)
Portfolio projects (not just hello-world tutorials)
Community support (not just solo learning)

We've built the platform we wish existed when we started our data engineering journeys.

What Makes This Different?

Traditional Learning	This Platform
Scattered tutorials	Structured 6-month blueprint
Local installations	100% containerized
Theoretical concepts	Real portfolio projects
Solo learning	Community-driven
Hello-world examples	Production-grade code
Static content	Active development

One-Command Setup

Windows Users: Use Git Bash • Mac/Linux Users: Use Terminal

git clone https://github.com/marlonribunal/learning-data-engineering.git
cd learning-data-engineering
./bootstrap.sh

That's it! Your complete data engineering environment automatically builds and will be ready at:

Service	URL	Credentials
Airflow	http://localhost:8080	`admin` / `admin`
Streamlit Dashboard	http://localhost:8501	-
PGAdmin	http://localhost:8081	`admin@datamart.com` / `admin`
Redpanda Console	http://localhost:8082	-

Complete Learning Path

Learning Cadence: By Sprint
Each sprint represents 2 weeks of focused learning, following agile methodology used in professional data teams. This structured approach ensures steady progress while building real portfolio projects.

Phase 1: Foundations (Months 1-2)

Project: E-Commerce Data Pipeline

Sprint	Focus	Skills
1	Cloud Data Ingestion	BigQuery, Python, Data Validation
2	Modern Transformation	dbt Core, Star Schema, Data Modeling
3	Workflow Orchestration	Airflow, DAGs, Task Dependencies

Phase 2: Scaling (Months 3-4)

Project: Hybrid Cloud Platform

Sprint	Focus	Skills
4	Big Data Processing	Spark, Databricks, Distributed Computing
5	Hybrid Pipelines	Cloud Integration, API Orchestration
6	Data Quality	Testing Frameworks, Monitoring, Alerting

Phase 3: Real-time (Months 5-6)

Project: Real-time Intelligence Platform

Sprint	Focus	Skills
7	Streaming Data	Kafka/Redpanda, Event Processing
8	Real-time Analytics	Spark Streaming, Stateful Processing
9	Unified Dashboards	Streamlit, Real-time Visualization
10-12	Portfolio & Career	Interviews, System Design, Job Search

For detailed daily breakdowns, weekly goals, and specific learning objectives for each sprint, see the Complete Learning Blueprint.

Tech Stack

Category	Technologies
Orchestration	Apache Airflow
Processing	Python, Pandas, PySpark
Transformation	dbt Core
Warehousing	BigQuery, PostgreSQL
Streaming	Redpanda, Spark Streaming
Dashboard	Streamlit, Plotly
Infrastructure	Docker, Docker Compose

** Flexibility Note:** While we use free-tier and open-source tools to make learning accessible, feel free to swap any component with tools of your choice! The architecture is designed to be modular—replace BigQuery with Snowflake, Airflow with Prefect, or Redpanda with Kafka based on your preferences or workplace requirements.

What's Included

learning-data-engineering/
├── 🐳 Complete Containerized Environment
├── 📚 12 Detailed Sprint Guides
├── 🛠️ Production-Ready Projects
├── 📊 Example Data Pipeline (Datamart Intelligence Platform)
├── 📖 Comprehensive Documentation
└── 🎯 Interview Preparation Materials

Featured Project: Datamart Intelligence Platform

A complete data platform for a fictional e-commerce company featuring:

Batch Processing: Daily ETL with data quality checks
Real-time Analytics: Streaming order processing
Hybrid Architecture: Local orchestration + cloud processing
Data Governance: Comprehensive quality monitoring
Business Intelligence: Interactive Streamlit dashboard

Quick Commands

# Start all services
./scripts/start.sh

# Stop services
./scripts/stop.sh

# View service logs
./scripts/logs.sh [service-name]

# Complete cleanup (removes all data)
./scripts/destroy.sh

# Access containers
docker-compose exec airflow-webserver bash
docker-compose exec dbt-service dbt run

Join the Community!

Call for Contributors

Are you a data engineer, data scientist, or aspiring data professional?
We're building the most comprehensive open-source data engineering learning platform, and we need your expertise!

How You Can Contribute

For Senior Data Engineers:

Add advanced patterns: CDC, data mesh, ML pipelines
Create real-world case studies: E-commerce, fintech, healthcare
Contribute production-grade code: Error handling, monitoring, optimization
Mentor: Code reviews, best practices, architecture guidance

For Intermediate Practitioners:

Expand project examples: Add new data sources, transformations
Create cheat sheets: Your favorite tools, optimization techniques
Write tutorials: Debugging guides, performance tuning
Improve documentation: Clarify concepts, add examples

For Beginners:

Test the learning path: Provide feedback on clarity and progression
Report issues: Found something confusing? Let us know!
Suggest improvements: What would help you learn better?
Share your journey: Blog posts, success stories

Contribution Areas

Area	Examples	Skill Level
Data Pipelines	Add CDC, error handling, monitoring	Intermediate+
dbt Models	Advanced patterns, custom tests	All Levels
Airflow DAGs	Complex dependencies, custom operators	Intermediate+
Streaming	Kafka connectors, stateful processing	Advanced
Dashboard	New visualizations, real-time features	All Levels
Documentation	Guides, tutorials, best practices	All Levels

First Time Contributors

Good first issues:

Add more dbt test examples
Create additional Streamlit visualization
Write a troubleshooting guide for common setup issues
Add more SQL query examples
Create a glossary of data engineering terms

Contribution Process

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Project Roadmap

Phase 1: Core platform (Complete)
Phase 2: Advanced patterns (In Progress)
Phase 3: Real-world case studies (Planned)
Phase 4: Enterprise features (Future)

Troubleshooting

Common Issues & Solutions

Issue	Solution
Port conflicts	Check ports 8080, 8501, 8081, 8082 are free
Docker not running	Start Docker Desktop first
Low memory	Allocate 4-8GB RAM to Docker
Windows permissions	Use Git Bash instead of PowerShell

Reset Everything

./scripts/destroy.sh
./bootstrap.sh

Learning Resources

Full Learning Blueprint - Complete 6-month roadmap
Sprint-by-Sprint Guides - Detailed weekly plans
Tool Cheatsheets - Quick references
Project Documentation - Example implementations

License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with for the data community
Inspired by modern data stack best practices
Supported by contributors worldwide

Ready to master data engineering?

Star this repo if you find it helpful!

** Get Started • Contribute • Learn More**

Join us in building the world's best data engineering learning platform!

```

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
containers		containers
docs		docs
projects/datamart-intelligence-platform		projects/datamart-intelligence-platform
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
bootstrap.sh		bootstrap.sh
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation