A complete, containerized data engineering learning platform
Master modern data engineering in 6 months • Zero local installations • Production-ready projects
Tired of piecing together scattered tutorials and wrestling with complex local setups? You've found the solution.
This isn't just another tutorial repository—it's a complete, production-ready learning environment that mirrors real-world data engineering workflows. Whether you're transitioning into data engineering, leveling up your skills, or building your portfolio, this platform provides everything you need in one place.
Data engineering is one of the fastest-growing fields in tech, but learning it effectively requires:
- Real infrastructure (not just isolated code examples)
- Production patterns (not just theoretical concepts)
- Portfolio projects (not just hello-world tutorials)
- Community support (not just solo learning)
We've built the platform we wish existed when we started our data engineering journeys.
| Traditional Learning | This Platform |
|---|---|
| Scattered tutorials | Structured 6-month blueprint |
| Local installations | 100% containerized |
| Theoretical concepts | Real portfolio projects |
| Solo learning | Community-driven |
| Hello-world examples | Production-grade code |
| Static content | Active development |
Windows Users: Use Git Bash • Mac/Linux Users: Use Terminal
git clone https://github.com/marlonribunal/learning-data-engineering.git
cd learning-data-engineering
./bootstrap.shThat's it! Your complete data engineering environment automatically builds and will be ready at:
| Service | URL | Credentials |
|---|---|---|
| Airflow | http://localhost:8080 | admin / admin |
| Streamlit Dashboard | http://localhost:8501 | - |
| PGAdmin | http://localhost:8081 | admin@datamart.com / admin |
| Redpanda Console | http://localhost:8082 | - |
Learning Cadence: By Sprint
Each sprint represents 2 weeks of focused learning, following agile methodology used in professional data teams. This structured approach ensures steady progress while building real portfolio projects.
Project: E-Commerce Data Pipeline
| Sprint | Focus | Skills |
|---|---|---|
| 1 | Cloud Data Ingestion | BigQuery, Python, Data Validation |
| 2 | Modern Transformation | dbt Core, Star Schema, Data Modeling |
| 3 | Workflow Orchestration | Airflow, DAGs, Task Dependencies |
Project: Hybrid Cloud Platform
| Sprint | Focus | Skills |
|---|---|---|
| 4 | Big Data Processing | Spark, Databricks, Distributed Computing |
| 5 | Hybrid Pipelines | Cloud Integration, API Orchestration |
| 6 | Data Quality | Testing Frameworks, Monitoring, Alerting |
Project: Real-time Intelligence Platform
| Sprint | Focus | Skills |
|---|---|---|
| 7 | Streaming Data | Kafka/Redpanda, Event Processing |
| 8 | Real-time Analytics | Spark Streaming, Stateful Processing |
| 9 | Unified Dashboards | Streamlit, Real-time Visualization |
| 10-12 | Portfolio & Career | Interviews, System Design, Job Search |
For detailed daily breakdowns, weekly goals, and specific learning objectives for each sprint, see the Complete Learning Blueprint.
| Category | Technologies |
|---|---|
| Orchestration | Apache Airflow |
| Processing | Python, Pandas, PySpark |
| Transformation | dbt Core |
| Warehousing | BigQuery, PostgreSQL |
| Streaming | Redpanda, Spark Streaming |
| Dashboard | Streamlit, Plotly |
| Infrastructure | Docker, Docker Compose |
** Flexibility Note:** While we use free-tier and open-source tools to make learning accessible, feel free to swap any component with tools of your choice! The architecture is designed to be modular—replace BigQuery with Snowflake, Airflow with Prefect, or Redpanda with Kafka based on your preferences or workplace requirements.
learning-data-engineering/
├── 🐳 Complete Containerized Environment
├── 📚 12 Detailed Sprint Guides
├── 🛠️ Production-Ready Projects
├── 📊 Example Data Pipeline (Datamart Intelligence Platform)
├── 📖 Comprehensive Documentation
└── 🎯 Interview Preparation Materials
A complete data platform for a fictional e-commerce company featuring:
- Batch Processing: Daily ETL with data quality checks
- Real-time Analytics: Streaming order processing
- Hybrid Architecture: Local orchestration + cloud processing
- Data Governance: Comprehensive quality monitoring
- Business Intelligence: Interactive Streamlit dashboard
# Start all services
./scripts/start.sh
# Stop services
./scripts/stop.sh
# View service logs
./scripts/logs.sh [service-name]
# Complete cleanup (removes all data)
./scripts/destroy.sh
# Access containers
docker-compose exec airflow-webserver bash
docker-compose exec dbt-service dbt runAre you a data engineer, data scientist, or aspiring data professional?
We're building the most comprehensive open-source data engineering learning platform, and we need your expertise!
- Add advanced patterns: CDC, data mesh, ML pipelines
- Create real-world case studies: E-commerce, fintech, healthcare
- Contribute production-grade code: Error handling, monitoring, optimization
- Mentor: Code reviews, best practices, architecture guidance
- Expand project examples: Add new data sources, transformations
- Create cheat sheets: Your favorite tools, optimization techniques
- Write tutorials: Debugging guides, performance tuning
- Improve documentation: Clarify concepts, add examples
- Test the learning path: Provide feedback on clarity and progression
- Report issues: Found something confusing? Let us know!
- Suggest improvements: What would help you learn better?
- Share your journey: Blog posts, success stories
| Area | Examples | Skill Level |
|---|---|---|
| Data Pipelines | Add CDC, error handling, monitoring | Intermediate+ |
| dbt Models | Advanced patterns, custom tests | All Levels |
| Airflow DAGs | Complex dependencies, custom operators | Intermediate+ |
| Streaming | Kafka connectors, stateful processing | Advanced |
| Dashboard | New visualizations, real-time features | All Levels |
| Documentation | Guides, tutorials, best practices | All Levels |
Good first issues:
- Add more dbt test examples
- Create additional Streamlit visualization
- Write a troubleshooting guide for common setup issues
- Add more SQL query examples
- Create a glossary of data engineering terms
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Phase 1: Core platform (Complete)
- Phase 2: Advanced patterns (In Progress)
- Phase 3: Real-world case studies (Planned)
- Phase 4: Enterprise features (Future)
| Issue | Solution |
|---|---|
| Port conflicts | Check ports 8080, 8501, 8081, 8082 are free |
| Docker not running | Start Docker Desktop first |
| Low memory | Allocate 4-8GB RAM to Docker |
| Windows permissions | Use Git Bash instead of PowerShell |
./scripts/destroy.sh
./bootstrap.sh- Full Learning Blueprint - Complete 6-month roadmap
- Sprint-by-Sprint Guides - Detailed weekly plans
- Tool Cheatsheets - Quick references
- Project Documentation - Example implementations
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with for the data community
- Inspired by modern data stack best practices
- Supported by contributors worldwide
Star this repo if you find it helpful!
** Get Started • Contribute • Learn More**
Join us in building the world's best data engineering learning platform!