$ pip install dehub-knowledge --upgrade
✓ Loaded: 500+ Resources | 50+ Tools | 10+ Roadmaps | 1 CommunityThe command center for data engineers worldwide. From zero to petabyte-scale — everything you need to become an elite data engineer.
🚀 Get Started • 📚 Resources • 🗺️ Roadmap • 🛠️ Tools • 🌐 Website • 🤝 Contribute
╔══════════════════════════════════════════════════════════════╗
║ dehub@engineer:~$ ║
╠══════════════════════════════════════════════════════════════╣
║ ║
║ $ dbt run --select +orders_mart ║
║ [INFO] Running with dbt=1.8.0 ║
║ [INFO] Found 47 models, 312 tests ║
║ ✓ Completed successfully ║
║ ║
║ $ spark-submit --master yarn pipeline.py ║
║ [INFO] SparkContext initialized ║
║ [INFO] Processing 2.4B records... ║
║ ✓ Job completed: 847s elapsed ║
║ ║
║ $ kafka-topics --create --topic events --partitions 12 ║
║ ✓ Created topic "events" ║
║ ║
║ $ SELECT COUNT(*) FROM iceberg.prod.events; ║
║ > 1,427,893,004 rows ║
║ ║
╚══════════════════════════════════════════════════════════════╝Click to expand
New to Data Engineering? Follow this path:
📍 You are here
│
▼
[1] 📖 Read "Fundamentals of Data Engineering" (Joe Reis & Matt Housley)
│
▼
[2] 🐍 Learn Python + SQL to professional level
│
▼
[3] 🌊 Build your first pipeline (Batch → Streaming)
│
▼
[4] ☁️ Get cloud certified (AWS/GCP/Azure)
│
▼
[5] 🚀 Contribute to open-source, get hired
Experienced Engineer? Jump to:
- Advanced Tools & Frameworks
- System Design Resources
- Senior-level Interview Prep
- Open Source Projects
| Skill | Resources | Status |
|---|---|---|
| Python for Data Engineering | Python for Data Analysis | 🔥 Essential |
| SQL Mastery | Mode SQL Tutorial | 🔥 Essential |
| Linux & Bash | The Linux Command Line | ✅ Required |
| Git & Version Control | Pro Git Book | ✅ Required |
| Basic ETL Concepts | Fundamentals of Data Engineering | 🔥 Essential |
| Docker Basics | Docker Official Docs | ✅ Required |
| Skill | Resources | Status |
|---|---|---|
| Apache Spark | Spark: The Definitive Guide | 🔥 Essential |
| Apache Airflow | Astronomer Guides | 🔥 Essential |
| dbt (Data Build Tool) | dbt Learn | 🔥 Essential |
| Cloud Platforms | AWS/GCP/Azure Certifications | ✅ Required |
| Data Modeling | Kimball - Data Warehouse Toolkit | 🔥 Essential |
| Kafka/Streaming | Confluent Kafka Tutorials | ⭐ Recommended |
| Skill | Resources | Status |
|---|---|---|
| Data Architecture | Designing Data-Intensive Applications | 🔥 Essential |
| Apache Iceberg | Apache Iceberg: The Definitive Guide | 🔥 Essential |
| Real-Time Streaming | Streaming Systems | ⭐ Recommended |
| Data Mesh | Data Mesh (Zhamak Dehghani) | ⭐ Recommended |
| ML Engineering | Designing Machine Learning Systems | ⭐ Recommended |
| System Design | ByteByteGo | ✅ Required |
| Skill | Focus |
|---|---|
| Data Platform Architecture | Design multi-cloud, multi-region data platforms |
| Cost Optimization | FinOps for data, query optimization at scale |
| Team Leadership | Building and mentoring data engineering teams |
| Open Source Contribution | Build tooling the community relies on |
| # | Book | Author | Level |
|---|---|---|---|
| 1 | Fundamentals of Data Engineering | Joe Reis & Matt Housley | All Levels |
| 2 | Designing Data-Intensive Applications | Martin Kleppmann | Intermediate+ |
| 3 | Designing Machine Learning Systems | Chip Huyen | Advanced |
View all 35+ books
- Fundamentals of Data Engineering
- Designing Data-Intensive Applications
- Data Engineering Design Patterns
- 97 Things Every Data Engineer Should Know
- Data Pipelines Pocket Reference
- Spark: The Definitive Guide
- Learning Spark, 2nd Edition (Free PDF)
- High Performance Spark
- Modern Data Engineering with Apache Spark
- Data Engineering with AWS
- Delta Lake: The Definitive Guide
- Apache Iceberg The Definitive Guide
- Architecting an Apache Iceberg Lakehouse
- Snowflake Data Engineering
- Data Mesh
- Deciphering Data Architectures
- Data Management at Scale, 2nd Edition
- Data Governance: The Definitive Guide
- Building Evolutionary Architectures, 2nd Edition
- Kimball - The Data Warehouse Toolkit
| Name | Provider | Level | Duration |
|---|---|---|---|
| Data Engineering Zoomcamp | DataTalks.Club | Beginner | 9 weeks |
| Beginner Data Engineering Bootcamp | DataExpert.io | Beginner | 4 weeks |
| Intermediate Bootcamp | DataExpert.io | Intermediate | 6 weeks |
| DE Fundamentals | Databricks | All Levels | Self-paced |
| Name | Provider | Level |
|---|---|---|
| Data Expert Courses | Zach Wilson | All Levels |
| dbt Fundamentals | dbt Labs | Beginner |
| Astronomer Certification | Astronomer | Intermediate |
| Databricks Certified Associate | Databricks | Intermediate |
| Snowflake SnowPro Core | Snowflake | Intermediate |
| Google Cloud Professional Data Engineer | Advanced | |
| AWS Data Analytics Specialty | AWS | Advanced |
| Tool | Stars | Description |
|---|---|---|
| Apache Airflow | Industry-standard workflow orchestrator | |
| Dagster | Data-aware orchestration platform | |
| Prefect | Modern workflow automation | |
| Mage | Modern data pipeline tool | |
| Kestra | Declarative orchestration | |
| Hamilton | Function-based DAG framework |
| Tool | Stars | Description |
|---|---|---|
| Apache Iceberg | Open table format for huge datasets | |
| Delta Lake | ACID transactions for big data | |
| Apache Hudi | Incremental data processing | |
| Apache Polaris | Open catalog for Apache Iceberg | |
| DuckLake | - | SQL-native lakehouse |
| Tool | Description |
|---|---|
| Snowflake | Cloud-native data warehouse |
| Google BigQuery | Serverless data warehouse |
| Databricks | Unified analytics platform |
| Amazon Redshift | AWS data warehouse |
| Firebolt | Ultra-fast cloud warehouse |
| Databend | Open-source cloud DW |
| ClickHouse | Real-time analytics |
| Tool | Stars | Description |
|---|---|---|
| Apache Spark | Unified analytics engine | |
| Apache Flink | Stream & batch processing | |
| DuckDB | In-process analytical DB | |
| Trino | Distributed SQL query engine | |
| Polars | Fast DataFrame library |
| Tool | Stars | Description |
|---|---|---|
| dbt | SQL-first transformation | |
| SQLMesh | Next-gen dbt alternative | |
| Coalesce | - | Cloud-native transformation |
| Tool | Stars | Description |
|---|---|---|
| Great Expectations | Data quality framework | |
| Soda | Data quality platform | |
| dbt tests | - | Built-in dbt testing |
| Metaplane | - | Data observability |
| DQOps | Automated data quality |
| Tool | Description |
|---|---|
| Apache Kafka | Distributed event streaming |
| Apache Pulsar | Cloud-native messaging |
| Redpanda | Kafka-compatible streaming |
| Confluent | Managed Kafka platform |
| AWS Kinesis | Real-time data streaming |
| Tool | Description |
|---|---|
| Apache Atlas | Data governance & metadata |
| DataHub | Modern metadata platform |
| OpenMetadata | Open-source data catalog |
| Amundsen | Data discovery & metadata |
| Tool | Description |
|---|---|
| Airbyte | Open-source data integration |
| Fivetran | Automated data movement |
| Debezium | CDC (Change Data Capture) |
| Apache NiFi | Data flow automation |
| dlt | Python data load tool |
| Tool | Description |
|---|---|
| Apache Superset | Open-source BI |
| Metabase | Business intelligence |
| Grafana | Observability & analytics |
| Redash | Query & visualization |
Build these projects to demonstrate real-world data engineering skills:
-
End-to-End NYC Taxi Data Pipeline
- Tools: Python, BigQuery, Looker Studio
- Skills: ETL, cloud storage, BI visualization
-
- Tools: Airflow, PostgreSQL, dbt
- Skills: Orchestration, scheduling, transformation
-
Extract YouTube Metadata
- GitHub Project
- Tools: AWS Lambda, S3, Free Tier
- Skills: Serverless, cloud storage, API ingestion
-
- GitHub
- Tools: S3, Spark, Delta Lake, Dagster, Superset
- Skills: Lakehouse, orchestration, visualization
-
Azure End-to-End Analytics Platform
- Tools: ADF, ADLS, Databricks, Synapse, Power BI
- Skills: Azure ecosystem, medallion architecture
-
LLM Data Pipeline
- Lecture
- Tools: OpenAI API, vector databases, Airflow
- Skills: AI integration, vector search
-
Real-Time Streaming Analytics
- Tools: Kafka, Flink, Iceberg, Grafana
- Skills: Event-driven architecture, stream processing
-
SQL Query Engine with LLMs
- Tutorial
- Tools: LangChain, LLMs, databases
- Skills: AI-powered tooling
-
Multi-Cloud Lakehouse
- Tools: Apache Iceberg, AWS + GCP, dbt, Airflow
- Skills: Cloud-agnostic architecture
Join these communities to learn, network, and grow as a data engineer.
| Community | Members | Focus |
|---|---|---|
| DataExpert.io Discord | 10,000+ | Data Engineering |
| AdalFlow | - | AI/ML Engineering |
| Chip Huyen MLOps | 10,000+ | ML Operations |
| Community | Focus |
|---|---|
| Data Talks Club | Data Science & Engineering |
| dbt Community | dbt, Analytics Engineering |
| Great Expectations | Data Quality |
| Prefect | Workflow Orchestration |
| Community | Platform | Focus |
|---|---|---|
| Data Engineer Things | Newsletter/Community | Data Engineering |
| r/dataengineering | Data Engineering | |
| r/apachespark | Apache Spark | |
| LinkedIn DE Community | Professional Networking |
| Podcast | Host | Topics |
|---|---|---|
| The Data Engineering Show | Databand | DE tools & practices |
| Data Engineering Podcast | Tobias Macey | Open-source data tools |
| DataTopics | - | Data engineering trends |
| DataWare | Ascend.io | Data pipelines |
| The Datastack Show | - | Modern data stack |
| Analytics Power Hour | - | Analytics & data |
| Drill to Detail | Mark Rittman | Analytics engineering |
| Newsletter | Author | Topics |
|---|---|---|
| DataEngineer.io Newsletter | Zach Wilson | DE career & tech |
| The Developing Dev | Ryan Peterman | Engineering growth |
| Data Engineering Weekly | Ananth Packkildurai | DE news |
| Benn Stancil's Newsletter | Benn Stancil | Data strategy |
| Ahead of the Trend | - | Data trends |
| Seattle Data Guy | Ben Rogojan | DE tips |
| Channel | Focus | Subscribers |
|---|---|---|
| Zach Wilson | Data Engineering Career | 50,000+ |
| Seattle Data Guy | DE Interviews & Tips | 50,000+ |
| Andreas Kretz | Data Engineering School | 50,000+ |
| ByteByteGo | System Design | 1,000,000+ |
| Alex The Analyst | Data Analysis | 700,000+ |
Technical Interview Topics:
├── SQL
│ ├── Window Functions (ROW_NUMBER, RANK, LAG, LEAD)
│ ├── CTEs and Recursive CTEs
│ ├── Query Optimization & EXPLAIN plans
│ └── Aggregations & Subqueries
├── Python
│ ├── PySpark DataFrames
│ ├── Pandas/Polars operations
│ ├── Generators & Iterators
│ └── OOP for data pipelines
├── System Design
│ ├── Design a data warehouse
│ ├── Design a real-time analytics system
│ ├── Design a CDC pipeline
│ └── Design a data lake
├── Data Modeling
│ ├── Star Schema vs Snowflake
│ ├── Slowly Changing Dimensions (SCD)
│ ├── Data Vault 2.0
│ └── Kimball vs Inmon
└── Infrastructure
├── Docker & Kubernetes
├── Cloud platforms (AWS/GCP/Azure)
├── CI/CD for data pipelines
└── Monitoring & Alerting
- Data Engineering Interview Questions — Comprehensive list
- ByteByteGo System Design — System design prep
- SQLZoo — SQL practice
- LeetCode Database Problems — SQL challenges
- Glassdoor DE Interview Reviews — Company-specific prep
- Quantify impact: "Reduced pipeline runtime by 67%" beats "improved pipeline"
- Include GitHub links to real projects
- Mention data volumes (TB, PB scale)
- List certifications (AWS, GCP, Databricks, dbt)
| Level | YoE | USA Salary Range |
|---|---|---|
| Junior DE | 0–2 | $80k–$120k |
| Mid-level DE | 2–5 | $120k–$170k |
| Senior DE | 5–8 | $160k–$220k |
| Staff DE | 8–12 | $200k–$280k |
| Principal DE | 12+ | $260k–$400k+ |
Source: levels.fyi, LinkedIn Salary, Glassdoor (2024)
| Name | Profile | Followers |
|---|---|---|
| Zach Wilson | EcZachly | 100,000+ |
| Seattle Data Guy | SeattleDataGuy | 50,000+ |
| Andreas Kretz | Andreas Kretz | 50,000+ |
| Lior Gavish | Lior Gavish | 30,000+ |
| Name | Handle | Followers |
|---|---|---|
| ByteByteGo | @alexxubyte | 500,000+ |
| Dan Kornas | @dankornas | 66,000+ |
| Zach Wilson | @EcZachly | 30,000+ |
| Seattle Data Guy | @SeattleDataGuy | 10,000+ |
Contributions are what make the open source community amazing! Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingResource) - Commit your Changes (
git commit -m 'Add some AmazingResource') - Push to the Branch (
git push origin feature/AmazingResource) - Open a Pull Request
See CONTRIBUTING.md for detailed guidelines.
Distributed under the MIT License. See LICENSE for more information.
Built with ❤️ for the Data Engineering Community
🌐 Website • ⭐ Star this repo • 🐛 Report Bug • 💡 Request Feature