Skip to content

Lutschippi/DEHUB


GitHub Stars GitHub Forks GitHub Watchers License Website PRs Welcome Contributions


$ pip install dehub-knowledge --upgrade
✓ Loaded: 500+ Resources | 50+ Tools | 10+ Roadmaps | 1 Community

The command center for data engineers worldwide. From zero to petabyte-scale — everything you need to become an elite data engineer.


🚀 Get Started📚 Resources🗺️ Roadmap🛠️ Tools🌐 Website🤝 Contribute


📡 Live Terminal Preview

╔══════════════════════════════════════════════════════════════╗
║  dehub@engineer:~$                                           ║
╠══════════════════════════════════════════════════════════════╣
║                                                              ║
║  $ dbt run --select +orders_mart                             ║
║  [INFO] Running with dbt=1.8.0                               ║
║  [INFO] Found 47 models, 312 tests                          ║
║  ✓ Completed successfully                                    ║
║                                                              ║
║  $ spark-submit --master yarn pipeline.py                   ║
║  [INFO] SparkContext initialized                             ║
║  [INFO] Processing 2.4B records...                          ║
║  ✓ Job completed: 847s elapsed                              ║
║                                                              ║
║  $ kafka-topics --create --topic events --partitions 12     ║
║  ✓ Created topic "events"                                   ║
║                                                              ║
║  $ SELECT COUNT(*) FROM iceberg.prod.events;                ║
║  > 1,427,893,004 rows                                       ║
║                                                              ║
╚══════════════════════════════════════════════════════════════╝

📋 Table of Contents

Click to expand

🚀 Getting Started

New to Data Engineering? Follow this path:

📍 You are here
    │
    ▼
[1] 📖 Read "Fundamentals of Data Engineering" (Joe Reis & Matt Housley)
    │
    ▼
[2] 🐍 Learn Python + SQL to professional level
    │
    ▼
[3] 🌊 Build your first pipeline (Batch → Streaming)
    │
    ▼
[4] ☁️  Get cloud certified (AWS/GCP/Azure)
    │
    ▼
[5] 🚀 Contribute to open-source, get hired

Experienced Engineer? Jump to:


🗺️ Roadmap

Beginner Path (0–6 months)

Skill Resources Status
Python for Data Engineering Python for Data Analysis 🔥 Essential
SQL Mastery Mode SQL Tutorial 🔥 Essential
Linux & Bash The Linux Command Line ✅ Required
Git & Version Control Pro Git Book ✅ Required
Basic ETL Concepts Fundamentals of Data Engineering 🔥 Essential
Docker Basics Docker Official Docs ✅ Required

Intermediate Path (6–18 months)

Skill Resources Status
Apache Spark Spark: The Definitive Guide 🔥 Essential
Apache Airflow Astronomer Guides 🔥 Essential
dbt (Data Build Tool) dbt Learn 🔥 Essential
Cloud Platforms AWS/GCP/Azure Certifications ✅ Required
Data Modeling Kimball - Data Warehouse Toolkit 🔥 Essential
Kafka/Streaming Confluent Kafka Tutorials ⭐ Recommended

Advanced Path (18+ months)

Skill Resources Status
Data Architecture Designing Data-Intensive Applications 🔥 Essential
Apache Iceberg Apache Iceberg: The Definitive Guide 🔥 Essential
Real-Time Streaming Streaming Systems ⭐ Recommended
Data Mesh Data Mesh (Zhamak Dehghani) ⭐ Recommended
ML Engineering Designing Machine Learning Systems ⭐ Recommended
System Design ByteByteGo ✅ Required

Expert Path

Skill Focus
Data Platform Architecture Design multi-cloud, multi-region data platforms
Cost Optimization FinOps for data, query optimization at scale
Team Leadership Building and mentoring data engineering teams
Open Source Contribution Build tooling the community relies on

📚 Books — Must Read

Top 3 Essential Books

# Book Author Level
1 Fundamentals of Data Engineering Joe Reis & Matt Housley All Levels
2 Designing Data-Intensive Applications Martin Kleppmann Intermediate+
3 Designing Machine Learning Systems Chip Huyen Advanced

Complete Book List

View all 35+ books

Data Engineering Core

Apache Spark

Streaming

Cloud & Lakehouse

dbt & Transformation

Data Architecture

Machine Learning & AI

Analytics & Python


🎓 Courses & Bootcamps

Free Bootcamps

Name Provider Level Duration
Data Engineering Zoomcamp DataTalks.Club Beginner 9 weeks
Beginner Data Engineering Bootcamp DataExpert.io Beginner 4 weeks
Intermediate Bootcamp DataExpert.io Intermediate 6 weeks
DE Fundamentals Databricks All Levels Self-paced

Premium Courses

Name Provider Level
Data Expert Courses Zach Wilson All Levels
dbt Fundamentals dbt Labs Beginner
Astronomer Certification Astronomer Intermediate
Databricks Certified Associate Databricks Intermediate
Snowflake SnowPro Core Snowflake Intermediate
Google Cloud Professional Data Engineer Google Advanced
AWS Data Analytics Specialty AWS Advanced

🛠️ Data Engineering Ecosystem

Orchestration

Tool Stars Description
Apache Airflow Stars Industry-standard workflow orchestrator
Dagster Stars Data-aware orchestration platform
Prefect Stars Modern workflow automation
Mage Stars Modern data pipeline tool
Kestra Stars Declarative orchestration
Hamilton Stars Function-based DAG framework

Data Lake / Lakehouse

Tool Stars Description
Apache Iceberg Stars Open table format for huge datasets
Delta Lake Stars ACID transactions for big data
Apache Hudi Stars Incremental data processing
Apache Polaris Stars Open catalog for Apache Iceberg
DuckLake - SQL-native lakehouse

Data Warehouse

Tool Description
Snowflake Cloud-native data warehouse
Google BigQuery Serverless data warehouse
Databricks Unified analytics platform
Amazon Redshift AWS data warehouse
Firebolt Ultra-fast cloud warehouse
Databend Open-source cloud DW
ClickHouse Real-time analytics

Processing Engines

Tool Stars Description
Apache Spark Stars Unified analytics engine
Apache Flink Stars Stream & batch processing
DuckDB Stars In-process analytical DB
Trino Stars Distributed SQL query engine
Polars Stars Fast DataFrame library

Transformation

Tool Stars Description
dbt Stars SQL-first transformation
SQLMesh Stars Next-gen dbt alternative
Coalesce - Cloud-native transformation

Data Quality

Tool Stars Description
Great Expectations Stars Data quality framework
Soda Stars Data quality platform
dbt tests - Built-in dbt testing
Metaplane - Data observability
DQOps Stars Automated data quality

Streaming & Messaging

Tool Description
Apache Kafka Distributed event streaming
Apache Pulsar Cloud-native messaging
Redpanda Kafka-compatible streaming
Confluent Managed Kafka platform
AWS Kinesis Real-time data streaming

Data Catalog & Governance

Tool Description
Apache Atlas Data governance & metadata
DataHub Modern metadata platform
OpenMetadata Open-source data catalog
Amundsen Data discovery & metadata

Ingestion & Integration

Tool Description
Airbyte Open-source data integration
Fivetran Automated data movement
Debezium CDC (Change Data Capture)
Apache NiFi Data flow automation
dlt Python data load tool

Visualization

Tool Description
Apache Superset Open-source BI
Metabase Business intelligence
Grafana Observability & analytics
Redash Query & visualization

💼 Projects & Portfolio

Build these projects to demonstrate real-world data engineering skills:

Beginner Projects

  1. End-to-End NYC Taxi Data Pipeline

    • Tools: Python, BigQuery, Looker Studio
    • Skills: ETL, cloud storage, BI visualization
  2. Weather Data Pipeline

    • Tools: Airflow, PostgreSQL, dbt
    • Skills: Orchestration, scheduling, transformation
  3. Extract YouTube Metadata

    • GitHub Project
    • Tools: AWS Lambda, S3, Free Tier
    • Skills: Serverless, cloud storage, API ingestion

Intermediate Projects

  1. Real Estate Data Platform

    • GitHub
    • Tools: S3, Spark, Delta Lake, Dagster, Superset
    • Skills: Lakehouse, orchestration, visualization
  2. Azure End-to-End Analytics Platform

    • Tools: ADF, ADLS, Databricks, Synapse, Power BI
    • Skills: Azure ecosystem, medallion architecture
  3. LLM Data Pipeline

    • Lecture
    • Tools: OpenAI API, vector databases, Airflow
    • Skills: AI integration, vector search

Advanced Projects

  1. Real-Time Streaming Analytics

    • Tools: Kafka, Flink, Iceberg, Grafana
    • Skills: Event-driven architecture, stream processing
  2. SQL Query Engine with LLMs

    • Tutorial
    • Tools: LangChain, LLMs, databases
    • Skills: AI-powered tooling
  3. Multi-Cloud Lakehouse

    • Tools: Apache Iceberg, AWS + GCP, dbt, Airflow
    • Skills: Cloud-agnostic architecture

🌐 Communities

Join these communities to learn, network, and grow as a data engineer.

Discord Communities

Community Members Focus
DataExpert.io Discord 10,000+ Data Engineering
AdalFlow - AI/ML Engineering
Chip Huyen MLOps 10,000+ ML Operations

Slack Communities

Community Focus
Data Talks Club Data Science & Engineering
dbt Community dbt, Analytics Engineering
Great Expectations Data Quality
Prefect Workflow Orchestration

Online Communities

Community Platform Focus
Data Engineer Things Newsletter/Community Data Engineering
r/dataengineering Reddit Data Engineering
r/apachespark Reddit Apache Spark
LinkedIn DE Community LinkedIn Professional Networking

🎙️ Podcasts

Podcast Host Topics
The Data Engineering Show Databand DE tools & practices
Data Engineering Podcast Tobias Macey Open-source data tools
DataTopics - Data engineering trends
DataWare Ascend.io Data pipelines
The Datastack Show - Modern data stack
Analytics Power Hour - Analytics & data
Drill to Detail Mark Rittman Analytics engineering

📰 Newsletters

Newsletter Author Topics
DataEngineer.io Newsletter Zach Wilson DE career & tech
The Developing Dev Ryan Peterman Engineering growth
Data Engineering Weekly Ananth Packkildurai DE news
Benn Stancil's Newsletter Benn Stancil Data strategy
Ahead of the Trend - Data trends
Seattle Data Guy Ben Rogojan DE tips

🎥 YouTube Channels

Channel Focus Subscribers
Zach Wilson Data Engineering Career 50,000+
Seattle Data Guy DE Interviews & Tips 50,000+
Andreas Kretz Data Engineering School 50,000+
ByteByteGo System Design 1,000,000+
Alex The Analyst Data Analysis 700,000+

💼 Interview Preparation

Data Engineering Interview Topics

Technical Interview Topics:
├── SQL
│   ├── Window Functions (ROW_NUMBER, RANK, LAG, LEAD)
│   ├── CTEs and Recursive CTEs
│   ├── Query Optimization & EXPLAIN plans
│   └── Aggregations & Subqueries
├── Python
│   ├── PySpark DataFrames
│   ├── Pandas/Polars operations
│   ├── Generators & Iterators
│   └── OOP for data pipelines
├── System Design
│   ├── Design a data warehouse
│   ├── Design a real-time analytics system
│   ├── Design a CDC pipeline
│   └── Design a data lake
├── Data Modeling
│   ├── Star Schema vs Snowflake
│   ├── Slowly Changing Dimensions (SCD)
│   ├── Data Vault 2.0
│   └── Kimball vs Inmon
└── Infrastructure
    ├── Docker & Kubernetes
    ├── Cloud platforms (AWS/GCP/Azure)
    ├── CI/CD for data pipelines
    └── Monitoring & Alerting

Interview Resources

Resume Tips

  • Quantify impact: "Reduced pipeline runtime by 67%" beats "improved pipeline"
  • Include GitHub links to real projects
  • Mention data volumes (TB, PB scale)
  • List certifications (AWS, GCP, Databricks, dbt)

📊 Data Engineering Salary

Level YoE USA Salary Range
Junior DE 0–2 $80k–$120k
Mid-level DE 2–5 $120k–$170k
Senior DE 5–8 $160k–$220k
Staff DE 8–12 $200k–$280k
Principal DE 12+ $260k–$400k+

Source: levels.fyi, LinkedIn Salary, Glassdoor (2024)


🌟 Top Influencers to Follow

LinkedIn

Name Profile Followers
Zach Wilson EcZachly 100,000+
Seattle Data Guy SeattleDataGuy 50,000+
Andreas Kretz Andreas Kretz 50,000+
Lior Gavish Lior Gavish 30,000+

Twitter/X

Name Handle Followers
ByteByteGo @alexxubyte 500,000+
Dan Kornas @dankornas 66,000+
Zach Wilson @EcZachly 30,000+
Seattle Data Guy @SeattleDataGuy 10,000+

🤝 Contributing

Contributions are what make the open source community amazing! Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingResource)
  3. Commit your Changes (git commit -m 'Add some AmazingResource')
  4. Push to the Branch (git push origin feature/AmazingResource)
  5. Open a Pull Request

See CONTRIBUTING.md for detailed guidelines.


👥 Contributors


📄 License

Distributed under the MIT License. See LICENSE for more information.


⭐ Star History

Star History Chart


Built with ❤️ for the Data Engineering Community

🌐 Website⭐ Star this repo🐛 Report Bug💡 Request Feature

About

The Ultimate Data Engineering Hub — 500+ Resources, 50+ Tools, Roadmaps & Community for Data Engineers Worldwide

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors