GitHub - Lutschippi/DEHUB: The Ultimate Data Engineering Hub — 500+ Resources, 50+ Tools, Roadmaps & Community for Data Engineers Worldwide

$ pip install dehub-knowledge --upgrade
✓ Loaded: 500+ Resources | 50+ Tools | 10+ Roadmaps | 1 Community

The command center for data engineers worldwide. From zero to petabyte-scale — everything you need to become an elite data engineer.

🚀 Get Started • 📚 Resources • 🗺️ Roadmap • 🛠️ Tools • 🌐 Website • 🤝 Contribute

📡 Live Terminal Preview

╔══════════════════════════════════════════════════════════════╗
║  dehub@engineer:~$                                           ║
╠══════════════════════════════════════════════════════════════╣
║                                                              ║
║  $ dbt run --select +orders_mart                             ║
║  [INFO] Running with dbt=1.8.0                               ║
║  [INFO] Found 47 models, 312 tests                          ║
║  ✓ Completed successfully                                    ║
║                                                              ║
║  $ spark-submit --master yarn pipeline.py                   ║
║  [INFO] SparkContext initialized                             ║
║  [INFO] Processing 2.4B records...                          ║
║  ✓ Job completed: 847s elapsed                              ║
║                                                              ║
║  $ kafka-topics --create --topic events --partitions 12     ║
║  ✓ Created topic "events"                                   ║
║                                                              ║
║  $ SELECT COUNT(*) FROM iceberg.prod.events;                ║
║  > 1,427,893,004 rows                                       ║
║                                                              ║
╚══════════════════════════════════════════════════════════════╝

📋 Table of Contents

Click to expand

🚀 Getting Started
🗺️ Roadmap
📚 Books
🎓 Courses & Bootcamps
🛠️ Data Engineering Ecosystem
💼 Projects & Portfolio
🌐 Communities
🎙️ Podcasts
📰 Newsletters
🎥 YouTube Channels
💼 Interview Preparation
📊 Data Engineering Salary
🤝 Contributing
📄 License

🚀 Getting Started

New to Data Engineering? Follow this path:

📍 You are here
    │
    ▼
[1] 📖 Read "Fundamentals of Data Engineering" (Joe Reis & Matt Housley)
    │
    ▼
[2] 🐍 Learn Python + SQL to professional level
    │
    ▼
[3] 🌊 Build your first pipeline (Batch → Streaming)
    │
    ▼
[4] ☁️  Get cloud certified (AWS/GCP/Azure)
    │
    ▼
[5] 🚀 Contribute to open-source, get hired

Experienced Engineer? Jump to:

Advanced Tools & Frameworks
System Design Resources
Senior-level Interview Prep
Open Source Projects

🗺️ Roadmap

Beginner Path (0–6 months)

Skill	Resources	Status
Python for Data Engineering	Python for Data Analysis	🔥 Essential
SQL Mastery	Mode SQL Tutorial	🔥 Essential
Linux & Bash	The Linux Command Line	✅ Required
Git & Version Control	Pro Git Book	✅ Required
Basic ETL Concepts	Fundamentals of Data Engineering	🔥 Essential
Docker Basics	Docker Official Docs	✅ Required

Intermediate Path (6–18 months)

Skill	Resources	Status
Apache Spark	Spark: The Definitive Guide	🔥 Essential
Apache Airflow	Astronomer Guides	🔥 Essential
dbt (Data Build Tool)	dbt Learn	🔥 Essential
Cloud Platforms	AWS/GCP/Azure Certifications	✅ Required
Data Modeling	Kimball - Data Warehouse Toolkit	🔥 Essential
Kafka/Streaming	Confluent Kafka Tutorials	⭐ Recommended

Advanced Path (18+ months)

Skill	Resources	Status
Data Architecture	Designing Data-Intensive Applications	🔥 Essential
Apache Iceberg	Apache Iceberg: The Definitive Guide	🔥 Essential
Real-Time Streaming	Streaming Systems	⭐ Recommended
Data Mesh	Data Mesh (Zhamak Dehghani)	⭐ Recommended
ML Engineering	Designing Machine Learning Systems	⭐ Recommended
System Design	ByteByteGo	✅ Required

Expert Path

Skill	Focus
Data Platform Architecture	Design multi-cloud, multi-region data platforms
Cost Optimization	FinOps for data, query optimization at scale
Team Leadership	Building and mentoring data engineering teams
Open Source Contribution	Build tooling the community relies on

📚 Books — Must Read

Top 3 Essential Books

#	Book	Author	Level
1	Fundamentals of Data Engineering	Joe Reis & Matt Housley	All Levels
2	Designing Data-Intensive Applications	Martin Kleppmann	Intermediate+
3	Designing Machine Learning Systems	Chip Huyen	Advanced

Complete Book List

View all 35+ books

Data Engineering Core

Apache Spark

Streaming

Cloud & Lakehouse

dbt & Transformation

Data Architecture

Machine Learning & AI

Analytics & Python

Pandas Cookbook, 3rd Edition
Python for Data Analysis, 3E (Free Online)
Trino: The Definitive Guide
Hadoop: The Definitive Guide

🎓 Courses & Bootcamps

Free Bootcamps

Name	Provider	Level	Duration
Data Engineering Zoomcamp	DataTalks.Club	Beginner	9 weeks
Beginner Data Engineering Bootcamp	DataExpert.io	Beginner	4 weeks
Intermediate Bootcamp	DataExpert.io	Intermediate	6 weeks
DE Fundamentals	Databricks	All Levels	Self-paced

Premium Courses

Name	Provider	Level
Data Expert Courses	Zach Wilson	All Levels
dbt Fundamentals	dbt Labs	Beginner
Astronomer Certification	Astronomer	Intermediate
Databricks Certified Associate	Databricks	Intermediate
Snowflake SnowPro Core	Snowflake	Intermediate
Google Cloud Professional Data Engineer	Google	Advanced
AWS Data Analytics Specialty	AWS	Advanced

🛠️ Data Engineering Ecosystem

Orchestration

Tool	Stars	Description
Apache Airflow		Industry-standard workflow orchestrator
Dagster		Data-aware orchestration platform
Prefect		Modern workflow automation
Mage		Modern data pipeline tool
Kestra		Declarative orchestration
Hamilton		Function-based DAG framework

Data Lake / Lakehouse

Tool	Stars	Description
Apache Iceberg		Open table format for huge datasets
Delta Lake		ACID transactions for big data
Apache Hudi		Incremental data processing
Apache Polaris		Open catalog for Apache Iceberg
DuckLake	-	SQL-native lakehouse

Data Warehouse

Tool	Description
Snowflake	Cloud-native data warehouse
Google BigQuery	Serverless data warehouse
Databricks	Unified analytics platform
Amazon Redshift	AWS data warehouse
Firebolt	Ultra-fast cloud warehouse
Databend	Open-source cloud DW
ClickHouse	Real-time analytics

Processing Engines

Tool	Stars	Description
Apache Spark		Unified analytics engine
Apache Flink		Stream & batch processing
DuckDB		In-process analytical DB
Trino		Distributed SQL query engine
Polars		Fast DataFrame library

Transformation

Tool	Stars	Description
dbt		SQL-first transformation
SQLMesh		Next-gen dbt alternative
Coalesce	-	Cloud-native transformation

Data Quality

Tool	Stars	Description
Great Expectations		Data quality framework
Soda		Data quality platform
dbt tests	-	Built-in dbt testing
Metaplane	-	Data observability
DQOps		Automated data quality

Streaming & Messaging

Tool	Description
Apache Kafka	Distributed event streaming
Apache Pulsar	Cloud-native messaging
Redpanda	Kafka-compatible streaming
Confluent	Managed Kafka platform
AWS Kinesis	Real-time data streaming

Data Catalog & Governance

Tool	Description
Apache Atlas	Data governance & metadata
DataHub	Modern metadata platform
OpenMetadata	Open-source data catalog
Amundsen	Data discovery & metadata

Ingestion & Integration

Tool	Description
Airbyte	Open-source data integration
Fivetran	Automated data movement
Debezium	CDC (Change Data Capture)
Apache NiFi	Data flow automation
dlt	Python data load tool

Visualization

Tool	Description
Apache Superset	Open-source BI
Metabase	Business intelligence
Grafana	Observability & analytics
Redash	Query & visualization

💼 Projects & Portfolio

Build these projects to demonstrate real-world data engineering skills:

Beginner Projects

End-to-End NYC Taxi Data Pipeline
- Tools: Python, BigQuery, Looker Studio
- Skills: ETL, cloud storage, BI visualization
Weather Data Pipeline
- Tools: Airflow, PostgreSQL, dbt
- Skills: Orchestration, scheduling, transformation
Extract YouTube Metadata
- GitHub Project
- Tools: AWS Lambda, S3, Free Tier
- Skills: Serverless, cloud storage, API ingestion

Intermediate Projects

Real Estate Data Platform
- GitHub
- Tools: S3, Spark, Delta Lake, Dagster, Superset
- Skills: Lakehouse, orchestration, visualization
Azure End-to-End Analytics Platform
- Tools: ADF, ADLS, Databricks, Synapse, Power BI
- Skills: Azure ecosystem, medallion architecture
LLM Data Pipeline
- Lecture
- Tools: OpenAI API, vector databases, Airflow
- Skills: AI integration, vector search

Advanced Projects

Real-Time Streaming Analytics
- Tools: Kafka, Flink, Iceberg, Grafana
- Skills: Event-driven architecture, stream processing
SQL Query Engine with LLMs
- Tutorial
- Tools: LangChain, LLMs, databases
- Skills: AI-powered tooling
Multi-Cloud Lakehouse
- Tools: Apache Iceberg, AWS + GCP, dbt, Airflow
- Skills: Cloud-agnostic architecture

🌐 Communities

Join these communities to learn, network, and grow as a data engineer.

Discord Communities

Community	Members	Focus
DataExpert.io Discord	10,000+	Data Engineering
AdalFlow	-	AI/ML Engineering
Chip Huyen MLOps	10,000+	ML Operations

Slack Communities

Community	Focus
Data Talks Club	Data Science & Engineering
dbt Community	dbt, Analytics Engineering
Great Expectations	Data Quality
Prefect	Workflow Orchestration

Online Communities

Community	Platform	Focus
Data Engineer Things	Newsletter/Community	Data Engineering
r/dataengineering	Reddit	Data Engineering
r/apachespark	Reddit	Apache Spark
LinkedIn DE Community	LinkedIn	Professional Networking

🎙️ Podcasts

Podcast	Host	Topics
The Data Engineering Show	Databand	DE tools & practices
Data Engineering Podcast	Tobias Macey	Open-source data tools
DataTopics	-	Data engineering trends
DataWare	Ascend.io	Data pipelines
The Datastack Show	-	Modern data stack
Analytics Power Hour	-	Analytics & data
Drill to Detail	Mark Rittman	Analytics engineering

📰 Newsletters

Newsletter	Author	Topics
DataEngineer.io Newsletter	Zach Wilson	DE career & tech
The Developing Dev	Ryan Peterman	Engineering growth
Data Engineering Weekly	Ananth Packkildurai	DE news
Benn Stancil's Newsletter	Benn Stancil	Data strategy
Ahead of the Trend	-	Data trends
Seattle Data Guy	Ben Rogojan	DE tips

🎥 YouTube Channels

Channel	Focus	Subscribers
Zach Wilson	Data Engineering Career	50,000+
Seattle Data Guy	DE Interviews & Tips	50,000+
Andreas Kretz	Data Engineering School	50,000+
ByteByteGo	System Design	1,000,000+
Alex The Analyst	Data Analysis	700,000+

💼 Interview Preparation

Data Engineering Interview Topics

Technical Interview Topics:
├── SQL
│   ├── Window Functions (ROW_NUMBER, RANK, LAG, LEAD)
│   ├── CTEs and Recursive CTEs
│   ├── Query Optimization & EXPLAIN plans
│   └── Aggregations & Subqueries
├── Python
│   ├── PySpark DataFrames
│   ├── Pandas/Polars operations
│   ├── Generators & Iterators
│   └── OOP for data pipelines
├── System Design
│   ├── Design a data warehouse
│   ├── Design a real-time analytics system
│   ├── Design a CDC pipeline
│   └── Design a data lake
├── Data Modeling
│   ├── Star Schema vs Snowflake
│   ├── Slowly Changing Dimensions (SCD)
│   ├── Data Vault 2.0
│   └── Kimball vs Inmon
└── Infrastructure
    ├── Docker & Kubernetes
    ├── Cloud platforms (AWS/GCP/Azure)
    ├── CI/CD for data pipelines
    └── Monitoring & Alerting

Interview Resources

Data Engineering Interview Questions — Comprehensive list
ByteByteGo System Design — System design prep
SQLZoo — SQL practice
LeetCode Database Problems — SQL challenges
Glassdoor DE Interview Reviews — Company-specific prep

Resume Tips

Quantify impact: "Reduced pipeline runtime by 67%" beats "improved pipeline"
Include GitHub links to real projects
Mention data volumes (TB, PB scale)
List certifications (AWS, GCP, Databricks, dbt)

📊 Data Engineering Salary

Level	YoE	USA Salary Range
Junior DE	0–2	$80k–$120k
Mid-level DE	2–5	$120k–$170k
Senior DE	5–8	$160k–$220k
Staff DE	8–12	$200k–$280k
Principal DE	12+	$260k–$400k+

Source: levels.fyi, LinkedIn Salary, Glassdoor (2024)

🌟 Top Influencers to Follow

LinkedIn

Name	Profile	Followers
Zach Wilson	EcZachly	100,000+
Seattle Data Guy	SeattleDataGuy	50,000+
Andreas Kretz	Andreas Kretz	50,000+
Lior Gavish	Lior Gavish	30,000+

Twitter/X

Name	Handle	Followers
ByteByteGo	@alexxubyte	500,000+
Dan Kornas	@dankornas	66,000+
Zach Wilson	@EcZachly	30,000+
Seattle Data Guy	@SeattleDataGuy	10,000+

🤝 Contributing

Contributions are what make the open source community amazing! Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingResource)
Commit your Changes (git commit -m 'Add some AmazingResource')
Push to the Branch (git push origin feature/AmazingResource)
Open a Pull Request

See CONTRIBUTING.md for detailed guidelines.

👥 Contributors

📄 License

Distributed under the MIT License. See LICENSE for more information.

⭐ Star History

Built with ❤️ for the Data Engineering Community

🌐 Website • ⭐ Star this repo • 🐛 Report Bug • 💡 Request Feature

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
artifacts		artifacts
docs		docs
lib		lib
scripts		scripts
.gitignore		.gitignore
.npmrc		.npmrc
.replit		.replit
.replitignore		.replitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
replit.md		replit.md
tsconfig.base.json		tsconfig.base.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

📡 Live Terminal Preview

📋 Table of Contents

🚀 Getting Started

🗺️ Roadmap

Beginner Path (0–6 months)

Intermediate Path (6–18 months)

Advanced Path (18+ months)

Expert Path

📚 Books — Must Read

Top 3 Essential Books

Complete Book List

Data Engineering Core

Apache Spark

Streaming

Cloud & Lakehouse

dbt & Transformation

Data Architecture

Machine Learning & AI

Analytics & Python

🎓 Courses & Bootcamps

Free Bootcamps

Premium Courses

🛠️ Data Engineering Ecosystem

Orchestration

Data Lake / Lakehouse

Data Warehouse

Processing Engines

Transformation

Data Quality

Streaming & Messaging

Data Catalog & Governance

Ingestion & Integration

Visualization

💼 Projects & Portfolio

Beginner Projects

Intermediate Projects

Advanced Projects

🌐 Communities

Discord Communities

Slack Communities

Online Communities

🎙️ Podcasts

📰 Newsletters

🎥 YouTube Channels

💼 Interview Preparation

Data Engineering Interview Topics

Interview Resources

Resume Tips

📊 Data Engineering Salary

🌟 Top Influencers to Follow

LinkedIn

Twitter/X

🤝 Contributing

👥 Contributors

📄 License

⭐ Star History

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages