Build and demo modern data and AI platforms without waiting on infrastructure tickets. This repository collects curated, dockerized blueprints that let data engineers, ML teams, and platform builders spin up end-to-end environments—data lake foundations, pipeline orchestration, observability, and AI-friendly tooling—in a few commands.
- Accelerate experimentation: Stand up realistic data/AI environments locally or on a single VM, then iterate on pipelines, models, and dashboards with production-inspired defaults.
- Stay modular: Each stack is self-contained and composable—pick the lakehouse, orchestration, or monitoring pieces you need today and combine them as your platform grows.
- Promote best practices: Included services cover security, backups, health checks, and resource monitoring so teams focus on insights, not plumbing.
- Bridge personas: Empower data engineers, AI engineers, analytics developers, and operators to collaborate against the same sandbox with role-aligned interfaces.
| Directory | Focus | Highlights | Docs |
|---|---|---|---|
data-Infrastructure/ |
Platform foundations | Opinionated essays covering the why behind stack choices—start with hidden pitfalls that derail data platforms before they scale | The Hidden Problems in Data Infrastructure |
datalake/ |
Data infrastructure | PostgreSQL-based lake with connection pooling, Redis cache, no-code access, backups, and uptime monitoring | Postgres Lake README |
data_pipeline_orchestration/ |
Data & AI engineering | Apache Airflow bundle with MinIO object storage, customizable ETL worker, resource monitoring, and helper scripts | Airflow Stack README |
ducklake-ai-platform/ |
Lakehouse + AI workspace | DuckDB + DuckLake core with Marimo notebooks, MinIO object storage, Postgres metadata, and vector search-ready defaults | DuckLake README |
dataengineering-dashboard-vision/ |
Observability agent | Conversational Grafana + Prometheus assistant delivers root-cause context and anomaly summaries via chat | Dashboard Agent README |
dwh-rag-framework/ |
Warehouse-first RAG lab | DuckDB snapshots feeding LightRAG indexing with Marimo notebooks and Cronicle automation for agent validation | RAG Framework README |
n8n-data-ai-orchestration/ |
AI-powered job orchestration | Customer retention workflow that blends SQL, enrichment, OpenAI strategy generation, Slack/email reporting, and failure alerting in n8n | n8n Flow README |
mcp-data-server/ |
Universal data loader MCP | Format-agnostic FastAPI server with auto-detect parsers, DuckDB SQL querying, and REST endpoints for instant file-to-query workflows | MCP Data Server README |
data-agent-sdk/ |
Data engineering agent SDK | Minimal SDK for building data agents with SQL/Polars tools, governance hooks, lineage tracking, and MCP server support in ~2,000 lines | Data Agent SDK README |
python-redis-streaming/ |
Streaming ingestion engine | Async Python + Redis Streams + Postgres stack with uv tooling, DLQ handling, and CLI helpers for monitoring and benchmarks | Python Redis Streaming README |
redis-postgres-pipeline/ |
High-performance pipeline | Production-ready data pipeline with Redis queues, dedup, caching, Postgres 18 async I/O, UNLOGGED staging, materialized views, and Polars — handles 500M records without Spark | Redis Postgres Pipeline README |
Pair the conceptual deep dives with the hands-on stack READMEs: skim data-Infrastructure/ to understand the platform philosophy, then jump into the stack directory that matches your next experiment for deployment steps and credentials.
- Install prerequisites: Docker + Docker Compose v2 on a machine with adequate CPU, RAM, and disk (see stack-specific READMEs for sizing).
- Clone the repo:
git clone https://github.com/hottechstack/simple-data-ai-stack.git cd simple-data-ai-stack - Choose a stack: Browse the directories above and open the corresponding README for detailed instructions.
- Launch locally: Most stacks run with a single command (
docker compose up -d,./start_pipeline.sh start, etc.). Scripts expose health checks, sample data loaders, and log helpers to keep you moving. - Compose your platform: Run stacks side-by-side for a fuller platform—pipe object storage into the SQL lake, orchestrate model feature jobs, or layer BI tooling on top.
- Prototype a lakehouse with production-grade components before committing to cloud services.
- Trial ETL & AI feature pipelines with real datasets and observe resource footprints.
- Provide analysts and business users a sandbox with self-service interfaces (NocoDB, pgAdmin, dashboards).
- Validate monitoring/backup strategies in isolation before promoting to shared environments.
- Land structured/unstructured data via MinIO or direct DB ingestion.
- Transform using Airflow-managed ETL jobs powered by DuckDB and Polars.
- Serve & explore through PostgreSQL, NocoDB, BI tools, or custom APIs.
- Observe everything with built-in uptime checks, metrics dashboards, and automated backups.
The stacks are designed to connect: object storage flows into transformation jobs, refined outputs land back into the data lake, and monitoring tools keep the feedback loop tight.
- Vector databases + retrieval-augmented generation demo stack.
- Streaming ingestion profile (Kafka/Redpanda + stream processing + materialized views).
- Notebook & model experimentation workspace with GPU-ready containers.
- Terraform modules to mirror these blueprints in managed cloud environments.
Have an idea or internal stack you want to share? Contributions are welcome—open an issue or PR to propose a new module or enhancement.
- Fork the repository and work inside a dedicated directory for your stack.
- Document your stack thoroughly (architecture, environment variables, health checks, teardown steps).
- Reuse existing patterns for Docker Compose profiles, scripts, and monitoring hooks to keep experiences consistent.
- Submit a PR describing the use case, prerequisites, and any sample data included.
Unless otherwise stated in a subdirectory, content is provided as-is for educational and production experimentation. Review upstream container licenses before deploying in regulated environments.