Simple Data AI Stack

Build and demo modern data and AI platforms without waiting on infrastructure tickets. This repository collects curated, dockerized blueprints that let data engineers, ML teams, and platform builders spin up end-to-end environments—data lake foundations, pipeline orchestration, observability, and AI-friendly tooling—in a few commands.

The Vision

Accelerate experimentation: Stand up realistic data/AI environments locally or on a single VM, then iterate on pipelines, models, and dashboards with production-inspired defaults.
Stay modular: Each stack is self-contained and composable—pick the lakehouse, orchestration, or monitoring pieces you need today and combine them as your platform grows.
Promote best practices: Included services cover security, backups, health checks, and resource monitoring so teams focus on insights, not plumbing.
Bridge personas: Empower data engineers, AI engineers, analytics developers, and operators to collaborate against the same sandbox with role-aligned interfaces.

Repository Guide

Directory	Focus	Highlights	Docs
`data-Infrastructure/`	Platform foundations	Opinionated essays covering the why behind stack choices—start with hidden pitfalls that derail data platforms before they scale	The Hidden Problems in Data Infrastructure
`datalake/`	Data infrastructure	PostgreSQL-based lake with connection pooling, Redis cache, no-code access, backups, and uptime monitoring	Postgres Lake README
`data_pipeline_orchestration/`	Data & AI engineering	Apache Airflow bundle with MinIO object storage, customizable ETL worker, resource monitoring, and helper scripts	Airflow Stack README
`ducklake-ai-platform/`	Lakehouse + AI workspace	DuckDB + DuckLake core with Marimo notebooks, MinIO object storage, Postgres metadata, and vector search-ready defaults	DuckLake README
`dataengineering-dashboard-vision/`	Observability agent	Conversational Grafana + Prometheus assistant delivers root-cause context and anomaly summaries via chat	Dashboard Agent README
`dwh-rag-framework/`	Warehouse-first RAG lab	DuckDB snapshots feeding LightRAG indexing with Marimo notebooks and Cronicle automation for agent validation	RAG Framework README
`n8n-data-ai-orchestration/`	AI-powered job orchestration	Customer retention workflow that blends SQL, enrichment, OpenAI strategy generation, Slack/email reporting, and failure alerting in n8n	n8n Flow README
`mcp-data-server/`	Universal data loader MCP	Format-agnostic FastAPI server with auto-detect parsers, DuckDB SQL querying, and REST endpoints for instant file-to-query workflows	MCP Data Server README
`data-agent-sdk/`	Data engineering agent SDK	Minimal SDK for building data agents with SQL/Polars tools, governance hooks, lineage tracking, and MCP server support in ~2,000 lines	Data Agent SDK README
`python-redis-streaming/`	Streaming ingestion engine	Async Python + Redis Streams + Postgres stack with uv tooling, DLQ handling, and CLI helpers for monitoring and benchmarks	Python Redis Streaming README
`redis-postgres-pipeline/`	High-performance pipeline	Production-ready data pipeline with Redis queues, dedup, caching, Postgres 18 async I/O, UNLOGGED staging, materialized views, and Polars — handles 500M records without Spark	Redis Postgres Pipeline README

Pair the conceptual deep dives with the hands-on stack READMEs: skim data-Infrastructure/ to understand the platform philosophy, then jump into the stack directory that matches your next experiment for deployment steps and credentials.

Getting Started

Install prerequisites: Docker + Docker Compose v2 on a machine with adequate CPU, RAM, and disk (see stack-specific READMEs for sizing).

Clone the repo:

git clone https://github.com/hottechstack/simple-data-ai-stack.git
cd simple-data-ai-stack

Choose a stack: Browse the directories above and open the corresponding README for detailed instructions.
Launch locally: Most stacks run with a single command (docker compose up -d, ./start_pipeline.sh start, etc.). Scripts expose health checks, sample data loaders, and log helpers to keep you moving.
Compose your platform: Run stacks side-by-side for a fuller platform—pipe object storage into the SQL lake, orchestrate model feature jobs, or layer BI tooling on top.

Typical Use Cases

Prototype a lakehouse with production-grade components before committing to cloud services.
Trial ETL & AI feature pipelines with real datasets and observe resource footprints.
Provide analysts and business users a sandbox with self-service interfaces (NocoDB, pgAdmin, dashboards).
Validate monitoring/backup strategies in isolation before promoting to shared environments.

Opinionated Workflow

Land structured/unstructured data via MinIO or direct DB ingestion.
Transform using Airflow-managed ETL jobs powered by DuckDB and Polars.
Serve & explore through PostgreSQL, NocoDB, BI tools, or custom APIs.
Observe everything with built-in uptime checks, metrics dashboards, and automated backups.

The stacks are designed to connect: object storage flows into transformation jobs, refined outputs land back into the data lake, and monitoring tools keep the feedback loop tight.

Roadmap Inspiration

Vector databases + retrieval-augmented generation demo stack.
Streaming ingestion profile (Kafka/Redpanda + stream processing + materialized views).
Notebook & model experimentation workspace with GPU-ready containers.
Terraform modules to mirror these blueprints in managed cloud environments.

Have an idea or internal stack you want to share? Contributions are welcome—open an issue or PR to propose a new module or enhancement.

Contributing

Fork the repository and work inside a dedicated directory for your stack.
Document your stack thoroughly (architecture, environment variables, health checks, teardown steps).
Reuse existing patterns for Docker Compose profiles, scripts, and monitoring hooks to keep experiences consistent.
Submit a PR describing the use case, prerequisites, and any sample data included.

License

Unless otherwise stated in a subdirectory, content is provided as-is for educational and production experimentation. Review upstream container licenses before deploying in regulated environments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Simple Data AI Stack

The Vision

Repository Guide

Getting Started

Typical Use Cases

Opinionated Workflow

Roadmap Inspiration

Contributing

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data-Infrastructure		data-Infrastructure
data-agent-sdk		data-agent-sdk
data_pipeline_orchestration		data_pipeline_orchestration
dataengineering-dashboard-vision		dataengineering-dashboard-vision
datalake/postgres_datalake		datalake/postgres_datalake
ducklake-ai-platform		ducklake-ai-platform
dwh-rag-framework		dwh-rag-framework
mcp-data-server		mcp-data-server
n8n-data-ai-orchestration		n8n-data-ai-orchestration
orchestrator-battle		orchestrator-battle
python-redis-streaming		python-redis-streaming
read-from-everywhere-platform		read-from-everywhere-platform
redis-postgres-pipeline		redis-postgres-pipeline
.gitignore		.gitignore
README.md		README.md

Uh oh!

Uh oh!

HotTechStack/simple-dataengineering-ai-stack

Folders and files

Latest commit

History

Repository files navigation

Simple Data AI Stack

The Vision

Repository Guide

Getting Started

Typical Use Cases

Opinionated Workflow

Roadmap Inspiration

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages