Skip to content

HotTechStack/simple-dataengineering-ai-stack

Repository files navigation

Simple Data AI Stack

Build and demo modern data and AI platforms without waiting on infrastructure tickets. This repository collects curated, dockerized blueprints that let data engineers, ML teams, and platform builders spin up end-to-end environments—data lake foundations, pipeline orchestration, observability, and AI-friendly tooling—in a few commands.


The Vision

  • Accelerate experimentation: Stand up realistic data/AI environments locally or on a single VM, then iterate on pipelines, models, and dashboards with production-inspired defaults.
  • Stay modular: Each stack is self-contained and composable—pick the lakehouse, orchestration, or monitoring pieces you need today and combine them as your platform grows.
  • Promote best practices: Included services cover security, backups, health checks, and resource monitoring so teams focus on insights, not plumbing.
  • Bridge personas: Empower data engineers, AI engineers, analytics developers, and operators to collaborate against the same sandbox with role-aligned interfaces.

Repository Guide

Directory Focus Highlights Docs
data-Infrastructure/ Platform foundations Opinionated essays covering the why behind stack choices—start with hidden pitfalls that derail data platforms before they scale The Hidden Problems in Data Infrastructure
datalake/ Data infrastructure PostgreSQL-based lake with connection pooling, Redis cache, no-code access, backups, and uptime monitoring Postgres Lake README
data_pipeline_orchestration/ Data & AI engineering Apache Airflow bundle with MinIO object storage, customizable ETL worker, resource monitoring, and helper scripts Airflow Stack README
ducklake-ai-platform/ Lakehouse + AI workspace DuckDB + DuckLake core with Marimo notebooks, MinIO object storage, Postgres metadata, and vector search-ready defaults DuckLake README
dataengineering-dashboard-vision/ Observability agent Conversational Grafana + Prometheus assistant delivers root-cause context and anomaly summaries via chat Dashboard Agent README
dwh-rag-framework/ Warehouse-first RAG lab DuckDB snapshots feeding LightRAG indexing with Marimo notebooks and Cronicle automation for agent validation RAG Framework README
n8n-data-ai-orchestration/ AI-powered job orchestration Customer retention workflow that blends SQL, enrichment, OpenAI strategy generation, Slack/email reporting, and failure alerting in n8n n8n Flow README
mcp-data-server/ Universal data loader MCP Format-agnostic FastAPI server with auto-detect parsers, DuckDB SQL querying, and REST endpoints for instant file-to-query workflows MCP Data Server README
data-agent-sdk/ Data engineering agent SDK Minimal SDK for building data agents with SQL/Polars tools, governance hooks, lineage tracking, and MCP server support in ~2,000 lines Data Agent SDK README
python-redis-streaming/ Streaming ingestion engine Async Python + Redis Streams + Postgres stack with uv tooling, DLQ handling, and CLI helpers for monitoring and benchmarks Python Redis Streaming README
redis-postgres-pipeline/ High-performance pipeline Production-ready data pipeline with Redis queues, dedup, caching, Postgres 18 async I/O, UNLOGGED staging, materialized views, and Polars — handles 500M records without Spark Redis Postgres Pipeline README

Pair the conceptual deep dives with the hands-on stack READMEs: skim data-Infrastructure/ to understand the platform philosophy, then jump into the stack directory that matches your next experiment for deployment steps and credentials.


Getting Started

  1. Install prerequisites: Docker + Docker Compose v2 on a machine with adequate CPU, RAM, and disk (see stack-specific READMEs for sizing).
  2. Clone the repo:
    git clone https://github.com/hottechstack/simple-data-ai-stack.git
    cd simple-data-ai-stack
  3. Choose a stack: Browse the directories above and open the corresponding README for detailed instructions.
  4. Launch locally: Most stacks run with a single command (docker compose up -d, ./start_pipeline.sh start, etc.). Scripts expose health checks, sample data loaders, and log helpers to keep you moving.
  5. Compose your platform: Run stacks side-by-side for a fuller platform—pipe object storage into the SQL lake, orchestrate model feature jobs, or layer BI tooling on top.

Typical Use Cases

  • Prototype a lakehouse with production-grade components before committing to cloud services.
  • Trial ETL & AI feature pipelines with real datasets and observe resource footprints.
  • Provide analysts and business users a sandbox with self-service interfaces (NocoDB, pgAdmin, dashboards).
  • Validate monitoring/backup strategies in isolation before promoting to shared environments.

Opinionated Workflow

  1. Land structured/unstructured data via MinIO or direct DB ingestion.
  2. Transform using Airflow-managed ETL jobs powered by DuckDB and Polars.
  3. Serve & explore through PostgreSQL, NocoDB, BI tools, or custom APIs.
  4. Observe everything with built-in uptime checks, metrics dashboards, and automated backups.

The stacks are designed to connect: object storage flows into transformation jobs, refined outputs land back into the data lake, and monitoring tools keep the feedback loop tight.


Roadmap Inspiration

  • Vector databases + retrieval-augmented generation demo stack.
  • Streaming ingestion profile (Kafka/Redpanda + stream processing + materialized views).
  • Notebook & model experimentation workspace with GPU-ready containers.
  • Terraform modules to mirror these blueprints in managed cloud environments.

Have an idea or internal stack you want to share? Contributions are welcome—open an issue or PR to propose a new module or enhancement.


Contributing

  1. Fork the repository and work inside a dedicated directory for your stack.
  2. Document your stack thoroughly (architecture, environment variables, health checks, teardown steps).
  3. Reuse existing patterns for Docker Compose profiles, scripts, and monitoring hooks to keep experiences consistent.
  4. Submit a PR describing the use case, prerequisites, and any sample data included.

License

Unless otherwise stated in a subdirectory, content is provided as-is for educational and production experimentation. Review upstream container licenses before deploying in regulated environments.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published