Data Lake pipelines for Vector DB & AI. Ingest raw documents, run pipelines, produce embeddings — ready for RAG, LLM, semantic search.
Website: lake-flow.vercel.app — docs, API reference, deployment
- Layered Data Lake:
000_inbox→100_raw→200_staging→300_processed→400_embeddings→500_catalog - Idempotent pipelines — Re-run safely, deterministic UUIDs for Qdrant
- Semantic search — Natural language query, cosine similarity
- Embedding API —
POST /search/embedfor text→vector (RAG/LLM ready) - Streamlit UI — Run pipelines, explore data, test search (dev mode)
- Multi-Qdrant — Choose Qdrant URL in UI
- NAS-friendly — SQLite without WAL; works on Synology/NFS
Requirements: Docker ≥ 20.x, Docker Compose ≥ 2.x
git clone https://github.com/Lampx83/LakeFlow.git
cd LakeFlow
cp env.example .env
# Edit .env: HOST_LAKE_PATH, QDRANT_HOST, API_BASE_URL
DOCKER_BUILDKIT=1 docker compose up --build| Service | URL |
|---|---|
| Backend API | http://localhost:8011 |
| API docs | http://localhost:8011/docs |
| Streamlit UI | http://localhost:8012 (login: admin / admin123) |
Data lake: lakeflow_data volume (binds to HOST_LAKE_PATH from .env). Create zones if needed: 000_inbox, 100_raw, …
Mac M1 GPU: Docker runs Linux — no Metal. For GPU, run backend via venv on macOS: pip install torch then pip install -r backend/requirements.txt.
Copy env.example to .env and adjust:
| Variable | Description |
|---|---|
HOST_LAKE_PATH |
Host path for data lake (bind to /data in container) |
LAKE_ROOT |
Container path (/data in Docker) |
QDRANT_HOST |
lakeflow-qdrant (Docker) or localhost (local) |
API_BASE_URL |
http://lakeflow-backend:8011 (Docker) or http://localhost:8011 (local) |
LAKEFLOW_MODE |
DEV = show Pipeline Runner; omit = hide (production) |
LLM_BASE_URL |
Ollama/LLM URL for Q&A, Admission agent |
EMBED_MODEL |
Default qwen3-embedding:8b |
LakeFlow/
├── backend/ # FastAPI + pipelines (Python)
│ └── src/lakeflow/
├── frontend/streamlit/# Streamlit control UI
├── website/ # Next.js docs → lake-flow.vercel.app
├── docker-compose.yml
├── env.example # Env template (copy to .env)
└── portainer-stack.yml
cd backend
python3 -m venv .venv && source .venv/bin/activate
pip install torch && pip install -r requirements.txt && pip install -e .
# .env in repo root with LAKE_ROOT, QDRANT_HOST, API_BASE_URL
python -m uvicorn lakeflow.main:app --reload --port 8011Qdrant: docker compose up -d qdrant. Frontend: streamlit run frontend/streamlit/app.py.
GET /health— Health checkPOST /auth/login— Demo auth (admin/admin123)POST /search/embed—{"text":"..."}→ vectorPOST /search/semantic—{"query":"...", "top_k":5}
See backend/docs/API_EMBED.md and lake-flow.vercel.app/docs.
- CI (
ci.yml): Lint (Ruff), Docker build on push/PR - CD (
cd.yml): On release tag → push images to GitHub Container Registry - Docker Hub (
push-dockerhub.yml): Pushlakeflow-backend,lakeflow-frontend— needsDOCKERHUB_USER,DOCKERHUB_TOKEN - Deploy (
deploy.yml): SSH to server,git pull,docker composeon push tomain
Portainer: Use portainer-stack.yml with pre-pushed images (no build). See deployment section in docs.
| Resource | URL |
|---|---|
| Website & Docs | lake-flow.vercel.app |
| PyPI | lake-flow-pipeline |
Issues, PRs, and doc improvements welcome.