LakeFlow

Data Lake pipelines for Vector DB & AI. Ingest raw documents, run pipelines, produce embeddings — ready for RAG, LLM, semantic search.

Website: lake-flow.vercel.app — docs, API reference, deployment

Features

Layered Data Lake: 000_inbox → 100_raw → 200_staging → 300_processed → 400_embeddings → 500_catalog
Idempotent pipelines — Re-run safely, deterministic UUIDs for Qdrant
Semantic search — Natural language query, cosine similarity
Embedding API — POST /search/embed for text→vector (RAG/LLM ready)
Streamlit UI — Run pipelines, explore data, test search (dev mode)
Multi-Qdrant — Choose Qdrant URL in UI
NAS-friendly — SQLite without WAL; works on Synology/NFS

Quick start (Docker)

Requirements: Docker ≥ 20.x, Docker Compose ≥ 2.x

git clone https://github.com/Lampx83/LakeFlow.git
cd LakeFlow
cp env.example .env
# Edit .env: HOST_LAKE_PATH, QDRANT_HOST, API_BASE_URL
DOCKER_BUILDKIT=1 docker compose up --build

Service	URL
Backend API	http://localhost:8011
API docs	http://localhost:8011/docs
Streamlit UI	http://localhost:8012 (login: `admin` / `admin123`)

Data lake: lakeflow_data volume (binds to HOST_LAKE_PATH from .env). Create zones if needed: 000_inbox, 100_raw, …

Mac M1 GPU: Docker runs Linux — no Metal. For GPU, run backend via venv on macOS: pip install torch then pip install -r backend/requirements.txt.

Configuration

Copy env.example to .env and adjust:

Variable	Description
`HOST_LAKE_PATH`	Host path for data lake (bind to `/data` in container)
`LAKE_ROOT`	Container path (`/data` in Docker)
`QDRANT_HOST`	`lakeflow-qdrant` (Docker) or `localhost` (local)
`API_BASE_URL`	`http://lakeflow-backend:8011` (Docker) or `http://localhost:8011` (local)
`LAKEFLOW_MODE`	`DEV` = show Pipeline Runner; omit = hide (production)
`LLM_BASE_URL`	Ollama/LLM URL for Q&A, Admission agent
`EMBED_MODEL`	Default `qwen3-embedding:8b`

Project structure

LakeFlow/
├── backend/           # FastAPI + pipelines (Python)
│   └── src/lakeflow/
├── frontend/streamlit/# Streamlit control UI
├── website/           # Next.js docs → lake-flow.vercel.app
├── docker-compose.yml
├── env.example        # Env template (copy to .env)
└── portainer-stack.yml

Development (without Docker)

cd backend
python3 -m venv .venv && source .venv/bin/activate
pip install torch && pip install -r requirements.txt && pip install -e .
# .env in repo root with LAKE_ROOT, QDRANT_HOST, API_BASE_URL
python -m uvicorn lakeflow.main:app --reload --port 8011

Qdrant: docker compose up -d qdrant. Frontend: streamlit run frontend/streamlit/app.py.

API overview

GET /health — Health check
POST /auth/login — Demo auth (admin / admin123)
POST /search/embed — {"text":"..."} → vector
POST /search/semantic — {"query":"...", "top_k":5}

See backend/docs/API_EMBED.md and lake-flow.vercel.app/docs.

CI / CD / Deploy

CI (ci.yml): Lint (Ruff), Docker build on push/PR
CD (cd.yml): On release tag → push images to GitHub Container Registry
Docker Hub (push-dockerhub.yml): Push lakeflow-backend, lakeflow-frontend — needs DOCKERHUB_USER, DOCKERHUB_TOKEN
Deploy (deploy.yml): SSH to server, git pull, docker compose on push to main

Portainer: Use portainer-stack.yml with pre-pushed images (no build). See deployment section in docs.

Links

Resource	URL
Website & Docs	lake-flow.vercel.app
PyPI	lake-flow-pipeline

Contributing

Issues, PRs, and doc improvements welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.github/workflows		.github/workflows
backend		backend
datalake		datalake
docs		docs
frontend		frontend
website		website
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.deploy.yml		docker-compose.deploy.yml
docker-compose.yml		docker-compose.yml
portainer-stack.yml		portainer-stack.yml
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LakeFlow

Features

Quick start (Docker)

Configuration

Project structure

Development (without Docker)

API overview

CI / CD / Deploy

Links

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LakeFlow

Features

Quick start (Docker)

Configuration

Project structure

Development (without Docker)

API overview

CI / CD / Deploy

Links

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages