Observe, deploy, respond. From a prompt.
Drift is a prompt-driven control plane for time-series systems and edge fleets. You ask questions or give instructions in plain language; an LLM agent picks the right tools, queries your VictoriaMetrics / Prometheus, runs statistical analysis, ships compose bundles to your devices, manages alert rules, and assembles a rich response — markdown, charts, tables, metric cards, timelines — that streams progressively into the UI as the work unfolds.
prompt → agent (tool use → metrics / fleet / alerts) → streaming render blocks → UI
Blog post with details.
📐 ARCHITECTURE.md — data flow, dataRef pattern, agent loop, tool catalog, extension points, file reference.
🚀 DEPLOY.md — Drift Deploy: fleet management, compose-app delivery, scenarios.
🚨 ALERTING.md — vmalert + Alertmanager + the agent's 14 alert tools, end-to-end workflows.
📦 deploy/README.md — single-server bundle (VM stack + Drift CP + Caddy/TLS on one box) with a guided installer.
The fast path is the single-server bundle. One Linux host with Docker, one public domain, two minutes of prompts:
VERSION=v0.1.41
curl -L "https://github.com/Scope-Creep-Labs/drift/releases/download/${VERSION}/drift-deploy-${VERSION#v}.tar.gz" | tar -xz
cd "drift-deploy-${VERSION#v}"
./install.shinstall.sh pulls ghcr.io/kidproquo/drift-agent:latest and drift-frontend:latest, so a fresh install lands directly on the current image versions regardless of which bundle tag you used. See deploy/README.md for the full operator walkthrough (DNS, prompts, day-2 ops) and deploy/UPDATES.md for the bundle-vs-image-only release model.
Want to hack on the code instead? See Quickstart below.
Three pillars, all driven from the same chat. The agent uses ~30 tools across them; you don't pick the tools, you describe the goal.
Ask anything about your telemetry. The agent discovers what metrics exist, picks the right query, fetches the data (which never enters the LLM context — see the dataRef pattern), runs statistics, and assembles a streamed response with charts, tables, summaries, and timelines.
> Which hosts are reporting metrics, and what jobs are scraping?
> Show CPU usage on the host over the last 15 minutes.
> Which containers are using the most memory right now?
> Look for anomalies in network traffic over the last hour.
> Compare p95 request latency between dev-cloud and edge devices last week.
> Plot disk I/O on jetson-002 every 5 seconds. ← live chart
> Now change the refresh rate to 1s. ← mutates the same chart in place
> Pull the last 200 error lines from dev-hetzner. ← log search via VictoriaLogs
Outputs: streaming markdown narration, Plotly charts, sortable tables, metric cards with sparkline trends, event timelines, live-refreshing charts.
Drift Deploy registers each device with a small edge agent that polls the control plane every 30s, applies whatever compose bundles you've assigned, and reports back. You drive the whole thing from the same prompt UI — devices, apps, revisions, tagging, deploy-by-tag, rollback. RBAC + per-group scoping keeps non-admins out of devices that aren't theirs. See DEPLOY.md for the full scenario catalog.
> List devices and their groups.
> Show what's deployed to home-pi4-001 right now.
> Tag pi-riffpod-001 with edge, client-z.
> Fork the reporter app as reporter-jetson.
> Save a new revision of reporter-jetson — here's the compose: <paste>
> Deploy reporter-jetson v3 to all devices tagged edge AND client-z.
> Roll home-pi4-001 back to reporter v2.
> Pull last 50 lines of the edge agent on dev-hetzner.
Outputs: propose-then-apply diffs in markdown, deployment status timelines, retry/conflict surfaces, terminal-action blocks, archive downloads (.tar.gz / .zip) of any revision.
Investigations end in action. From the same chat, manage vmalert rules and Alertmanager routing, silence noise during planned work, or jump straight into a host shell. The agent uses the same propose-then-apply pattern as deploys so you see exactly what will change before it does. See ALERTING.md for the alert subsystem details.
> List firing alerts.
> Create an alert when CPU > 90% for 5 minutes on any edge device.
> Silence anything from jetson-002 for 2 hours — I'm rebooting it.
> Wire up a webhook so critical alerts ping https://ntfy.sh/drift-alerts.
> Show the receivers configured in alertmanager.
> Open a terminal to home-pi4-001. ← xterm.js, one click in the UI
Outputs: propose-then-apply rule/receiver diffs, alert state timelines, terminal-action blocks, and the in-browser terminal modal (full pty, mux-friendly with TERM=xterm-256color, audited per session).
I wanted to observe and deploy docker-compose stacks across a fleet of Linux hosts — homelab, edge, cloud, corp — from one place, conversationally. The constraints came first; the architecture fell out of them.
- No inbound ports on target devices. Edge agents poll out to the control plane every 30s; nothing listens on the device side. Works behind NAT, firewalls, residential routers, and corp networks without holepunching, port forwards, or VPNs.
- No SSH after the first install. Once the device is commissioned (one
curl | bashover SSH), everything happens through the CP: deploys, updates, tag changes, log queries, and even shell access (in-browser via xterm.js, audited per session). The agent script self-updates from the CP via SHA comparison on each check-in — no per-device upgrade chore. Image-baseline changes are the one exception and remain a deliberate, infrequent per-device step. - Queue-based deploys, not push. Desired state lives on the CP. Targets can be offline when you make a change — when they come back, they converge. No imperative "ssh-and-run" model that breaks when half the fleet is asleep.
- Compose is the contract. Apps are versioned bundles of plain files (
compose.yaml+.env+ configs). Ifdocker compose upruns it on your laptop, Drift can ship it. Rollback is "deploy revision v2" — no proprietary packaging, no special tooling. - Groups and tags for dynamic filtering. Groups are the RBAC/multi-tenant boundary (one per device); tags are free-form operational labels (
edge,client-z,low-power) that overlap freely. Match-all rollouts (deploy to tags=["edge","client-z"]) handle the cross-cutting cases that groups alone can't. - Lean on the proven observability stack. VictoriaMetrics + VictoriaLogs + vmalert + Alertmanager + Grafana + node-exporter + cAdvisor + Vector — lightweight, replaceable, no homegrown protocols. Drift builds the interaction layer, not another TSDB.
- PromQL as the query language. The agent generates and runs PromQL; the operator never has to see it. Anything that speaks the Prometheus query API plugs in (VM, Prometheus, Thanos, Mimir).
- Tool calling to extend the agent, not fine-tuning. New capability = a function in
app/tools/*.pyplus a JSON schema. No retraining, no embeddings store, no RAG. Telemetry data flows through tools and stays out of the LLM context (the dataRef pattern) — analysis is precise (numpy/scipy actually computes); the model orchestrates. Stops the "LLM hallucinated a p95" failure mode and keeps token cost flat regardless of fleet size. - Propose-then-apply for every mutation. The LLM never silently changes state. Creating an alert rule, deploying a bundle, editing a route — each goes through a
propose_*tool that surfaces the diff beforeapply_*runs. This is how you let an LLM touch production. - Watch the investigation, not just the answer. Tool calls, narration, intermediate charts, results — all painted progressively as the agent works. No 30-second blank wait followed by a wall of text. Trust comes from seeing how the result was reached.
- Self-hosted, self-owned. One Caddy + the Drift CP + a TSDB on a single Linux box. Your devices, your data, your model key. No SaaS phone-home, no per-device subscription, no vendor.
- Bring-your-own model. Claude Opus 4.7 is the default for its quality on agentic loops, but
MODEL=…+ the engine adapter pattern let you point at Sonnet, Haiku, or anything else. The frontend doesn't know which model is running. - RBAC + per-group scoping out of the box. Three roles (
observe < deploy < admin), per-user group membership scopes which devices a user can see/touch, separate registry credentials per group. Multi-tenant from day one rather than retrofit. - Host-CA injection for corp networks.
install.shdetects the host's combined CA bundle and propagates it to the agent plus every deployed app (mounted at the standard Debian + Alpine paths, plusSSL_CERT_FILE/CURL_CA_BUNDLEin env). Ship to devices sitting behind a TLS-intercepting corp proxy without per-app workarounds.
The same constraints rule out a lot of common shapes: no PaaS-style "give us your code", no per-device daemon you upgrade by hand, no log-aggregator-as-a-service, no "let the LLM read all your data" RAG, no listening sockets on target devices.
The agent operates on metadata — names, labels, summaries, configs by reference — not on raw secrets or raw bulk data. The boundary is enforced in code, not by prompting the model to behave.
What the LLM has access to:
- Names and metadata: metric / label / job names, device names + groups + tags + statuses, app / revision metadata, alert rule names + expressions + labels, receiver names + webhook URLs, session metadata.
- File contents of compose bundles when explicitly fetched via
get_app_revision— typically${VAR}references; the actual values come from device-side env. - Time-series summaries (n, mean, p50, p95, min, max, …) computed server-side from each query. Raw arrays stay server-side under a
prom://<uuid>dataRef and are pushed straight to the UI via SSE (the dataRef pattern). - Log lines returned by
query_logs— the same content you'd see indocker logson the device.
What the LLM never has access to:
- API keys (
ANTHROPIC_API_KEY, etc.) and any other env-var credentials — env vars don't enter the prompt or the tool-result surface. - Drift's database password (
DRIFT_PG_PASSWORD) and Fernet key (DRIFT_SECRET_KEY). - Auth secrets for the TSDB / vmalert / Alertmanager (
VM_BASIC_AUTH,VMALERT_BASIC_AUTH,ALERTMANAGER_BASIC_AUTH, etc.) — tool handlers attach these directly to outboundhttpxcalls. - Registry credentials — encrypted at rest with
DRIFT_SECRET_KEY, decrypted only per device check-in, shipped over TLS straight to the edge agent. Operators set them via a UI modal that bypasses the LLM entirely. - Alertmanager receiver secrets (bearer tokens, webhook auth) — the agent only calls
Path.exists()onam-secrets/*filenames and emits a path reference (bearer_token_file: /etc/alertmanager/secrets/<name>). Alertmanager opens the file at notify time; the LLM never sees the bytes. - Raw time-series arrays — kept under server-side dataRefs, streamed to the UI out-of-band.
- Web-terminal bytes — pty stdio flows agent ↔ edge over a dedicated WebSocket and never the LLM.
- User passwords — set + verify happen server-side via
passlib; the LLM has no read path to the password column.
Three places where sensitive content briefly touches the chat surface:
create_user/reset_user_passwordreturn a server-generated password ONCE in the tool response, which renders into the chat trace. Hand it to the user out-of-band and clear the investigation afterwards. The self-service "change my password" sidebar flow keeps the new password off the chat entirely.commission_devicereturns a one-shot bootstrap token in the curl line it generates. The token is single-use — once a device claims it, it's exhausted — and acts as a device-commissioning credential, not a long-lived secret.- If you paste compose contents with literal secrets in
.envinto the prompt, the LLM sees what you typed. Use${VAR}references resolved on the device, or the registry-credentials modal for image-pull tokens — both keep secrets off the chat.
- Frontend — React + Material UI dark theme, Plotly time-series charts, real-time streaming UI that surfaces the agent's thinking and tool calls. Sidebar lists devices and apps in your groups; xterm.js opens a host shell in one click.
- Backend — FastAPI agent powered by Claude Opus 4.7 (default; configurable via
MODEL) with adaptive thinking, prompt caching, and ~30 tools across discovery / query / analysis / fleet / alerts / render-block emission. - Multi-user RBAC — login + cookie sessions, three roles (
observe<deploy<admin), user-group scoping for devices, audit log for terminal sessions. Bootstrap an admin via env vars; manage the rest from chat or the admin API. - Drift Deploy — promote a compose bundle as an "app", push to one device or every device matching a tag set, watch the edge agents reconcile in real time. Per-group registry credentials, edge-agent self-update, retry budgets, conflict detection, host-CA injection for corp PKI.
- Live charts —
make_live_chartpolls a server-side PromQL passthrough on a timer andPlotly.react-diffs in place; mutating one keeps zoom/hover. - Compose stack — slim Docker images for both services. Brings its own TSDB? No — point it at any Prometheus-compatible source via
VM_URL. The bundled single-server install adds VictoriaMetrics, VictoriaLogs, vmalert, Alertmanager, Grafana, and Caddy/TLS.
| Tool | Version | Why |
|---|---|---|
| Docker | ≥ 24.0 | Recommended path for running everything. |
| Docker Compose | ≥ 2.20 | Bundled with Docker Desktop. |
| Node.js | ≥ 20 | Local frontend dev (alternative to Docker). |
| Python | ≥ 3.12 | Local backend dev (alternative to Docker). |
| Anthropic API key | — | Required for the agent to actually call the LLM. Get one at https://console.anthropic.com. |
You also need a Prometheus-compatible time-series source the agent can reach:
- Your VictoriaMetrics (single-node or vmselect cluster) via
VM_URL. - Any Prometheus-API-compatible store (Prometheus, Thanos, Grafana Mimir, etc.).
On this host, a VM stack lives at
/root/setup/victoria/(single-node VM on:8428, vmauth basic-auth proxy on:8427, Grafana on:3000) with a vmagent + cadvisor reporter at/root/setup/victoria/reporter/. The shipped.env.exampleshows how to point Drift at it via the public vmauth URL.
Two paths, pick whichever fits.
git clone <this repo>
cd drift
cp .env.example .env
$EDITOR .env # ANTHROPIC_API_KEY plus VM_URL (and VM_BASIC_AUTH / VM_BEARER_TOKEN if needed)
docker compose up --buildThe frontend is exposed on host port 10001 (mapped to nginx :80 in the container). Open http://localhost:10001 for direct access, or wire it up behind a reverse proxy at the path of your choice (this repo's deployment is at https://drift.example.com/drift/). Try:
- "Which hosts are reporting metrics, and what jobs are scraping?"
- "Show CPU usage on the host over the last 15 minutes."
- "Which containers are using the most memory right now?"
- "Look for anomalies in network traffic over the last hour."
For VM cluster (vmselect): set VM_TENANT_PATH=/select/<accountID>/prometheus. For auth: set VM_BASIC_AUTH=user:pass or VM_BEARER_TOKEN=....
If your VM is on the docker host (not in another compose stack on a shared network), use VM_URL=http://host.docker.internal:8428 and add extra_hosts: ["host.docker.internal:host-gateway"] to the drift-agent service.
Run the backend in a venv and the frontend in Vite's dev server. Best for iterating on code.
Backend:
cd drift-agent
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
cp .env.example .env # set ANTHROPIC_API_KEY + VM_URL (must be reachable from your machine)
uvicorn app.main:app --reload --host 127.0.0.1 --port 8000Frontend (in another terminal):
npm install
cat > .env.local <<EOF
VITE_ENGINE=agent
VITE_AGENT_DEV_URL=http://localhost:8000
EOF
npm run devOpen http://localhost:5173. Vite's dev proxy forwards /api/* to the backend, so the same code path works in dev as in Docker.
To iterate on the UI without spending API credits or running a backend, set VITE_ENGINE=mock instead. The frontend ships 5 hard-coded scenarios with synthetic data.
All env vars live in two files:
- Root
.env— read bydocker-compose.ymland substituted into both services. drift-agent/.env— read by the agent when running locally withuvicorn(outside Docker).
In Docker, only the root .env matters. In local dev, both can exist independently — the frontend reads drift/.env.local, the agent reads drift-agent/.env.
| Variable | Required | Default | Notes |
|---|---|---|---|
ANTHROPIC_API_KEY |
yes | — | Claude API key. |
VM_URL |
yes | — | Base URL of your VictoriaMetrics / Prometheus. |
VM_TENANT_PATH |
no | "" |
/select/<id>/prometheus for vmselect; empty for single-node. |
VM_BASIC_AUTH |
no | "" |
user:pass. Sent as Authorization: Basic. |
VM_BEARER_TOKEN |
no | "" |
Sent as Authorization: Bearer <token>. |
MODEL |
no | claude-opus-4-7 |
Any current Claude model ID. |
EFFORT |
no | high |
low / medium / high / xhigh / max. |
MAX_TOKENS |
no | 64000 |
Per-iteration max_tokens. |
ALLOWED_ORIGINS |
no | http://localhost:5173,http://127.0.0.1:5173 |
Comma-separated CORS allowlist for the FastAPI app. Compose sets this to the frontend origin. |
| Variable | Default | Notes |
|---|---|---|
VITE_ENGINE |
mock |
agent for the real backend, mock for synthetic. |
VITE_API_BASE |
/api |
Base URL the AgentAdapter POSTs to. |
VITE_AGENT_DEV_URL |
http://localhost:8000 |
Where Vite's dev proxy forwards /api/*. Dev only. |
In Docker, the frontend image is built with VITE_ENGINE=agent and VITE_API_BASE=/api (via build args in the Dockerfile). Override at build time with --build-arg VITE_ENGINE=mock if you want the UI without the backend.
Top level:
drift/
├── README.md this file
├── ARCHITECTURE.md deep dive: data flow, agent loop, dataRef pattern, tool catalog
├── ALERTING.md vmalert + Alertmanager subsystem; alert/silence/receiver tools
├── DEPLOY.md Drift Deploy user guide; deploy/commission/migrate scenarios
├── docker-compose.yml frontend + agent
├── Dockerfile frontend: alpine node builder + nginx alpine runtime
├── nginx.conf SPA + SSE-friendly /api proxy
├── package.json frontend dependencies
├── tsconfig.json
├── vite.config.ts dev proxy /api → VITE_AGENT_DEV_URL
├── index.html
├── src/ React frontend
├── drift-agent/ Python backend (FastAPI + agent + tools)
└── spec/ original product specs (reference only)
For a full file-by-file breakdown, see ARCHITECTURE.md → File reference.
Edit one of drift-agent/app/tools/{metrics,analysis,emit}.py:
- Define an
async def my_tool(ctx, args)returning a JSON-serializable dict. - Add an entry to that file's
*_TOOLSlist (JSON Schema describing inputs). - Register the handler in
*_HANDLERS.
The agent picks it up automatically on next request — system prompt and tools list rebuild from the registries on import.
See ARCHITECTURE.md → Extension points.
- Add the variant to
src/types/blocks.tsanddrift-agent/app/schemas.py. - Write a React component under
src/components/blocks/. - Register it in
src/components/blocks/BlockRenderer.tsx. - Add an emit tool in
drift-agent/app/tools/emit.py.
Set VITE_ENGINE=mock in .env.local. The Mock adapter synthesizes a fake event stream from 5 hard-coded scenarios (gateway-17 instability, fleet thermal, dispatch optimization, v2.8 regression, latency correlation). The streaming UI works the same.
# Frontend
npx tsc --noEmit
npm run build
# Backend
cd drift-agent && .venv/bin/python -c "from app.main import app; print('OK')"There are no automated tests yet — verification is manual end-to-end via the UI.
Set MODEL=claude-sonnet-4-6 (or any current Claude ID) in .env. Adjust EFFORT for the cost/quality balance you want. Restart the agent.
To use a different LLM provider entirely, refactor drift-agent/app/agent.py:run_agent. The SSE protocol stays the same, so no frontend changes are needed.
Agent fails to start with "1 validation error for Settings: anthropic_api_key".
You haven't set ANTHROPIC_API_KEY in .env. The Settings class requires it.
Agent starts but /investigate returns an error: "anthropic_api_error: ...".
Either the key is invalid, the model ID is wrong, or you've hit a rate limit. Check the agent's logs (docker compose logs drift-agent or the uvicorn terminal).
Agent runs but every tool call fails with HTTP timeout / connection refused.
The agent can't reach VM_URL. From inside the agent container: docker compose exec drift-agent curl -s "$VM_URL/api/v1/labels". Common causes:
- VM is on the docker host but
VM_URL=http://localhost:8428— containers can't see the host'slocalhost. Usehttp://host.docker.internal:8428with anextra_hosts: host-gatewaymapping, or attach drift to the VM stack's docker network. - Auth required but
VM_BASIC_AUTH/VM_BEARER_TOKENnot set (e.g. you're hittingvmauthon:8427, notvm:8428). - Firewall / Tailscale not connected.
- vmselect cluster but you forgot
VM_TENANT_PATH=/select/0/prometheus.
Agent fetches data but charts in the UI show "Chart data is no longer in cache". You reloaded the page. The dataRegistry is in-memory only — re-run the prompt to refetch.
cache_read_input_tokens shows 0 across consecutive turns.
Something invalidated the prompt cache prefix. Look in drift-agent/app/agent.py for non-deterministic content in SYSTEM_PROMPT or the tools list (timestamps, UUIDs, varying tool order). The prefix must be byte-stable across calls.
"Failed to load chart data: dataRef not found: prom://..."
The agent emitted a chart referencing a ref that wasn't pushed via a data event. Check the agent logs; usually means an emit tool fired before the underlying query_range succeeded. File a bug.
Vite dev server won't start with port-in-use error.
lsof -ti:5173 | xargs kill to free the port.
Docker build fails on npm ci.
Delete node_modules/ locally before building (Docker's COPY may have picked up a partial install).
Frontend serves but /api/* returns 502 in nginx.
The agent container isn't healthy. docker compose ps should show it running and healthy. If not, docker compose logs drift-agent.
Agent runs slowly / takes 30+ seconds.
Normal for complex investigations — claude-opus-4-7 with effort=high is thorough. Lower EFFORT=medium if you need faster, less exhaustive responses.
- Persistence: investigation history is in
localStorageunder keydrift.investigations.v2. Chart trace data is in-memory only — see ARCHITECTURE.md → The dataRef pattern. User auth, devices, apps, registry creds, terminal session metadata live in Postgres (thedrift-postgresservice in compose). Token usage is reported as metrics into VictoriaMetrics so the sidebar's per-user "$X used" survives drift-agent restarts. - Agent loop cap: the loop is bounded at 20 LLM iterations; most investigations finish in 4–8.
- No automated tests yet. Verification is manual via the UI.
- Bootstrap admin: set
DRIFT_ADMIN_USERNAME+DRIFT_ADMIN_PASSWORDin.envfor first-run admin creation. Subsequent users are created from chat or the admin API.
Drift is licensed under the Apache License 2.0. Copyright 2026 Scope Creep Labs LLC.
Contributions are welcome — bug fixes, features, docs, edge-agent ports to new platforms. See CONTRIBUTING.md for the development setup and PR guidelines.
All contributors must sign the Individual Contributor License Agreement. Our CLA Assistant bot posts a one-click signing link on your first pull request; sign once and it covers every future PR. The CLA permits Scope Creep Labs LLC to relicense future versions of the project under different terms — Apache 2.0 on existing releases is permanent.
For security reports, please email support@scopecreeplabs.com rather than opening a public issue.