Fleet

Fleet control plane

The Argus Fleet control plane is a standalone, opt-in service that gives operators a single, readable view across many Argus-instrumented bot processes ("clusters"), grouped into regions ("fleets"). It is a separate process and a separate container image; it is never embedded in a bot.

If you run one bot process, you do not need this: the built-in Dashboard already shows it. Reach for the fleet control plane when you run several processes (shards, regions, or many bots) and want one pane that rolls them up without standing up Grafana and writing PromQL.

The fleet path is operational aggregation only. It still never introduces a guild_id/user_id/channel_id Prometheus label; per-entity questions stay on the analytical path (see History and ClickHouse).

Four tiers

Global   all fleets rolled up         (every cluster)
  |
Fleet    one region, e.g. "asia"      (a group of clusters)
  |
Cluster  one bot process              (a single /metrics owner)
  |
Shard    one gateway shard            (up/down + heartbeat latency)

You drill down Global -> Fleet -> Cluster -> Shard in the UI. Each cluster shows the fixed readable metric set (latency, shards up, guilds, cached users, interactions/sec, error rate, command p95, rate limits/sec, uptime), colour graded against sensible thresholds, plus an inline grid of its shards (id, up/down, latency) and recent trend sparklines. No PromQL or Grafana setup needed.

Optionally the same pane also serves per-guild analytics (see below), so one dashboard covers operational rollups and analytics, fleet-wide and per bot.

Why a separate service

/metrics and the live gauges must stay inside each bot process because they read live bot state (bot.latencies, bot.guilds, ...) at scrape time. That is unchanged. What moves out is the aggregation and the fleet UI, which is resource heavy and should not compete with the bot. So the control plane is its own deployable, and bots only make a tiny outbound heartbeat to it.

Two data sources

The control plane merges two interchangeable sources behind one interface, both rendering the same UI. Either or both can be active.

Source	Needs	How values arrive
Push	nothing (built in)	members POST a snapshot on each heartbeat; the latest per cluster is the data. NAT friendly: members only call out.
Prometheus	an existing Prometheus	the control plane queries a curated PromQL catalog grouped by the `cluster` label.

Default is push (members always register). Set ARGUS_FLEET_PROMETHEUS_URL to also read from Prometheus; when both are active, Prometheus values take precedence and push fills any gaps.

A note on rates (push source)

Per-second rates (interactions/sec, rate limits/sec) need two samples to differentiate, so they read 0 on the push source until a second heartbeat arrives, and may stay coarse at long heartbeat intervals. Counts, latency, p95, and error rate are exact from a single snapshot. If you want precise rates, point the control plane at Prometheus, which computes them with rate().

The registry: identity, numbering, health

The registry is the source of truth for topology, identity, and health; metric values come from a data source and join on the cluster id.

Identity is fleet_id if you set it, else cluster_id if you set that, else an auto-generated UUID persisted to the member's state dir (so a restart keeps the same identity). Falling back to cluster_id means the identity equals the Prometheus cluster label, so the push and Prometheus sources join the same cluster with no extra config.
Fleet (region) is the member's fleet_group (default default).
Numbering is per fleet, monotonic, and never reused. The first cluster in asia is asia #1, the next asia #2, and so on. A dead cluster keeps its number and is shown down (it does not free its slot); a brand new cluster gets the next number, not the dead one's. A reconnecting identity reclaims its own number. (A standard lease + heartbeat-TTL + monotonic-token pattern.)
Health is a lease: a cluster is up while now - last_seen is within heartbeat_interval * ttl_factor, else down. A background sweeper updates status on the same interval.
Persistence is a JSON file at ARGUS_FLEET_STATE, so numbers and topology survive a control-plane restart. Mount a volume for it in production.

Cluster-to-fleet grouping lives entirely in the registry, so operational metrics need no extra fleet label (no added Prometheus cardinality, no change to the core catalogue).

Setup wizard and diagnostics

The fastest way to stand up a control plane is the wizard, which mints a token and scaffolds everything:

python -m argus.fleet init      # writes .env + docker-compose.fleet.yml, prints the member snippet
python -m argus.fleet doctor --url http://fleet-host:9190 --token secret   # probe a running plane

init also prints a ready Prometheus http_sd scrape config. doctor reads the view, so pass the viewer token; it checks reachability, auth, cluster health, and (with --namespace) a namespace mismatch. (From a bot host, a plain curl .../healthz confirms reachability without any token.) For the nicest python -m argus.fleet experience (autoload .env) install the extra: pip install 'argus-dpy[fleet]'. Without it, the generated compose env_file or a systemd EnvironmentFile= loads the .env.

Hardening (secure by default)

Refuse-insecure bind: a non-loopback bind with no token refuses to start. Set ARGUS_FLEET_TOKEN (or ARGUS_FLEET_TOKEN_FILE), bind to loopback, or ARGUS_FLEET_INSECURE=1 for local testing only.
Split tokens (optional): a low-privilege ARGUS_FLEET_INGEST_TOKEN (on every bot) and an ARGUS_FLEET_VIEWER_TOKEN (operators), each falling back to the shared token, so a leaked bot token does not unlock the dashboard. Any token var accepts a comma-separated list, so you can rotate with zero downtime (add the new token, roll it out, drop the old).
Per-identity lease (optional): ARGUS_FLEET_REQUIRE_LEASE=1 makes the plane mint a high-entropy secret at register that the member must present on every heartbeat; a mismatch is 409, so even a leaked ingest token can't take over an existing slot. Stored only as an HMAC-SHA256 digest (optionally peppered via ARGUS_FLEET_SECRET_PEPPER); see Security.
Abuse resistance: request body cap (413); per-IP register, per-identity heartbeat, and per-client read (ARGUS_FLEET_READ_BURST) rate limits (429); and an ARGUS_FLEET_MAX_CLUSTERS cap.
Audit log (optional): ARGUS_FLEET_AUDIT_LOG=1 logs one sanitized INFO line per ingest event (identity, client, outcome); the secret is never logged.
Scanner resistance: the version banner is stripped and security headers (X-Frame-Options, CSP, nosniff) are sent on every response. Front the plane with a TLS reverse proxy (or a tunnel) for any public deploy.
Single writer: an advisory lock on the state file refuses a second instance sharing it. The on-disk state is schema-versioned (unknown versions refuse to load, never truncate).
Self-observability: the plane exposes its own /metrics (register/heartbeat counters, live cluster up/down gauges) and a /readyz.
Retention (optional): ARGUS_FLEET_RETENTION_DAYS prunes long-dead clusters; per-fleet numbers are still never reused.

Per-guild analytics (one pane for everything)

Set ARGUS_FLEET_CLICKHOUSE_DSN to the same ClickHouse your bots drain per-guild events to (see History and ClickHouse), and the fleet SPA gains an Analytics view: pick a guild to see its top commands and average command duration, fleet-wide or sliced to one bot. Because events now carry a cluster_id, the view can filter per cluster; the analytics API is viewer-token gated and fails closed without a token (invariant 7). This makes the control plane a single pane for both operational rollups and per-guild analytics.

OTLP is deliberately out of scope here: it is a one-way export, so OTLP-shipped metrics are viewed in their own backend, not pulled back into this pane.

Prometheus auto-discovery

Set ARGUS_FLEET_SCRAPE_TARGET=<host:port> on each bot to advertise its metrics address. The control plane then serves GET /api/fleet/targets in Prometheus http_sd format (viewer-gated, with cluster/fleet labels), so one Prometheus pointed at the fleet discovers every bot. argus-fleet init prints the matching scrape config.

Configuration

Server (`python -m argus.fleet`)

env	default	meaning
`ARGUS_FLEET_HOST`	`0.0.0.0`	bind host
`ARGUS_FLEET_PORT`	`9190`	bind port
`ARGUS_FLEET_TOKEN`	(none)	shared secret; gates every route except `/healthz`
`ARGUS_FLEET_HEARTBEAT_INTERVAL`	`15`	expected member heartbeat seconds
`ARGUS_FLEET_TTL_FACTOR`	`3`	down after `interval * ttl_factor` seconds of silence
`ARGUS_FLEET_STATE`	`argus-fleet-state.json`	registry persistence path
`ARGUS_FLEET_PROMETHEUS_URL`	(none)	also read values from this Prometheus
`ARGUS_FLEET_CLICKHOUSE_DSN`	(none)	shared ClickHouse for the per-guild Analytics view
`ARGUS_NAMESPACE`	`discord`	metric prefix; must match the members

Advanced / hardening (all optional):

env	default	meaning
`ARGUS_FLEET_TOKEN_FILE`	(none)	read the token from a mounted secret file
`ARGUS_FLEET_INGEST_TOKEN` / `_VIEWER_TOKEN`	(shared)	split write/read tokens (+ `*_FILE`)
`ARGUS_FLEET_INSECURE`	`0`	allow a public bind with no token (local testing only)
`ARGUS_FLEET_MAX_BODY_BYTES`	`262144`	request body cap (413 over)
`ARGUS_FLEET_CORS_ORIGINS`	(none)	allowlist for a detached UI (see Clustering/CORS)
`ARGUS_FLEET_VIEW_CACHE_MS`	`1000`	view cache TTL (shared across viewers)
`ARGUS_FLEET_MAX_CLUSTERS`	`5000`	cap on registered clusters
`ARGUS_FLEET_REGISTER_BURST` / `_HEARTBEAT_BURST`	`60`	rate-limit token-bucket burst per 60s
`ARGUS_FLEET_READ_BURST`	`120`	per-client GET (view/api/metrics) rate-limit burst per 60s
`ARGUS_FLEET_REQUIRE_LEASE`	`0`	require the per-identity lease secret on heartbeat/re-register
`ARGUS_FLEET_SECRET_PEPPER` / `_FILE`	(none)	server-side key for the at-rest lease HMAC
`ARGUS_FLEET_AUDIT_LOG`	`0`	log one INFO line per ingest event (identity, client, outcome)
`ARGUS_FLEET_RETENTION_DAYS`	`0`	prune clusters down longer than N days (0 = never)

Member (on the bot, opt-in)

These are fields on Argus(bot) and matching env vars. When fleet_url is unset, no fleet code runs at all.

kwarg / env	default	meaning
`fleet_url` / `ARGUS_FLEET_URL`	(none)	opt in; register + heartbeat to this control plane
`fleet_token` / `ARGUS_FLEET_TOKEN`	(none)	the shared token
`fleet_group` / `ARGUS_FLEET_GROUP`	`default`	the region/fleet name
`fleet_id` / `ARGUS_FLEET_ID`	(none)	stable identity; auto-UUID persisted if unset
`fleet_state_dir` / `ARGUS_FLEET_STATE_DIR`	`.`	where the member persists its identity
`fleet_scrape_target` / `ARGUS_FLEET_SCRAPE_TARGET`	(none)	advertise `host:port` for Prometheus `http_sd`

When split tokens are in use, set the bot's fleet_token to the ingest token.

The member side is fail-open and bounded: at most one heartbeat is in flight, failures drop the sample and retry on the next tick, and a fleet outage or error never touches the bot loop (mirrors invariant 5). If the control plane forgets a member (a 404 on heartbeat), the member transparently re-registers.

Because the member only makes outbound calls, it needs no inbound port - the cleanest way to monitor bots on Docker panels (Pterodactyl, PebbleHost, Railway) that can't expose /metrics. See Hosting on bot panels.

HTTP surface

Auth is path-aware: ingest paths need the ingest token, the rest need the viewer token (both default to the shared ARGUS_FLEET_TOKEN), supplied as Authorization: Bearer <token> or ?token=. /healthz and /readyz are open.

Method + path	Who calls it	Token	Body / response
`POST /fleet/register`	member	ingest	`{identity, fleet, version, scrape_target?}` -> `{number}`
`POST /fleet/heartbeat`	member	ingest	`{identity, snapshot?}` -> `204`
`GET /api/fleet/view`	the SPA	viewer	the whole `FleetView` JSON
`GET /api/fleet/cluster?fleet=&number=`	the SPA	viewer	one cluster + history
`GET /api/fleet/targets`	Prometheus	viewer	`http_sd` target list
`GET /api/fleet/analytics/*`	the SPA	viewer	per-guild analytics (only when ClickHouse is configured; fails closed without a token)
`GET /metrics`	Prometheus	viewer	the control plane's own metrics
`GET /api/config`	the SPA	viewer	`{fleet: true, ...}`
`GET /`	browser	viewer	the SPA in fleet mode
`GET /healthz`, `GET /readyz`	anyone	none	liveness / readiness

Deploy

Container (recommended)

docker run -d --name argus-fleet \
  -e ARGUS_FLEET_TOKEN=change-me \
  -p 9190:9190 \
  -v argus-fleet-state:/data \
  ghcr.io/astoristhebrave/argus-fleet:latest

State persists in the /data volume (the image sets ARGUS_FLEET_STATE=/data/argus-fleet-state.json). Pin a version tag in production. To also read from Prometheus, add -e ARGUS_FLEET_PROMETHEUS_URL=http://prometheus:9090.

From the package

pip install argus-dpy
ARGUS_FLEET_TOKEN=change-me python -m argus.fleet

Members opt in

ARGUS_FLEET_URL=http://fleet-host:9190 \
ARGUS_FLEET_TOKEN=change-me \
ARGUS_FLEET_GROUP=asia \
CLUSTER_ID=asia-0 \
    python bot.py

examples/fleet_member_bot.py is a complete, runnable member.

Security

Always set ARGUS_FLEET_TOKEN if the control plane is reachable by anything other than localhost. It gates registration, heartbeats, and the UI/APIs with a constant-time comparison.
Terminate TLS in front of the service (a reverse proxy) for any public deployment; the token is a bearer credential.
The control plane reads aggregate metrics only. It cannot expose per-guild or per-user data, by construction.

Scaling notes

One control plane can serve a large fleet: heartbeats are tiny and the registry is in memory with a periodic JSON flush. For very large fleets, lengthen ARGUS_FLEET_HEARTBEAT_INTERVAL to reduce write churn.
For multi-region resilience you can run the Prometheus source against a global Prometheus and treat push as the fallback, or run a control plane per region.
See Clustering for the per-process metric/label model the fleet builds on, and the Tutorial Fleet for an end-to-end walkthrough.

Fleet

Fleet control plane

Four tiers

Why a separate service

Two data sources

A note on rates (push source)

The registry: identity, numbering, health

Setup wizard and diagnostics

Hardening (secure by default)

Per-guild analytics (one pane for everything)

Prometheus auto-discovery

Configuration

Server (python -m argus.fleet)

Member (on the bot, opt-in)

HTTP surface

Deploy

Container (recommended)

From the package

Members opt in

Security

Scaling notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Argus

Tutorials

Clone this wiki locally

Server (`python -m argus.fleet`)