Skip to content
AstorisTheBrave edited this page Jun 21, 2026 · 6 revisions

Fleet control plane

The Argus Fleet control plane is a standalone, opt-in service that gives operators a single, readable view across many Argus-instrumented bot processes ("clusters"), grouped into regions ("fleets"). It is a separate process and a separate container image; it is never embedded in a bot.

If you run one bot process, you do not need this: the built-in Dashboard already shows it. Reach for the fleet control plane when you run several processes (shards, regions, or many bots) and want one pane that rolls them up without standing up Grafana and writing PromQL.

The fleet path is operational aggregation only. It still never introduces a guild_id/user_id/channel_id Prometheus label; per-entity questions stay on the analytical path (see History and ClickHouse).

Four tiers

Global   all fleets rolled up         (every cluster)
  |
Fleet    one region, e.g. "asia"      (a group of clusters)
  |
Cluster  one bot process              (a single /metrics owner)
  |
Shard    one gateway shard            (up/down + heartbeat latency)

You drill down Global -> Fleet -> Cluster -> Shard in the UI. Each cluster shows the fixed readable metric set (latency, shards up, guilds, cached users, interactions/sec, error rate, command p95, rate limits/sec, uptime), colour graded against sensible thresholds, plus an inline grid of its shards (id, up/down, latency) and recent trend sparklines. No PromQL or Grafana setup needed.

Optionally the same pane also serves per-guild analytics (see below), so one dashboard covers operational rollups and analytics, fleet-wide and per bot.

Why a separate service

/metrics and the live gauges must stay inside each bot process because they read live bot state (bot.latencies, bot.guilds, ...) at scrape time. That is unchanged. What moves out is the aggregation and the fleet UI, which is resource heavy and should not compete with the bot. So the control plane is its own deployable, and bots only make a tiny outbound heartbeat to it.

Two data sources

The control plane merges two interchangeable sources behind one interface, both rendering the same UI. Either or both can be active.

Source Needs How values arrive
Push nothing (built in) members POST a snapshot on each heartbeat; the latest per cluster is the data. NAT friendly: members only call out.
Prometheus an existing Prometheus the control plane queries a curated PromQL catalog grouped by the cluster label.

Default is push (members always register). Set ARGUS_FLEET_PROMETHEUS_URL to also read from Prometheus; when both are active, Prometheus values take precedence and push fills any gaps.

A note on rates (push source)

Per-second rates (interactions/sec, rate limits/sec) need two samples to differentiate, so they read 0 on the push source until a second heartbeat arrives, and may stay coarse at long heartbeat intervals. Counts, latency, p95, and error rate are exact from a single snapshot. If you want precise rates, point the control plane at Prometheus, which computes them with rate().

The registry: identity, numbering, health

The registry is the source of truth for topology, identity, and health; metric values come from a data source and join on the cluster id.

  • Identity is fleet_id if you set it, else cluster_id if you set that, else an auto-generated UUID persisted to the member's state dir (so a restart keeps the same identity). Falling back to cluster_id means the identity equals the Prometheus cluster label, so the push and Prometheus sources join the same cluster with no extra config.
  • Fleet (region) is the member's fleet_group (default default).
  • Numbering is per fleet, monotonic, and never reused. The first cluster in asia is asia #1, the next asia #2, and so on. A dead cluster keeps its number and is shown down (it does not free its slot); a brand new cluster gets the next number, not the dead one's. A reconnecting identity reclaims its own number. (A standard lease + heartbeat-TTL + monotonic-token pattern.)
  • Health is a lease: a cluster is up while now - last_seen is within heartbeat_interval * ttl_factor, else down. A background sweeper updates status on the same interval.
  • Persistence is a JSON file at ARGUS_FLEET_STATE, so numbers and topology survive a control-plane restart. Mount a volume for it in production.

Cluster-to-fleet grouping lives entirely in the registry, so operational metrics need no extra fleet label (no added Prometheus cardinality, no change to the core catalogue).

Setup wizard and diagnostics

The fastest way to stand up a control plane is the wizard, which mints a token and scaffolds everything:

python -m argus.fleet init      # writes .env + docker-compose.fleet.yml, prints the member snippet
python -m argus.fleet doctor --url http://fleet-host:9190 --token secret   # probe a running plane

init also prints a ready Prometheus http_sd scrape config. doctor reads the view, so pass the viewer token; it checks reachability, auth, cluster health, and (with --namespace) a namespace mismatch. (From a bot host, a plain curl .../healthz confirms reachability without any token.) For the nicest python -m argus.fleet experience (autoload .env) install the extra: pip install 'argus-dpy[fleet]'. Without it, the generated compose env_file or a systemd EnvironmentFile= loads the .env.

Hardening (secure by default)

  • Refuse-insecure bind: a non-loopback bind with no token refuses to start. Set ARGUS_FLEET_TOKEN (or ARGUS_FLEET_TOKEN_FILE), bind to loopback, or ARGUS_FLEET_INSECURE=1 for local testing only.
  • Split tokens (optional): a low-privilege ARGUS_FLEET_INGEST_TOKEN (on every bot) and an ARGUS_FLEET_VIEWER_TOKEN (operators), each falling back to the shared token, so a leaked bot token does not unlock the dashboard. Any token var accepts a comma-separated list, so you can rotate with zero downtime (add the new token, roll it out, drop the old).
  • Per-identity lease (optional): ARGUS_FLEET_REQUIRE_LEASE=1 makes the plane mint a high-entropy secret at register that the member must present on every heartbeat; a mismatch is 409, so even a leaked ingest token can't take over an existing slot. Stored only as an HMAC-SHA256 digest (optionally peppered via ARGUS_FLEET_SECRET_PEPPER); see Security.
  • Abuse resistance: request body cap (413); per-IP register, per-identity heartbeat, and per-client read (ARGUS_FLEET_READ_BURST) rate limits (429); and an ARGUS_FLEET_MAX_CLUSTERS cap.
  • Audit log (optional): ARGUS_FLEET_AUDIT_LOG=1 logs one sanitized INFO line per ingest event (identity, client, outcome); the secret is never logged.
  • Scanner resistance: the version banner is stripped and security headers (X-Frame-Options, CSP, nosniff) are sent on every response. Front the plane with a TLS reverse proxy (or a tunnel) for any public deploy.
  • Single writer: an advisory lock on the state file refuses a second instance sharing it. The on-disk state is schema-versioned (unknown versions refuse to load, never truncate).
  • Self-observability: the plane exposes its own /metrics (register/heartbeat counters, live cluster up/down gauges) and a /readyz.
  • Retention (optional): ARGUS_FLEET_RETENTION_DAYS prunes long-dead clusters; per-fleet numbers are still never reused.

Per-guild analytics (one pane for everything)

Set ARGUS_FLEET_CLICKHOUSE_DSN to the same ClickHouse your bots drain per-guild events to (see History and ClickHouse), and the fleet SPA gains an Analytics view: pick a guild to see its top commands and average command duration, fleet-wide or sliced to one bot. Because events now carry a cluster_id, the view can filter per cluster; the analytics API is viewer-token gated and fails closed without a token (invariant 7). This makes the control plane a single pane for both operational rollups and per-guild analytics.

OTLP is deliberately out of scope here: it is a one-way export, so OTLP-shipped metrics are viewed in their own backend, not pulled back into this pane.

Prometheus auto-discovery

Set ARGUS_FLEET_SCRAPE_TARGET=<host:port> on each bot to advertise its metrics address. The control plane then serves GET /api/fleet/targets in Prometheus http_sd format (viewer-gated, with cluster/fleet labels), so one Prometheus pointed at the fleet discovers every bot. argus-fleet init prints the matching scrape config.

Configuration

Server (python -m argus.fleet)

env default meaning
ARGUS_FLEET_HOST 0.0.0.0 bind host
ARGUS_FLEET_PORT 9190 bind port
ARGUS_FLEET_TOKEN (none) shared secret; gates every route except /healthz
ARGUS_FLEET_HEARTBEAT_INTERVAL 15 expected member heartbeat seconds
ARGUS_FLEET_TTL_FACTOR 3 down after interval * ttl_factor seconds of silence
ARGUS_FLEET_STATE argus-fleet-state.json registry persistence path
ARGUS_FLEET_PROMETHEUS_URL (none) also read values from this Prometheus
ARGUS_FLEET_CLICKHOUSE_DSN (none) shared ClickHouse for the per-guild Analytics view
ARGUS_NAMESPACE discord metric prefix; must match the members

Advanced / hardening (all optional):

env default meaning
ARGUS_FLEET_TOKEN_FILE (none) read the token from a mounted secret file
ARGUS_FLEET_INGEST_TOKEN / _VIEWER_TOKEN (shared) split write/read tokens (+ *_FILE)
ARGUS_FLEET_INSECURE 0 allow a public bind with no token (local testing only)
ARGUS_FLEET_MAX_BODY_BYTES 262144 request body cap (413 over)
ARGUS_FLEET_CORS_ORIGINS (none) allowlist for a detached UI (see Clustering/CORS)
ARGUS_FLEET_VIEW_CACHE_MS 1000 view cache TTL (shared across viewers)
ARGUS_FLEET_MAX_CLUSTERS 5000 cap on registered clusters
ARGUS_FLEET_REGISTER_BURST / _HEARTBEAT_BURST 60 rate-limit token-bucket burst per 60s
ARGUS_FLEET_READ_BURST 120 per-client GET (view/api/metrics) rate-limit burst per 60s
ARGUS_FLEET_REQUIRE_LEASE 0 require the per-identity lease secret on heartbeat/re-register
ARGUS_FLEET_SECRET_PEPPER / _FILE (none) server-side key for the at-rest lease HMAC
ARGUS_FLEET_AUDIT_LOG 0 log one INFO line per ingest event (identity, client, outcome)
ARGUS_FLEET_RETENTION_DAYS 0 prune clusters down longer than N days (0 = never)

Member (on the bot, opt-in)

These are fields on Argus(bot) and matching env vars. When fleet_url is unset, no fleet code runs at all.

kwarg / env default meaning
fleet_url / ARGUS_FLEET_URL (none) opt in; register + heartbeat to this control plane
fleet_token / ARGUS_FLEET_TOKEN (none) the shared token
fleet_group / ARGUS_FLEET_GROUP default the region/fleet name
fleet_id / ARGUS_FLEET_ID (none) stable identity; auto-UUID persisted if unset
fleet_state_dir / ARGUS_FLEET_STATE_DIR . where the member persists its identity
fleet_scrape_target / ARGUS_FLEET_SCRAPE_TARGET (none) advertise host:port for Prometheus http_sd

When split tokens are in use, set the bot's fleet_token to the ingest token.

The member side is fail-open and bounded: at most one heartbeat is in flight, failures drop the sample and retry on the next tick, and a fleet outage or error never touches the bot loop (mirrors invariant 5). If the control plane forgets a member (a 404 on heartbeat), the member transparently re-registers.

Because the member only makes outbound calls, it needs no inbound port - the cleanest way to monitor bots on Docker panels (Pterodactyl, PebbleHost, Railway) that can't expose /metrics. See Hosting on bot panels.

HTTP surface

Auth is path-aware: ingest paths need the ingest token, the rest need the viewer token (both default to the shared ARGUS_FLEET_TOKEN), supplied as Authorization: Bearer <token> or ?token=. /healthz and /readyz are open.

Method + path Who calls it Token Body / response
POST /fleet/register member ingest {identity, fleet, version, scrape_target?} -> {number}
POST /fleet/heartbeat member ingest {identity, snapshot?} -> 204
GET /api/fleet/view the SPA viewer the whole FleetView JSON
GET /api/fleet/cluster?fleet=&number= the SPA viewer one cluster + history
GET /api/fleet/targets Prometheus viewer http_sd target list
GET /api/fleet/analytics/* the SPA viewer per-guild analytics (only when ClickHouse is configured; fails closed without a token)
GET /metrics Prometheus viewer the control plane's own metrics
GET /api/config the SPA viewer {fleet: true, ...}
GET / browser viewer the SPA in fleet mode
GET /healthz, GET /readyz anyone none liveness / readiness

Deploy

Container (recommended)

docker run -d --name argus-fleet \
  -e ARGUS_FLEET_TOKEN=change-me \
  -p 9190:9190 \
  -v argus-fleet-state:/data \
  ghcr.io/astoristhebrave/argus-fleet:latest

State persists in the /data volume (the image sets ARGUS_FLEET_STATE=/data/argus-fleet-state.json). Pin a version tag in production. To also read from Prometheus, add -e ARGUS_FLEET_PROMETHEUS_URL=http://prometheus:9090.

From the package

pip install argus-dpy
ARGUS_FLEET_TOKEN=change-me python -m argus.fleet

Members opt in

ARGUS_FLEET_URL=http://fleet-host:9190 \
ARGUS_FLEET_TOKEN=change-me \
ARGUS_FLEET_GROUP=asia \
CLUSTER_ID=asia-0 \
    python bot.py

examples/fleet_member_bot.py is a complete, runnable member.

Security

  • Always set ARGUS_FLEET_TOKEN if the control plane is reachable by anything other than localhost. It gates registration, heartbeats, and the UI/APIs with a constant-time comparison.
  • Terminate TLS in front of the service (a reverse proxy) for any public deployment; the token is a bearer credential.
  • The control plane reads aggregate metrics only. It cannot expose per-guild or per-user data, by construction.

Scaling notes

  • One control plane can serve a large fleet: heartbeats are tiny and the registry is in memory with a periodic JSON flush. For very large fleets, lengthen ARGUS_FLEET_HEARTBEAT_INTERVAL to reduce write churn.
  • For multi-region resilience you can run the Prometheus source against a global Prometheus and treat push as the fallback, or run a control plane per region.
  • See Clustering for the per-process metric/label model the fleet builds on, and the Tutorial Fleet for an end-to-end walkthrough.

Clone this wiki locally