-
Notifications
You must be signed in to change notification settings - Fork 0
Fleet
The Argus Fleet control plane is a standalone, opt-in service that gives operators a single, readable view across many Argus-instrumented bot processes ("clusters"), grouped into regions ("fleets"). It is a separate process and a separate container image; it is never embedded in a bot.
If you run one bot process, you do not need this: the built-in Dashboard already shows it. Reach for the fleet control plane when you run several processes (shards, regions, or many bots) and want one pane that rolls them up without standing up Grafana and writing PromQL.
The fleet path is operational aggregation only. It still never introduces a
guild_id/user_id/channel_idPrometheus label; per-entity questions stay on the analytical path (see History and ClickHouse).
Global all fleets rolled up (every cluster)
|
Fleet one region, e.g. "asia" (a group of clusters)
|
Cluster one bot process (a single /metrics owner)
|
Shard one gateway shard (up/down + heartbeat latency)
You drill down Global -> Fleet -> Cluster -> Shard in the UI. Each cluster shows the fixed readable metric set (latency, shards up, guilds, cached users, interactions/sec, error rate, command p95, rate limits/sec, uptime), colour graded against sensible thresholds, plus an inline grid of its shards (id, up/down, latency) and recent trend sparklines. No PromQL or Grafana setup needed.
Optionally the same pane also serves per-guild analytics (see below), so one dashboard covers operational rollups and analytics, fleet-wide and per bot.
/metrics and the live gauges must stay inside each bot process because they
read live bot state (bot.latencies, bot.guilds, ...) at scrape time. That is
unchanged. What moves out is the aggregation and the fleet UI, which is
resource heavy and should not compete with the bot. So the control plane is its
own deployable, and bots only make a tiny outbound heartbeat to it.
The control plane merges two interchangeable sources behind one interface, both rendering the same UI. Either or both can be active.
| Source | Needs | How values arrive |
|---|---|---|
| Push | nothing (built in) | members POST a snapshot on each heartbeat; the latest per cluster is the data. NAT friendly: members only call out. |
| Prometheus | an existing Prometheus | the control plane queries a curated PromQL catalog grouped by the cluster label. |
Default is push (members always register). Set ARGUS_FLEET_PROMETHEUS_URL to
also read from Prometheus; when both are active, Prometheus values take
precedence and push fills any gaps.
Per-second rates (interactions/sec, rate limits/sec) need two samples to
differentiate, so they read 0 on the push source until a second heartbeat
arrives, and may stay coarse at long heartbeat intervals. Counts, latency, p95,
and error rate are exact from a single snapshot. If you want precise rates,
point the control plane at Prometheus, which computes them with rate().
The registry is the source of truth for topology, identity, and health; metric values come from a data source and join on the cluster id.
-
Identity is
fleet_idif you set it, elsecluster_idif you set that, else an auto-generated UUID persisted to the member's state dir (so a restart keeps the same identity). Falling back tocluster_idmeans the identity equals the Prometheusclusterlabel, so the push and Prometheus sources join the same cluster with no extra config. -
Fleet (region) is the member's
fleet_group(defaultdefault). -
Numbering is per fleet, monotonic, and never reused. The first cluster in
asiaisasia #1, the nextasia #2, and so on. A dead cluster keeps its number and is shown down (it does not free its slot); a brand new cluster gets the next number, not the dead one's. A reconnecting identity reclaims its own number. (A standard lease + heartbeat-TTL + monotonic-token pattern.) -
Health is a lease: a cluster is
upwhilenow - last_seenis withinheartbeat_interval * ttl_factor, elsedown. A background sweeper updates status on the same interval. -
Persistence is a JSON file at
ARGUS_FLEET_STATE, so numbers and topology survive a control-plane restart. Mount a volume for it in production.
Cluster-to-fleet grouping lives entirely in the registry, so operational metrics
need no extra fleet label (no added Prometheus cardinality, no change to
the core catalogue).
The fastest way to stand up a control plane is the wizard, which mints a token and scaffolds everything:
python -m argus.fleet init # writes .env + docker-compose.fleet.yml, prints the member snippet
python -m argus.fleet doctor --url http://fleet-host:9190 --token secret # probe a running planeinit also prints a ready Prometheus http_sd scrape config. doctor reads the
view, so pass the viewer token; it checks reachability, auth, cluster health,
and (with --namespace) a namespace mismatch. (From a bot host, a plain
curl .../healthz confirms reachability without any token.) For the
nicest python -m argus.fleet experience (autoload .env) install the extra:
pip install 'argus-dpy[fleet]'. Without it, the generated compose env_file or
a systemd EnvironmentFile= loads the .env.
-
Refuse-insecure bind: a non-loopback bind with no token refuses to start.
Set
ARGUS_FLEET_TOKEN(orARGUS_FLEET_TOKEN_FILE), bind to loopback, orARGUS_FLEET_INSECURE=1for local testing only. -
Split tokens (optional): a low-privilege
ARGUS_FLEET_INGEST_TOKEN(on every bot) and anARGUS_FLEET_VIEWER_TOKEN(operators), each falling back to the shared token, so a leaked bot token does not unlock the dashboard. Any token var accepts a comma-separated list, so you can rotate with zero downtime (add the new token, roll it out, drop the old). -
Per-identity lease (optional):
ARGUS_FLEET_REQUIRE_LEASE=1makes the plane mint a high-entropy secret at register that the member must present on every heartbeat; a mismatch is409, so even a leaked ingest token can't take over an existing slot. Stored only as an HMAC-SHA256 digest (optionally peppered viaARGUS_FLEET_SECRET_PEPPER); see Security. -
Abuse resistance: request body cap (413); per-IP register, per-identity
heartbeat, and per-client read (
ARGUS_FLEET_READ_BURST) rate limits (429); and anARGUS_FLEET_MAX_CLUSTERScap. -
Audit log (optional):
ARGUS_FLEET_AUDIT_LOG=1logs one sanitized INFO line per ingest event (identity, client, outcome); the secret is never logged. -
Scanner resistance: the version banner is stripped and security headers
(
X-Frame-Options, CSP,nosniff) are sent on every response. Front the plane with a TLS reverse proxy (or a tunnel) for any public deploy. - Single writer: an advisory lock on the state file refuses a second instance sharing it. The on-disk state is schema-versioned (unknown versions refuse to load, never truncate).
-
Self-observability: the plane exposes its own
/metrics(register/heartbeat counters, live cluster up/down gauges) and a/readyz. -
Retention (optional):
ARGUS_FLEET_RETENTION_DAYSprunes long-dead clusters; per-fleet numbers are still never reused.
Set ARGUS_FLEET_CLICKHOUSE_DSN to the same ClickHouse your bots drain per-guild
events to (see History and ClickHouse), and the fleet
SPA gains an Analytics view: pick a guild to see its top commands and average
command duration, fleet-wide or sliced to one bot. Because events now carry a
cluster_id, the view can filter per cluster; the analytics API is viewer-token
gated and fails closed without a token (invariant 7). This makes the control
plane a single pane for both operational rollups and per-guild analytics.
OTLP is deliberately out of scope here: it is a one-way export, so OTLP-shipped metrics are viewed in their own backend, not pulled back into this pane.
Set ARGUS_FLEET_SCRAPE_TARGET=<host:port> on each bot to advertise its metrics
address. The control plane then serves GET /api/fleet/targets in Prometheus
http_sd format (viewer-gated, with cluster/fleet labels), so one Prometheus
pointed at the fleet discovers every bot. argus-fleet init prints the matching
scrape config.
| env | default | meaning |
|---|---|---|
ARGUS_FLEET_HOST |
0.0.0.0 |
bind host |
ARGUS_FLEET_PORT |
9190 |
bind port |
ARGUS_FLEET_TOKEN |
(none) | shared secret; gates every route except /healthz
|
ARGUS_FLEET_HEARTBEAT_INTERVAL |
15 |
expected member heartbeat seconds |
ARGUS_FLEET_TTL_FACTOR |
3 |
down after interval * ttl_factor seconds of silence |
ARGUS_FLEET_STATE |
argus-fleet-state.json |
registry persistence path |
ARGUS_FLEET_PROMETHEUS_URL |
(none) | also read values from this Prometheus |
ARGUS_FLEET_CLICKHOUSE_DSN |
(none) | shared ClickHouse for the per-guild Analytics view |
ARGUS_NAMESPACE |
discord |
metric prefix; must match the members |
Advanced / hardening (all optional):
| env | default | meaning |
|---|---|---|
ARGUS_FLEET_TOKEN_FILE |
(none) | read the token from a mounted secret file |
ARGUS_FLEET_INGEST_TOKEN / _VIEWER_TOKEN
|
(shared) | split write/read tokens (+ *_FILE) |
ARGUS_FLEET_INSECURE |
0 |
allow a public bind with no token (local testing only) |
ARGUS_FLEET_MAX_BODY_BYTES |
262144 |
request body cap (413 over) |
ARGUS_FLEET_CORS_ORIGINS |
(none) | allowlist for a detached UI (see Clustering/CORS) |
ARGUS_FLEET_VIEW_CACHE_MS |
1000 |
view cache TTL (shared across viewers) |
ARGUS_FLEET_MAX_CLUSTERS |
5000 |
cap on registered clusters |
ARGUS_FLEET_REGISTER_BURST / _HEARTBEAT_BURST
|
60 |
rate-limit token-bucket burst per 60s |
ARGUS_FLEET_READ_BURST |
120 |
per-client GET (view/api/metrics) rate-limit burst per 60s |
ARGUS_FLEET_REQUIRE_LEASE |
0 |
require the per-identity lease secret on heartbeat/re-register |
ARGUS_FLEET_SECRET_PEPPER / _FILE
|
(none) | server-side key for the at-rest lease HMAC |
ARGUS_FLEET_AUDIT_LOG |
0 |
log one INFO line per ingest event (identity, client, outcome) |
ARGUS_FLEET_RETENTION_DAYS |
0 |
prune clusters down longer than N days (0 = never) |
These are fields on Argus(bot) and matching env vars. When fleet_url is unset,
no fleet code runs at all.
| kwarg / env | default | meaning |
|---|---|---|
fleet_url / ARGUS_FLEET_URL
|
(none) | opt in; register + heartbeat to this control plane |
fleet_token / ARGUS_FLEET_TOKEN
|
(none) | the shared token |
fleet_group / ARGUS_FLEET_GROUP
|
default |
the region/fleet name |
fleet_id / ARGUS_FLEET_ID
|
(none) | stable identity; auto-UUID persisted if unset |
fleet_state_dir / ARGUS_FLEET_STATE_DIR
|
. |
where the member persists its identity |
fleet_scrape_target / ARGUS_FLEET_SCRAPE_TARGET
|
(none) | advertise host:port for Prometheus http_sd
|
When split tokens are in use, set the bot's fleet_token to the ingest token.
The member side is fail-open and bounded: at most one heartbeat is in flight, failures drop the sample and retry on the next tick, and a fleet outage or error never touches the bot loop (mirrors invariant 5). If the control plane forgets a member (a 404 on heartbeat), the member transparently re-registers.
Because the member only makes outbound calls, it needs no inbound port - the
cleanest way to monitor bots on Docker panels (Pterodactyl, PebbleHost, Railway)
that can't expose /metrics. See Hosting on bot panels.
Auth is path-aware: ingest paths need the ingest token, the rest need the viewer
token (both default to the shared ARGUS_FLEET_TOKEN), supplied as
Authorization: Bearer <token> or ?token=. /healthz and /readyz are open.
| Method + path | Who calls it | Token | Body / response |
|---|---|---|---|
POST /fleet/register |
member | ingest |
{identity, fleet, version, scrape_target?} -> {number}
|
POST /fleet/heartbeat |
member | ingest |
{identity, snapshot?} -> 204
|
GET /api/fleet/view |
the SPA | viewer | the whole FleetView JSON |
GET /api/fleet/cluster?fleet=&number= |
the SPA | viewer | one cluster + history |
GET /api/fleet/targets |
Prometheus | viewer |
http_sd target list |
GET /api/fleet/analytics/* |
the SPA | viewer | per-guild analytics (only when ClickHouse is configured; fails closed without a token) |
GET /metrics |
Prometheus | viewer | the control plane's own metrics |
GET /api/config |
the SPA | viewer | {fleet: true, ...} |
GET / |
browser | viewer | the SPA in fleet mode |
GET /healthz, GET /readyz
|
anyone | none | liveness / readiness |
docker run -d --name argus-fleet \
-e ARGUS_FLEET_TOKEN=change-me \
-p 9190:9190 \
-v argus-fleet-state:/data \
ghcr.io/astoristhebrave/argus-fleet:latestState persists in the /data volume (the image sets
ARGUS_FLEET_STATE=/data/argus-fleet-state.json). Pin a version tag in
production. To also read from Prometheus, add
-e ARGUS_FLEET_PROMETHEUS_URL=http://prometheus:9090.
pip install argus-dpy
ARGUS_FLEET_TOKEN=change-me python -m argus.fleetARGUS_FLEET_URL=http://fleet-host:9190 \
ARGUS_FLEET_TOKEN=change-me \
ARGUS_FLEET_GROUP=asia \
CLUSTER_ID=asia-0 \
python bot.pyexamples/fleet_member_bot.py is a complete, runnable member.
- Always set
ARGUS_FLEET_TOKENif the control plane is reachable by anything other than localhost. It gates registration, heartbeats, and the UI/APIs with a constant-time comparison. - Terminate TLS in front of the service (a reverse proxy) for any public deployment; the token is a bearer credential.
- The control plane reads aggregate metrics only. It cannot expose per-guild or per-user data, by construction.
- One control plane can serve a large fleet: heartbeats are tiny and the registry
is in memory with a periodic JSON flush. For very large fleets, lengthen
ARGUS_FLEET_HEARTBEAT_INTERVALto reduce write churn. - For multi-region resilience you can run the Prometheus source against a global Prometheus and treat push as the fallback, or run a control plane per region.
- See Clustering for the per-process metric/label model the fleet builds on, and the Tutorial Fleet for an end-to-end walkthrough.