Fleet

Fleet control plane

The Argus Fleet control plane is a standalone, opt-in service that gives operators a single, readable view across many Argus-instrumented bot processes ("clusters"), grouped into regions ("fleets"). It is a separate process and a separate container image; it is never embedded in a bot.

If you run one bot process, you do not need this: the built-in Dashboard already shows it. Reach for the fleet control plane when you run several processes (shards, regions, or many bots) and want one pane that rolls them up without standing up Grafana and writing PromQL.

The fleet path is operational aggregation only. It still never introduces a guild_id/user_id/channel_id Prometheus label; per-entity questions stay on the analytical path (see History and ClickHouse).

Three tiers

Global   all fleets rolled up         (every cluster)
  |
Fleet    one region, e.g. "asia"      (a group of clusters)
  |
Cluster  one bot process              (a single /metrics owner)

You drill down Global -> Fleet -> Cluster in the UI. Each tier shows the same fixed set of readable metrics (latency, shards up, guilds, cached users, interactions/sec, error rate, command p95, rate limits/sec, uptime), colour graded against sensible thresholds. No PromQL or Grafana setup is required.

Why a separate service

/metrics and the live gauges must stay inside each bot process because they read live bot state (bot.latencies, bot.guilds, ...) at scrape time. That is unchanged. What moves out is the aggregation and the fleet UI, which is resource heavy and should not compete with the bot. So the control plane is its own deployable, and bots only make a tiny outbound heartbeat to it.

Two data sources

The control plane merges two interchangeable sources behind one interface, both rendering the same UI. Either or both can be active.

Source	Needs	How values arrive
Push	nothing (built in)	members POST a snapshot on each heartbeat; the latest per cluster is the data. NAT friendly: members only call out.
Prometheus	an existing Prometheus	the control plane queries a curated PromQL catalog grouped by the `cluster` label.

Default is push (members always register). Set ARGUS_FLEET_PROMETHEUS_URL to also read from Prometheus; when both are active, Prometheus values take precedence and push fills any gaps.

A note on rates (push source)

Per-second rates (interactions/sec, rate limits/sec) need two samples to differentiate, so they read 0 on the push source until a second heartbeat arrives, and may stay coarse at long heartbeat intervals. Counts, latency, p95, and error rate are exact from a single snapshot. If you want precise rates, point the control plane at Prometheus, which computes them with rate().

The registry: identity, numbering, health

The registry is the source of truth for topology, identity, and health; metric values come from a data source and join on the cluster id.

Identity is fleet_id if you set it, otherwise an auto-generated UUID that is persisted to the member's state dir, so a restart keeps the same identity.
Fleet (region) is the member's fleet_group (default default).
Numbering is per fleet, monotonic, and never reused. The first cluster in asia is asia #1, the next asia #2, and so on. A dead cluster keeps its number and is shown down (it does not free its slot); a brand new cluster gets the next number, not the dead one's. A reconnecting identity reclaims its own number. (A standard lease + heartbeat-TTL + monotonic-token pattern.)
Health is a lease: a cluster is up while now - last_seen is within heartbeat_interval * ttl_factor, else down. A background sweeper updates status on the same interval.
Persistence is a JSON file at ARGUS_FLEET_STATE, so numbers and topology survive a control-plane restart. Mount a volume for it in production.

Cluster-to-fleet grouping lives entirely in the registry, so operational metrics need no extra fleet label (no added Prometheus cardinality, no change to the core catalogue).

Configuration

Server (`python -m argus.fleet`)

env	default	meaning
`ARGUS_FLEET_HOST`	`0.0.0.0`	bind host
`ARGUS_FLEET_PORT`	`9190`	bind port
`ARGUS_FLEET_TOKEN`	(none)	shared secret; gates every route except `/healthz`
`ARGUS_FLEET_HEARTBEAT_INTERVAL`	`15`	expected member heartbeat seconds
`ARGUS_FLEET_TTL_FACTOR`	`3`	down after `interval * ttl_factor` seconds of silence
`ARGUS_FLEET_STATE`	`argus-fleet-state.json`	registry persistence path
`ARGUS_FLEET_PROMETHEUS_URL`	(none)	also read values from this Prometheus
`ARGUS_NAMESPACE`	`discord`	metric prefix; must match the members

Member (on the bot, opt-in)

These are fields on Argus(bot) and matching env vars. When fleet_url is unset, no fleet code runs at all.

kwarg / env	default	meaning
`fleet_url` / `ARGUS_FLEET_URL`	(none)	opt in; register + heartbeat to this control plane
`fleet_token` / `ARGUS_FLEET_TOKEN`	(none)	the shared token
`fleet_group` / `ARGUS_FLEET_GROUP`	`default`	the region/fleet name
`fleet_id` / `ARGUS_FLEET_ID`	(none)	stable identity; auto-UUID persisted if unset
`fleet_state_dir` / `ARGUS_FLEET_STATE_DIR`	`.`	where the member persists its identity

The member side is fail-open and bounded: at most one heartbeat is in flight, failures drop the sample and retry on the next tick, and a fleet outage or error never touches the bot loop (mirrors invariant 5). If the control plane forgets a member (a 404 on heartbeat), the member transparently re-registers.

HTTP surface

Every route except /healthz requires the token (as Authorization: Bearer <token> or ?token=).

Method + path	Who calls it	Body / response
`POST /fleet/register`	member	`{identity, fleet, version}` -> `{number}`
`POST /fleet/heartbeat`	member	`{identity, snapshot?}` -> `204`
`GET /api/fleet/view`	the SPA	the whole `FleetView` JSON
`GET /api/fleet/cluster?fleet=&number=`	the SPA	one cluster + history
`GET /api/config`	the SPA	`{fleet: true, ...}`
`GET /`	browser	the SPA in fleet mode
`GET /healthz`	anyone	`ok` (never gated)

Deploy

Container (recommended)

docker run -d --name argus-fleet \
  -e ARGUS_FLEET_TOKEN=change-me \
  -p 9190:9190 \
  -v argus-fleet-state:/data \
  ghcr.io/astoristhebrave/argus-fleet:latest

State persists in the /data volume (the image sets ARGUS_FLEET_STATE=/data/argus-fleet-state.json). Pin a version tag in production. To also read from Prometheus, add -e ARGUS_FLEET_PROMETHEUS_URL=http://prometheus:9090.

From the package

pip install argus-dpy
ARGUS_FLEET_TOKEN=change-me python -m argus.fleet

Members opt in

ARGUS_FLEET_URL=http://fleet-host:9190 \
ARGUS_FLEET_TOKEN=change-me \
ARGUS_FLEET_GROUP=asia \
CLUSTER_ID=asia-0 \
    python bot.py

examples/fleet_member_bot.py is a complete, runnable member.

Security

Always set ARGUS_FLEET_TOKEN if the control plane is reachable by anything other than localhost. It gates registration, heartbeats, and the UI/APIs with a constant-time comparison.
Terminate TLS in front of the service (a reverse proxy) for any public deployment; the token is a bearer credential.
The control plane reads aggregate metrics only. It cannot expose per-guild or per-user data, by construction.

Scaling notes

One control plane can serve a large fleet: heartbeats are tiny and the registry is in memory with a periodic JSON flush. For very large fleets, lengthen ARGUS_FLEET_HEARTBEAT_INTERVAL to reduce write churn.
For multi-region resilience you can run the Prometheus source against a global Prometheus and treat push as the fallback, or run a control plane per region.
See Clustering for the per-process metric/label model the fleet builds on, and the Tutorial Fleet for an end-to-end walkthrough.

Fleet

Fleet control plane

Three tiers

Why a separate service

Two data sources

A note on rates (push source)

The registry: identity, numbering, health

Configuration

Server (python -m argus.fleet)

Member (on the bot, opt-in)

HTTP surface

Deploy

Container (recommended)

From the package

Members opt in

Security

Scaling notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Argus

Tutorials

Clone this wiki locally

Server (`python -m argus.fleet`)