-
Notifications
You must be signed in to change notification settings - Fork 0
Fleet
The Argus Fleet control plane is a standalone, opt-in service that gives operators a single, readable view across many Argus-instrumented bot processes ("clusters"), grouped into regions ("fleets"). It is a separate process and a separate container image; it is never embedded in a bot.
If you run one bot process, you do not need this: the built-in Dashboard already shows it. Reach for the fleet control plane when you run several processes (shards, regions, or many bots) and want one pane that rolls them up without standing up Grafana and writing PromQL.
The fleet path is operational aggregation only. It still never introduces a
guild_id/user_id/channel_idPrometheus label; per-entity questions stay on the analytical path (see History and ClickHouse).
Global all fleets rolled up (every cluster)
|
Fleet one region, e.g. "asia" (a group of clusters)
|
Cluster one bot process (a single /metrics owner)
You drill down Global -> Fleet -> Cluster in the UI. Each tier shows the same fixed set of readable metrics (latency, shards up, guilds, cached users, interactions/sec, error rate, command p95, rate limits/sec, uptime), colour graded against sensible thresholds. No PromQL or Grafana setup is required.
/metrics and the live gauges must stay inside each bot process because they
read live bot state (bot.latencies, bot.guilds, ...) at scrape time. That is
unchanged. What moves out is the aggregation and the fleet UI, which is
resource heavy and should not compete with the bot. So the control plane is its
own deployable, and bots only make a tiny outbound heartbeat to it.
The control plane merges two interchangeable sources behind one interface, both rendering the same UI. Either or both can be active.
| Source | Needs | How values arrive |
|---|---|---|
| Push | nothing (built in) | members POST a snapshot on each heartbeat; the latest per cluster is the data. NAT friendly: members only call out. |
| Prometheus | an existing Prometheus | the control plane queries a curated PromQL catalog grouped by the cluster label. |
Default is push (members always register). Set ARGUS_FLEET_PROMETHEUS_URL to
also read from Prometheus; when both are active, Prometheus values take
precedence and push fills any gaps.
Per-second rates (interactions/sec, rate limits/sec) need two samples to
differentiate, so they read 0 on the push source until a second heartbeat
arrives, and may stay coarse at long heartbeat intervals. Counts, latency, p95,
and error rate are exact from a single snapshot. If you want precise rates,
point the control plane at Prometheus, which computes them with rate().
The registry is the source of truth for topology, identity, and health; metric values come from a data source and join on the cluster id.
-
Identity is
fleet_idif you set it, otherwise an auto-generated UUID that is persisted to the member's state dir, so a restart keeps the same identity. -
Fleet (region) is the member's
fleet_group(defaultdefault). -
Numbering is per fleet, monotonic, and never reused. The first cluster in
asiaisasia #1, the nextasia #2, and so on. A dead cluster keeps its number and is shown down (it does not free its slot); a brand new cluster gets the next number, not the dead one's. A reconnecting identity reclaims its own number. (A standard lease + heartbeat-TTL + monotonic-token pattern.) -
Health is a lease: a cluster is
upwhilenow - last_seenis withinheartbeat_interval * ttl_factor, elsedown. A background sweeper updates status on the same interval. -
Persistence is a JSON file at
ARGUS_FLEET_STATE, so numbers and topology survive a control-plane restart. Mount a volume for it in production.
Cluster-to-fleet grouping lives entirely in the registry, so operational metrics
need no extra fleet label (no added Prometheus cardinality, no change to
the core catalogue).
| env | default | meaning |
|---|---|---|
ARGUS_FLEET_HOST |
0.0.0.0 |
bind host |
ARGUS_FLEET_PORT |
9190 |
bind port |
ARGUS_FLEET_TOKEN |
(none) | shared secret; gates every route except /healthz
|
ARGUS_FLEET_HEARTBEAT_INTERVAL |
15 |
expected member heartbeat seconds |
ARGUS_FLEET_TTL_FACTOR |
3 |
down after interval * ttl_factor seconds of silence |
ARGUS_FLEET_STATE |
argus-fleet-state.json |
registry persistence path |
ARGUS_FLEET_PROMETHEUS_URL |
(none) | also read values from this Prometheus |
ARGUS_NAMESPACE |
discord |
metric prefix; must match the members |
These are fields on Argus(bot) and matching env vars. When fleet_url is unset,
no fleet code runs at all.
| kwarg / env | default | meaning |
|---|---|---|
fleet_url / ARGUS_FLEET_URL
|
(none) | opt in; register + heartbeat to this control plane |
fleet_token / ARGUS_FLEET_TOKEN
|
(none) | the shared token |
fleet_group / ARGUS_FLEET_GROUP
|
default |
the region/fleet name |
fleet_id / ARGUS_FLEET_ID
|
(none) | stable identity; auto-UUID persisted if unset |
fleet_state_dir / ARGUS_FLEET_STATE_DIR
|
. |
where the member persists its identity |
The member side is fail-open and bounded: at most one heartbeat is in flight, failures drop the sample and retry on the next tick, and a fleet outage or error never touches the bot loop (mirrors invariant 5). If the control plane forgets a member (a 404 on heartbeat), the member transparently re-registers.
Every route except /healthz requires the token (as Authorization: Bearer <token> or ?token=).
| Method + path | Who calls it | Body / response |
|---|---|---|
POST /fleet/register |
member |
{identity, fleet, version} -> {number}
|
POST /fleet/heartbeat |
member |
{identity, snapshot?} -> 204
|
GET /api/fleet/view |
the SPA | the whole FleetView JSON |
GET /api/fleet/cluster?fleet=&number= |
the SPA | one cluster + history |
GET /api/config |
the SPA | {fleet: true, ...} |
GET / |
browser | the SPA in fleet mode |
GET /healthz |
anyone |
ok (never gated) |
docker run -d --name argus-fleet \
-e ARGUS_FLEET_TOKEN=change-me \
-p 9190:9190 \
-v argus-fleet-state:/data \
ghcr.io/astoristhebrave/argus-fleet:latestState persists in the /data volume (the image sets
ARGUS_FLEET_STATE=/data/argus-fleet-state.json). Pin a version tag in
production. To also read from Prometheus, add
-e ARGUS_FLEET_PROMETHEUS_URL=http://prometheus:9090.
pip install argus-dpy
ARGUS_FLEET_TOKEN=change-me python -m argus.fleetARGUS_FLEET_URL=http://fleet-host:9190 \
ARGUS_FLEET_TOKEN=change-me \
ARGUS_FLEET_GROUP=asia \
CLUSTER_ID=asia-0 \
python bot.pyexamples/fleet_member_bot.py is a complete, runnable member.
- Always set
ARGUS_FLEET_TOKENif the control plane is reachable by anything other than localhost. It gates registration, heartbeats, and the UI/APIs with a constant-time comparison. - Terminate TLS in front of the service (a reverse proxy) for any public deployment; the token is a bearer credential.
- The control plane reads aggregate metrics only. It cannot expose per-guild or per-user data, by construction.
- One control plane can serve a large fleet: heartbeats are tiny and the registry
is in memory with a periodic JSON flush. For very large fleets, lengthen
ARGUS_FLEET_HEARTBEAT_INTERVALto reduce write churn. - For multi-region resilience you can run the Prometheus source against a global Prometheus and treat push as the fallback, or run a control plane per region.
- See Clustering for the per-process metric/label model the fleet builds on, and the Tutorial Fleet for an end-to-end walkthrough.