-
Notifications
You must be signed in to change notification settings - Fork 0
Tutorial Fleet
A complete walkthrough for running the Fleet control plane across many bot processes and regions, with one readable Global -> Fleet -> Cluster view. Start with the Tutorial Single Bot if you have not instrumented a single process yet; this builds on it.
- You run several processes (shards split across processes, or many bots).
- You want one pane that rolls them up by region without writing PromQL or standing up Grafana.
- You want stable, human-friendly identifiers (
asia #1,asia #2, ...) that survive restarts and clearly show which clusters are down.
If you only run one process, you do not need the control plane: the built-in Dashboard already covers it.
- The same as a single bot (Python 3.10+, discord.py 2.4+, a token).
- One extra host (or container) to run the control plane. It needs no database: the push source is built in. A persistent volume for its state file is recommended so per-fleet numbers survive restarts.
- Optional: an existing Prometheus, if you want exact per-second rates.
pip install argus-dpy # the same package provides the control plane bot (asia-0) --\
bot (asia-1) ---+--register/heartbeat--> [ control plane ] <-- you open this
bot (eu-0) --/ (tiny outbound) :9190 Global/Fleet/Cluster UI
|
(optional) query an existing Prometheus
Members make a small outbound heartbeat. The control plane owns topology (numbers, regions, health) and renders the view. Bots stay light.
On its own host or container. Always set a token if it is reachable.
# container (recommended): state persists in the volume
docker run -d --name argus-fleet \
-e ARGUS_FLEET_TOKEN=change-me \
-p 9190:9190 \
-v argus-fleet-state:/data \
ghcr.io/astoristhebrave/argus-fleet:latestOr from the package:
ARGUS_FLEET_TOKEN=change-me python -m argus.fleetOpen http://fleet-host:9190/?token=change-me. It is empty until members
register.
Add the fleet fields to your existing Argus(bot). They also read from env vars,
so you can configure entirely from the environment:
# bot.py (same as the single-bot tutorial, plus the fleet fields)
import os
import discord
from discord.ext import commands
from argus import Argus
intents = discord.Intents.default()
intents.members = True
bot = commands.Bot(command_prefix="!", intents=intents)
Argus(
bot,
cluster_id=os.environ.get("CLUSTER_ID", "node-0"), # the Prometheus cluster label / join key
fleet_url=os.environ.get("ARGUS_FLEET_URL"), # opt in
fleet_token=os.environ.get("ARGUS_FLEET_TOKEN"),
fleet_group=os.environ.get("ARGUS_FLEET_GROUP", "default"), # the region
)
bot.run(os.environ["DISCORD_TOKEN"])Run a few members, each in its region with a distinct CLUSTER_ID:
# asia
DISCORD_TOKEN=... ARGUS_FLEET_URL=http://fleet-host:9190 ARGUS_FLEET_TOKEN=change-me \
ARGUS_FLEET_GROUP=asia CLUSTER_ID=asia-0 python bot.py
DISCORD_TOKEN=... ARGUS_FLEET_URL=http://fleet-host:9190 ARGUS_FLEET_TOKEN=change-me \
ARGUS_FLEET_GROUP=asia CLUSTER_ID=asia-1 python bot.py
# europe
DISCORD_TOKEN=... ARGUS_FLEET_URL=http://fleet-host:9190 ARGUS_FLEET_TOKEN=change-me \
ARGUS_FLEET_GROUP=europe CLUSTER_ID=eu-0 python bot.pyexamples/fleet_member_bot.py is exactly this, ready to run.
Open the control plane and you will see:
-
Global: rollup cards across everything, plus a grid of fleets with health
(e.g.
asia 2/2 up,europe 1/1 up). Click a fleet to drill in. -
Fleet: the region's rollup and its clusters (
asia #1,asia #2) with up/down. Click a cluster to drill in. - Cluster: one process's readable metrics.
Numbers are assigned per region in order and never reused. Kill asia-0 and,
after heartbeat_interval * ttl_factor seconds, it shows down but keeps
asia #1; a new process becomes asia #3, and if asia-0 comes back it
reclaims asia #1.
The push source reports counts, latency, p95, and error rate exactly, but per-second rates need two samples and are coarse. If you already run Prometheus scraping your bots (see Clustering), point the control plane at it:
docker run -d --name argus-fleet \
-e ARGUS_FLEET_TOKEN=change-me \
-e ARGUS_FLEET_PROMETHEUS_URL=http://prometheus:9090 \
-p 9190:9190 -v argus-fleet-state:/data \
ghcr.io/astoristhebrave/argus-fleet:latestNow Prometheus supplies values (with real rate() based throughput) and push
fills any gaps. The registry still owns topology, so no new fleet label is
added to your metrics. The join is on the cluster id: the member's CLUSTER_ID
must equal its fleet identity, which it does when you set cluster_id and let the
fleet identity default to it, or set fleet_id to match.
-
Always set
ARGUS_FLEET_TOKENand terminate TLS in front of the control plane for any non-localhost deployment. The token gates registration, heartbeats, and the UI alike. -
Persist the state file (
ARGUS_FLEET_STATE, the/datavolume in the image). Without it, per-fleet numbers reset on restart. -
Use stable ids in orchestrators. On Kubernetes, a StatefulSet gives each pod
a stable name; set
ARGUS_FLEET_ID(or a stableCLUSTER_ID) from it so a rescheduled pod reclaims its number instead of taking a new one. -
Keep
ARGUS_NAMESPACEidentical on members and the control plane, otherwise the Prometheus source queries the wrong metric names. - One Argus per process, same port across hosts. Do not hand-allocate a port range; separate hosts/pods can all use 9191 (see Clustering). The fleet control plane is the view; collection still lives in each process.
-
Tune the heartbeat to your fleet size. A longer
ARGUS_FLEET_HEARTBEAT_INTERVALreduces traffic and state-file churn for large fleets; a shorter one detects death faster.ttl_factorcontrols how many missed beats mark a cluster down. -
The member is fail-open. A control-plane outage never affects your bots;
they keep serving
/metricsand retry the heartbeat quietly. You can deploy or restart the control plane any time.
-
A bot does not appear: check
ARGUS_FLEET_URLis reachable from the bot and the token matches; registration is fail-open, so failures are silent by design (raise thearguslogger to DEBUG to see them). -
A cluster shows down but is alive: its heartbeat is not reaching the control
plane, or
heartbeat_interval/ttl_factorare too tight for its network. -
Prometheus values are missing for a cluster: its
CLUSTER_ID(theclusterlabel) does not match its fleet identity; align them. -
Numbers changed after a restart: the state file was not persisted; mount a
volume for
ARGUS_FLEET_STATE.
- Fleet - the reference (registry semantics, HTTP API, config tables).
- Clustering - the per-process metric/label model this builds on.
- Configuration - every option and its precedence.