Skip to content

Tutorial Fleet

AstorisTheBrave edited this page Jun 20, 2026 · 3 revisions

Tutorial: a fleet at scale

A complete walkthrough for running the Fleet control plane across many bot processes and regions, with one readable Global -> Fleet -> Cluster view. Start with the Tutorial Single Bot if you have not instrumented a single process yet; this builds on it.

When you need this

  • You run several processes (shards split across processes, or many bots).
  • You want one pane that rolls them up by region without writing PromQL or standing up Grafana.
  • You want stable, human-friendly identifiers (asia #1, asia #2, ...) that survive restarts and clearly show which clusters are down.

If you only run one process, you do not need the control plane: the built-in Dashboard already covers it.

Bare minimum requirements

  • The same as a single bot (Python 3.10+, discord.py 2.4+, a token).
  • One extra host (or container) to run the control plane. It needs no database: the push source is built in. A persistent volume for its state file is recommended so per-fleet numbers survive restarts.
  • Optional: an existing Prometheus, if you want exact per-second rates.
pip install argus-dpy   # the same package provides the control plane

Architecture in one picture

   bot (asia-0) --\
   bot (asia-1) ---+--register/heartbeat--> [ control plane ]  <-- you open this
   bot (eu-0)   --/        (tiny outbound)     :9190  Global/Fleet/Cluster UI
                                                  |
                          (optional) query an existing Prometheus

Members make a small outbound heartbeat. The control plane owns topology (numbers, regions, health) and renders the view. Bots stay light.

Step 1: run the control plane

On its own host or container. Always set a token if it is reachable.

# container (recommended): state persists in the volume
docker run -d --name argus-fleet \
  -e ARGUS_FLEET_TOKEN=change-me \
  -p 9190:9190 \
  -v argus-fleet-state:/data \
  ghcr.io/astoristhebrave/argus-fleet:latest

Or from the package:

ARGUS_FLEET_TOKEN=change-me python -m argus.fleet

Open http://fleet-host:9190/?token=change-me. It is empty until members register.

Step 2: opt each bot in

Add the fleet fields to your existing Argus(bot). They also read from env vars, so you can configure entirely from the environment:

# bot.py (same as the single-bot tutorial, plus the fleet fields)
import os

import discord
from discord.ext import commands

from argus import Argus

intents = discord.Intents.default()
intents.members = True
bot = commands.Bot(command_prefix="!", intents=intents)

Argus(
    bot,
    cluster_id=os.environ.get("CLUSTER_ID", "node-0"),  # the Prometheus cluster label / join key
    fleet_url=os.environ.get("ARGUS_FLEET_URL"),         # opt in
    fleet_token=os.environ.get("ARGUS_FLEET_TOKEN"),
    fleet_group=os.environ.get("ARGUS_FLEET_GROUP", "default"),  # the region
)

bot.run(os.environ["DISCORD_TOKEN"])

Run a few members, each in its region with a distinct CLUSTER_ID:

# asia
DISCORD_TOKEN=... ARGUS_FLEET_URL=http://fleet-host:9190 ARGUS_FLEET_TOKEN=change-me \
  ARGUS_FLEET_GROUP=asia CLUSTER_ID=asia-0 python bot.py
DISCORD_TOKEN=... ARGUS_FLEET_URL=http://fleet-host:9190 ARGUS_FLEET_TOKEN=change-me \
  ARGUS_FLEET_GROUP=asia CLUSTER_ID=asia-1 python bot.py

# europe
DISCORD_TOKEN=... ARGUS_FLEET_URL=http://fleet-host:9190 ARGUS_FLEET_TOKEN=change-me \
  ARGUS_FLEET_GROUP=europe CLUSTER_ID=eu-0 python bot.py

examples/fleet_member_bot.py is exactly this, ready to run.

Step 3: read the view

Open the control plane and you will see:

  • Global: rollup cards across everything, plus a grid of fleets with health (e.g. asia 2/2 up, europe 1/1 up). Click a fleet to drill in.
  • Fleet: the region's rollup and its clusters (asia #1, asia #2) with up/down. Click a cluster to drill in.
  • Cluster: one process's readable metrics.

Numbers are assigned per region in order and never reused. Kill asia-0 and, after heartbeat_interval * ttl_factor seconds, it shows down but keeps asia #1; a new process becomes asia #3, and if asia-0 comes back it reclaims asia #1.

Step 4 (optional): add Prometheus for exact rates

The push source reports counts, latency, p95, and error rate exactly, but per-second rates need two samples and are coarse. If you already run Prometheus scraping your bots (see Clustering), point the control plane at it:

docker run -d --name argus-fleet \
  -e ARGUS_FLEET_TOKEN=change-me \
  -e ARGUS_FLEET_PROMETHEUS_URL=http://prometheus:9090 \
  -p 9190:9190 -v argus-fleet-state:/data \
  ghcr.io/astoristhebrave/argus-fleet:latest

Now Prometheus supplies values (with real rate() based throughput) and push fills any gaps. The registry still owns topology, so no new fleet label is added to your metrics. The join is on the cluster id: the member's CLUSTER_ID must equal its fleet identity, which it does when you set cluster_id and let the fleet identity default to it, or set fleet_id to match.

Best practices

  • Always set ARGUS_FLEET_TOKEN and terminate TLS in front of the control plane for any non-localhost deployment. The token gates registration, heartbeats, and the UI alike.
  • Persist the state file (ARGUS_FLEET_STATE, the /data volume in the image). Without it, per-fleet numbers reset on restart.
  • Use stable ids in orchestrators. On Kubernetes, a StatefulSet gives each pod a stable name; set ARGUS_FLEET_ID (or a stable CLUSTER_ID) from it so a rescheduled pod reclaims its number instead of taking a new one.
  • Keep ARGUS_NAMESPACE identical on members and the control plane, otherwise the Prometheus source queries the wrong metric names.
  • One Argus per process, same port across hosts. Do not hand-allocate a port range; separate hosts/pods can all use 9191 (see Clustering). The fleet control plane is the view; collection still lives in each process.
  • Tune the heartbeat to your fleet size. A longer ARGUS_FLEET_HEARTBEAT_INTERVAL reduces traffic and state-file churn for large fleets; a shorter one detects death faster. ttl_factor controls how many missed beats mark a cluster down.
  • The member is fail-open. A control-plane outage never affects your bots; they keep serving /metrics and retry the heartbeat quietly. You can deploy or restart the control plane any time.

Troubleshooting

  • A bot does not appear: check ARGUS_FLEET_URL is reachable from the bot and the token matches; registration is fail-open, so failures are silent by design (raise the argus logger to DEBUG to see them).
  • A cluster shows down but is alive: its heartbeat is not reaching the control plane, or heartbeat_interval/ttl_factor are too tight for its network.
  • Prometheus values are missing for a cluster: its CLUSTER_ID (the cluster label) does not match its fleet identity; align them.
  • Numbers changed after a restart: the state file was not persisted; mount a volume for ARGUS_FLEET_STATE.

See also

  • Fleet - the reference (registry semantics, HTTP API, config tables).
  • Clustering - the per-process metric/label model this builds on.
  • Configuration - every option and its precedence.

Clone this wiki locally