-
Notifications
You must be signed in to change notification settings - Fork 0
Clustering
Argus supports single-process and clustered (multi-process) deployments. The
cluster label keeps them apart and is the only thing you must set per process.
One AutoShardedBot, one Argus, one endpoint exposing all shards. cluster_id
optional (defaults to default).
Argus(bot)Run one Argus per process. Two separate questions, often conflated:
-
cluster_id- always distinct, no exceptions. It is the value of theclusterlabel that keeps processes apart in Prometheus. -
Port - distinct only when processes share a host. A port can only be bound
once per host, so co-located processes need 9191, 9192, ... But processes on
separate hosts / containers / pods each have their own IP, so they can (and
should) all keep the default 9191 - there is no collision, and Prometheus
scrapes
host-a:9191,host-b:9191, ... This is the normal production case; see Hosting at scale below.
The example below co-locates two processes on one host, so it uses distinct ports. On separate hosts you would leave both at 9191:
Argus(bot, cluster_id="0", port=9191) # process 0, shards 0..n
Argus(bot, cluster_id="1", port=9192) # process 1, shards n+1..m (same host -> distinct port)State gauges carry the distinct cluster label; every counter and the duration
histogram carry it too, so per-cluster breakdowns work directly:
sum by (cluster) (rate(discord_interactions_total[5m]))
sum by (cluster) (discord_guilds)
Counter rates aggregate across the fleet by simply dropping the by (cluster).
List every process; do not also set a cluster target label, or Prometheus
renames Argus's own cluster label to exported_cluster to avoid the clash:
scrape_configs:
- job_name: argus
static_configs:
- targets:
- "host.docker.internal:9191"
- "host.docker.internal:9192"examples/clustered_bot.py shows the per-process pattern driven by env vars
(CLUSTER_ID, ARGUS_PORT, SHARD_IDS, SHARD_COUNT).
discord_shard_latency_seconds{shard} and discord_shard_up{shard} are
per-shard; shard ids are globally unique across a clustered deploy (each process
owns a disjoint range), so they need no cluster qualifier to disambiguate.
The metrics endpoint must live inside each bot process: gauges read live bot
state (bot.latencies, bot.guilds, ...) at scrape time, so /metrics cannot
be moved to a separate process. You run one Argus per process; what you
centralise is the view and the storage, not the collection.
Ports. Do not hand-allocate a contiguous range like 9191..9290.
- Separate hosts/containers/pods (the normal case at this scale): keep the
same port (9191) on every process. They do not collide because each has its
own network namespace/IP. Prometheus scrapes each at
host:9191. - Co-located on one host (not recommended at 100): use distinct ports, assigned
by the orchestrator via
ARGUS_PORTper process, and discover targets with Prometheus service discovery, not a hand-written 100-target config.
Scrape config. Use service discovery rather than a static list at this size.
- Kubernetes: one pod per cluster,
containerPort: 9191, aPodMonitor/ServiceMonitorselecting them. Theclusterlabel is already on the metrics. - VMs/bare metal: Prometheus
file_sdor DNS SD listing the hosts.
A single pane across all clusters. The built-in dashboard is per-process (it
shows the one process it is attached to). For a fleet view across 100 clusters,
use Grafana: it scrapes every process via Prometheus and aggregates with
PromQL (sum by (cluster) (...)). The repo ships provisioned Grafana dashboards;
set grafana_url so each process's dashboard links to them. A common production
setup is dashboard=False on the bots (rely on Grafana for the fleet) while
/metrics stays scraped.
Already separate. ClickHouse analytics is external and shared: all processes write to one ClickHouse and queries run anywhere (see History and ClickHouse).
The Grafana approach above works, but if you want a readable Global -> Fleet -> Cluster view with no PromQL or Grafana setup, use the Fleet control plane: a standalone, opt-in service that aggregates every process into one pane, fed by a zero-infra push path and/or your existing Prometheus. It assigns stable per-region numbers and shows which clusters are down. Walkthrough: Tutorial Fleet.