-
Notifications
You must be signed in to change notification settings - Fork 0
Clustering
Argus supports single-process and clustered (multi-process) deployments. The
cluster label keeps them apart and is the only thing you must set per process.
One AutoShardedBot, one Argus, one endpoint exposing all shards. cluster_id
optional (defaults to default).
Argus(bot)Run one Argus per process, each with a distinct cluster_id and port:
Argus(bot, cluster_id="0", port=9191) # process 0, shards 0..n
Argus(bot, cluster_id="1", port=9192) # process 1, shards n+1..mState gauges carry the distinct cluster label; every counter and the duration
histogram carry it too, so per-cluster breakdowns work directly:
sum by (cluster) (rate(discord_interactions_total[5m]))
sum by (cluster) (discord_guilds)
Counter rates aggregate across the fleet by simply dropping the by (cluster).
List every process; do not also set a cluster target label, or Prometheus
renames Argus's own cluster label to exported_cluster to avoid the clash:
scrape_configs:
- job_name: argus
static_configs:
- targets:
- "host.docker.internal:9191"
- "host.docker.internal:9192"examples/clustered_bot.py shows the per-process pattern driven by env vars
(CLUSTER_ID, ARGUS_PORT, SHARD_IDS, SHARD_COUNT).
discord_shard_latency_seconds{shard} and discord_shard_up{shard} are
per-shard; shard ids are globally unique across a clustered deploy (each process
owns a disjoint range), so they need no cluster qualifier to disambiguate.
The metrics endpoint must live inside each bot process: gauges read live bot
state (bot.latencies, bot.guilds, ...) at scrape time, so /metrics cannot
be moved to a separate process. You run one Argus per process; what you
centralise is the view and the storage, not the collection.
Ports. Do not hand-allocate a contiguous range like 9191..9290.
- Separate hosts/containers/pods (the normal case at this scale): keep the
same port (9191) on every process. They do not collide because each has its
own network namespace/IP. Prometheus scrapes each at
host:9191. - Co-located on one host (not recommended at 100): use distinct ports, assigned
by the orchestrator via
ARGUS_PORTper process, and discover targets with Prometheus service discovery, not a hand-written 100-target config.
Scrape config. Use service discovery rather than a static list at this size.
- Kubernetes: one pod per cluster,
containerPort: 9191, aPodMonitor/ServiceMonitorselecting them. Theclusterlabel is already on the metrics. - VMs/bare metal: Prometheus
file_sdor DNS SD listing the hosts.
A single pane across all clusters. The built-in dashboard is per-process (it
shows the one process it is attached to). For a fleet view across 100 clusters,
use Grafana: it scrapes every process via Prometheus and aggregates with
PromQL (sum by (cluster) (...)). The repo ships provisioned Grafana dashboards;
set grafana_url so each process's dashboard links to them. A common production
setup is dashboard=False on the bots (rely on Grafana for the fleet) while
/metrics stays scraped.
Already separate. ClickHouse analytics is external and shared: all processes write to one ClickHouse and queries run anywhere (see History and ClickHouse).
Roadmap idea (not shipped): a mode where the built-in SPA reads from a Prometheus base URL (PromQL) instead of a single
/metrics, making it a standalone, separately-hostable fleet dashboard. Open an issue if you want it.