Skip to content

Clustering

AstorisTheBrave edited this page Jun 21, 2026 · 5 revisions

Clustering

Argus supports single-process and clustered (multi-process) deployments. The cluster label keeps them apart and is the only thing you must set per process.

Single process

One AutoShardedBot, one Argus, one endpoint exposing all shards. cluster_id optional (defaults to default).

Argus(bot)

Clustered

Run one Argus per process. Two separate questions, often conflated:

  • cluster_id - always distinct, no exceptions. It is the value of the cluster label that keeps processes apart in Prometheus.
  • Port - distinct only when processes share a host. A port can only be bound once per host, so co-located processes need 9191, 9192, ... But processes on separate hosts / containers / pods each have their own IP, so they can (and should) all keep the default 9191 - there is no collision, and Prometheus scrapes host-a:9191, host-b:9191, ... This is the normal production case; see Hosting at scale below.

The example below co-locates two processes on one host, so it uses distinct ports. On separate hosts you would leave both at 9191:

Argus(bot, cluster_id="0", port=9191)   # process 0, shards 0..n
Argus(bot, cluster_id="1", port=9192)   # process 1, shards n+1..m (same host -> distinct port)

State gauges carry the distinct cluster label; every counter and the duration histogram carry it too, so per-cluster breakdowns work directly:

sum by (cluster) (rate(discord_interactions_total[5m]))
sum by (cluster) (discord_guilds)

Counter rates aggregate across the fleet by simply dropping the by (cluster).

Prometheus scrape config

List every process; do not also set a cluster target label, or Prometheus renames Argus's own cluster label to exported_cluster to avoid the clash:

scrape_configs:
  - job_name: argus
    static_configs:
      - targets:
          - "host.docker.internal:9191"
          - "host.docker.internal:9192"

examples/clustered_bot.py shows the per-process pattern driven by env vars (CLUSTER_ID, ARGUS_PORT, SHARD_IDS, SHARD_COUNT).

Per-shard metrics

discord_shard_latency_seconds{shard} and discord_shard_up{shard} are per-shard; shard ids are globally unique across a clustered deploy (each process owns a disjoint range), so they need no cluster qualifier to disambiguate.

Hosting at scale (e.g. 100 clusters) and ports

The metrics endpoint must live inside each bot process: gauges read live bot state (bot.latencies, bot.guilds, ...) at scrape time, so /metrics cannot be moved to a separate process. You run one Argus per process; what you centralise is the view and the storage, not the collection.

Ports. Do not hand-allocate a contiguous range like 9191..9290.

  • Separate hosts/containers/pods (the normal case at this scale): keep the same port (9191) on every process. They do not collide because each has its own network namespace/IP. Prometheus scrapes each at host:9191.
  • Co-located on one host (not recommended at 100): use distinct ports, assigned by the orchestrator via ARGUS_PORT per process, and discover targets with Prometheus service discovery, not a hand-written 100-target config.

Scrape config. Use service discovery rather than a static list at this size.

  • Kubernetes: one pod per cluster, containerPort: 9191, a PodMonitor/ ServiceMonitor selecting them. The cluster label is already on the metrics.
  • VMs/bare metal: Prometheus file_sd or DNS SD listing the hosts.

A single pane across all clusters. The built-in dashboard is per-process (it shows the one process it is attached to). For a fleet view across 100 clusters, use Grafana: it scrapes every process via Prometheus and aggregates with PromQL (sum by (cluster) (...)). The repo ships provisioned Grafana dashboards; set grafana_url so each process's dashboard links to them. A common production setup is dashboard=False on the bots (rely on Grafana for the fleet) while /metrics stays scraped.

Already separate. ClickHouse analytics is external and shared: all processes write to one ClickHouse and queries run anywhere (see History and ClickHouse).

A built-in fleet view (the control plane)

The Grafana approach above works, but if you want a readable Global -> Fleet -> Cluster view with no PromQL or Grafana setup, use the Fleet control plane: a standalone, opt-in service that aggregates every process into one pane, fed by a zero-infra push path and/or your existing Prometheus. It assigns stable per-region numbers and shows which clusters are down. Walkthrough: Tutorial Fleet.

Clone this wiki locally