Clustering

Argus supports single-process and clustered (multi-process) deployments. The cluster label keeps them apart and is the only thing you must set per process.

Single process

One AutoShardedBot, one Argus, one endpoint exposing all shards. cluster_id optional (defaults to default).

Argus(bot)

Clustered

Run one Argus per process. Two separate questions, often conflated:

cluster_id - always distinct, no exceptions. It is the value of the cluster label that keeps processes apart in Prometheus.
Port - distinct only when processes share a host. A port can only be bound once per host, so co-located processes need 9191, 9192, ... But processes on separate hosts / containers / pods each have their own IP, so they can (and should) all keep the default 9191 - there is no collision, and Prometheus scrapes host-a:9191, host-b:9191, ... This is the normal production case; see Hosting at scale below.

The example below co-locates two processes on one host, so it uses distinct ports. On separate hosts you would leave both at 9191:

Argus(bot, cluster_id="0", port=9191)   # process 0, shards 0..n
Argus(bot, cluster_id="1", port=9192)   # process 1, shards n+1..m (same host -> distinct port)

State gauges carry the distinct cluster label; every counter and the duration histogram carry it too, so per-cluster breakdowns work directly:

sum by (cluster) (rate(discord_interactions_total[5m]))
sum by (cluster) (discord_guilds)

Counter rates aggregate across the fleet by simply dropping the by (cluster).

Prometheus scrape config

List every process; do not also set a cluster target label, or Prometheus renames Argus's own cluster label to exported_cluster to avoid the clash:

scrape_configs:
  - job_name: argus
    static_configs:
      - targets:
          - "host.docker.internal:9191"
          - "host.docker.internal:9192"

examples/clustered_bot.py shows the per-process pattern driven by env vars (CLUSTER_ID, ARGUS_PORT, SHARD_IDS, SHARD_COUNT).

Per-shard metrics

discord_shard_latency_seconds{shard} and discord_shard_up{shard} are per-shard; shard ids are globally unique across a clustered deploy (each process owns a disjoint range), so they need no cluster qualifier to disambiguate.

Hosting at scale (e.g. 100 clusters) and ports

The metrics endpoint must live inside each bot process: gauges read live bot state (bot.latencies, bot.guilds, ...) at scrape time, so /metrics cannot be moved to a separate process. You run one Argus per process; what you centralise is the view and the storage, not the collection.

Ports. Do not hand-allocate a contiguous range like 9191..9290.

Separate hosts/containers/pods (the normal case at this scale): keep the same port (9191) on every process. They do not collide because each has its own network namespace/IP. Prometheus scrapes each at host:9191.
Co-located on one host (not recommended at 100): use distinct ports, assigned by the orchestrator via ARGUS_PORT per process, and discover targets with Prometheus service discovery, not a hand-written 100-target config.

Scrape config. Use service discovery rather than a static list at this size.

Kubernetes: one pod per cluster, containerPort: 9191, a PodMonitor/ ServiceMonitor selecting them. The cluster label is already on the metrics.
VMs/bare metal: Prometheus file_sd or DNS SD listing the hosts.

A single pane across all clusters. The built-in dashboard is per-process (it shows the one process it is attached to). For a fleet view across 100 clusters, use Grafana: it scrapes every process via Prometheus and aggregates with PromQL (sum by (cluster) (...)). The repo ships provisioned Grafana dashboards; set grafana_url so each process's dashboard links to them. A common production setup is dashboard=False on the bots (rely on Grafana for the fleet) while /metrics stays scraped.

Already separate. ClickHouse analytics is external and shared: all processes write to one ClickHouse and queries run anywhere (see History and ClickHouse).

A built-in fleet view (the control plane)

The Grafana approach above works, but if you want a readable Global -> Fleet -> Cluster view with no PromQL or Grafana setup, use the Fleet control plane: a standalone, opt-in service that aggregates every process into one pane, fed by a zero-infra push path and/or your existing Prometheus. It assigns stable per-region numbers and shows which clusters are down. Walkthrough: Tutorial Fleet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering

Clustering

Single process

Clustered

Prometheus scrape config

Per-shard metrics

Hosting at scale (e.g. 100 clusters) and ports

A built-in fleet view (the control plane)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Argus

Tutorials

Clone this wiki locally