Skip to content
AstorisTheBrave edited this page Jun 21, 2026 · 5 revisions

FAQ

Short answers to the things people hit first. See Configuration and Dashboard for the full detail.

Do I need to configure anything?

No. Argus(bot) is the whole integration: metrics at /metrics, the dashboard at /, on port 9191. Everything else is opt-in.

How do I protect the dashboard? Where does the token go?

Set one environment variable on the host/process that runs your bot:

ARGUS_DASHBOARD_AUTH_TOKEN=your-secret

Argus reads it automatically (no kwarg needed) and uses it for both serving and gating. There is nothing else to host: the dashboard is served by Argus inside your bot process, not as a separate app. The token gates / and every /api/* route; /metrics and /healthz stay open so a Prometheus scraper does not need it.

You can also pass it in code: Argus(bot, dashboard_auth_token="..."). The kwarg wins over the env var if both are set.

How do viewers log in?

Open the dashboard once with the token in the URL and it is remembered in that browser (localStorage):

http://your-host:9191/?token=your-secret

After that, plain http://your-host:9191/ works. Programmatic clients send Authorization: Bearer your-secret.

"I host the dashboard separately" — how?

You do not. Argus serves the SPA from the same aiohttp server as /metrics, on your bot's event loop. If you want it on a public URL, put a reverse proxy in front of the bot's port and set the token. If you want it on a different path or port, use dashboard_path / port.

How do I turn the dashboard off?

Argus(bot, dashboard=False) (or ARGUS_DASHBOARD=false). /metrics still serves.

Why don't I see per-guild numbers in Prometheus?

By design. guild_id/user_id/channel_id are unbounded and would explode Prometheus, so they are never labels (invariant 2). Per-guild figures live in the analytical path: set enable_per_guild=true + clickhouse_dsn, and use the dashboard's Analytics section. See History and ClickHouse.

discord_cached_users looks wrong / is zero

It reflects the cache, which needs the members intent enabled on your bot to be meaningful.

I run multiple processes (clustering)

Run one Argus per process with a distinct cluster_id; the cluster label separates them and counter rates aggregate across the fleet. See Clustering.

Can I host the dashboard/metrics completely separately from the bot?

The /metrics endpoint cannot be moved out of the bot process: gauges read live bot state at scrape time. So collection is always in-process. What you centralise is the view (Grafana, which aggregates all processes via Prometheus) and the storage (one shared ClickHouse for analytics). The built-in per-process dashboard is for a single process; use Grafana for a fleet view (set grafana_url).

100 clusters - should I use ports 9191..9290?

No. If each cluster is on its own host/container/pod (the normal case), keep the same port 9191 everywhere - they do not collide across hosts. Only co-located processes need distinct ports, and then assign them via ARGUS_PORT from your orchestrator and use Prometheus service discovery (Kubernetes PodMonitor, file_sd, DNS) instead of a hand-written 100-target scrape config. Full guidance: Clustering.

Does it slow my bot down?

No measurable amount. Hooks are O(1), non-blocking, and fail-open (an instrumentation error is counted and swallowed, never raised into your bot). The metrics server runs on the bot's existing loop.

How do I send metrics to Datadog / an OTLP collector?

pip install "argus-dpy[otlp]" and set otlp_endpoint. Argus pushes via OpenTelemetry in addition to the Prometheus endpoint. See OTLP.

Where do I get Grafana dashboards?

docker compose up -d in the repo brings up a provisioned Prometheus + Grafana with three dashboards. Point the dashboard's Grafana section at them with grafana_url.

Is the bundled image safe to deploy?

ghcr.io/astoristhebrave/argus:<version> ships the released SDK + the example bot. Pin a version tag so a mid-development change can never reach your deployment; :latest tracks the newest release. See Releasing.


Troubleshooting

General

The dashboard is blank / "waiting for the first sample". The bot has not logged in yet, or it just started. The first snapshot appears within a few seconds of the bot connecting. If it never appears, check the browser console and that /metrics returns data.

Port already in use / OSError: address already in use. Another process holds the port. Argus is fail-open here too: it logs the failure, sets argus_subsystem_up{subsystem="server"} 0, and your bot keeps running normally - it just serves no metrics until you fix the bind. Change ARGUS_PORT (or port=), or stop the other process. In a clustered single-host deploy give each process a distinct ARGUS_PORT. Alert on argus_subsystem_up == 0 to catch this.

Prometheus shows exported_cluster instead of cluster. You set a cluster target label in your scrape config that clashes with Argus's own label; Prometheus renames the conflicting one. Remove the target label. See Clustering.

argus_instrumentation_errors_total is non-zero. Instrumentation is fail-open: an error in a hook was counted and swallowed (your bot was never affected). A rising counter is a signal to investigate (raise the argus logger to DEBUG), not an outage.

discord_cached_users is 0 or wrong. Enable the members intent on the bot (and in the Discord developer portal); the cache is empty without it.

401 on the dashboard. A token is set; open with ?token=... once, or send Authorization: Bearer <token>. /metrics and /healthz stay open.

Behind a reverse proxy, client IPs look wrong. aiohttp does not trust X-Forwarded-For by default. Terminate TLS and set real-IP headers at the proxy; do not expose the bot/control-plane port directly.

Fleet control plane

It refuses to start: "refusing to bind ... without a token". Secure by default: a non-loopback bind needs a token. Set ARGUS_FLEET_TOKEN (or ARGUS_FLEET_TOKEN_FILE, or split ARGUS_FLEET_INGEST_TOKEN + ARGUS_FLEET_VIEWER_TOKEN), bind to 127.0.0.1, or set ARGUS_FLEET_INSECURE=1 for local testing only.

A bot does not appear in the fleet. Check the bot's ARGUS_FLEET_URL is reachable from the bot host and the token matches (the bot uses the ingest token when split tokens are in use). Registration is fail-open and silent by design; raise the argus/argus.fleet logger to DEBUG to see why. From the bot host you can confirm reachability with curl http://fleet-host:9190/healthz (open, no token); doctor is an operator tool and reads the view, so it needs the viewer token: python -m argus.fleet doctor --url <fleet> --token <viewer>.

A cluster shows "down" but it is alive. Its heartbeat is not reaching the control plane, or ARGUS_FLEET_HEARTBEAT_INTERVAL / ARGUS_FLEET_TTL_FACTOR are too tight for its network. A cluster is up while now - last_seen <= interval * ttl_factor.

Per-fleet numbers reset after a restart. The state file was not persisted. Set ARGUS_FLEET_STATE to durable storage (the container image mounts /data; the wizard sets it). Without persistence, numbers restart from 1.

A second control plane will not start ("another argus-fleet process holds..."). By design: two processes must not share one state file. Run one control plane per state file, or give the second its own ARGUS_FLEET_STATE.

The view shows clusters up but all metrics are 0. Likely a namespace mismatch: the members' ARGUS_NAMESPACE must equal the control plane's. Confirm with python -m argus.fleet doctor --url <fleet> --token <viewer> --namespace <expected>. With the push source, per-second rates are also 0 until a second heartbeat arrives; for exact rates use the Prometheus source.

429 Too Many Requests from register/heartbeat. Rate limiting kicked in (per-IP for register, per-identity for heartbeat). Raise ARGUS_FLEET_REGISTER_BURST / ARGUS_FLEET_HEARTBEAT_BURST, or slow the caller.

413 Request Entity Too Large on heartbeat. The snapshot exceeded ARGUS_FLEET_MAX_BODY_BYTES (256 KiB default). Raise it if you have a legitimate reason; otherwise something is sending an oversized body.

403 fleet cluster cap reached. You hit ARGUS_FLEET_MAX_CLUSTERS. Raise the cap (and check you are not leaking new identities; reuse a stable CLUSTER_ID / fleet_id).

argus_fleet_identity_conflicts_total is climbing. Two processes are registering under the same identity from different hosts (a duplicate CLUSTER_ID/fleet_id), so the number/health flaps. Give each process a unique identity (e.g. a StatefulSet pod name).

The fleet dashboard 401s but the bots register fine (or vice versa). You are using split tokens: the viewer token gates the UI//api/* and the ingest token gates register/heartbeat. Use the right one for each.

python -m argus.fleet ignores my .env. Autoload needs the extra: pip install 'argus-dpy[fleet]' (or point ARGUS_FLEET_ENV_FILE at it). Without it, load the .env via the generated compose env_file: or a systemd EnvironmentFile=.

Do I need CORS for the fleet dashboard? No, unless you serve the SPA from a different origin than the API (e.g. the UI on a CDN). The bundled same-origin UI needs nothing. For a detached UI set ARGUS_FLEET_CORS_ORIGINS to the explicit origin(s). Hosting the bot and the dashboard on different providers does not require CORS.

OpenTelemetry (OTLP)

No data at my backend. Check the collector first (a debug exporter proves the bot reached it). If the collector sees data but the backend does not, the problem is the collector's exporter/credentials, not Argus.

Connection refused / handshake errors. The endpoint must be the gRPC receiver (:4317), reachable from the bot; use https:// only if the collector terminates TLS. See the OTLP tutorial.

ImportError on start. Install the extra: pip install "argus-dpy[otlp]".

Per-guild analytics (ClickHouse)

Analytics section is empty or returns 403. The analytics API fails closed without dashboard_auth_token; set it and open with ?token=.

No rows appear. You need both enable_per_guild=true and a valid clickhouse_dsn; with either missing, the sink is a no-op. Check connectivity to ClickHouse's HTTP port (8123). See the analytics tutorial.

The events table is growing without bound. Add a ClickHouse TTL to argus_events (e.g. drop rows older than 90 days).

Clone this wiki locally