Metrics Reference

Metrics reference

Names use the namespace prefix (default discord); the argus_* internals are never prefixed. Every counter and the histogram carry a cluster label (its value is cluster_id or default). No metric carries guild_id/user_id/channel_id (invariant 2); per-guild figures live in the analytical path.

State gauges (read live at scrape time, invariant 4)

All gauge reads are O(1) off the discord.py cache; none iterate guilds.

Metric	Labels	Source
`discord_shard_latency_seconds`	`shard`	`bot.latencies` (NaN before ready is dropped)
`discord_shard_up`	`shard`	`1` if `not ShardInfo.is_closed()` else `0`
`discord_shards_connected`	`cluster`	count of open shards
`discord_shards_configured`	`cluster`	`bot.shard_count`
`discord_guilds`	`cluster`	`len(bot.guilds)`
`discord_cached_users`	`cluster`	`len(bot.users)` (needs the members intent to be meaningful)
`discord_voice_clients`	`cluster`	`len(bot.voice_clients)`
`discord_emojis`	`cluster`	`len(bot.emojis)`
`discord_stickers`	`cluster`	`len(bot.stickers)`
`discord_private_channels`	`cluster`	`len(bot.private_channels)`
`discord_app_commands_registered`	`cluster`	`len(bot.tree.get_commands())`
`discord_uptime_seconds`	`cluster`	seconds since the collector started
`discord_bot_info`	`discord_py_version`, `argus_version`	a prometheus `Info` (value always 1)
`argus_up`	—	`1` while the collector is alive
`argus_subsystem_up`	`subsystem`	`1` if an Argus subsystem is healthy, else `0`. Only configured subsystems are reported: `server` (metrics server bound), `fleet` (when `fleet_url` is set), `sink` (when per-guild analytics is on; goes `0` when the sink circuit breaker opens after repeated ClickHouse flush failures, and recovers on the next success). Alert on `argus_subsystem_up == 0` to catch Argus degrading while the bot itself is fine (e.g. the metrics port was already in use, or ClickHouse is down).

Counters (event-driven, invariant 3)

Metric	Labels	Hook
`discord_interactions_total`	`type`, `status`, `cluster`	`on_interaction` (`status=received`)
`discord_app_commands_total`	`command`, `status`, `cluster`	app command completion / tree error
`discord_commands_total`	`command`, `status`, `cluster`	prefix command completion / error
`discord_command_errors_total`	`command`, `error_type`, `cluster`	app + prefix command errors
`discord_gateway_events_total`	`event`, `cluster`	`on_socket_event_type`
`discord_shard_disconnects_total`	`shard`, `cluster`	`on_shard_disconnect`
`discord_shard_reconnects_total`	`shard`, `cluster`	`on_shard_connect`/`on_shard_resumed`
`discord_log_records_total`	`logger`, `level`, `cluster`	handler on the `discord` logger
`discord_ratelimits_total`	`cluster`	rate-limit warnings on `discord.http`
`argus_instrumentation_errors_total`	`hook`, `cluster`	fail-open catch (invariant 5)
`argus_history_events_dropped_total`	`cluster`	per-guild analytical events dropped on sink-queue overflow (backpressure signal; only moves when `enable_per_guild` is on)

Histogram

Metric	Labels	Buckets (seconds)
`discord_app_command_duration_seconds`	`command`, `cluster`	0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10
`discord_command_duration_seconds`	`command`, `cluster`	0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10

App-command duration is timed from interaction receipt to completion; prefix (text) command duration is timed from invocation (on_command) to completion. Both use a bounded start-time map (cap 10k in-flight each), falling back to the interaction/message timestamp, and never carry a per-guild label.

Useful PromQL

# command error rate
sum(rate(discord_app_commands_total{status="error"}[5m]))
  / sum(rate(discord_app_commands_total[5m]))

# p95 command duration
histogram_quantile(0.95, sum by (le) (rate(discord_app_command_duration_seconds_bucket[5m])))

# average command duration
sum(rate(discord_app_command_duration_seconds_sum[5m]))
  / sum(rate(discord_app_command_duration_seconds_count[5m]))

# per-cluster interaction rate
sum by (cluster) (rate(discord_interactions_total[5m]))

# worst shard latency
max(discord_shard_latency_seconds)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics Reference

Metrics reference

State gauges (read live at scrape time, invariant 4)

Counters (event-driven, invariant 3)

Histogram

Useful PromQL

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Argus

Tutorials

Clone this wiki locally