Skip to content

Metrics Reference

AstorisTheBrave edited this page Jun 21, 2026 · 7 revisions

Metrics reference

Names use the namespace prefix (default discord); the argus_* internals are never prefixed. Every counter and the histogram carry a cluster label (its value is cluster_id or default). No metric carries guild_id/user_id/channel_id (invariant 2); per-guild figures live in the analytical path.

State gauges (read live at scrape time, invariant 4)

All gauge reads are O(1) off the discord.py cache; none iterate guilds.

Metric Labels Source
discord_shard_latency_seconds shard bot.latencies (NaN before ready is dropped)
discord_shard_up shard 1 if not ShardInfo.is_closed() else 0
discord_shards_connected cluster count of open shards
discord_shards_configured cluster bot.shard_count
discord_guilds cluster len(bot.guilds)
discord_cached_users cluster len(bot.users) (needs the members intent to be meaningful)
discord_voice_clients cluster len(bot.voice_clients)
discord_emojis cluster len(bot.emojis)
discord_stickers cluster len(bot.stickers)
discord_private_channels cluster len(bot.private_channels)
discord_app_commands_registered cluster len(bot.tree.get_commands())
discord_uptime_seconds cluster seconds since the collector started
discord_bot_info discord_py_version, argus_version a prometheus Info (value always 1)
argus_up 1 while the collector is alive
argus_subsystem_up subsystem 1 if an Argus subsystem is healthy, else 0. Only configured subsystems are reported: server (metrics server bound), fleet (when fleet_url is set), sink (when per-guild analytics is on). Alert on argus_subsystem_up == 0 to catch Argus degrading while the bot itself is fine (e.g. the metrics port was already in use).

Counters (event-driven, invariant 3)

Metric Labels Hook
discord_interactions_total type, status, cluster on_interaction (status=received)
discord_app_commands_total command, status, cluster app command completion / tree error
discord_commands_total command, status, cluster prefix command completion / error
discord_command_errors_total command, error_type, cluster app + prefix command errors
discord_gateway_events_total event, cluster on_socket_event_type
discord_shard_disconnects_total shard, cluster on_shard_disconnect
discord_shard_reconnects_total shard, cluster on_shard_connect/on_shard_resumed
discord_log_records_total logger, level, cluster handler on the discord logger
discord_ratelimits_total cluster rate-limit warnings on discord.http
argus_instrumentation_errors_total hook, cluster fail-open catch (invariant 5)

Histogram

Metric Labels Buckets (seconds)
discord_app_command_duration_seconds command, cluster 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10
discord_command_duration_seconds command, cluster 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10

App-command duration is timed from interaction receipt to completion; prefix (text) command duration is timed from invocation (on_command) to completion. Both use a bounded start-time map (cap 10k in-flight each), falling back to the interaction/message timestamp, and never carry a per-guild label.

Useful PromQL

# command error rate
sum(rate(discord_app_commands_total{status="error"}[5m]))
  / sum(rate(discord_app_commands_total[5m]))

# p95 command duration
histogram_quantile(0.95, sum by (le) (rate(discord_app_command_duration_seconds_bucket[5m])))

# average command duration
sum(rate(discord_app_command_duration_seconds_sum[5m]))
  / sum(rate(discord_app_command_duration_seconds_count[5m]))

# per-cluster interaction rate
sum by (cluster) (rate(discord_interactions_total[5m]))

# worst shard latency
max(discord_shard_latency_seconds)

Clone this wiki locally