-
Notifications
You must be signed in to change notification settings - Fork 0
Metrics Reference
AstorisTheBrave edited this page Jun 21, 2026
·
7 revisions
Names use the namespace prefix (default discord); the argus_* internals are
never prefixed. Every counter and the histogram carry a cluster label (its
value is cluster_id or default). No metric carries
guild_id/user_id/channel_id (invariant 2); per-guild figures live in the
analytical path.
All gauge reads are O(1) off the discord.py cache; none iterate guilds.
| Metric | Labels | Source |
|---|---|---|
discord_shard_latency_seconds |
shard |
bot.latencies (NaN before ready is dropped) |
discord_shard_up |
shard |
1 if not ShardInfo.is_closed() else 0
|
discord_shards_connected |
cluster |
count of open shards |
discord_shards_configured |
cluster |
bot.shard_count |
discord_guilds |
cluster |
len(bot.guilds) |
discord_cached_users |
cluster |
len(bot.users) (needs the members intent to be meaningful) |
discord_voice_clients |
cluster |
len(bot.voice_clients) |
discord_emojis |
cluster |
len(bot.emojis) |
discord_stickers |
cluster |
len(bot.stickers) |
discord_private_channels |
cluster |
len(bot.private_channels) |
discord_app_commands_registered |
cluster |
len(bot.tree.get_commands()) |
discord_uptime_seconds |
cluster |
seconds since the collector started |
discord_bot_info |
discord_py_version, argus_version
|
a prometheus Info (value always 1) |
argus_up |
— |
1 while the collector is alive |
argus_subsystem_up |
subsystem |
1 if an Argus subsystem is healthy, else 0. Only configured subsystems are reported: server (metrics server bound), fleet (when fleet_url is set), sink (when per-guild analytics is on; goes 0 when the sink circuit breaker opens after repeated ClickHouse flush failures, and recovers on the next success). Alert on argus_subsystem_up == 0 to catch Argus degrading while the bot itself is fine (e.g. the metrics port was already in use, or ClickHouse is down). |
| Metric | Labels | Hook |
|---|---|---|
discord_interactions_total |
type, status, cluster
|
on_interaction (status=received) |
discord_app_commands_total |
command, status, cluster
|
app command completion / tree error |
discord_commands_total |
command, status, cluster
|
prefix command completion / error |
discord_command_errors_total |
command, error_type, cluster
|
app + prefix command errors |
discord_gateway_events_total |
event, cluster
|
on_socket_event_type |
discord_shard_disconnects_total |
shard, cluster
|
on_shard_disconnect |
discord_shard_reconnects_total |
shard, cluster
|
on_shard_connect/on_shard_resumed
|
discord_log_records_total |
logger, level, cluster
|
handler on the discord logger |
discord_ratelimits_total |
cluster |
rate-limit warnings on discord.http
|
argus_instrumentation_errors_total |
hook, cluster
|
fail-open catch (invariant 5) |
argus_history_events_dropped_total |
cluster |
per-guild analytical events dropped on sink-queue overflow (backpressure signal; only moves when enable_per_guild is on) |
| Metric | Labels | Buckets (seconds) |
|---|---|---|
discord_app_command_duration_seconds |
command, cluster
|
0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10 |
discord_command_duration_seconds |
command, cluster
|
0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10 |
App-command duration is timed from interaction receipt to completion; prefix
(text) command duration is timed from invocation (on_command) to completion.
Both use a bounded start-time map (cap 10k in-flight each), falling back to the
interaction/message timestamp, and never carry a per-guild label.
# command error rate
sum(rate(discord_app_commands_total{status="error"}[5m]))
/ sum(rate(discord_app_commands_total[5m]))
# p95 command duration
histogram_quantile(0.95, sum by (le) (rate(discord_app_command_duration_seconds_bucket[5m])))
# average command duration
sum(rate(discord_app_command_duration_seconds_sum[5m]))
/ sum(rate(discord_app_command_duration_seconds_count[5m]))
# per-cluster interaction rate
sum by (cluster) (rate(discord_interactions_total[5m]))
# worst shard latency
max(discord_shard_latency_seconds)