Skip to content

Architecture and Invariants

AstorisTheBrave edited this page Jun 20, 2026 · 1 revision

Architecture and invariants

Layers

discord.py bot
   │  events, state
   ▼
core        hooks -> instrumentation -> MetricRegistry (neutral model)
   │  reads from
   ├───────────────┬───────────────────┐
   ▼               ▼                   ▼
adapters/        adapters/           history/      (optional)
prometheus       otlp                EventSink -> ClickHouse
   │
   ▼
exposition (aiohttp): /metrics, /healthz, dashboard, /api/*
  • core (src/argus/core/): collector.py (the neutral MetricRegistry, MetricDef, MetricKind, the MetricBackend protocol), metrics.py (the catalogue + scrape-time gauge callbacks), hooks.py (listener registration + teardown), instrumentation.py (the hook bodies + the fail-open wrapper).
  • adapters (src/argus/adapters/): base.py (Adapter ABC), prometheus.py (custom collector + held Counter/Histogram/Info on one CollectorRegistry), otlp.py (OpenTelemetry push).
  • exposition (src/argus/exposition/server.py): build_app(registry, ...) factory + start_server (AppRunner/TCPSite, no web.run_app).
  • dashboard (src/argus/dashboard/): route registration, auth middleware, the JSON snapshot, the built SPA under static/.
  • history (src/argus/history/): EventSink ABC, NullSink, BatchingSink, ClickHouseSink, and AnalyticsQuery.

The seven invariants

These are enforced by structure and tests, not convention.

  1. Collection is decoupled from exposition. core never imports an adapter. Adapters depend on core via the MetricBackend protocol. A test parses the core package AST and fails if it imports prometheus_client, opentelemetry, clickhouse, or argus.adapters.
  2. No unbounded-cardinality label. guild_id/user_id/channel_id are never Prometheus labels. Allowed labels are bounded: shard, cluster, command, event, status, type, error_type, logger, level, hook. A test asserts the forbidden set appears in no metric definition or exposition.
  3. Hooks are O(1) and non-blocking. Hook bodies are synchronous, do counter increments and label lookups only, and never await I/O. An overhead benchmark fires 20k events and asserts sub-0.5 ms/event.
  4. Gauges are read at scrape time. Live state (latency, counts, shard state) is read by a custom collector each scrape via the neutral gauge callback. No background poller caches values.
  5. The SDK fails open. Every hook runs inside _safe; an exception is counted (argus_instrumentation_errors_total), logged, and swallowed. It is never re-raised into the bot loop.
  6. One config object. Everything reads a single resolved ArgusConfig. No module reads env vars or defaults independently.
  7. Operational and analytical paths are separate. enable_per_guild only affects the history sink; it can never add a guild_id label to Prometheus. The analytics API fails closed without a token.

Attachment lifecycle

Argus(bot) constructs an ArgusCog: it builds the registry, defines the catalogue, attaches the Prometheus adapter (and OTLP if otlp_endpoint is set), builds the sink, registers listeners synchronously, and sets the info metric. It then chains bot.setup_hook to await bot.add_cog(cog). The cog's cog_load starts the aiohttp server; cog_unload removes listeners, cleans up the runner, closes the sink and any analytics client.

App command errors have no bot event, so CommandTree.on_error is chained (the original is preserved and still runs). Rate-limit/log signal comes from a logging.Handler on the discord logger hierarchy, never by monkeypatching the HTTP client.

Clone this wiki locally