fix(sentry): stop reporting transient redis blips and client disconnects#3018
Conversation
There was a problem hiding this comment.
Pull request overview
This PR reduces Sentry noise by treating expected transient conditions (brief Redis unavailability and ASGI client disconnect cancellations) as non-actionable, while keeping real exceptions reportable. It also refactors the viewer’s display-resolution reporter tick into a helper to make the behavior testable.
Changes:
- Added a Sentry
before_sendhook that drops events whose exception chain includesredis.exceptions.ConnectionErrororasyncio.CancelledError. - Silenced Celery consumer reconnect “ERROR” log noise in Sentry via
ignore_logger('celery.worker.consumer.consumer'). - Refactored the viewer display-resolution reporter into
_publish_display_resolution_once()and downgraded Redis-connection blips to a WARNING (no traceback), with unit tests covering the behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/test_viewer.py | Adds regression tests ensuring Redis blips in the viewer resolution reporter log as WARNING (not ERROR) and that TTL writes still occur. |
| tests/test_sentry.py | Adds regression tests for _sentry_before_send dropping transient Redis/CancelledError noise, and confirms the Celery consumer logger is ignored. |
| src/anthias_viewer/init.py | Extracts a single-tick helper for publishing display resolution; logs transient Redis connection errors at WARNING without traceback. |
| src/anthias_server/django_project/settings.py | Implements exception-chain walking + before_send filter; ignores Celery reconnect logger for Sentry. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Good call — added the orchestration layer: redis now has a Kept the Sentry-side handling alongside it, for three reasons:
🤖 Generated with Claude Code |
8d3be97 to
b95a8c4
Compare
|
Balena compatibility check on the compose changes:
🤖 Generated with Claude Code |
- Redis restarting (container recycle, compose startup before DNS resolves) produced an error event per process per blip even though every consumer self-heals: celery reconnects with backoff, the viewer's resolution reporter retries next tick, Channels re-establishes on the next frame (Sentry ANTHIAS-M, ANTHIAS-K, ANTHIAS-H, ANTHIAS-J) - Add a before_send hook that drops events whose exception chain contains redis.exceptions.ConnectionError or asyncio.CancelledError (an HTTP client hanging up mid-request under ASGI — ANTHIAS-N) - Silence celery's per-reconnect-attempt ERROR log at the logger (it arrives as a log message, not an exception) - Downgrade the viewer reporter's redis-down log to a warning and extract the tick body into a testable helper - Add regression tests for the filter and the reporter tick Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Annotate the hook with sentry_sdk.types Event/Hint for strict mypy - Build exc_info triples directly in tests instead of catching BaseException (Sonar S5754) and compare events by equality (Sonar S5796) - Use record.getMessage() in the caplog assertion (Copilot) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…endent Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… caplog tests Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- The embedded beat scheduler logs every broker reconnect attempt at
ERROR ("beat: Connection error ... Trying again"), the same
expected-transient noise as the consumer logger (Sentry ANTHIAS-P)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…in walk Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- depends_on with bare service_started only orders container creation; uvicorn/celery/viewer could still race a redis that hadn't finished loading its RDB, producing the startup connection-refused noise (review feedback on this PR) - Add a redis-cli ping healthcheck to the prod template, dev, and test composes, and gate anthias-server / anthias-viewer / anthias-celery on service_healthy - compose-only: the balena supervisor doesn't support depends_on conditions, and a redis container recycling mid-life is gated by nothing — so the Sentry-side handling of transient redis errors stays Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
b95a8c4 to
e142df7
Compare
|
- CalVer (YYYY.0M.MICRO); still June 2026, micro 2 -> 3 - Gives Sentry a real release boundary: every build since 2026.6.2 reported the same base version (only the +git-hash differed), so resolved-in-next-release never stuck and fixed issues kept reopening on the next event. A version bump lets the deployed fixes actually clear from the board. - Ships the crash/noise fixes merged since 2026.6.2: SQLite WAL + busy timeout (#3015), celery migration-gate (#3016) and asset-probe soft limits (#3017), transient-redis/CancelledError Sentry filtering + redis healthcheck (#3018/#3028), GitHub update-check log level (#3019), webview respawn on D-Bus death at setup and mid-play (#3020/#3031), resilient static-file scan (#3026), Wayland-socket wait (#3030), and Sentry release/board triage tags (#3021/#3025) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>



Issues Fixed
Sentry: ANTHIAS-M / ANTHIAS-H (viewer resolution reporter, redis refused / DNS not yet resolvable), ANTHIAS-K (celery consumer reconnect log), ANTHIAS-J (ASGI redis connection reset), ANTHIAS-N (
CancelledErroron client disconnect), ANTHIAS-P (celery beat reconnect log).Description
Two layers, per review feedback:
redis-cli pinghealthcheck, andanthias-server/anthias-viewer/anthias-celeryare gated oncondition: service_healthyin the prod template, dev, and test composes. Bareservice_startedonly orders container creation, so consumers could race a redis still loading its RDB.depends_onconditions (fleet devices get no ordering guarantee), and nothing gates a redis container recycling mid-life (OOM, upgrade) — long-running services will always see a window ofConnectionErrorthen. Every consumer already self-heals, so those windows are filtered rather than reported.Redis being briefly unreachable (its container recycling, or compose startup before the
redisDNS name resolves) is an expected state on a signage device, and every consumer already self-heals — celery's consumer reconnects with backoff, the viewer's reporter retries on its next tick, Channels re-establishes on the next WebSocket frame. Same for an HTTP client hanging up mid-request under ASGI: Django/uvicorn cancel the handler by design. Neither is a code bug, but together they produced 6 noise events per redis blip / disconnect.before_sendhook drops events whose exception chain (walking__cause__/__context__, since channels-redis/kombu wrap the underlying error) containsredis.exceptions.ConnectionErrororasyncio.CancelledErrorignore_logger('celery.worker.consumer.consumer')— the reconnect retry arrives as an ERROR log line, not an exceptionA persistent redis outage still surfaces — the device visibly stops working and the restart loop shows in balena — but a 5-second blip no longer fans out into Sentry issues.
Checklist
🤖 Generated with Claude Code