fix(sentry): stop reporting transient redis blips and client disconnects by vpetersson · Pull Request #3018 · Screenly/Anthias

vpetersson · 2026-06-07T11:16:43Z

Issues Fixed

Sentry: ANTHIAS-M / ANTHIAS-H (viewer resolution reporter, redis refused / DNS not yet resolvable), ANTHIAS-K (celery consumer reconnect log), ANTHIAS-J (ASGI redis connection reset), ANTHIAS-N (CancelledError on client disconnect), ANTHIAS-P (celery beat reconnect log).

Description

Two layers, per review feedback:

Orchestration (root cause for startup ordering): redis now has a redis-cli ping healthcheck, and anthias-server / anthias-viewer / anthias-celery are gated on condition: service_healthy in the prod template, dev, and test composes. Bare service_started only orders container creation, so consumers could race a redis still loading its RDB.
Sentry-side handling for what compose can't gate: the balena supervisor doesn't support depends_on conditions (fleet devices get no ordering guarantee), and nothing gates a redis container recycling mid-life (OOM, upgrade) — long-running services will always see a window of ConnectionError then. Every consumer already self-heals, so those windows are filtered rather than reported.

Redis being briefly unreachable (its container recycling, or compose startup before the redis DNS name resolves) is an expected state on a signage device, and every consumer already self-heals — celery's consumer reconnects with backoff, the viewer's reporter retries on its next tick, Channels re-establishes on the next WebSocket frame. Same for an HTTP client hanging up mid-request under ASGI: Django/uvicorn cancel the handler by design. Neither is a code bug, but together they produced 6 noise events per redis blip / disconnect.

before_send hook drops events whose exception chain (walking __cause__/__context__, since channels-redis/kombu wrap the underlying error) contains redis.exceptions.ConnectionError or asyncio.CancelledError
ignore_logger('celery.worker.consumer.consumer') — the reconnect retry arrives as an ERROR log line, not an exception
The viewer's resolution reporter now logs a redis blip at WARNING (retry next tick) instead of ERROR-with-traceback; tick body extracted into a testable helper

A persistent redis outage still surfaces — the device visibly stops working and the restart loop shows in balena — but a 5-second blip no longer fans out into Sentry issues.

Checklist

I have performed a self-review of my own code.
New and existing unit tests pass locally and on CI with my changes.
I have done an end-to-end test for Raspberry Pi devices.
I have tested my changes for x86 devices.
I added a documentation for the changes I have made (when necessary).

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR reduces Sentry noise by treating expected transient conditions (brief Redis unavailability and ASGI client disconnect cancellations) as non-actionable, while keeping real exceptions reportable. It also refactors the viewer’s display-resolution reporter tick into a helper to make the behavior testable.

Changes:

Added a Sentry before_send hook that drops events whose exception chain includes redis.exceptions.ConnectionError or asyncio.CancelledError.
Silenced Celery consumer reconnect “ERROR” log noise in Sentry via ignore_logger('celery.worker.consumer.consumer').
Refactored the viewer display-resolution reporter into _publish_display_resolution_once() and downgraded Redis-connection blips to a WARNING (no traceback), with unit tests covering the behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
tests/test_viewer.py	Adds regression tests ensuring Redis blips in the viewer resolution reporter log as WARNING (not ERROR) and that TTL writes still occur.
tests/test_sentry.py	Adds regression tests for `_sentry_before_send` dropping transient Redis/CancelledError noise, and confirms the Celery consumer logger is ignored.
src/anthias_viewer/init.py	Extracts a single-tick helper for publishing display resolution; logs transient Redis connection errors at WARNING without traceback.
src/anthias_server/django_project/settings.py	Implements exception-chain walking + `before_send` filter; ignores Celery reconnect logger for Sentry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

vpetersson · 2026-06-07T12:43:42Z

Good call — added the orchestration layer: redis now has a redis-cli ping healthcheck and server/viewer/celery are gated on condition: service_healthy (prod template + dev + test composes).

Kept the Sentry-side handling alongside it, for three reasons:

balena — the supervisor doesn't support depends_on conditions, so fleet devices get no startup ordering from compose; their first-boot reconnects would keep generating events.
mid-life restarts — depends_on only gates first start. A redis container recycling later (OOM, upgrade roll) still throws ConnectionError into the long-running services, and each of them already self-heals.
CancelledError (ANTHIAS-N) is client-disconnect noise unrelated to redis ordering.

🤖 Generated with Claude Code

vpetersson · 2026-06-07T12:46:29Z

Balena compatibility check on the compose changes:

The balena deploy paths (bin/balena_ota_deploy.sh, bin/deploy_to_balena.sh) render docker-compose.balena.yml.tmpl / docker-compose.balena.dev.yml.tmpl — untouched by this PR, so fleet devices never see the new syntax.
The condition: service_healthy gating is only in docker-compose.yml.tmpl (plain compose installs via upgrade_containers.sh), docker-compose.dev.yml, and docker-compose.test.yml.
Confirmed against the balena supervisor compose reference: depends_on supports "only array form and service_started condition" — which is exactly why the balena templates stay on the supervisor's defaults and the Sentry-side filter carries the fleet.

🤖 Generated with Claude Code

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

- Redis restarting (container recycle, compose startup before DNS resolves) produced an error event per process per blip even though every consumer self-heals: celery reconnects with backoff, the viewer's resolution reporter retries next tick, Channels re-establishes on the next frame (Sentry ANTHIAS-M, ANTHIAS-K, ANTHIAS-H, ANTHIAS-J) - Add a before_send hook that drops events whose exception chain contains redis.exceptions.ConnectionError or asyncio.CancelledError (an HTTP client hanging up mid-request under ASGI — ANTHIAS-N) - Silence celery's per-reconnect-attempt ERROR log at the logger (it arrives as a log message, not an exception) - Downgrade the viewer reporter's redis-down log to a warning and extract the tick body into a testable helper - Add regression tests for the filter and the reporter tick Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Annotate the hook with sentry_sdk.types Event/Hint for strict mypy - Build exc_info triples directly in tests instead of catching BaseException (Sonar S5754) and compare events by equality (Sonar S5796) - Use record.getMessage() in the caplog assertion (Copilot) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…endent Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… caplog tests Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- The embedded beat scheduler logs every broker reconnect attempt at ERROR ("beat: Connection error ... Trying again"), the same expected-transient noise as the consumer logger (Sentry ANTHIAS-P) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…in walk Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- depends_on with bare service_started only orders container creation; uvicorn/celery/viewer could still race a redis that hadn't finished loading its RDB, producing the startup connection-refused noise (review feedback on this PR) - Add a redis-cli ping healthcheck to the prod template, dev, and test composes, and gate anthias-server / anthias-viewer / anthias-celery on service_healthy - compose-only: the balena supervisor doesn't support depends_on conditions, and a redis container recycling mid-life is gated by nothing — so the Sentry-side handling of transient redis errors stays Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-06-07T12:57:17Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

- CalVer (YYYY.0M.MICRO); still June 2026, micro 2 -> 3 - Gives Sentry a real release boundary: every build since 2026.6.2 reported the same base version (only the +git-hash differed), so resolved-in-next-release never stuck and fixed issues kept reopening on the next event. A version bump lets the deployed fixes actually clear from the board. - Ships the crash/noise fixes merged since 2026.6.2: SQLite WAL + busy timeout (#3015), celery migration-gate (#3016) and asset-probe soft limits (#3017), transient-redis/CancelledError Sentry filtering + redis healthcheck (#3018/#3028), GitHub update-check log level (#3019), webview respawn on D-Bus death at setup and mid-play (#3020/#3031), resilient static-file scan (#3026), Wayland-socket wait (#3030), and Sentry release/board triage tags (#3021/#3025) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vpetersson requested a review from a team as a code owner June 7, 2026 11:16

vpetersson self-assigned this Jun 7, 2026

vpetersson requested a review from Copilot June 7, 2026 11:16

Copilot started reviewing on behalf of vpetersson June 7, 2026 11:16 View session