Skip to content

fix(sentry): stop reporting transient redis blips and client disconnects#3018

Merged
vpetersson merged 7 commits into
masterfrom
fix/sentry-transient-noise
Jun 7, 2026
Merged

fix(sentry): stop reporting transient redis blips and client disconnects#3018
vpetersson merged 7 commits into
masterfrom
fix/sentry-transient-noise

Conversation

@vpetersson

@vpetersson vpetersson commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Issues Fixed

Sentry: ANTHIAS-M / ANTHIAS-H (viewer resolution reporter, redis refused / DNS not yet resolvable), ANTHIAS-K (celery consumer reconnect log), ANTHIAS-J (ASGI redis connection reset), ANTHIAS-N (CancelledError on client disconnect), ANTHIAS-P (celery beat reconnect log).

Description

Two layers, per review feedback:

  1. Orchestration (root cause for startup ordering): redis now has a redis-cli ping healthcheck, and anthias-server / anthias-viewer / anthias-celery are gated on condition: service_healthy in the prod template, dev, and test composes. Bare service_started only orders container creation, so consumers could race a redis still loading its RDB.
  2. Sentry-side handling for what compose can't gate: the balena supervisor doesn't support depends_on conditions (fleet devices get no ordering guarantee), and nothing gates a redis container recycling mid-life (OOM, upgrade) — long-running services will always see a window of ConnectionError then. Every consumer already self-heals, so those windows are filtered rather than reported.

Redis being briefly unreachable (its container recycling, or compose startup before the redis DNS name resolves) is an expected state on a signage device, and every consumer already self-heals — celery's consumer reconnects with backoff, the viewer's reporter retries on its next tick, Channels re-establishes on the next WebSocket frame. Same for an HTTP client hanging up mid-request under ASGI: Django/uvicorn cancel the handler by design. Neither is a code bug, but together they produced 6 noise events per redis blip / disconnect.

  • before_send hook drops events whose exception chain (walking __cause__/__context__, since channels-redis/kombu wrap the underlying error) contains redis.exceptions.ConnectionError or asyncio.CancelledError
  • ignore_logger('celery.worker.consumer.consumer') — the reconnect retry arrives as an ERROR log line, not an exception
  • The viewer's resolution reporter now logs a redis blip at WARNING (retry next tick) instead of ERROR-with-traceback; tick body extracted into a testable helper

A persistent redis outage still surfaces — the device visibly stops working and the restart loop shows in balena — but a 5-second blip no longer fans out into Sentry issues.

Checklist

  • I have performed a self-review of my own code.
  • New and existing unit tests pass locally and on CI with my changes.
  • I have done an end-to-end test for Raspberry Pi devices.
  • I have tested my changes for x86 devices.
  • I added a documentation for the changes I have made (when necessary).

🤖 Generated with Claude Code

@vpetersson vpetersson requested a review from a team as a code owner June 7, 2026 11:16
@vpetersson vpetersson self-assigned this Jun 7, 2026
@vpetersson vpetersson requested a review from Copilot June 7, 2026 11:16

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces Sentry noise by treating expected transient conditions (brief Redis unavailability and ASGI client disconnect cancellations) as non-actionable, while keeping real exceptions reportable. It also refactors the viewer’s display-resolution reporter tick into a helper to make the behavior testable.

Changes:

  • Added a Sentry before_send hook that drops events whose exception chain includes redis.exceptions.ConnectionError or asyncio.CancelledError.
  • Silenced Celery consumer reconnect “ERROR” log noise in Sentry via ignore_logger('celery.worker.consumer.consumer').
  • Refactored the viewer display-resolution reporter into _publish_display_resolution_once() and downgraded Redis-connection blips to a WARNING (no traceback), with unit tests covering the behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
tests/test_viewer.py Adds regression tests ensuring Redis blips in the viewer resolution reporter log as WARNING (not ERROR) and that TTL writes still occur.
tests/test_sentry.py Adds regression tests for _sentry_before_send dropping transient Redis/CancelledError noise, and confirms the Celery consumer logger is ignored.
src/anthias_viewer/init.py Extracts a single-tick helper for publishing display resolution; logs transient Redis connection errors at WARNING without traceback.
src/anthias_server/django_project/settings.py Implements exception-chain walking + before_send filter; ignores Celery reconnect logger for Sentry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_viewer.py

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread tests/test_sentry.py

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread tests/test_viewer.py

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread src/anthias_server/django_project/settings.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

@vpetersson

Copy link
Copy Markdown
Contributor Author

Good call — added the orchestration layer: redis now has a redis-cli ping healthcheck and server/viewer/celery are gated on condition: service_healthy (prod template + dev + test composes).

Kept the Sentry-side handling alongside it, for three reasons:

  1. balena — the supervisor doesn't support depends_on conditions, so fleet devices get no startup ordering from compose; their first-boot reconnects would keep generating events.
  2. mid-life restartsdepends_on only gates first start. A redis container recycling later (OOM, upgrade roll) still throws ConnectionError into the long-running services, and each of them already self-heals.
  3. CancelledError (ANTHIAS-N) is client-disconnect noise unrelated to redis ordering.

🤖 Generated with Claude Code

@vpetersson

Copy link
Copy Markdown
Contributor Author

Balena compatibility check on the compose changes:

  • The balena deploy paths (bin/balena_ota_deploy.sh, bin/deploy_to_balena.sh) render docker-compose.balena.yml.tmpl / docker-compose.balena.dev.yml.tmpluntouched by this PR, so fleet devices never see the new syntax.
  • The condition: service_healthy gating is only in docker-compose.yml.tmpl (plain compose installs via upgrade_containers.sh), docker-compose.dev.yml, and docker-compose.test.yml.
  • Confirmed against the balena supervisor compose reference: depends_on supports "only array form and service_started condition" — which is exactly why the balena templates stay on the supervisor's defaults and the Sentry-side filter carries the fleet.

🤖 Generated with Claude Code

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

vpetersson and others added 4 commits June 7, 2026 12:55
- Redis restarting (container recycle, compose startup before DNS
  resolves) produced an error event per process per blip even though
  every consumer self-heals: celery reconnects with backoff, the
  viewer's resolution reporter retries next tick, Channels
  re-establishes on the next frame (Sentry ANTHIAS-M, ANTHIAS-K,
  ANTHIAS-H, ANTHIAS-J)
- Add a before_send hook that drops events whose exception chain
  contains redis.exceptions.ConnectionError or asyncio.CancelledError
  (an HTTP client hanging up mid-request under ASGI — ANTHIAS-N)
- Silence celery's per-reconnect-attempt ERROR log at the logger
  (it arrives as a log message, not an exception)
- Downgrade the viewer reporter's redis-down log to a warning and
  extract the tick body into a testable helper
- Add regression tests for the filter and the reporter tick

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Annotate the hook with sentry_sdk.types Event/Hint for strict mypy
- Build exc_info triples directly in tests instead of catching
  BaseException (Sonar S5754) and compare events by equality
  (Sonar S5796)
- Use record.getMessage() in the caplog assertion (Copilot)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…endent

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… caplog tests

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vpetersson and others added 3 commits June 7, 2026 12:55
- The embedded beat scheduler logs every broker reconnect attempt at
  ERROR ("beat: Connection error ... Trying again"), the same
  expected-transient noise as the consumer logger (Sentry ANTHIAS-P)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…in walk

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- depends_on with bare service_started only orders container
  creation; uvicorn/celery/viewer could still race a redis that
  hadn't finished loading its RDB, producing the startup
  connection-refused noise (review feedback on this PR)
- Add a redis-cli ping healthcheck to the prod template, dev, and
  test composes, and gate anthias-server / anthias-viewer /
  anthias-celery on service_healthy
- compose-only: the balena supervisor doesn't support depends_on
  conditions, and a redis container recycling mid-life is gated by
  nothing — so the Sentry-side handling of transient redis errors
  stays

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vpetersson vpetersson force-pushed the fix/sentry-transient-noise branch from b95a8c4 to e142df7 Compare June 7, 2026 12:56
@sonarqubecloud

sonarqubecloud Bot commented Jun 7, 2026

Copy link
Copy Markdown

@vpetersson vpetersson merged commit 43b9375 into master Jun 7, 2026
10 checks passed
vpetersson added a commit that referenced this pull request Jun 9, 2026
- CalVer (YYYY.0M.MICRO); still June 2026, micro 2 -> 3
- Gives Sentry a real release boundary: every build since 2026.6.2
  reported the same base version (only the +git-hash differed), so
  resolved-in-next-release never stuck and fixed issues kept
  reopening on the next event. A version bump lets the deployed
  fixes actually clear from the board.
- Ships the crash/noise fixes merged since 2026.6.2: SQLite WAL +
  busy timeout (#3015), celery migration-gate (#3016) and
  asset-probe soft limits (#3017), transient-redis/CancelledError
  Sentry filtering + redis healthcheck (#3018/#3028), GitHub
  update-check log level (#3019), webview respawn on D-Bus death at
  setup and mid-play (#3020/#3031), resilient static-file scan
  (#3026), Wayland-socket wait (#3030), and Sentry release/board
  triage tags (#3021/#3025)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants