Skip to content

fix(celery): catch the asset-probe timeout instead of hard-killing the worker#3017

Merged
vpetersson merged 2 commits into
masterfrom
fix/revalidate-soft-time-limit
Jun 7, 2026
Merged

fix(celery): catch the asset-probe timeout instead of hard-killing the worker#3017
vpetersson merged 2 commits into
masterfrom
fix/revalidate-soft-time-limit

Conversation

@vpetersson

Copy link
Copy Markdown
Contributor

Issues Fixed

Sentry: ANTHIAS-A (Hard time limit (30s) exceeded for revalidate_asset_url), ANTHIAS-9 (TimeLimitExceeded), ANTHIAS-B (ForkPoolWorker exited with signal 9 (SIGKILL)) — all three are the same hard-kill.

Description

revalidate_asset_url's 30s hard time limit was reachable by a legitimately slow probe: url_fails can burn a hanging getaddrinfo against a broken resolver (no timeout knob exists for it), then an HTTP HEAD (10s) plus the GET fallback (10s). Tripping the hard limit SIGKILLs the pool child, which surfaces as three separate Sentry issues per occurrence.

  • soft_time_limit=60 / time_limit=90 on the on-demand probe — the soft limit raises SoftTimeLimitExceeded inside the task, which now records the same verdict an HTTP timeout gets (unreachable, last_reachability_check stamped) instead of dying
  • The periodic sweep gets the same treatment: it aborts cleanly one minute before its 30-min hard limit, releasing its Redis singleton lock so the next beat tick starts fresh
  • The hard limits stay as the backstop for a probe stuck in C code where the soft signal can't be delivered

Checklist

  • I have performed a self-review of my own code.
  • New and existing unit tests pass locally and on CI with my changes.
  • I have done an end-to-end test for Raspberry Pi devices.
  • I have tested my changes for x86 devices.
  • I added a documentation for the changes I have made (when necessary).

🤖 Generated with Claude Code

…e worker

- revalidate_asset_url's 30s hard limit was reachable by a legitimate
  probe (DNS stall + HEAD 10s + GET 10s), and tripping it SIGKILLs the
  pool child — three Sentry issues per occurrence (ANTHIAS-A,
  ANTHIAS-9, ANTHIAS-B)
- Add soft_time_limit=60 / time_limit=90: the soft limit raises inside
  the task, which records the verdict an HTTP timeout gets
  (unreachable) instead of dying
- Give the periodic sweep the same treatment: abort cleanly a minute
  before its hard limit, releasing the singleton lock
- Add regression tests for limits and soft-timeout behaviour

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vpetersson vpetersson requested a review from a team as a code owner June 7, 2026 11:12
@vpetersson vpetersson self-assigned this Jun 7, 2026
@vpetersson vpetersson requested a review from Copilot June 7, 2026 11:12

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts Celery task time limits and handling for asset reachability probes so that slow/hung probes are handled via soft time limits (caught in-task) instead of reaching the hard time limit that SIGKILLs the pool worker process, reducing the related Sentry noise and improving operational stability.

Changes:

  • Introduces explicit soft/hard time-limit constants for the on-demand asset probe and the periodic sweep.
  • Catches SoftTimeLimitExceeded inside both tasks to abort/record outcomes cleanly (instead of worker SIGKILL).
  • Adds unit tests asserting time-limit configuration and verifying soft-limit behavior (unreachable verdict / sweep abort + lock release).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/anthias_server/celery_tasks.py Adds soft/hard time limits and catches SoftTimeLimitExceeded in asset revalidation tasks.
tests/test_celery_tasks.py Adds tests validating time-limit configuration and soft-limit handling behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_celery_tasks.py Outdated
Comment thread src/anthias_server/celery_tasks.py Outdated
Comment thread src/anthias_server/celery_tasks.py Outdated
… catch

- The soft signal is delivered asynchronously, so it can land during
  the row UPDATE as well as the probe; cover the whole task body in
  both the on-demand recheck and the sweep
- Re-raise SoftTimeLimitExceeded past the sweep's blanket per-asset
  handler so the outer abort path sees it
- Satisfy strict mypy on the Optional time-limit comparisons; reword
  a misleading test comment

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sonarqubecloud

sonarqubecloud Bot commented Jun 7, 2026

Copy link
Copy Markdown

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@vpetersson vpetersson merged commit 633b41e into master Jun 7, 2026
10 checks passed
@vpetersson vpetersson mentioned this pull request Jun 9, 2026
5 tasks
vpetersson added a commit that referenced this pull request Jun 9, 2026
- CalVer (YYYY.0M.MICRO); still June 2026, micro 2 -> 3
- Gives Sentry a real release boundary: every build since 2026.6.2
  reported the same base version (only the +git-hash differed), so
  resolved-in-next-release never stuck and fixed issues kept
  reopening on the next event. A version bump lets the deployed
  fixes actually clear from the board.
- Ships the crash/noise fixes merged since 2026.6.2: SQLite WAL +
  busy timeout (#3015), celery migration-gate (#3016) and
  asset-probe soft limits (#3017), transient-redis/CancelledError
  Sentry filtering + redis healthcheck (#3018/#3028), GitHub
  update-check log level (#3019), webview respawn on D-Bus death at
  setup and mid-play (#3020/#3031), resilient static-file scan
  (#3026), Wayland-socket wait (#3030), and Sentry release/board
  triage tags (#3021/#3025)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants