Skip to content

Improve celery task dispatch and cancellation to prevent stuck jobs#1324

Open
mihow wants to merge 2 commits into
mainfrom
fix/celery-stuck-jobs
Open

Improve celery task dispatch and cancellation to prevent stuck jobs#1324
mihow wants to merge 2 commits into
mainfrom
fix/celery-stuck-jobs

Conversation

@mihow
Copy link
Copy Markdown
Collaborator

@mihow mihow commented May 27, 2026

Closes #1323.

Last week we hit the symptom Issue #1323 describes: a run_job was sitting inside filter_processed_images() for ~9 minutes against a huge collection, and a fresh run_job queued behind it sat in RESERVED on the same worker container the entire time — even though 15 of 16 children on that container were idle and the entire sibling container was idle. SIGKILL'ing the blocker let the queued job start in the same second. The job that finally ran was almost certainly fine, but it wasted nine minutes of wall clock for no good reason, and any user clicking "run" during that window saw nothing happen.

This PR fixes the three reinforcing causes the issue identified, plus an orthogonal cancel-path bug I noticed while reading the code paths. Each one is a small config or decorator change; the value is in stacking them.

What changes for users

  • Stuck run_job no longer blocks other jobs in the queue. A slow first job releases the worker slot for the next one as soon as a sibling child is idle, rather than holding 15 idle slots hostage.
  • A worker crash mid-job no longer silently drops the job. Before: the celery message was already acked at delivery, so a SIGKILL/OOM/deploy roll meant the job stayed in STARTED forever until the reaper found it. After: broker holds the message, redelivers it when a worker comes back. The job either resumes or — if it had already settled — exits cleanly.
  • Cancelling an async ML job actually cancels. Today's Job.cancel calls revoke(terminate=True) on the local run_job task. For ASYNC_API that task has almost always already finished (queue_images_to_nats returns fast) — terminating it does nothing about the actual work running on the remote ADC worker, and on the rare occasion the bootstrap is still running, the SIGTERM kills it without redelivery. The real cancel mechanism for async is tearing down the NATS stream + Redis state, which cleanup_async_job_if_needed already does. We now skip terminate for ASYNC_API.

Plain-language summary

Behavior Before After
Long run_job blocking sibling jobs on same container Yes — pre-assignment pins messages to busy child No — fair scheduler hands messages only to idle children
Long run_job blocking sibling jobs on sibling container Sometimes — broker spills late Reduced — pairs naturally with smaller per-container reservation window (see below)
Worker SIGKILL/OOM mid-run_job Message lost; job stranded in STARTED Broker redelivers; early-guard handles redelivery cleanly
Cancel of ASYNC_API job Terminates local bootstrap (mostly a no-op, occasionally a SIGTERM mid-flight); remote workers keep going Tears down NATS stream + Redis state; remote work stops naturally
Cancel of SYNC_API / INTERNAL job Terminates celery task Unchanged — terminate is still the only way

What's in this PR

  1. config/settings/base.pyCELERY_WORKER_POOL_OPTIMIZATION = "fair". One line. Applies to all queues; the value is largely on jobs.
  2. ami/jobs/tasks.pyacks_late=True, reject_on_worker_lost=True on run_job, plus an early-guard at the top of the task body that returns cleanly when job.status is in final_states() or CANCELING. The guard is what makes redelivery and cancel-race safe.
  3. ami/jobs/models.pyJob.cancel no longer passes terminate=True when dispatch_mode == ASYNC_API. For other dispatch modes, behavior is unchanged.
  4. docker-compose.worker.yml — added a commented-out CELERY_WORKER_CONCURRENCY: "4" on celeryworker_jobs with a TODO referencing this issue, in case we want to enable cause C later.

A note on the counter-intuitive concurrency knob

The issue suggests lowering per-container concurrency on the jobs queue as a third fix. I've left this commented out for now (just a discoverability hint in docker-compose.worker.yml) because it reads as backwards and I'd rather watch the first two fixes in production before pulling this lever too.

The reasoning, briefly: celery's prefetch reserves concurrency × prefetch_multiplier(=1) messages per container at the broker level. With concurrency=16, that's 16 messages held in the container's local buffer. When one of those messages is a stuck task, the container still tells the broker "I have 15 free slots" — and the broker keeps offering new messages to that container instead of spilling to a fully-idle sibling. Lowering concurrency to 4 shrinks the reservation window so the broker spills sooner.

The reason this isn't a meaningful capacity cut: run_job spends nearly all its time waiting on NATS results to come back, not burning CPU. The 16 was originally raised (#1228) for ml_results and antenna, which are DB/Redis-bound and benefit from oversubscription. The jobs queue inherited the high number incidentally.

What the three fixes are actually doing

There are three different head-of-line problems, with three different mechanisms:

  • Inside a child (one slow task blocks its own pipe): can't be fixed, that's just how prefork works.
  • Inside a container, across children (default pre-assignment scheduler pins messages to specific child pipes regardless of which child becomes idle first): fixed by -O fair.
  • Across containers (broker prefetch reservation: a 1-task container still looks like it has 15 free slots): mitigated by lower per-container concurrency (the deferred knob).

acks_late is orthogonal to all three — it's about surviving worker death, not about scheduling. But it's a precondition for both the cancel fix to be safe (cancellation can now terminate a worker without losing the redelivered message that the early-guard will short-circuit) and for the deferred concurrency change to be safe (smaller pools mean each child handles more tasks, and a single SIGKILL hurts more).

What's verified

  • New unit tests on Job.cancel for ASYNC_API / SYNC_API / no-task-id paths.
  • New tests on the run_job early-guard covering REVOKED, CANCELING, SUCCESS, and the contract pair (PENDING still runs).
  • Full ami.jobs.tests (118 tests) and ami.ml.tests + ami.ml.orchestration.tests (81 tests) green.

What still needs verifying in staging/prod

From the issue's "what we still need to verify" section, two of three are now testable:

  1. Re-running the symptom against a real -O fair worker — confirm RESERVED → ACTIVE happens immediately when a sibling child is idle. I'd want to do this on dev box (queue two long run_jobs back-to-back, sleep one).
  2. -O fair interaction with max-tasks-per-child=100 — should be fine but worth watching for the first day after deploy.
  3. acks_late redelivery in practice — the early-guard makes a redelivered run_job a no-op when the job is already settled, so this is covered by the tests, but worth eyeballing the celery logs after deploy for unexpected Skipping run_job messages.

The cancel fix is the one I'd most like a second pair of eyes on — the docstring is the long version, but the gist is "for ASYNC_API the celery task isn't where the work is, so terminating it doesn't cancel; cleanup does."

Related

Co-Authored-By: Claude noreply@anthropic.com

Summary by CodeRabbit

  • Bug Fixes

    • Job cancellation now handles different job types more reliably, ensuring consistent state transitions and proper cleanup.
    • Task execution is more resilient when workers are unexpectedly lost, with automatic re-delivery of in-flight tasks to maintain reliability.
  • Chores

    • Optimized worker pool scheduling configuration to enable fair task distribution and reduce processing delays.

Review Change Stack

Addresses three reinforcing causes of run_job head-of-line blocking on the
jobs queue (#1323) plus an orthogonal cancel bug exposed
by the same investigation.

- Enable fair scheduling (CELERY_WORKER_POOL_OPTIMIZATION = "fair") so the
  master process holds prefetched messages in a shared buffer instead of
  pre-assigning them to specific prefork children. Long heterogeneous
  tasks (notably run_job inside filter_processed_images) no longer block
  newer messages stuck behind them on the same child.

- Add acks_late=True + reject_on_worker_lost=True to run_job so a worker
  SIGKILL/OOM mid-task triggers broker redelivery instead of silently
  dropping the job. Pairs with an early-guard at the top of run_job that
  returns cleanly if the Job is already in a terminal state or being
  cancelled, so redelivery never re-runs side effects.

- Fix Job.cancel for ASYNC_API: skip terminate=True on the (likely-done)
  run_job task — the actual work runs on remote ADC workers via NATS, and
  cleanup_async_job_if_needed is what stops it. Terminating the local
  bootstrap was a no-op at best and SIGTERM'd a still-bootstrapping child
  at worst. INTERNAL / SYNC_API keep terminate=True since their celery
  task body owns the entire job lifecycle.

- Document the optional CELERY_WORKER_CONCURRENCY=4 override on the
  celeryworker_jobs container (commented out for now) so operators can
  opt in once -O fair is observed in production.

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 27, 2026 23:16
@netlify
Copy link
Copy Markdown

netlify Bot commented May 27, 2026

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit c8e6c8a
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/6a17971be3a452000810b632

@netlify
Copy link
Copy Markdown

netlify Bot commented May 27, 2026

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit c8e6c8a
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/6a17971b5a4ff00008e75cd4

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

Warning

Review limit reached

@mihow, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 46 minutes and 41 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9597a54d-4d1a-4e33-b4f8-6d679c8a0109

📥 Commits

Reviewing files that changed from the base of the PR and between f772b71 and c8e6c8a.

📒 Files selected for processing (4)
  • ami/jobs/models.py
  • ami/jobs/tasks.py
  • ami/jobs/tests/test_jobs.py
  • ami/jobs/tests/test_tasks.py
📝 Walkthrough

Walkthrough

This PR addresses blocking behavior in the Celery jobs queue by unifying job cancellation semantics across dispatch modes, adding broker-safe redelivery handling to the run_job task with early-return guards, and configuring fair worker pool scheduling to prevent head-of-line blocking.

Changes

Job Cancellation and Celery Task Resilience

Layer / File(s) Summary
Job cancellation refactor for dispatch modes
ami/jobs/models.py, ami/jobs/tests/test_jobs.py
Job.cancel() now revokes Celery tasks with terminate=not is_async_api, meaning ASYNC_API jobs skip worker termination while SYNC_API/INTERNAL jobs use SIGTERM. Status unconditionally transitions to REVOKED and cleanup is called. Three test cases verify behavior for ASYNC_API, SYNC_API, and jobs without task_id.
Celery task reliability: late acks and early guards
ami/jobs/tasks.py, ami/jobs/tests/test_tasks.py
run_job task now uses acks_late=True and reject_on_worker_lost=True for broker redelivery safety. An early-return guard skips execution if the job is already terminal (REVOKED, SUCCESS) or canceling (CANCELING), preventing side-effect re-runs on redelivered messages. Post-execution refreshes the job and logs conditionally by dispatch mode. Regression tests confirm the guard prevents Job.run() from executing in terminal/canceling states.
Worker pool scheduling and configuration
config/settings/base.py, docker-compose.worker.yml
Introduces CELERY_WORKER_POOL_OPTIMIZATION = "fair" to enable fair scheduling in prefork pools, reducing head-of-line blocking from long heterogeneous tasks. Documentation comments in docker-compose.worker.yml explain CELERY_WORKER_CONCURRENCY tuning rationale for the jobs queue.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • RolnickLab/antenna#1118: Introduced JobDispatchMode enum and dispatch_mode field on Job, which this PR's cancellation logic now uses to determine Celery task termination behavior.

Suggested labels

backend

Suggested reviewers

  • carlosgjs

Poem

🐰 A job was stuck, and workers sat idle,

Fair scheduling and acks_late made the queue less spiteful.

With early guards and redelivery's care,

No more head-of-line blocking—freedom in the air! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.59% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately summarizes the main objective: improving celery task dispatch and cancellation to prevent stuck jobs, which directly addresses the root causes identified in issue #1323.
Linked Issues check ✅ Passed The code changes directly implement all three primary objectives from issue #1323: (A) adds CELERY_WORKER_POOL_OPTIMIZATION='fair' for fair scheduling, (B) adds acks_late/reject_on_worker_lost with status guards for safe redelivery, and (C) provides commented concurrency guidance for future tuning.
Out of Scope Changes check ✅ Passed All code changes are tightly scoped to addressing the three causes of stuck jobs identified in #1323: scheduler config, celery task decorators/guards, cancel semantics, and worker config comments. No unrelated refactoring or feature additions present.
Description check ✅ Passed The PR description is comprehensive, well-structured, and covers all required sections from the template.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/celery-stuck-jobs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Celery run_job scheduling and cancellation behavior to reduce stuck-job blast radius and make worker-loss redelivery safer.

Changes:

  • Enables fair Celery prefork scheduling globally.
  • Adds late acknowledgement/redelivery settings and an early status guard to run_job.
  • Changes Job.cancel() behavior for ASYNC_API jobs and adds regression tests.
  • Adds a documented optional jobs-worker concurrency override.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
config/settings/base.py Adds fair worker pool optimization for Celery workers.
ami/jobs/tasks.py Adds late ack/reject-on-worker-lost and early short-circuit logic for run_job.
ami/jobs/models.py Updates cancellation behavior for async vs sync/internal jobs.
ami/jobs/tests/test_tasks.py Adds early-guard regression tests for run_job.
ami/jobs/tests/test_jobs.py Adds cancellation behavior tests.
docker-compose.worker.yml Documents a possible future jobs-worker concurrency override.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ami/jobs/tasks.py Outdated
Comment on lines +165 to +170
if job.status in JobState.final_states() or job.status == JobState.CANCELING:
job.logger.info(
f"Skipping run_job for job {job.pk}: already in status {job.status} "
f"(redelivery or cancellation in flight)"
)
return
Comment thread ami/jobs/models.py
Comment on lines 1097 to +1101
task = run_job.AsyncResult(self.task_id)
if task:
task.revoke(terminate=True)
if self.dispatch_mode == JobDispatchMode.ASYNC_API:
# For async jobs we need to set the status to revoked here since the task already
# finished (it only queues the images).
self.status = JobState.REVOKED
self.save()
else:
self.status = JobState.REVOKED
self.save()
task.revoke(terminate=not is_async_api)

self.status = JobState.REVOKED
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ami/jobs/tasks.py`:
- Around line 161-170: The pre-run guard using job.status /
JobState.final_states() is insufficient for ASYNC_API jobs because cancellation
may occur after the initial check but before dispatch; to fix, add a second
status refresh and guard immediately before the async dispatch call (right
before queue_images_to_nats) by reloading the Job from the DB (e.g., call the
model refresh/get by PK) and aborting the task (return) if the reloaded
job.status is JobState.CANCELING or in JobState.final_states(), logging a
similar skip message; ensure you reference the same job PK/logger and perform
this check right before queue_images_to_nats to avoid enqueuing work for
canceled jobs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ed9ca82f-4b76-42e9-ae6e-290f6abf61b2

📥 Commits

Reviewing files that changed from the base of the PR and between f585ddc and f772b71.

📒 Files selected for processing (6)
  • ami/jobs/models.py
  • ami/jobs/tasks.py
  • ami/jobs/tests/test_jobs.py
  • ami/jobs/tests/test_tasks.py
  • config/settings/base.py
  • docker-compose.worker.yml

Comment thread ami/jobs/tasks.py
Comment on lines +161 to +170
# Early-guard: under acks_late, the broker may redeliver this message after a
# worker SIGKILL/OOM, and Job.cancel() may also flip status to CANCELING /
# REVOKED while the message sits in the prefetch buffer. Don't re-run a job
# that's already settled or being torn down.
if job.status in JobState.final_states() or job.status == JobState.CANCELING:
job.logger.info(
f"Skipping run_job for job {job.pk}: already in status {job.status} "
f"(redelivery or cancellation in flight)"
)
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Entry-only cancel guard is race-prone for ASYNC_API jobs.

Lines 165-170 guard only before job.run(). If cancel happens after that check, the task can still reach async dispatch and enqueue work under a canceled job because ASYNC_API cancel no longer terminates the worker process. Add a second DB refresh/status check immediately before async dispatch (e.g., right before queue_images_to_nats) and abort when status is CANCELING/terminal.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ami/jobs/tasks.py` around lines 161 - 170, The pre-run guard using job.status
/ JobState.final_states() is insufficient for ASYNC_API jobs because
cancellation may occur after the initial check but before dispatch; to fix, add
a second status refresh and guard immediately before the async dispatch call
(right before queue_images_to_nats) by reloading the Job from the DB (e.g., call
the model refresh/get by PK) and aborting the task (return) if the reloaded
job.status is JobState.CANCELING or in JobState.final_states(), logging a
similar skip message; ensure you reference the same job PK/logger and perform
this check right before queue_images_to_nats to avoid enqueuing work for
canceled jobs.

Review on #1324 surfaced two races that left the early-guard non-functional
in production:

1. ``task_prerun`` (``pre_update_job_status``) wrote PENDING to the row
   before the ``run_job`` body inspected status. A canceled or redelivered
   message therefore had its REVOKED/CANCELING overwritten with PENDING,
   and the early-guard added in the parent commit never tripped. The
   existing tests passed only because they invoked ``run_job.apply(args=[…])``
   while production uses ``kwargs={"job_id": …}`` — under args, the prerun
   handler raised ``KeyError`` and exited silently. Switching the tests to
   ``kwargs=`` reproduces the production code path; the prerun handler now
   short-circuits when ``Job.is_settled()`` is true, preserving the status
   the early-guard reads next.

2. For ASYNC_API jobs ``Job.cancel()`` revokes without ``terminate=True``,
   marks the row REVOKED, and tears down the NATS stream + Redis state.
   ``MLJob.run`` running in a worker that's still inside ``collect_images``
   (slow for large collections) would then proceed to ``queue_images_to_nats``
   and recreate the stream the cancel just deleted, dispatching real GPU
   work to ADC for a revoked job; the results came back to no Redis state
   and ``_fail_job`` silently overwrote REVOKED with FAILURE. The bootstrap
   now checks ``Job.status`` (via a values-only read so the in-memory
   ``progress`` mutations don't clobber the cancel's REVOKED) right after
   the collect stage and bails out before any dispatch.

Adds ``Job.is_settled()`` to centralize the "terminal or being torn down"
predicate that ``run_job``'s early-guard, the prerun handler, ``_fail_job``,
and the bootstrap guard all needed. Adds two regression tests: one for the
prerun-then-guard chain, one for the cancel-during-bootstrap race.

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stuck job blocks other jobs even when workers are free

2 participants