Improve celery task dispatch and cancellation to prevent stuck jobs by mihow · Pull Request #1324 · RolnickLab/antenna

mihow · 2026-05-27T23:16:29Z

Closes #1323.

Last week we hit the symptom Issue #1323 describes: a run_job was sitting inside filter_processed_images() for ~9 minutes against a huge collection, and a fresh run_job queued behind it sat in RESERVED on the same worker container the entire time — even though 15 of 16 children on that container were idle and the entire sibling container was idle. SIGKILL'ing the blocker let the queued job start in the same second. The job that finally ran was almost certainly fine, but it wasted nine minutes of wall clock for no good reason, and any user clicking "run" during that window saw nothing happen.

This PR fixes the three reinforcing causes the issue identified, plus an orthogonal cancel-path bug I noticed while reading the code paths. Each one is a small config or decorator change; the value is in stacking them.

What changes for users

Stuck run_job no longer blocks other jobs in the queue. A slow first job releases the worker slot for the next one as soon as a sibling child is idle, rather than holding 15 idle slots hostage.
A worker crash mid-job no longer silently drops the job. Before: the celery message was already acked at delivery, so a SIGKILL/OOM/deploy roll meant the job stayed in STARTED forever until the reaper found it. After: broker holds the message, redelivers it when a worker comes back. The job either resumes or — if it had already settled — exits cleanly.
Cancelling an async ML job actually cancels. Today's Job.cancel calls revoke(terminate=True) on the local run_job task. For ASYNC_API that task has almost always already finished (queue_images_to_nats returns fast) — terminating it does nothing about the actual work running on the remote ADC worker, and on the rare occasion the bootstrap is still running, the SIGTERM kills it without redelivery. The real cancel mechanism for async is tearing down the NATS stream + Redis state, which cleanup_async_job_if_needed already does. We now skip terminate for ASYNC_API.

Plain-language summary

Behavior	Before	After
Long `run_job` blocking sibling jobs on same container	Yes — pre-assignment pins messages to busy child	No — fair scheduler hands messages only to idle children
Long `run_job` blocking sibling jobs on sibling container	Sometimes — broker spills late	Reduced — pairs naturally with smaller per-container reservation window (see below)
Worker SIGKILL/OOM mid-`run_job`	Message lost; job stranded in STARTED	Broker redelivers; early-guard handles redelivery cleanly
Cancel of ASYNC_API job	Terminates local bootstrap (mostly a no-op, occasionally a SIGTERM mid-flight); remote workers keep going	Tears down NATS stream + Redis state; remote work stops naturally
Cancel of SYNC_API / INTERNAL job	Terminates celery task	Unchanged — terminate is still the only way

What's in this PR

config/settings/base.py — CELERY_WORKER_POOL_OPTIMIZATION = "fair". One line. Applies to all queues; the value is largely on jobs.
ami/jobs/tasks.py — acks_late=True, reject_on_worker_lost=True on run_job, plus an early-guard at the top of the task body that returns cleanly when job.status is in final_states() or CANCELING. The guard is what makes redelivery and cancel-race safe.
ami/jobs/models.py — Job.cancel no longer passes terminate=True when dispatch_mode == ASYNC_API. For other dispatch modes, behavior is unchanged.
docker-compose.worker.yml — added a commented-out CELERY_WORKER_CONCURRENCY: "4" on celeryworker_jobs with a TODO referencing this issue, in case we want to enable cause C later.

A note on the counter-intuitive concurrency knob

The issue suggests lowering per-container concurrency on the jobs queue as a third fix. I've left this commented out for now (just a discoverability hint in docker-compose.worker.yml) because it reads as backwards and I'd rather watch the first two fixes in production before pulling this lever too.

The reasoning, briefly: celery's prefetch reserves concurrency × prefetch_multiplier(=1) messages per container at the broker level. With concurrency=16, that's 16 messages held in the container's local buffer. When one of those messages is a stuck task, the container still tells the broker "I have 15 free slots" — and the broker keeps offering new messages to that container instead of spilling to a fully-idle sibling. Lowering concurrency to 4 shrinks the reservation window so the broker spills sooner.

The reason this isn't a meaningful capacity cut: run_job spends nearly all its time waiting on NATS results to come back, not burning CPU. The 16 was originally raised (#1228) for ml_results and antenna, which are DB/Redis-bound and benefit from oversubscription. The jobs queue inherited the high number incidentally.

What the three fixes are actually doing

There are three different head-of-line problems, with three different mechanisms:

Inside a child (one slow task blocks its own pipe): can't be fixed, that's just how prefork works.
Inside a container, across children (default pre-assignment scheduler pins messages to specific child pipes regardless of which child becomes idle first): fixed by -O fair.
Across containers (broker prefetch reservation: a 1-task container still looks like it has 15 free slots): mitigated by lower per-container concurrency (the deferred knob).

acks_late is orthogonal to all three — it's about surviving worker death, not about scheduling. But it's a precondition for both the cancel fix to be safe (cancellation can now terminate a worker without losing the redelivered message that the early-guard will short-circuit) and for the deferred concurrency change to be safe (smaller pools mean each child handles more tasks, and a single SIGKILL hurts more).

What's verified

New unit tests on Job.cancel for ASYNC_API / SYNC_API / no-task-id paths.
New tests on the run_job early-guard covering REVOKED, CANCELING, SUCCESS, and the contract pair (PENDING still runs).
Full ami.jobs.tests (118 tests) and ami.ml.tests + ami.ml.orchestration.tests (81 tests) green.

What still needs verifying in staging/prod

From the issue's "what we still need to verify" section, two of three are now testable:

Re-running the symptom against a real -O fair worker — confirm RESERVED → ACTIVE happens immediately when a sibling child is idle. I'd want to do this on dev box (queue two long run_jobs back-to-back, sleep one).
-O fair interaction with max-tasks-per-child=100 — should be fine but worth watching for the first day after deploy.
acks_late redelivery in practice — the early-guard makes a redelivered run_job a no-op when the job is already settled, so this is covered by the tests, but worth eyeballing the celery logs after deploy for unexpected Skipping run_job messages.

The cancel fix is the one I'd most like a second pair of eyes on — the docstring is the long version, but the gist is "for ASYNC_API the celery task isn't where the work is, so terminating it doesn't cancel; cleanup does."

Stuck job blocks other jobs even when workers are free #1323 — this issue.
Preparing jobs takes too long preparing large collections #1321 — filter_processed_images slowness, the upstream cause of the long run_job that exposed this. Fixing Preparing jobs takes too long preparing large collections #1321 reduces the frequency of stuck jobs; this PR reduces the blast radius when they happen.
fix(celery): update worker concurrency defaults #1228 — the PR that raised CELERY_WORKER_CONCURRENCY to 16 for ml/antenna queues. Context for why the jobs queue inherited the same number.
fix(jobs): prevent jobs from hanging in STARTED state with no progress #1234 / fix(jobs): don't fail jobs prematurely, but fail stalled jobs sooner (3 days -> 10 min) #1235 — the prior async-job fix pair (ACK/SREM ordering, Bug C guard, stale-job cutoff). The cancel fix here is in similar territory.

Co-Authored-By: Claude noreply@anthropic.com

Summary by CodeRabbit

Bug Fixes
- Job cancellation now handles different job types more reliably, ensuring consistent state transitions and proper cleanup.
- Task execution is more resilient when workers are unexpectedly lost, with automatic re-delivery of in-flight tasks to maintain reliability.
Chores
- Optimized worker pool scheduling configuration to enable fair task distribution and reduce processing delays.

Addresses three reinforcing causes of run_job head-of-line blocking on the jobs queue (#1323) plus an orthogonal cancel bug exposed by the same investigation. - Enable fair scheduling (CELERY_WORKER_POOL_OPTIMIZATION = "fair") so the master process holds prefetched messages in a shared buffer instead of pre-assigning them to specific prefork children. Long heterogeneous tasks (notably run_job inside filter_processed_images) no longer block newer messages stuck behind them on the same child. - Add acks_late=True + reject_on_worker_lost=True to run_job so a worker SIGKILL/OOM mid-task triggers broker redelivery instead of silently dropping the job. Pairs with an early-guard at the top of run_job that returns cleanly if the Job is already in a terminal state or being cancelled, so redelivery never re-runs side effects. - Fix Job.cancel for ASYNC_API: skip terminate=True on the (likely-done) run_job task — the actual work runs on remote ADC workers via NATS, and cleanup_async_job_if_needed is what stops it. Terminating the local bootstrap was a no-op at best and SIGTERM'd a still-bootstrapping child at worst. INTERNAL / SYNC_API keep terminate=True since their celery task body owns the entire job lifecycle. - Document the optional CELERY_WORKER_CONCURRENCY=4 override on the celeryworker_jobs container (commented out for now) so operators can opt in once -O fair is observed in production. Co-Authored-By: Claude <noreply@anthropic.com>

netlify · 2026-05-27T23:16:34Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`c8e6c8a`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/6a17971be3a452000810b632

netlify · 2026-05-27T23:16:34Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`c8e6c8a`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/6a17971b5a4ff00008e75cd4

coderabbitai · 2026-05-27T23:16:56Z

Warning

Review limit reached

@mihow, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 46 minutes and 41 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9597a54d-4d1a-4e33-b4f8-6d679c8a0109

📥 Commits

Reviewing files that changed from the base of the PR and between f772b71 and c8e6c8a.

📒 Files selected for processing (4)

ami/jobs/models.py
ami/jobs/tasks.py
ami/jobs/tests/test_jobs.py
ami/jobs/tests/test_tasks.py

📝 Walkthrough

Walkthrough

This PR addresses blocking behavior in the Celery jobs queue by unifying job cancellation semantics across dispatch modes, adding broker-safe redelivery handling to the run_job task with early-return guards, and configuring fair worker pool scheduling to prevent head-of-line blocking.

Changes

Job Cancellation and Celery Task Resilience

Layer / File(s)	Summary
Job cancellation refactor for dispatch modes `ami/jobs/models.py`, `ami/jobs/tests/test_jobs.py`	`Job.cancel()` now revokes Celery tasks with `terminate=not is_async_api`, meaning ASYNC_API jobs skip worker termination while SYNC_API/INTERNAL jobs use SIGTERM. Status unconditionally transitions to REVOKED and cleanup is called. Three test cases verify behavior for ASYNC_API, SYNC_API, and jobs without task_id.
Celery task reliability: late acks and early guards `ami/jobs/tasks.py`, `ami/jobs/tests/test_tasks.py`	`run_job` task now uses `acks_late=True` and `reject_on_worker_lost=True` for broker redelivery safety. An early-return guard skips execution if the job is already terminal (REVOKED, SUCCESS) or canceling (CANCELING), preventing side-effect re-runs on redelivered messages. Post-execution refreshes the job and logs conditionally by dispatch mode. Regression tests confirm the guard prevents `Job.run()` from executing in terminal/canceling states.
Worker pool scheduling and configuration `config/settings/base.py`, `docker-compose.worker.yml`	Introduces `CELERY_WORKER_POOL_OPTIMIZATION = "fair"` to enable fair scheduling in prefork pools, reducing head-of-line blocking from long heterogeneous tasks. Documentation comments in docker-compose.worker.yml explain CELERY_WORKER_CONCURRENCY tuning rationale for the jobs queue.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

RolnickLab/antenna#1118: Introduced JobDispatchMode enum and dispatch_mode field on Job, which this PR's cancellation logic now uses to determine Celery task termination behavior.

Suggested labels

backend

Suggested reviewers

carlosgjs

Poem

🐰 A job was stuck, and workers sat idle,

Fair scheduling and acks_late made the queue less spiteful.

With early guards and redelivery's care,

No more head-of-line blocking—freedom in the air! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 70.59% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title accurately summarizes the main objective: improving celery task dispatch and cancellation to prevent stuck jobs, which directly addresses the root causes identified in issue `#1323`.
Linked Issues check	✅ Passed	The code changes directly implement all three primary objectives from issue `#1323`: (A) adds CELERY_WORKER_POOL_OPTIMIZATION='fair' for fair scheduling, (B) adds acks_late/reject_on_worker_lost with status guards for safe redelivery, and (C) provides commented concurrency guidance for future tuning.
Out of Scope Changes check	✅ Passed	All code changes are tightly scoped to addressing the three causes of stuck jobs identified in `#1323`: scheduler config, celery task decorators/guards, cancel semantics, and worker config comments. No unrelated refactoring or feature additions present.
Description check	✅ Passed	The PR description is comprehensive, well-structured, and covers all required sections from the template.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/celery-stuck-jobs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR improves Celery run_job scheduling and cancellation behavior to reduce stuck-job blast radius and make worker-loss redelivery safer.

Changes:

Enables fair Celery prefork scheduling globally.
Adds late acknowledgement/redelivery settings and an early status guard to run_job.
Changes Job.cancel() behavior for ASYNC_API jobs and adds regression tests.
Adds a documented optional jobs-worker concurrency override.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`config/settings/base.py`	Adds fair worker pool optimization for Celery workers.
`ami/jobs/tasks.py`	Adds late ack/reject-on-worker-lost and early short-circuit logic for `run_job`.
`ami/jobs/models.py`	Updates cancellation behavior for async vs sync/internal jobs.
`ami/jobs/tests/test_tasks.py`	Adds early-guard regression tests for `run_job`.
`ami/jobs/tests/test_jobs.py`	Adds cancellation behavior tests.
`docker-compose.worker.yml`	Documents a possible future jobs-worker concurrency override.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    if job.status in JobState.final_states() or job.status == JobState.CANCELING:
+        job.logger.info(
+            f"Skipping run_job for job {job.pk}: already in status {job.status} "
+            f"(redelivery or cancellation in flight)"
+        )
+        return


            task = run_job.AsyncResult(self.task_id)
            if task:
-                task.revoke(terminate=True)
-            if self.dispatch_mode == JobDispatchMode.ASYNC_API:
-                # For async jobs we need to set the status to revoked here since the task already
-                # finished (it only queues the images).
-                self.status = JobState.REVOKED
-                self.save()
-        else:
-            self.status = JobState.REVOKED
-            self.save()
+                task.revoke(terminate=not is_async_api)
+
+        self.status = JobState.REVOKED


coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ami/jobs/tasks.py`:
- Around line 161-170: The pre-run guard using job.status /
JobState.final_states() is insufficient for ASYNC_API jobs because cancellation
may occur after the initial check but before dispatch; to fix, add a second
status refresh and guard immediately before the async dispatch call (right
before queue_images_to_nats) by reloading the Job from the DB (e.g., call the
model refresh/get by PK) and aborting the task (return) if the reloaded
job.status is JobState.CANCELING or in JobState.final_states(), logging a
similar skip message; ensure you reference the same job PK/logger and perform
this check right before queue_images_to_nats to avoid enqueuing work for
canceled jobs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ed9ca82f-4b76-42e9-ae6e-290f6abf61b2

📥 Commits

Reviewing files that changed from the base of the PR and between f585ddc and f772b71.

📒 Files selected for processing (6)

ami/jobs/models.py
ami/jobs/tasks.py
ami/jobs/tests/test_jobs.py
ami/jobs/tests/test_tasks.py
config/settings/base.py
docker-compose.worker.yml

coderabbitai · 2026-05-27T23:22:27Z

+    # Early-guard: under acks_late, the broker may redeliver this message after a
+    # worker SIGKILL/OOM, and Job.cancel() may also flip status to CANCELING /
+    # REVOKED while the message sits in the prefetch buffer. Don't re-run a job
+    # that's already settled or being torn down.
+    if job.status in JobState.final_states() or job.status == JobState.CANCELING:
+        job.logger.info(
+            f"Skipping run_job for job {job.pk}: already in status {job.status} "
+            f"(redelivery or cancellation in flight)"
+        )
+        return


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Entry-only cancel guard is race-prone for ASYNC_API jobs.

Lines 165-170 guard only before job.run(). If cancel happens after that check, the task can still reach async dispatch and enqueue work under a canceled job because ASYNC_API cancel no longer terminates the worker process. Add a second DB refresh/status check immediately before async dispatch (e.g., right before queue_images_to_nats) and abort when status is CANCELING/terminal.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ami/jobs/tasks.py` around lines 161 - 170, The pre-run guard using job.status / JobState.final_states() is insufficient for ASYNC_API jobs because cancellation may occur after the initial check but before dispatch; to fix, add a second status refresh and guard immediately before the async dispatch call (right before queue_images_to_nats) by reloading the Job from the DB (e.g., call the model refresh/get by PK) and aborting the task (return) if the reloaded job.status is JobState.CANCELING or in JobState.final_states(), logging a similar skip message; ensure you reference the same job PK/logger and perform this check right before queue_images_to_nats to avoid enqueuing work for canceled jobs.

Review on #1324 surfaced two races that left the early-guard non-functional in production: 1. ``task_prerun`` (``pre_update_job_status``) wrote PENDING to the row before the ``run_job`` body inspected status. A canceled or redelivered message therefore had its REVOKED/CANCELING overwritten with PENDING, and the early-guard added in the parent commit never tripped. The existing tests passed only because they invoked ``run_job.apply(args=[…])`` while production uses ``kwargs={"job_id": …}`` — under args, the prerun handler raised ``KeyError`` and exited silently. Switching the tests to ``kwargs=`` reproduces the production code path; the prerun handler now short-circuits when ``Job.is_settled()`` is true, preserving the status the early-guard reads next. 2. For ASYNC_API jobs ``Job.cancel()`` revokes without ``terminate=True``, marks the row REVOKED, and tears down the NATS stream + Redis state. ``MLJob.run`` running in a worker that's still inside ``collect_images`` (slow for large collections) would then proceed to ``queue_images_to_nats`` and recreate the stream the cancel just deleted, dispatching real GPU work to ADC for a revoked job; the results came back to no Redis state and ``_fail_job`` silently overwrote REVOKED with FAILURE. The bootstrap now checks ``Job.status`` (via a values-only read so the in-memory ``progress`` mutations don't clobber the cancel's REVOKED) right after the collect stage and bails out before any dispatch. Adds ``Job.is_settled()`` to centralize the "terminal or being torn down" predicate that ``run_job``'s early-guard, the prerun handler, ``_fail_job``, and the bootstrap guard all needed. Adds two regression tests: one for the prerun-then-guard chain, one for the cancel-during-bootstrap race. Co-Authored-By: Claude <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 27, 2026 23:16

Copilot started reviewing on behalf of mihow May 27, 2026 23:16 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve celery task dispatch and cancellation to prevent stuck jobs#1324

Improve celery task dispatch and cancellation to prevent stuck jobs#1324
mihow wants to merge 2 commits into
mainfrom
fix/celery-stuck-jobs

mihow commented May 27, 2026 •

edited

Loading

Uh oh!

netlify Bot commented May 27, 2026 •

edited

Loading

Uh oh!

netlify Bot commented May 27, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mihow commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes for users

Plain-language summary

What's in this PR

A note on the counter-intuitive concurrency knob

What the three fixes are actually doing

What's verified

What still needs verifying in staging/prod

Related

Summary by CodeRabbit

Uh oh!

netlify Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

netlify Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mihow commented May 27, 2026 •

edited

Loading

netlify Bot commented May 27, 2026 •

edited

Loading

netlify Bot commented May 27, 2026 •

edited

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading