Skip to content

Clean up task queue when job is revoked or re-started, fix duplicate tasks #1283

@mihow

Description

@mihow

Bug

NATS task queue (stream + consumer) is not cleaned up when an async_api job is revoked or re-started. Two distinct lifecycle gaps:

1. Job marked REVOKED → stream + consumer left behind

The job-watchdog / healthcheck flips a stuck job to REVOKED, but the per-job NATS stream and consumer are untouched. Messages sit in the stream indefinitely.

2. User clicks "Start" on a non-running job → tasks re-published, stream not purged

Re-starting (from REVOKED, FAILURE, or CANCELLED) re-runs the collect stage and publishes a fresh round of tasks on top of whatever is still in the stream. The existing consumer's delivered_seq continues from where it was, so:

  • Stream message count roughly doubles (or worse on multiple restarts).
  • Already-ACKed tasks from the previous attempt that were re-published become duplicates that workers will process again, producing duplicate detections / classifications.
  • Stage params are overwritten with fresh values, which manifests in the UI as the "Processed" counter visibly resetting to 0 while the user is watching.

Concrete observation

A real job hit both gaps in sequence today:

  1. Job started, collect completed with 44,286 tasks → stream job_<N> published 44,286 messages.
  2. Workers were starved by an unrelated /next filter bug (separate ticket: CANCELLED jobs leak through /next filter, starve newer async_api jobs #1282), never delivered. Healthcheck flipped the job to REVOKED after ~18 min. Stream + consumer still present, 44,286 pending.
  3. After the upstream filter bug was unblocked manually, user clicked "Start". Second collect ran, found 43,597 tasks (689 source images had soft-deleted in between), and published them on top of the existing 44,286.
  4. Resulting NATS state: messages = 87,883, delivered_seq = wherever the consumer had reached, ~half the workload now duplicated across the two rounds.
  5. UI showed the process stage counter snap from ~3,000 back to 0 and start counting up again, while detection / classification counts ran ahead of the processed-image count (reflecting re-detection of already-processed images).

Suggested cleanup hooks

  • On status transition into REVOKED / FAILURE / CANCELLED (whether by user, watchdog, or admin tooling): delete the consumer, purge or delete the stream. After that point the job is terminal; nothing should still be drainable.
  • On Start action for a non-running job: idempotently delete any leftover stream + consumer for that job ID before publishing the new round, so the new run begins from a known-empty queue. Reset stage params to fresh defaults at the same point so the UI counter starts cleanly at 0 rather than briefly reflecting stale state.
  • Both code paths should tolerate "stream/consumer doesn't exist" without raising — the cleanup must be safe to run when there's nothing to clean.

Why this matters

  • Duplicate task processing wastes GPU time and produces duplicate Detection / Classification rows that have to be reconciled downstream.
  • The visible counter reset confuses operators about whether work was lost.
  • Long-lived REVOKED streams + consumers accumulate on the NATS server and obscure real triage signal (num_pending counts from dead jobs).

Out of scope

A separate "resume" action that does keep the existing stream + consumer would be a useful follow-up, but the current "Start" button has restart-from-scratch semantics and the cleanup-then-publish sequence matches that intent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions