You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NATS task queue (stream + consumer) is not cleaned up when an async_api job is revoked or re-started. Two distinct lifecycle gaps:
1. Job marked REVOKED → stream + consumer left behind
The job-watchdog / healthcheck flips a stuck job to REVOKED, but the per-job NATS stream and consumer are untouched. Messages sit in the stream indefinitely.
2. User clicks "Start" on a non-running job → tasks re-published, stream not purged
Re-starting (from REVOKED, FAILURE, or CANCELLED) re-runs the collect stage and publishes a fresh round of tasks on top of whatever is still in the stream. The existing consumer's delivered_seq continues from where it was, so:
Stream message count roughly doubles (or worse on multiple restarts).
Already-ACKed tasks from the previous attempt that were re-published become duplicates that workers will process again, producing duplicate detections / classifications.
Stage params are overwritten with fresh values, which manifests in the UI as the "Processed" counter visibly resetting to 0 while the user is watching.
Concrete observation
A real job hit both gaps in sequence today:
Job started, collect completed with 44,286 tasks → stream job_<N> published 44,286 messages.
After the upstream filter bug was unblocked manually, user clicked "Start". Second collect ran, found 43,597 tasks (689 source images had soft-deleted in between), and published them on top of the existing 44,286.
Resulting NATS state: messages = 87,883, delivered_seq = wherever the consumer had reached, ~half the workload now duplicated across the two rounds.
UI showed the process stage counter snap from ~3,000 back to 0 and start counting up again, while detection / classification counts ran ahead of the processed-image count (reflecting re-detection of already-processed images).
Suggested cleanup hooks
On status transition into REVOKED / FAILURE / CANCELLED (whether by user, watchdog, or admin tooling): delete the consumer, purge or delete the stream. After that point the job is terminal; nothing should still be drainable.
On Start action for a non-running job: idempotently delete any leftover stream + consumer for that job ID before publishing the new round, so the new run begins from a known-empty queue. Reset stage params to fresh defaults at the same point so the UI counter starts cleanly at 0 rather than briefly reflecting stale state.
Both code paths should tolerate "stream/consumer doesn't exist" without raising — the cleanup must be safe to run when there's nothing to clean.
Why this matters
Duplicate task processing wastes GPU time and produces duplicate Detection / Classification rows that have to be reconciled downstream.
The visible counter reset confuses operators about whether work was lost.
Long-lived REVOKED streams + consumers accumulate on the NATS server and obscure real triage signal (num_pending counts from dead jobs).
Out of scope
A separate "resume" action that does keep the existing stream + consumer would be a useful follow-up, but the current "Start" button has restart-from-scratch semantics and the cleanup-then-publish sequence matches that intent.
Bug
NATS task queue (stream + consumer) is not cleaned up when an async_api job is revoked or re-started. Two distinct lifecycle gaps:
1. Job marked REVOKED → stream + consumer left behind
The job-watchdog / healthcheck flips a stuck job to
REVOKED, but the per-job NATS stream and consumer are untouched. Messages sit in the stream indefinitely.2. User clicks "Start" on a non-running job → tasks re-published, stream not purged
Re-starting (from
REVOKED,FAILURE, orCANCELLED) re-runs thecollectstage and publishes a fresh round of tasks on top of whatever is still in the stream. The existing consumer'sdelivered_seqcontinues from where it was, so:Concrete observation
A real job hit both gaps in sequence today:
collectcompleted with 44,286 tasks → streamjob_<N>published 44,286 messages./nextfilter bug (separate ticket: CANCELLED jobs leak through /next filter, starve newer async_api jobs #1282), never delivered. Healthcheck flipped the job toREVOKEDafter ~18 min. Stream + consumer still present, 44,286 pending.collectran, found 43,597 tasks (689 source images had soft-deleted in between), and published them on top of the existing 44,286.messages = 87,883,delivered_seq= wherever the consumer had reached, ~half the workload now duplicated across the two rounds.processstage counter snap from ~3,000 back to 0 and start counting up again, while detection / classification counts ran ahead of the processed-image count (reflecting re-detection of already-processed images).Suggested cleanup hooks
REVOKED/FAILURE/CANCELLED(whether by user, watchdog, or admin tooling): delete the consumer, purge or delete the stream. After that point the job is terminal; nothing should still be drainable.Startaction for a non-running job: idempotently delete any leftover stream + consumer for that job ID before publishing the new round, so the new run begins from a known-empty queue. Reset stage params to fresh defaults at the same point so the UI counter starts cleanly at 0 rather than briefly reflecting stale state.Why this matters
REVOKEDstreams + consumers accumulate on the NATS server and obscure real triage signal (num_pendingcounts from dead jobs).Out of scope
A separate "resume" action that does keep the existing stream + consumer would be a useful follow-up, but the current "Start" button has restart-from-scratch semantics and the cleanup-then-publish sequence matches that intent.