PSv2: Implement queue clean-up upon job completion#1113
PSv2: Implement queue clean-up upon job completion#1113mihow merged 7 commits intoRolnickLab:mainfrom
Conversation
✅ Deploy Preview for antenna-ssec canceled.
|
✅ Deploy Preview for antenna-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
📝 WalkthroughWalkthroughThis pull request implements cleanup of async job resources (NATS streams/consumers and Redis keys) when ML jobs complete, fail, or are revoked. The core cleanup logic is refactored into a unified function, integrated into task status handlers, and validated with comprehensive tests covering all completion scenarios. Changes
Sequence DiagramsequenceDiagram
participant Job as Job Completion Event
participant Tasks as tasks.py<br/>(Orchestration)
participant Cleanup as jobs.py<br/>(cleanup_async_job_resources)
participant Redis as TaskStateManager<br/>(Redis)
participant NATS as TaskQueueManager<br/>(NATS)
Note over Job,NATS: Completion triggered by: progress==100% OR failure OR revocation
Job->>Tasks: _update_job_progress() / update_job_status() / update_job_failure()
Tasks->>Tasks: Check if job_type=="ml" & async_pipeline_workers enabled
alt Cleanup needed
Tasks->>Cleanup: _cleanup_job_if_needed(job)
Cleanup->>Redis: cleanup() - remove task state keys
activate Redis
Redis-->>Cleanup: redis_success (bool)
deactivate Redis
Cleanup->>NATS: TaskQueueManager context - delete streams/consumers
activate NATS
NATS-->>Cleanup: nats_success (bool)
deactivate NATS
Cleanup-->>Tasks: return redis_success AND nats_success
else No cleanup needed
Tasks->>Tasks: Skip cleanup
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR implements automatic cleanup of NATS JetStream and Redis resources when async ML jobs complete, fail, or are cancelled. This addresses issue #1083 by ensuring that temporary resources used for job orchestration are properly removed after jobs finish.
Changes:
- Renamed
cleanup_nats_resourcestocleanup_async_job_resourcesto handle both NATS and Redis cleanup - Integrated cleanup into job lifecycle at three points: completion, failure, and revocation
- Added comprehensive integration tests covering all three cleanup scenarios
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| ami/ml/orchestration/jobs.py | Enhanced cleanup function to handle both Redis and NATS resources, with proper error handling and logging |
| ami/jobs/tasks.py | Integrated cleanup calls in job completion, failure, and revocation handlers with feature flag checks |
| ami/ml/orchestration/test_cleanup.py | Added comprehensive integration tests verifying cleanup works correctly in all three scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@ami/ml/orchestration/test_cleanup.py`:
- Around line 118-159: In _verify_resources_cleaned, change the broad exception
handling inside the async check_nats_resources (which calls
manager.js.stream_info and manager.js.consumer_info via TaskQueueManager) to
only treat nats.js.errors.NotFoundError as "not found" (set
stream_exists/consumer_exists = False) and re-raise any other exceptions so
connection/infra errors fail the test; import or reference NotFoundError from
nats.js.errors and use it in the except clauses for the respective stream and
consumer checks.
🧹 Nitpick comments (1)
ami/ml/orchestration/jobs.py (1)
33-53: Capture stack traces on cleanup failures for easier diagnosis.
job.logger.errordrops the traceback;job.logger.exceptionpreserves context without changing behavior.🔧 Suggested update
- except Exception as e: - job.logger.error(f"Error cleaning up Redis state for job {job.pk}: {e}") + except Exception: + job.logger.exception(f"Error cleaning up Redis state for job {job.pk}") ... - except Exception as e: - job.logger.error(f"Error cleaning up NATS resources for job {job.pk}: {e}") + except Exception: + job.logger.exception(f"Error cleaning up NATS resources for job {job.pk}")

Summary
Performs clean-up of NATS and Redis resources used by async jobs
Related Issues
Closes #1083
Testing
NATS Dashboard With job running:
After job finished:
Redis with job running:
With job complete:
Job logs:
Checklist
Summary by CodeRabbit
Bug Fixes
Tests