Conversation
Three fixes for reopened issues #90, #91, #92: **#90 — Jobs stuck PENDING with Reason=Priority despite idle nodes** Root cause: Job::new() set initial pending_reason to Priority before the scheduler evaluated the job. If the scheduler interval was > 1s, users saw a misleading "Priority" reason. Fix: Set initial pending_reason to None. The scheduler loop's update_pending_reasons() sets the real reason (Priority, Resources, etc.) on the first evaluation cycle. Before: squeue shows (Priority) immediately after sbatch, even with idle nodes. Misleads users into thinking a priority issue. After: squeue shows (None) briefly, then transitions to Running once the scheduler evaluates and dispatches. **#91 — Container jobs stuck in pending (Resources)** Root cause: When dispatch to agent failed (e.g., container image not found), the job was immediately marked Failed. Users had no chance to fix the problem (e.g., import the image) and retry. Fix: Requeue the job back to Pending on dispatch failure instead of marking it Failed. Added requeue_job() method that unconditionally returns the job to Pending state. Before: Container image not found → job immediately Failed. After: Container image not found → job requeued to Pending, scheduler retries on next cycle. User can import image. **#92 — salloc hangs with no interactive I/O** Root cause: salloc polled indefinitely for job RUNNING state with no timeout or progress feedback. If the scheduler couldn't place the job (related to #90), salloc hung forever with no indication why. Fix: Added 5-minute timeout with automatic job cancellation, and pending reason display so users see why the job isn't starting. Before: salloc hangs indefinitely with no output if job can't start. After: salloc shows "job N pending (Resources)" progress messages and times out after 5 minutes with a clear error. Also fixed maybe_requeue() to set PendingReason::None instead of Priority on requeue (same root cause as #90). Tests added: - t07_50: Initial pending reason is None (not Priority) - t07_51: Held jobs still get PendingReason::Held - t07_52: Simple job schedules immediately on idle nodes - t07_53: Container jobs schedule same as bare-process jobs 808 tests, 0 failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes three reopened bugs: #90, #91, #92.
Issue #90: Jobs stuck PENDING with Reason=Priority
Root cause:
Job::new()set initialpending_reasontoPrioritybefore the scheduler evaluated the job.Before:
After:
Issue #91: Container jobs stuck in PENDING (Resources)
Root cause: When agent rejected dispatch (e.g., container image not found), job was immediately marked
Failed. No retry.Before: Container image not found -> job immediately
FailedAfter: Job requeued to
Pendingso user can fix the issue (import image) and the scheduler retriesIssue #92: salloc hangs with no interactive I/O
Root cause:
sallocpolled indefinitely for RUNNING state with no timeout or feedback.Before:
sallochangs forever with no outputAfter: Shows pending reason, times out after 5 minutes with clear error
Test plan
@amd-kmundiga — please verify these fixes resolve the issues you reported. The key changes:
(None)initially instead of(Priority)— the scheduler sets the real reason after evaluatingsallocshows progress and times out instead of hanging🤖 Generated with Claude Code