Skip to content

Fix scheduler pending reason, dispatch requeue, and salloc timeout (#90, #91, #92)#94

Merged
powderluv merged 3 commits intomainfrom
users/powderluv/fix-issues-90-91-92
Apr 16, 2026
Merged

Fix scheduler pending reason, dispatch requeue, and salloc timeout (#90, #91, #92)#94
powderluv merged 3 commits intomainfrom
users/powderluv/fix-issues-90-91-92

Conversation

@powderluv
Copy link
Copy Markdown
Collaborator

Summary

Fixes three reopened bugs: #90, #91, #92.

Issue #90: Jobs stuck PENDING with Reason=Priority

Root cause: Job::new() set initial pending_reason to Priority before the scheduler evaluated the job.

Before:

$ sbatch -N 1 test.sh
Submitted batch job 1
$ squeue
  JOBID  PARTITION  NAME  USER  ST  TIME  NODES  NODELIST(REASON)
      1    default  test   nod  PD  0:00      1  (Priority)      <-- WRONG

After:

$ sbatch -N 1 test.sh
Submitted batch job 1
$ squeue
  JOBID  PARTITION  NAME  USER  ST  TIME  NODES  NODELIST(REASON)
      1    default  test   nod  PD  0:00      1  (None)           <-- correct
# Next scheduler tick: job transitions to Running

Issue #91: Container jobs stuck in PENDING (Resources)

Root cause: When agent rejected dispatch (e.g., container image not found), job was immediately marked Failed. No retry.

Before: Container image not found -> job immediately Failed
After: Job requeued to Pending so user can fix the issue (import image) and the scheduler retries

Issue #92: salloc hangs with no interactive I/O

Root cause: salloc polled indefinitely for RUNNING state with no timeout or feedback.

Before: salloc hangs forever with no output
After: Shows pending reason, times out after 5 minutes with clear error

Test plan

  • t07_50: Initial pending reason is None
  • t07_51: Held jobs keep PendingReason::Held
  • t07_52: Job schedules immediately on idle nodes
  • t07_53: Container jobs schedule same as bare-process
  • Full suite: 808 tests, 0 failures

@amd-kmundiga — please verify these fixes resolve the issues you reported. The key changes:

  1. Pending reason now shows (None) initially instead of (Priority) — the scheduler sets the real reason after evaluating
  2. Container dispatch failures requeue instead of failing immediately
  3. salloc shows progress and times out instead of hanging

🤖 Generated with Claude Code

powderluv and others added 3 commits April 16, 2026 00:21
Three fixes for reopened issues #90, #91, #92:

**#90 — Jobs stuck PENDING with Reason=Priority despite idle nodes**
Root cause: Job::new() set initial pending_reason to Priority before
the scheduler evaluated the job. If the scheduler interval was > 1s,
users saw a misleading "Priority" reason.
Fix: Set initial pending_reason to None. The scheduler loop's
update_pending_reasons() sets the real reason (Priority, Resources,
etc.) on the first evaluation cycle.

Before: squeue shows (Priority) immediately after sbatch, even with
        idle nodes. Misleads users into thinking a priority issue.
After:  squeue shows (None) briefly, then transitions to Running
        once the scheduler evaluates and dispatches.

**#91 — Container jobs stuck in pending (Resources)**
Root cause: When dispatch to agent failed (e.g., container image not
found), the job was immediately marked Failed. Users had no chance to
fix the problem (e.g., import the image) and retry.
Fix: Requeue the job back to Pending on dispatch failure instead of
marking it Failed. Added requeue_job() method that unconditionally
returns the job to Pending state.

Before: Container image not found → job immediately Failed.
After:  Container image not found → job requeued to Pending,
        scheduler retries on next cycle. User can import image.

**#92 — salloc hangs with no interactive I/O**
Root cause: salloc polled indefinitely for job RUNNING state with no
timeout or progress feedback. If the scheduler couldn't place the job
(related to #90), salloc hung forever with no indication why.
Fix: Added 5-minute timeout with automatic job cancellation, and
pending reason display so users see why the job isn't starting.

Before: salloc hangs indefinitely with no output if job can't start.
After:  salloc shows "job N pending (Resources)" progress messages
        and times out after 5 minutes with a clear error.

Also fixed maybe_requeue() to set PendingReason::None instead of
Priority on requeue (same root cause as #90).

Tests added:
- t07_50: Initial pending reason is None (not Priority)
- t07_51: Held jobs still get PendingReason::Held
- t07_52: Simple job schedules immediately on idle nodes
- t07_53: Container jobs schedule same as bare-process jobs

808 tests, 0 failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@powderluv powderluv requested a review from shiv-tyagi April 16, 2026 08:20
@powderluv powderluv merged commit 5753440 into main Apr 16, 2026
6 checks passed
shiv-tyagi added a commit to shiv-tyagi/spur that referenced this pull request Apr 16, 2026
PR ROCm#94 changed requeue pending_reason from Priority to None (issue
ROCm#90). Update the requeue_resets_fields_via_apply test assertion to
match.

Made-with: Cursor
shiv-tyagi added a commit to shiv-tyagi/spur that referenced this pull request Apr 18, 2026
PR ROCm#94 changed requeue pending_reason from Priority to None (issue
ROCm#90). Update the requeue_resets_fields_via_apply test assertion to
match.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant