Skip to content

Fix GPU job starvation in resource-aware claims#256

Merged
nkeilbart merged 2 commits intomainfrom
fix/gpu-job-priority
Apr 7, 2026
Merged

Fix GPU job starvation in resource-aware claims#256
nkeilbart merged 2 commits intomainfrom
fix/gpu-job-priority

Conversation

@nkeilbart
Copy link
Copy Markdown
Collaborator

Summary

Fix resource-aware job claiming so equal-priority GPU jobs are considered before less constrained CPU-only jobs, and stop truncating the candidate set before resource packing.

What changed

  • order equal-priority resource-aware candidates by GPUs, runtime, memory, CPUs, then job id
  • remove SQL-side candidate limiting before the packing loop so later runnable jobs are still considered
  • add explicit to
  • add regressions for mixed CPU/GPU starvation and the pre-pack limit case
  • update docs to clarify equal-priority tie-break behavior

Verification

running 6 tests
Setting up database with url: sqlite:/var/folders/np/bd58hgy13dz8nrh6fr1ll880000z6p/T/.tmpIepSWz
Applied 20250101000000/migrate initial schema (1.802708ms)
Applied 20251213000000/migrate event timestamp to integer (1.667958ms)
Applied 20251223000000/migrate add active compute node id (864.917µs)
Applied 20251225000000/migrate add is recovery to workflow action (904.083µs)
Applied 20251226000000/migrate add remote worker (656.625µs)
Applied 20251227000000/migrate add compute node min time for new jobs (925.625µs)
Applied 20251229000000/migrate add access groups (1.158083ms)
Applied 20251230000000/migrate add is system to access group (815.25µs)
Applied 20260109000000/migrate add slurm defaults (832.042µs)
Applied 20260110000000/migrate add failure handlers (2.174417ms)
Applied 20260112000000/migrate add use pending failed (712µs)
Applied 20260206000000/migrate add metadata and project (932.75µs)
Applied 20260214000000/migrate add result indexes (1.252708ms)
Applied 20260222000001/migrate add workflow id to workflow status (941.667µs)
Applied 20260223000000/migrate add ro crate (687.459µs)
Applied 20260227000000/migrate add limit resources to workflow (654.25µs)
Applied 20260228000000/migrate create slurm stats table (621.833µs)
Applied 20260301000000/migrate add step nodes to resource requirements (706.709µs)
Applied 20260302000000/migrate add use srun to workflow (699.959µs)
Applied 20260304000000/migrate add enable ro crate to workflow (626.042µs)
Applied 20260309000000/migrate consolidate slurm config (1.847959ms)
Applied 20260310000000/migrate remove step nodes from resource requirements (1.568541ms)
Applied 20260311000000/migrate add ro crate entity composite index (645.042µs)
Applied 20260312000000/migrate remove redundant job status index (534.834µs)
Applied 20260313000000/migrate add execution config (653µs)
Applied 20260317000000/migrate add job priority (848.541µs)
Applied 20260318000000/migrate remove jobs sort method (1.552792ms)
TORC_SERVER_PORT=55906
test test_claim_jobs_based_on_resources_strict_scheduler_match_controls_fallback ... ok
test test_claim_jobs_based_on_resources_honors_limit ... ok
test test_claim_jobs_based_on_resources_priority_ordering ... ok
test test_claim_jobs_based_on_resources_skips_high_priority_job_that_does_not_fit ... ok
test test_claim_jobs_based_on_resources_prefers_gpu_jobs_with_equal_priority ... ok
test test_claim_jobs_based_on_resources_scans_past_limit_for_runnable_jobs ... ok

test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 10.44s

running 23 tests
Setting up database with url: sqlite:/var/folders/np/bd58hgy13dz8nrh6fr1ll880000z6p/T/.tmpa7ZaW3
Applied 20250101000000/migrate initial schema (2.071625ms)
Applied 20251213000000/migrate event timestamp to integer (1.413834ms)
Applied 20251223000000/migrate add active compute node id (696.708µs)
Applied 20251225000000/migrate add is recovery to workflow action (653µs)
Applied 20251226000000/migrate add remote worker (559.875µs)
Applied 20251227000000/migrate add compute node min time for new jobs (648.625µs)
Applied 20251229000000/migrate add access groups (797.5µs)
Applied 20251230000000/migrate add is system to access group (586.5µs)
Applied 20260109000000/migrate add slurm defaults (561.417µs)
Applied 20260110000000/migrate add failure handlers (1.80925ms)
Applied 20260112000000/migrate add use pending failed (523.416µs)
Applied 20260206000000/migrate add metadata and project (668.041µs)
Applied 20260214000000/migrate add result indexes (874.167µs)
Applied 20260222000001/migrate add workflow id to workflow status (617.708µs)
Applied 20260223000000/migrate add ro crate (520.584µs)
Applied 20260227000000/migrate add limit resources to workflow (577.083µs)
Applied 20260228000000/migrate create slurm stats table (512µs)
Applied 20260301000000/migrate add step nodes to resource requirements (639.291µs)
Applied 20260302000000/migrate add use srun to workflow (687.375µs)
Applied 20260304000000/migrate add enable ro crate to workflow (792.416µs)
Applied 20260309000000/migrate consolidate slurm config (1.679084ms)
Applied 20260310000000/migrate remove step nodes from resource requirements (1.517292ms)
Applied 20260311000000/migrate add ro crate entity composite index (877.5µs)
Applied 20260312000000/migrate remove redundant job status index (469.167µs)
Applied 20260313000000/migrate add execution config (630.708µs)
Applied 20260317000000/migrate add job priority (683.375µs)
Applied 20260318000000/migrate remove jobs sort method (1.402167ms)
TORC_SERVER_PORT=55946
test test_prepare_next_jobs_invalid_workflow ... ok
test test_prepare_next_jobs_no_ready_jobs ... ok
test test_claim_next_jobs_returns_invocation_script ... ok
test test_prepare_next_jobs_response_structure ... ok
test test_prepare_next_jobs_basic ... ok
test test_prepare_next_jobs_limit_exceeds_available ... ok
test test_claim_next_jobs_priority_ordering ... ok
test test_prepare_next_jobs_canceled_workflow ... ok
test test_prepare_next_jobs_marks_jobs_pending ... ok
test test_prepare_next_jobs_various_counts::case_2 ... ok
test test_prepare_next_jobs_various_counts::case_1 ... ok
test test_prepare_next_jobs_ignores_resources ... ok
test test_prepare_next_jobs_various_counts::case_3 ... ok
test test_prepare_next_jobs_zero_limit ... ok
No more jobs available after 4 iterations
test test_prepare_next_jobs_exhaust_all_jobs ... ok
Successfully allocated 40 jobs across 8 threads requesting 1 job at a time
test test_prepare_next_jobs_concurrent_small_batches ... ok
test test_prepare_next_jobs_no_double_allocation ... ok
test test_prepare_next_jobs_with_limit::case_1 ... ok
test test_prepare_next_jobs_with_limit::case_2 ... ok
test test_prepare_next_jobs_returns_full_limit ... ok
test test_prepare_next_jobs_with_limit::case_4 ... ok
test test_prepare_next_jobs_with_limit::case_3 ... ok
Thread 0 received 10 jobs
Thread 1 received 10 jobs
Thread 2 received 5 jobs
Thread 3 received 10 jobs
Thread 4 received 10 jobs
Thread 5 received 10 jobs
Thread 6 received 5 jobs
Thread 7 received 10 jobs
Thread 8 received 5 jobs
Thread 9 received 5 jobs
Thread 10 received 5 jobs
Thread 11 received 5 jobs
Thread 12 received 5 jobs
Thread 13 received 5 jobs
Successfully allocated 100 jobs across 14 threads with no race conditions
test test_prepare_next_jobs_concurrent_allocation ... ok

test result: ok. 23 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 1.61s

Prefer more constrained jobs when priorities are equal in the\nresource-aware claim path, and stop truncating candidates before\nresource packing so later runnable jobs are still considered.\n\nAdd regressions for equal-priority CPU/GPU mixes and for the\npre-pack limit case, and document the tie-break behavior for both\nclaim paths.
Comment on lines +1052 to +1053
rr.runtime_s DESC, \
rr.memory_bytes DESC, \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about these two. This is what the original prioritization scheme tried to address - and it let the user choose. How confident are you that this is correct? On one hand, I agree. On the other, maybe the user wants num_cpus first. There also the point that the user will want the priority based on the slurm allocation (long walltime -> runtime, bigmem -> memory). Most nodes have the same number of CPUs. Would it be better to leave runtime and memory to user-defined priority so that there is never surprising behavior?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I agree the runtime, memory, and CPU tie-break was too opinionated and not necessary for the actual bug fix. I narrowed the equal-priority ordering to prefer GPU jobs only, which preserves the starvation fix without overriding user intent for the other resource dimensions.

Limit equal-priority resource-based ordering to GPU count so the
starvation fix does not impose implicit runtime, memory, or CPU
preferences.

Update the API comments and user docs to match the narrowed
behavior.
Copy link
Copy Markdown
Collaborator

@daniel-thom daniel-thom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@nkeilbart nkeilbart merged commit 2cda144 into main Apr 7, 2026
9 checks passed
@nkeilbart nkeilbart deleted the fix/gpu-job-priority branch April 7, 2026 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants