Fix GPU job starvation in resource-aware claims#256
Conversation
Prefer more constrained jobs when priorities are equal in the\nresource-aware claim path, and stop truncating candidates before\nresource packing so later runnable jobs are still considered.\n\nAdd regressions for equal-priority CPU/GPU mixes and for the\npre-pack limit case, and document the tie-break behavior for both\nclaim paths.
| rr.runtime_s DESC, \ | ||
| rr.memory_bytes DESC, \ |
There was a problem hiding this comment.
I'm wondering about these two. This is what the original prioritization scheme tried to address - and it let the user choose. How confident are you that this is correct? On one hand, I agree. On the other, maybe the user wants num_cpus first. There also the point that the user will want the priority based on the slurm allocation (long walltime -> runtime, bigmem -> memory). Most nodes have the same number of CPUs. Would it be better to leave runtime and memory to user-defined priority so that there is never surprising behavior?
There was a problem hiding this comment.
Good point. I agree the runtime, memory, and CPU tie-break was too opinionated and not necessary for the actual bug fix. I narrowed the equal-priority ordering to prefer GPU jobs only, which preserves the starvation fix without overriding user intent for the other resource dimensions.
Limit equal-priority resource-based ordering to GPU count so the starvation fix does not impose implicit runtime, memory, or CPU preferences. Update the API comments and user docs to match the narrowed behavior.
Summary
Fix resource-aware job claiming so equal-priority GPU jobs are considered before less constrained CPU-only jobs, and stop truncating the candidate set before resource packing.
What changed
Verification
running 6 tests
Setting up database with url: sqlite:/var/folders/np/bd58hgy13dz8nrh6fr1ll880000z6p/T/.tmpIepSWz
Applied 20250101000000/migrate initial schema (1.802708ms)
Applied 20251213000000/migrate event timestamp to integer (1.667958ms)
Applied 20251223000000/migrate add active compute node id (864.917µs)
Applied 20251225000000/migrate add is recovery to workflow action (904.083µs)
Applied 20251226000000/migrate add remote worker (656.625µs)
Applied 20251227000000/migrate add compute node min time for new jobs (925.625µs)
Applied 20251229000000/migrate add access groups (1.158083ms)
Applied 20251230000000/migrate add is system to access group (815.25µs)
Applied 20260109000000/migrate add slurm defaults (832.042µs)
Applied 20260110000000/migrate add failure handlers (2.174417ms)
Applied 20260112000000/migrate add use pending failed (712µs)
Applied 20260206000000/migrate add metadata and project (932.75µs)
Applied 20260214000000/migrate add result indexes (1.252708ms)
Applied 20260222000001/migrate add workflow id to workflow status (941.667µs)
Applied 20260223000000/migrate add ro crate (687.459µs)
Applied 20260227000000/migrate add limit resources to workflow (654.25µs)
Applied 20260228000000/migrate create slurm stats table (621.833µs)
Applied 20260301000000/migrate add step nodes to resource requirements (706.709µs)
Applied 20260302000000/migrate add use srun to workflow (699.959µs)
Applied 20260304000000/migrate add enable ro crate to workflow (626.042µs)
Applied 20260309000000/migrate consolidate slurm config (1.847959ms)
Applied 20260310000000/migrate remove step nodes from resource requirements (1.568541ms)
Applied 20260311000000/migrate add ro crate entity composite index (645.042µs)
Applied 20260312000000/migrate remove redundant job status index (534.834µs)
Applied 20260313000000/migrate add execution config (653µs)
Applied 20260317000000/migrate add job priority (848.541µs)
Applied 20260318000000/migrate remove jobs sort method (1.552792ms)
TORC_SERVER_PORT=55906
test test_claim_jobs_based_on_resources_strict_scheduler_match_controls_fallback ... ok
test test_claim_jobs_based_on_resources_honors_limit ... ok
test test_claim_jobs_based_on_resources_priority_ordering ... ok
test test_claim_jobs_based_on_resources_skips_high_priority_job_that_does_not_fit ... ok
test test_claim_jobs_based_on_resources_prefers_gpu_jobs_with_equal_priority ... ok
test test_claim_jobs_based_on_resources_scans_past_limit_for_runnable_jobs ... ok
test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 10.44s
running 23 tests
Setting up database with url: sqlite:/var/folders/np/bd58hgy13dz8nrh6fr1ll880000z6p/T/.tmpa7ZaW3
Applied 20250101000000/migrate initial schema (2.071625ms)
Applied 20251213000000/migrate event timestamp to integer (1.413834ms)
Applied 20251223000000/migrate add active compute node id (696.708µs)
Applied 20251225000000/migrate add is recovery to workflow action (653µs)
Applied 20251226000000/migrate add remote worker (559.875µs)
Applied 20251227000000/migrate add compute node min time for new jobs (648.625µs)
Applied 20251229000000/migrate add access groups (797.5µs)
Applied 20251230000000/migrate add is system to access group (586.5µs)
Applied 20260109000000/migrate add slurm defaults (561.417µs)
Applied 20260110000000/migrate add failure handlers (1.80925ms)
Applied 20260112000000/migrate add use pending failed (523.416µs)
Applied 20260206000000/migrate add metadata and project (668.041µs)
Applied 20260214000000/migrate add result indexes (874.167µs)
Applied 20260222000001/migrate add workflow id to workflow status (617.708µs)
Applied 20260223000000/migrate add ro crate (520.584µs)
Applied 20260227000000/migrate add limit resources to workflow (577.083µs)
Applied 20260228000000/migrate create slurm stats table (512µs)
Applied 20260301000000/migrate add step nodes to resource requirements (639.291µs)
Applied 20260302000000/migrate add use srun to workflow (687.375µs)
Applied 20260304000000/migrate add enable ro crate to workflow (792.416µs)
Applied 20260309000000/migrate consolidate slurm config (1.679084ms)
Applied 20260310000000/migrate remove step nodes from resource requirements (1.517292ms)
Applied 20260311000000/migrate add ro crate entity composite index (877.5µs)
Applied 20260312000000/migrate remove redundant job status index (469.167µs)
Applied 20260313000000/migrate add execution config (630.708µs)
Applied 20260317000000/migrate add job priority (683.375µs)
Applied 20260318000000/migrate remove jobs sort method (1.402167ms)
TORC_SERVER_PORT=55946
test test_prepare_next_jobs_invalid_workflow ... ok
test test_prepare_next_jobs_no_ready_jobs ... ok
test test_claim_next_jobs_returns_invocation_script ... ok
test test_prepare_next_jobs_response_structure ... ok
test test_prepare_next_jobs_basic ... ok
test test_prepare_next_jobs_limit_exceeds_available ... ok
test test_claim_next_jobs_priority_ordering ... ok
test test_prepare_next_jobs_canceled_workflow ... ok
test test_prepare_next_jobs_marks_jobs_pending ... ok
test test_prepare_next_jobs_various_counts::case_2 ... ok
test test_prepare_next_jobs_various_counts::case_1 ... ok
test test_prepare_next_jobs_ignores_resources ... ok
test test_prepare_next_jobs_various_counts::case_3 ... ok
test test_prepare_next_jobs_zero_limit ... ok
No more jobs available after 4 iterations
test test_prepare_next_jobs_exhaust_all_jobs ... ok
Successfully allocated 40 jobs across 8 threads requesting 1 job at a time
test test_prepare_next_jobs_concurrent_small_batches ... ok
test test_prepare_next_jobs_no_double_allocation ... ok
test test_prepare_next_jobs_with_limit::case_1 ... ok
test test_prepare_next_jobs_with_limit::case_2 ... ok
test test_prepare_next_jobs_returns_full_limit ... ok
test test_prepare_next_jobs_with_limit::case_4 ... ok
test test_prepare_next_jobs_with_limit::case_3 ... ok
Thread 0 received 10 jobs
Thread 1 received 10 jobs
Thread 2 received 5 jobs
Thread 3 received 10 jobs
Thread 4 received 10 jobs
Thread 5 received 10 jobs
Thread 6 received 5 jobs
Thread 7 received 10 jobs
Thread 8 received 5 jobs
Thread 9 received 5 jobs
Thread 10 received 5 jobs
Thread 11 received 5 jobs
Thread 12 received 5 jobs
Thread 13 received 5 jobs
Successfully allocated 100 jobs across 14 threads with no race conditions
test test_prepare_next_jobs_concurrent_allocation ... ok
test result: ok. 23 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 1.61s