Avoid rayon priority inversion in step_resolution scheduling#230
Avoid rayon priority inversion in step_resolution scheduling#230JoeyBF merged 1 commit intoSpectralSequences:masterfrom
Conversation
When rayon workers finish inner parallel work (par_iter_mut in get_partial_matrix/get_matrix) and wait at join points, work-stealing can cause them to pick up another step_resolution job, blocking the original from completing. This crosses thread boundaries: any thread in the join tree can steal a long job. Fix: track active parallel sections with a global atomic counter (PARALLEL_DEPTH). Before each get_partial_matrix/get_matrix call, a ParallelGuard increments the counter; it decrements on drop. Spawned step_resolution closures check is_in_parallel() and send a retry message if true. The scheduler re-queues retried jobs, and the freed thread returns to rayon's pool where it helps finish the active parallel section -- turning the retry into productive work. See rayon-rs/rayon#957 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis pull request introduces a parallel execution tracking mechanism via Changes
Sequence DiagramsequenceDiagram
participant Worker as Worker Thread
participant Scheduler as Scheduler Loop
participant MatrixOp as Matrix Operations
participant Queue as Sender Queue
Worker->>Scheduler: step_resolution(bidegree)
Scheduler->>Scheduler: Check is_in_parallel()?
alt Parallel execution detected
Scheduler->>Queue: send_retry(bidegree)
Queue->>Scheduler: SenderData{retry: true}
Scheduler->>Scheduler: Re-invoke scheduling closure f(b, sender)
else No parallel execution
Scheduler->>MatrixOp: get_matrix() within ParallelGuard
MatrixOp->>MatrixOp: Update counter
MatrixOp->>Scheduler: Return matrix
MatrixOp->>MatrixOp: Drop guard, decrement counter
Scheduler->>Scheduler: Update progress
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ext/src/resolution.rs`:
- Around line 836-839: The retry gate currently protects step_resolution jobs
but not the stem-edge kernel task spawned via scope.spawn; instead of calling
self.get_kernel(next_b) directly inside the scope.spawn closure, wrap that path
with the same parallel-check + retry defer used by
SenderData::send_retry/is_in_parallel so the closure defers and re-queues the
task when running in a parallel worker (or otherwise retries via
SenderData::send_retry), ensuring get_kernel/get_matrix calls are not executed
on a worker that can block other threads; update the code around the scope.spawn
that references self.get_kernel(next_b) to use the deferred/retry mechanism used
for step_resolution so the stem-edge kernel path is covered.
In `@ext/src/utils.rs`:
- Around line 563-565: The increment of PARALLEL_DEPTH in the constructor new()
uses Ordering::Release which doesn't provide acquire semantics for subsequent
work; change the atomic increment to use at least Ordering::AcqRel (or
Ordering::SeqCst) so that the publication of PARALLEL_DEPTH is visible before
this thread proceeds to call get_matrix / get_partial_matrix and prevent other
workers from missing the active-parallel marker and running step_resolution;
update the fetch_add call on PARALLEL_DEPTH in new() accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4eca9b8a-e45c-4d8d-9ba6-86eada00a550
📒 Files selected for processing (3)
ext/src/nassau.rsext/src/resolution.rsext/src/utils.rs
When rayon workers finish inner parallel work (par_iter_mut in get_partial_matrix/get_matrix) and wait at join points, work-stealing can cause them to pick up another step_resolution job, blocking the original from completing. This crosses thread boundaries: any thread in the join tree can steal a long job. Fix: track active parallel sections with a global atomic counter (PARALLEL_DEPTH). Before each get_partial_matrix/get_matrix call, a ParallelGuard increments the counter; it decrements on drop. Spawned step_resolution closures check is_in_parallel() and send a retry message if true. The scheduler re-queues retried jobs, and the freed thread returns to rayon's pool where it helps finish the active parallel section -- turning the retry into productive work. See rayon-rs/rayon#957
…lSequences#230) When rayon workers finish inner parallel work (par_iter_mut in get_partial_matrix/get_matrix) and wait at join points, work-stealing can cause them to pick up another step_resolution job, blocking the original from completing. This crosses thread boundaries: any thread in the join tree can steal a long job. Fix: track active parallel sections with a global atomic counter (PARALLEL_DEPTH). Before each get_partial_matrix/get_matrix call, a ParallelGuard increments the counter; it decrements on drop. Spawned step_resolution closures check is_in_parallel() and send a retry message if true. The scheduler re-queues retried jobs, and the freed thread returns to rayon's pool where it helps finish the active parallel section -- turning the retry into productive work. See rayon-rs/rayon#957
When rayon workers finish inner parallel work (par_iter_mut in get_partial_matrix/get_matrix) and wait at join points, work-stealing can cause them to pick up another step_resolution job, blocking the original from completing. This crosses thread boundaries: any thread in the join tree can steal a long job.
Fix: track active parallel sections with a global atomic counter (PARALLEL_DEPTH). Before each get_partial_matrix/get_matrix call, a ParallelGuard increments the counter; it decrements on drop. Spawned step_resolution closures check is_in_parallel() and send a retry message if true. The scheduler re-queues retried jobs, and the freed thread returns to rayon's pool where it helps finish the active parallel section -- turning the retry into productive work.
See rayon-rs/rayon#957
Summary by CodeRabbit