Avoid rayon priority inversion in step_resolution scheduling by JoeyBF · Pull Request #230 · SpectralSequences/sseq

JoeyBF · 2026-04-12T00:54:56Z

When rayon workers finish inner parallel work (par_iter_mut in get_partial_matrix/get_matrix) and wait at join points, work-stealing can cause them to pick up another step_resolution job, blocking the original from completing. This crosses thread boundaries: any thread in the join tree can steal a long job.

Fix: track active parallel sections with a global atomic counter (PARALLEL_DEPTH). Before each get_partial_matrix/get_matrix call, a ParallelGuard increments the counter; it decrements on drop. Spawned step_resolution closures check is_in_parallel() and send a retry message if true. The scheduler re-queues retried jobs, and the freed thread returns to rayon's pool where it helps finish the active parallel section -- turning the retry into productive work.

See rayon-rs/rayon#957

Summary by CodeRabbit

Refactor
- Enhanced parallel execution control and worker scheduling to improve reliability during concurrent computations.
- Implemented retry mechanisms for work queue management to ensure robust task processing in parallel environments.

When rayon workers finish inner parallel work (par_iter_mut in get_partial_matrix/get_matrix) and wait at join points, work-stealing can cause them to pick up another step_resolution job, blocking the original from completing. This crosses thread boundaries: any thread in the join tree can steal a long job. Fix: track active parallel sections with a global atomic counter (PARALLEL_DEPTH). Before each get_partial_matrix/get_matrix call, a ParallelGuard increments the counter; it decrements on drop. Spawned step_resolution closures check is_in_parallel() and send a retry message if true. The scheduler re-queues retried jobs, and the freed thread returns to rayon's pool where it helps finish the active parallel section -- turning the retry into productive work. See rayon-rs/rayon#957 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-12T00:55:10Z

📝 Walkthrough

Walkthrough

This pull request introduces a parallel execution tracking mechanism via ParallelGuard and extends the worker scheduling system with a retry capability. When matrix operations are detected during active parallel work, jobs are re-queued for later execution instead of proceeding immediately, preventing potential conflicts.

Changes

Cohort / File(s)	Summary
Parallel Execution Tracking `ext/src/utils.rs`	Introduced new `parallel` module with `ParallelGuard` RAII type and `is_in_parallel()` function to track active parallel work via a global atomic counter.
Retry Logic in Nassau `ext/src/nassau.rs`	Extended `SenderData` with `retry` field, added `send_retry()` method, wrapped matrix-construction calls in `ParallelGuard` scope, and updated scheduling loop to detect parallel execution and enqueue retry messages instead of proceeding immediately.
Retry Logic in Resolution `ext/src/resolution.rs`	Extended `SenderData` with `retry` field, added `send_retry()` method, wrapped `get_matrix()` calls in `ParallelGuard` scope, updated worker scheduling loops to detect parallel execution and re-invoke scheduling closure on retry instead of direct execution.

Sequence Diagram

sequenceDiagram
    participant Worker as Worker Thread
    participant Scheduler as Scheduler Loop
    participant MatrixOp as Matrix Operations
    participant Queue as Sender Queue
    
    Worker->>Scheduler: step_resolution(bidegree)
    Scheduler->>Scheduler: Check is_in_parallel()?
    alt Parallel execution detected
        Scheduler->>Queue: send_retry(bidegree)
        Queue->>Scheduler: SenderData{retry: true}
        Scheduler->>Scheduler: Re-invoke scheduling closure f(b, sender)
    else No parallel execution
        Scheduler->>MatrixOp: get_matrix() within ParallelGuard
        MatrixOp->>MatrixOp: Update counter
        MatrixOp->>Scheduler: Return matrix
        MatrixOp->>MatrixOp: Drop guard, decrement counter
        Scheduler->>Scheduler: Update progress
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hark! A guard doth mark when threads run fast,
And jobs deemed unsafe wait to execute at last,
With retry flags a-flying, they queue once more,
Until the parallel dance is done and finished for sure! 🌲✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and specifically describes the main change: introducing a mechanism to avoid Rayon priority inversion during step_resolution scheduling, which aligns with the core objective and code changes (ParallelGuard, retry logic, is_in_parallel checks).
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ext/src/resolution.rs`:
- Around line 836-839: The retry gate currently protects step_resolution jobs
but not the stem-edge kernel task spawned via scope.spawn; instead of calling
self.get_kernel(next_b) directly inside the scope.spawn closure, wrap that path
with the same parallel-check + retry defer used by
SenderData::send_retry/is_in_parallel so the closure defers and re-queues the
task when running in a parallel worker (or otherwise retries via
SenderData::send_retry), ensuring get_kernel/get_matrix calls are not executed
on a worker that can block other threads; update the code around the scope.spawn
that references self.get_kernel(next_b) to use the deferred/retry mechanism used
for step_resolution so the stem-edge kernel path is covered.

In `@ext/src/utils.rs`:
- Around line 563-565: The increment of PARALLEL_DEPTH in the constructor new()
uses Ordering::Release which doesn't provide acquire semantics for subsequent
work; change the atomic increment to use at least Ordering::AcqRel (or
Ordering::SeqCst) so that the publication of PARALLEL_DEPTH is visible before
this thread proceeds to call get_matrix / get_partial_matrix and prevent other
workers from missing the active-parallel marker and running step_resolution;
update the fetch_add call on PARALLEL_DEPTH in new() accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4eca9b8a-e45c-4d8d-9ba6-86eada00a550

📥 Commits

Reviewing files that changed from the base of the PR and between bb6dafc and 0ec0e20.

📒 Files selected for processing (3)

ext/src/nassau.rs
ext/src/resolution.rs
ext/src/utils.rs

When rayon workers finish inner parallel work (par_iter_mut in get_partial_matrix/get_matrix) and wait at join points, work-stealing can cause them to pick up another step_resolution job, blocking the original from completing. This crosses thread boundaries: any thread in the join tree can steal a long job. Fix: track active parallel sections with a global atomic counter (PARALLEL_DEPTH). Before each get_partial_matrix/get_matrix call, a ParallelGuard increments the counter; it decrements on drop. Spawned step_resolution closures check is_in_parallel() and send a retry message if true. The scheduler re-queues retried jobs, and the freed thread returns to rayon's pool where it helps finish the active parallel section -- turning the retry into productive work. See rayon-rs/rayon#957

…lSequences#230) When rayon workers finish inner parallel work (par_iter_mut in get_partial_matrix/get_matrix) and wait at join points, work-stealing can cause them to pick up another step_resolution job, blocking the original from completing. This crosses thread boundaries: any thread in the join tree can steal a long job. Fix: track active parallel sections with a global atomic counter (PARALLEL_DEPTH). Before each get_partial_matrix/get_matrix call, a ParallelGuard increments the counter; it decrements on drop. Spawned step_resolution closures check is_in_parallel() and send a retry message if true. The scheduler re-queues retried jobs, and the freed thread returns to rayon's pool where it helps finish the active parallel section -- turning the retry into productive work. See rayon-rs/rayon#957

JoeyBF force-pushed the prio_inv branch from cd2c7bc to 0ec0e20 Compare April 12, 2026 00:55

coderabbitai bot requested changes Apr 12, 2026

View reviewed changes

Comment thread ext/src/resolution.rs

Comment thread ext/src/utils.rs

coderabbitai bot approved these changes Apr 12, 2026

View reviewed changes

JoeyBF merged commit 775784d into SpectralSequences:master Apr 12, 2026
24 checks passed

JoeyBF deleted the prio_inv branch April 12, 2026 01:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid rayon priority inversion in step_resolution scheduling#230

Avoid rayon priority inversion in step_resolution scheduling#230
JoeyBF merged 1 commit intoSpectralSequences:masterfrom
JoeyBF:prio_inv

JoeyBF commented Apr 12, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 12, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JoeyBF commented Apr 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JoeyBF commented Apr 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 12, 2026 •

edited

Loading