Skip to content

Avoid rayon priority inversion in step_resolution scheduling#230

Merged
JoeyBF merged 1 commit intoSpectralSequences:masterfrom
JoeyBF:prio_inv
Apr 12, 2026
Merged

Avoid rayon priority inversion in step_resolution scheduling#230
JoeyBF merged 1 commit intoSpectralSequences:masterfrom
JoeyBF:prio_inv

Conversation

@JoeyBF
Copy link
Copy Markdown
Collaborator

@JoeyBF JoeyBF commented Apr 12, 2026

When rayon workers finish inner parallel work (par_iter_mut in get_partial_matrix/get_matrix) and wait at join points, work-stealing can cause them to pick up another step_resolution job, blocking the original from completing. This crosses thread boundaries: any thread in the join tree can steal a long job.

Fix: track active parallel sections with a global atomic counter (PARALLEL_DEPTH). Before each get_partial_matrix/get_matrix call, a ParallelGuard increments the counter; it decrements on drop. Spawned step_resolution closures check is_in_parallel() and send a retry message if true. The scheduler re-queues retried jobs, and the freed thread returns to rayon's pool where it helps finish the active parallel section -- turning the retry into productive work.

See rayon-rs/rayon#957

Summary by CodeRabbit

  • Refactor
    • Enhanced parallel execution control and worker scheduling to improve reliability during concurrent computations.
    • Implemented retry mechanisms for work queue management to ensure robust task processing in parallel environments.

When rayon workers finish inner parallel work (par_iter_mut in
get_partial_matrix/get_matrix) and wait at join points, work-stealing
can cause them to pick up another step_resolution job, blocking the
original from completing. This crosses thread boundaries: any thread
in the join tree can steal a long job.

Fix: track active parallel sections with a global atomic counter
(PARALLEL_DEPTH). Before each get_partial_matrix/get_matrix call, a
ParallelGuard increments the counter; it decrements on drop. Spawned
step_resolution closures check is_in_parallel() and send a retry
message if true. The scheduler re-queues retried jobs, and the freed
thread returns to rayon's pool where it helps finish the active
parallel section -- turning the retry into productive work.

See rayon-rs/rayon#957

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 12, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a parallel execution tracking mechanism via ParallelGuard and extends the worker scheduling system with a retry capability. When matrix operations are detected during active parallel work, jobs are re-queued for later execution instead of proceeding immediately, preventing potential conflicts.

Changes

Cohort / File(s) Summary
Parallel Execution Tracking
ext/src/utils.rs
Introduced new parallel module with ParallelGuard RAII type and is_in_parallel() function to track active parallel work via a global atomic counter.
Retry Logic in Nassau
ext/src/nassau.rs
Extended SenderData with retry field, added send_retry() method, wrapped matrix-construction calls in ParallelGuard scope, and updated scheduling loop to detect parallel execution and enqueue retry messages instead of proceeding immediately.
Retry Logic in Resolution
ext/src/resolution.rs
Extended SenderData with retry field, added send_retry() method, wrapped get_matrix() calls in ParallelGuard scope, updated worker scheduling loops to detect parallel execution and re-invoke scheduling closure on retry instead of direct execution.

Sequence Diagram

sequenceDiagram
    participant Worker as Worker Thread
    participant Scheduler as Scheduler Loop
    participant MatrixOp as Matrix Operations
    participant Queue as Sender Queue
    
    Worker->>Scheduler: step_resolution(bidegree)
    Scheduler->>Scheduler: Check is_in_parallel()?
    alt Parallel execution detected
        Scheduler->>Queue: send_retry(bidegree)
        Queue->>Scheduler: SenderData{retry: true}
        Scheduler->>Scheduler: Re-invoke scheduling closure f(b, sender)
    else No parallel execution
        Scheduler->>MatrixOp: get_matrix() within ParallelGuard
        MatrixOp->>MatrixOp: Update counter
        MatrixOp->>Scheduler: Return matrix
        MatrixOp->>MatrixOp: Drop guard, decrement counter
        Scheduler->>Scheduler: Update progress
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hark! A guard doth mark when threads run fast,
And jobs deemed unsafe wait to execute at last,
With retry flags a-flying, they queue once more,
Until the parallel dance is done and finished for sure! 🌲✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically describes the main change: introducing a mechanism to avoid Rayon priority inversion during step_resolution scheduling, which aligns with the core objective and code changes (ParallelGuard, retry logic, is_in_parallel checks).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ext/src/resolution.rs`:
- Around line 836-839: The retry gate currently protects step_resolution jobs
but not the stem-edge kernel task spawned via scope.spawn; instead of calling
self.get_kernel(next_b) directly inside the scope.spawn closure, wrap that path
with the same parallel-check + retry defer used by
SenderData::send_retry/is_in_parallel so the closure defers and re-queues the
task when running in a parallel worker (or otherwise retries via
SenderData::send_retry), ensuring get_kernel/get_matrix calls are not executed
on a worker that can block other threads; update the code around the scope.spawn
that references self.get_kernel(next_b) to use the deferred/retry mechanism used
for step_resolution so the stem-edge kernel path is covered.

In `@ext/src/utils.rs`:
- Around line 563-565: The increment of PARALLEL_DEPTH in the constructor new()
uses Ordering::Release which doesn't provide acquire semantics for subsequent
work; change the atomic increment to use at least Ordering::AcqRel (or
Ordering::SeqCst) so that the publication of PARALLEL_DEPTH is visible before
this thread proceeds to call get_matrix / get_partial_matrix and prevent other
workers from missing the active-parallel marker and running step_resolution;
update the fetch_add call on PARALLEL_DEPTH in new() accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4eca9b8a-e45c-4d8d-9ba6-86eada00a550

📥 Commits

Reviewing files that changed from the base of the PR and between bb6dafc and 0ec0e20.

📒 Files selected for processing (3)
  • ext/src/nassau.rs
  • ext/src/resolution.rs
  • ext/src/utils.rs

Comment thread ext/src/resolution.rs
Comment thread ext/src/utils.rs
@JoeyBF JoeyBF merged commit 775784d into SpectralSequences:master Apr 12, 2026
24 checks passed
@JoeyBF JoeyBF deleted the prio_inv branch April 12, 2026 01:51
github-actions bot added a commit that referenced this pull request Apr 12, 2026
When rayon workers finish inner parallel work (par_iter_mut in
get_partial_matrix/get_matrix) and wait at join points, work-stealing
can cause them to pick up another step_resolution job, blocking the
original from completing. This crosses thread boundaries: any thread
in the join tree can steal a long job.

Fix: track active parallel sections with a global atomic counter
(PARALLEL_DEPTH). Before each get_partial_matrix/get_matrix call, a
ParallelGuard increments the counter; it decrements on drop. Spawned
step_resolution closures check is_in_parallel() and send a retry
message if true. The scheduler re-queues retried jobs, and the freed
thread returns to rayon's pool where it helps finish the active
parallel section -- turning the retry into productive work.

See rayon-rs/rayon#957
github-actions bot added a commit to JoeyBF/sseq that referenced this pull request Apr 12, 2026
…lSequences#230)

When rayon workers finish inner parallel work (par_iter_mut in
get_partial_matrix/get_matrix) and wait at join points, work-stealing
can cause them to pick up another step_resolution job, blocking the
original from completing. This crosses thread boundaries: any thread
in the join tree can steal a long job.

Fix: track active parallel sections with a global atomic counter
(PARALLEL_DEPTH). Before each get_partial_matrix/get_matrix call, a
ParallelGuard increments the counter; it decrements on drop. Spawned
step_resolution closures check is_in_parallel() and send a retry
message if true. The scheduler re-queues retried jobs, and the freed
thread returns to rayon's pool where it helps finish the active
parallel section -- turning the retry into productive work.

See rayon-rs/rayon#957
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant