Skip to content

Lazy Heartbeat Scheduling#44

Merged
NthTensor merged 3 commits into
mainfrom
instruction_counters_lazy_scheduling
May 12, 2026
Merged

Lazy Heartbeat Scheduling#44
NthTensor merged 3 commits into
mainfrom
instruction_counters_lazy_scheduling

Conversation

@NthTensor
Copy link
Copy Markdown
Owner

@NthTensor NthTensor commented Apr 24, 2026

At the start of the year, I started circulating this crate about and soliciting feedback. The general response was: it doesn't perform well. Womp womp. Sad trumpet noises.

Over the last few months, I tried built several prototypes aimed at improving performance. This PR is the culmination of those efforts. Some of of the intermediate experiments got pretty pretty messy, so for simplicity I've chosen to just pull the working experiments together into this Mega-PR.

There are three major changes:

  • I've added a new lock-free worker registration and sleep-tracking system (Basically we use an atomic bit-set for tracking seat information). This has significantly improved the overhead of worker wakeups, and allows us to wake workers up much more eagerly, but also means thread pools now have a hard limit of 32 members. I think this is a reasonable constraint (and we can theoretically expand it up to 64 on platforms that support 64-bit atomics).

  • There's a totally new (potentially novel) approach to workload balancing that combines work-stealing, heartbeat scheduling, and lazy scheduling. This is detailed in the new readme, so I won't repeat it here.

  • I've also done another comprehensive pass on docs and safety comments.

This code will form the basis for 1.0.0-alpha.5 and (depending on how the next round of feedback goes) may end up being released as 1.0. Hopefully changes will be more incremental going forwards.

@CorvusPrudens
Copy link
Copy Markdown

Here are my results. My system isn't particularly quiet, so take it with a grain of salt.

https://gist.github.com/CorvusPrudens/16c4f6dc26770c8578a869978eb85ab8

@NthTensor
Copy link
Copy Markdown
Owner Author

There seems to be a potential segfault within join on x86 windows, which I am trying to diagnose.

Comment thread src/thread_pool.rs
if sleeping == 0 {
return;
}
cold_path();
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How odd, this cold_path hint seems to be the source of the segfault...

Copy link
Copy Markdown

@giant-chipmunk giant-chipmunk May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing that cold_path() still leads to "exit code: 0xc0000005, STATUS_ACCESS_VIOLATION" on my machine (Win11 on a Ryzen 7600x) once the fork_join bench hits forte

Enabling any LTO (fat or thin) fixes the issue

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some UB somewhere in the promotion code which I am still tracking down.

Copy link
Copy Markdown

@giant-chipmunk giant-chipmunk May 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through your code and couldn't find anything for the life of me

So I chucked some Codex GPT5.5-High at it while doing the dishes and it seems to have found an upstream issue in tick_counter

They do:

/// Returns a current value of the tick counter to use as a staring point
#[cfg(target_arch = "x86_64")]
#[inline]
pub fn start() -> u64 {
    let rax: u64;
    unsafe {
        asm!(
            "mfence",
            "lfence",
            "rdtsc",
            "shl rdx, 32",
            "or rax, rdx",
            out("rax") rax
        );
    }
    rax
}

in https://github.com/sheroz/tick_counter/blob/7511cf114ca536f94c35b675b5a12d5b9a762e96/src/lib.rs#L143

rdx gets clobbered but is not declared as a clobber or an out despite being left-shifted within this block. However Rust assumes that nothing but the declared output changes within an asm! block. Therefore the program might overwrite some existing value in rdx and run into UB trying to use it in other places

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! This would be consistent with my observations, the issues started when tick-counter was introduced.

I'll report this upstream and try switching to cputicks. I'd really rather not have to maintain my own inline asm for this.

Thanks for investigating!!

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this seems to have already been reported. sheroz/tick_counter#15 (comment)

What fun, I really should have audited that dep more closely.

@giant-chipmunk
Copy link
Copy Markdown

giant-chipmunk commented May 1, 2026

Edit: Unified comments and added more Benches


Ran the bench with the tick_counter issue fixed locally:

Default:
https://gist.github.com/giant-chipmunk/2de494bb10dd84c24098eca591a1e0ee

Default + mimalloc:
https://gist.github.com/giant-chipmunk/a2f2898fe4df6e4f9e8e9dfe8acf8900

My Fix: Replaced tick_counter usage with the following code:

#[inline(always)]
fn current_tick() -> u64 {
    let low: u64;
    let high: u64;
    // SAFETY: This reads the CPU timestamp counter and declares both registers written by `rdtsc` as outputs.
    unsafe {
        core::arch::asm!(
            "mfence",
            "lfence",
            "rdtsc",
            out("rax") low,
            out("rdx") high,
            options(nostack, preserves_flags),
        );
    }
    low | (high << 32)
}

Old Benches:

Gist of the bench as is:
https://gist.github.com/giant-chipmunk/d34aa54ff7d41da6082edd6d672202c6

Gist using thin LTO to avoid STATUS_ACCESS_VIOLATION on fork_join bench:
https://gist.github.com/giant-chipmunk/f48ee5cdb1dfd505b73e7819a3216fea

Gist using fat LTO + misc build flags on nightly (see notes inside gist for details):
https://gist.github.com/giant-chipmunk/c8359760702e85188b5bd5a51cd741be

Gist using config above + enabling -Cpanic=abort for benches:
https://gist.github.com/giant-chipmunk/869e0ebcc33c8db1a98cdc8db6e64a99

Got curious about the impact of some misc. build flags so I chucked them in there

The panic=abort results are quite interesting. Some select benchmarks show significant improvements while others (most) barely changed. And on fork_join - throughput_forte there is a significant speedup on smaller workloads but a regression from (11, 2047) onwards.


Also ran with mimalloc out of curiosity:

Gist of the bench as is + mimalloc:
https://gist.github.com/giant-chipmunk/85196dc62eec515d98f5957d4c936e14

Gist using thin LTO + mimalloc:
https://gist.github.com/giant-chipmunk/9dd20ed0f032f5738f565e9ce83e79f9 (overhead - baseline sucks in this case)

Gist using nightly + misc flags + panic=abort + mimalloc:
https://gist.github.com/giant-chipmunk/6c94c820759c60ce8170db3c6b36c733


@NthTensor
Copy link
Copy Markdown
Owner Author

Switched to hotclock (new version of cputicks) due to UB in tick_counter. Previous version of cputicks seemed to be much slower than tick_counter, but hotclock is just as fast and doesn't have trivial UB.

This PR is mostly waiting for a 0.2.0 release of hotclock now.

@giant-chipmunk
Copy link
Copy Markdown

Reran cargo bench under Win11_x64 with latest stable on two different CPUs

Bench with Ryzen 5 7600x
https://gist.github.com/giant-chipmunk/0dfc8ed85a328550a31843a084612e3e

Bench with Intel i5-9600k
https://gist.github.com/giant-chipmunk/ad5dd3a5200e1e3e1416084df34d7534

No real surprises, except that Rayon perf degraded significantly worse relative to the other crates when running fork_join (20, 1048575) on the older cache-poor CPU

@NthTensor
Copy link
Copy Markdown
Owner Author

Sick! I'm reasonably happy with those results. Chili is still beating us on x86 windows, and especially on resource-constrained x86 windows, but we are mostly matching rayon. There's probably a few more things we may be able to steal from chili (and enable conditionally on windows) which could close that gap.

I'm going to merge this, I think, and then I'll put out the next beta release as soon as the git-dep is no longer necessary.

@NthTensor NthTensor merged commit e983f85 into main May 12, 2026
4 checks passed
@NthTensor NthTensor deleted the instruction_counters_lazy_scheduling branch May 12, 2026 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants