Lazy Heartbeat Scheduling by NthTensor · Pull Request #44 · NthTensor/Forte

NthTensor · 2026-04-24T21:43:40Z

At the start of the year, I started circulating this crate about and soliciting feedback. The general response was: it doesn't perform well. Womp womp. Sad trumpet noises.

Over the last few months, I tried built several prototypes aimed at improving performance. This PR is the culmination of those efforts. Some of of the intermediate experiments got pretty pretty messy, so for simplicity I've chosen to just pull the working experiments together into this Mega-PR.

There are three major changes:

I've added a new lock-free worker registration and sleep-tracking system (Basically we use an atomic bit-set for tracking seat information). This has significantly improved the overhead of worker wakeups, and allows us to wake workers up much more eagerly, but also means thread pools now have a hard limit of 32 members. I think this is a reasonable constraint (and we can theoretically expand it up to 64 on platforms that support 64-bit atomics).
There's a totally new (potentially novel) approach to workload balancing that combines work-stealing, heartbeat scheduling, and lazy scheduling. This is detailed in the new readme, so I won't repeat it here.
I've also done another comprehensive pass on docs and safety comments.

This code will form the basis for 1.0.0-alpha.5 and (depending on how the next round of feedback goes) may end up being released as 1.0. Hopefully changes will be more incremental going forwards.

CorvusPrudens · 2026-04-24T22:04:17Z

Here are my results. My system isn't particularly quiet, so take it with a grain of salt.

https://gist.github.com/CorvusPrudens/16c4f6dc26770c8578a869978eb85ab8

NthTensor · 2026-04-25T02:26:10Z

There seems to be a potential segfault within join on x86 windows, which I am trying to diagnose.

NthTensor · 2026-04-27T04:45:06Z

+        if sleeping == 0 {
            return;
        }
+        cold_path();


How odd, this cold_path hint seems to be the source of the segfault...

removing that cold_path() still leads to "exit code: 0xc0000005, STATUS_ACCESS_VIOLATION" on my machine (Win11 on a Ryzen 7600x) once the fork_join bench hits forte

Enabling any LTO (fat or thin) fixes the issue

There's some UB somewhere in the promotion code which I am still tracking down.

I looked through your code and couldn't find anything for the life of me

So I chucked some Codex GPT5.5-High at it while doing the dishes and it seems to have found an upstream issue in tick_counter

They do:

/// Returns a current value of the tick counter to use as a staring point #[cfg(target_arch = "x86_64")] #[inline] pub fn start() -> u64 { let rax: u64; unsafe { asm!( "mfence", "lfence", "rdtsc", "shl rdx, 32", "or rax, rdx", out("rax") rax ); } rax }

in https://github.com/sheroz/tick_counter/blob/7511cf114ca536f94c35b675b5a12d5b9a762e96/src/lib.rs#L143

rdx gets clobbered but is not declared as a clobber or an out despite being left-shifted within this block. However Rust assumes that nothing but the declared output changes within an asm! block. Therefore the program might overwrite some existing value in rdx and run into UB trying to use it in other places

Interesting! This would be consistent with my observations, the issues started when tick-counter was introduced.

I'll report this upstream and try switching to cputicks. I'd really rather not have to maintain my own inline asm for this.

Thanks for investigating!!

Ah, this seems to have already been reported. sheroz/tick_counter#15 (comment)

What fun, I really should have audited that dep more closely.

giant-chipmunk · 2026-05-01T21:56:13Z

Edit: Unified comments and added more Benches

Ran the bench with the tick_counter issue fixed locally:

Default:
https://gist.github.com/giant-chipmunk/2de494bb10dd84c24098eca591a1e0ee

Default + mimalloc:
https://gist.github.com/giant-chipmunk/a2f2898fe4df6e4f9e8e9dfe8acf8900

My Fix: Replaced tick_counter usage with the following code:

#[inline(always)]
fn current_tick() -> u64 {
    let low: u64;
    let high: u64;
    // SAFETY: This reads the CPU timestamp counter and declares both registers written by `rdtsc` as outputs.
    unsafe {
        core::arch::asm!(
            "mfence",
            "lfence",
            "rdtsc",
            out("rax") low,
            out("rdx") high,
            options(nostack, preserves_flags),
        );
    }
    low | (high << 32)
}

Old Benches:

Gist of the bench as is:
https://gist.github.com/giant-chipmunk/d34aa54ff7d41da6082edd6d672202c6

Gist using thin LTO to avoid STATUS_ACCESS_VIOLATION on fork_join bench:
https://gist.github.com/giant-chipmunk/f48ee5cdb1dfd505b73e7819a3216fea

Gist using fat LTO + misc build flags on nightly (see notes inside gist for details):
https://gist.github.com/giant-chipmunk/c8359760702e85188b5bd5a51cd741be

Gist using config above + enabling -Cpanic=abort for benches:
https://gist.github.com/giant-chipmunk/869e0ebcc33c8db1a98cdc8db6e64a99

Got curious about the impact of some misc. build flags so I chucked them in there

The panic=abort results are quite interesting. Some select benchmarks show significant improvements while others (most) barely changed. And on fork_join - throughput_forte there is a significant speedup on smaller workloads but a regression from (11, 2047) onwards.

Also ran with mimalloc out of curiosity:

Gist of the bench as is + mimalloc:
https://gist.github.com/giant-chipmunk/85196dc62eec515d98f5957d4c936e14

Gist using thin LTO + mimalloc:
https://gist.github.com/giant-chipmunk/9dd20ed0f032f5738f565e9ce83e79f9 (overhead - baseline sucks in this case)

Gist using nightly + misc flags + panic=abort + mimalloc:
https://gist.github.com/giant-chipmunk/6c94c820759c60ce8170db3c6b36c733

NthTensor · 2026-05-06T14:56:42Z

Switched to hotclock (new version of cputicks) due to UB in tick_counter. Previous version of cputicks seemed to be much slower than tick_counter, but hotclock is just as fast and doesn't have trivial UB.

This PR is mostly waiting for a 0.2.0 release of hotclock now.

giant-chipmunk · 2026-05-07T02:46:19Z

Reran cargo bench under Win11_x64 with latest stable on two different CPUs

Bench with Ryzen 5 7600x
https://gist.github.com/giant-chipmunk/0dfc8ed85a328550a31843a084612e3e

Bench with Intel i5-9600k
https://gist.github.com/giant-chipmunk/ad5dd3a5200e1e3e1416084df34d7534

No real surprises, except that Rayon perf degraded significantly worse relative to the other crates when running fork_join (20, 1048575) on the older cache-poor CPU

NthTensor · 2026-05-07T12:08:40Z

Sick! I'm reasonably happy with those results. Chili is still beating us on x86 windows, and especially on resource-constrained x86 windows, but we are mostly matching rayon. There's probably a few more things we may be able to steal from chili (and enable conditionally on windows) which could close that gap.

I'm going to merge this, I think, and then I'll put out the next beta release as soon as the git-dep is no longer necessary.

NthTensor added 2 commits April 24, 2026 01:18

feat: instruction counters and lazy scheduling

2c10ea5

feat: improve docs

d0597d5

NthTensor commented Apr 27, 2026

View reviewed changes

fix: switch to hotclock for cpu ticks

be90449

NthTensor merged commit e983f85 into main May 12, 2026
4 checks passed

NthTensor deleted the instruction_counters_lazy_scheduling branch May 12, 2026 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy Heartbeat Scheduling#44

Lazy Heartbeat Scheduling#44
NthTensor merged 3 commits into
mainfrom
instruction_counters_lazy_scheduling

NthTensor commented Apr 24, 2026 •

edited

Loading

Uh oh!

CorvusPrudens commented Apr 24, 2026

Uh oh!

NthTensor commented Apr 25, 2026

Uh oh!

NthTensor Apr 27, 2026

Uh oh!

giant-chipmunk May 1, 2026 •

edited

Loading

Uh oh!

NthTensor May 1, 2026

Uh oh!

giant-chipmunk May 2, 2026 •

edited

Loading

Uh oh!

NthTensor May 2, 2026

Uh oh!

NthTensor May 2, 2026

Uh oh!

giant-chipmunk commented May 1, 2026 •

edited

Loading

Uh oh!

NthTensor commented May 6, 2026

Uh oh!

giant-chipmunk commented May 7, 2026

Uh oh!

NthTensor commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

NthTensor commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CorvusPrudens commented Apr 24, 2026

Uh oh!

NthTensor commented Apr 25, 2026

Uh oh!

NthTensor Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

giant-chipmunk May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NthTensor May 1, 2026

Choose a reason for hiding this comment

Uh oh!

giant-chipmunk May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NthTensor May 2, 2026

Choose a reason for hiding this comment

Uh oh!

NthTensor May 2, 2026

Choose a reason for hiding this comment

Uh oh!

giant-chipmunk commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Old Benches:

Uh oh!

NthTensor commented May 6, 2026

Uh oh!

giant-chipmunk commented May 7, 2026

Uh oh!

NthTensor commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NthTensor commented Apr 24, 2026 •

edited

Loading

giant-chipmunk May 1, 2026 •

edited

Loading

giant-chipmunk May 2, 2026 •

edited

Loading

giant-chipmunk commented May 1, 2026 •

edited

Loading