Notify execution complete by chrisstaite-menlo · Pull Request #1975 · TraceMachina/nativelink

chrisstaite-menlo · 2025-10-14T11:00:17Z

Description

When execution is complete, there's a large amount of IO still to be done. In the mean time a new action could be starting.

Previously an attempt to implement this was quite complex and caused panics. In this implementation a very simple mechanism is used which only executes on success and keeps track of which operations have been notified on the scheduler. This massively simplifies things.

Fixes #1903

Type of change

Please delete options that aren't relevant.

New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Ran up on my test cluster and it was running at 100% constantly.

Checklist

Updated documentation if needed
Tests added/amended
bazel test //... passes locally
PR is contained in a single commit, using git amend see some docs

This change is

amankrx · 2025-10-14T20:23:05Z

I was running the basic_cas.json5 example locally and this failed with a few errors.

  2025-10-14T20:17:44.619643Z ERROR nativelink_service::worker_api_server: error: status: Internal, message: "Worker 1f0a93a9-23c3-642c-93c3-fad1453f631e does not exist in SimpleScheduler::update_action : Failed to operation Uuid(d38a2e62-3793-428f-a8db-5d58c5b7d846)", details: [], metadata: MetadataMap { headers: {} }                                                                                                  

  2025-10-14T20:17:44.626726Z  WARN nativelink_service::worker_api_server: UpdateForWorker channel was closed, thus closing connection to worker node, worker_id: 1f0a93a9-23c3-642c-93c3-fad1453f631e

  2025-10-14T20:17:44.628023Z ERROR nativelink_service::worker_api_server: error: status: InvalidArgument, message: "Worker not found in worker map in refresh_lifetime() 1f0a93a9-23c3-642c-93c3-fad1453f631e : Error refreshing lifetime in worker_keep_alive_received() : Could not process keep_alive from worker in inner_keep_alive()", details: [], metadata: MetadataMap { headers: {} }                                  

  2025-10-14T20:17:44.630005Z ERROR nativelink_worker::running_actions_manager: RunningActionImpl did not cleanup. This is a violation of the requirements, will attempt to do it in the background., operation_id: Uuid(31c42711-07cc-448a-ba1e-5de58f3028d6)

Steps to Reproduce:

Run the Basic CAS Example:

bazel run nativelink -- \
    $(pwd)/nativelink-config/examples/basic_cas.json5

Run the Nativelink tests:

bazel test //... --verbose_failures --remote_instance_name=main --remote_cache=grpc://127.0.0.1:50051 --remote_executor=grpc://127.0.0.1:50051

Can you please check if you can reproduce this error?

chrisstaite-menlo · 2025-10-15T07:24:47Z

That sounds like a standard overloaded system with a worker keep alive set too low. Notably #1977 should actually help with this as it doesn't require setting up a separate channel.

You may be seeing this with this change as it's specifically designed to keep worker utilisation high.

palfrey

Couple of very minor items, but otherwise happy. Ran this against the tests from #1971 as well, and worked fine.

nativelink-proto/com/github/trace_machina/nativelink/remote_execution/worker_api.proto

nativelink-scheduler/src/simple_scheduler_state_manager.rs

When execution is complete, there's a large amount of IO still to be done. In the mean time a new action could be starting. Previously an attempt to implement this was quite complex and caused panics. In this implementation a very simple mechanism is used which only executes on success and keeps track of which operations have been notified on the scheduler. This massively simplifies things. Fixes #1903

chrisstaite-menlo · 2025-10-16T16:58:18Z

Think we're all good to merge now I've run all the formatting suites again, should add a pre-commit hook for that! If you could approve please @palfrey

palfrey

@palfrey reviewed 7 of 10 files at r1, 2 of 3 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 1 of 1 LGTMs obtained, and all files reviewed

When execution is complete, there's a large amount of IO still to be done. In the mean time a new action could be starting. Previously an attempt to implement this was quite complex and caused panics. In this implementation a very simple mechanism is used which only executes on success and keeps track of which operations have been notified on the scheduler. This massively simplifies things. Fixes TraceMachina#1903 Co-authored-by: Chris Staite <chris@yourdreamnet.co.uk>

chrisstaite-menlo requested a review from amankrx October 14, 2025 11:00

chrisstaite-menlo temporarily deployed to production October 14, 2025 11:00 — with GitHub Actions Inactive

chrisstaite-menlo had a problem deploying to production October 14, 2025 11:00 — with GitHub Actions Failure

chrisstaite-menlo force-pushed the feature/Execution-complete branch from 84f72e6 to ef3458a Compare October 14, 2025 11:13

chrisstaite-menlo temporarily deployed to production October 14, 2025 11:13 — with GitHub Actions Inactive

chrisstaite-menlo force-pushed the feature/Execution-complete branch from ef3458a to 2246ee2 Compare October 14, 2025 12:45

chrisstaite-menlo temporarily deployed to production October 14, 2025 12:45 — with GitHub Actions Inactive

palfrey requested changes Oct 16, 2025

View reviewed changes

nativelink-proto/com/github/trace_machina/nativelink/remote_execution/worker_api.proto Outdated Show resolved Hide resolved

nativelink-scheduler/src/simple_scheduler_state_manager.rs Outdated Show resolved Hide resolved

chrisstaite-menlo force-pushed the feature/Execution-complete branch from 2246ee2 to 6b61f0c Compare October 16, 2025 14:38

chrisstaite-menlo temporarily deployed to production October 16, 2025 14:40 — with GitHub Actions Inactive

chrisstaite-menlo had a problem deploying to production October 16, 2025 14:40 — with GitHub Actions Error

palfrey reviewed Oct 16, 2025

View reviewed changes

nativelink-scheduler/src/simple_scheduler_state_manager.rs Outdated Show resolved Hide resolved

chrisstaite-menlo force-pushed the feature/Execution-complete branch from 6b61f0c to f1e1e05 Compare October 16, 2025 14:54

chrisstaite-menlo had a problem deploying to production October 16, 2025 14:54 — with GitHub Actions Error

chrisstaite-menlo enabled auto-merge (squash) October 16, 2025 14:56

chrisstaite-menlo force-pushed the feature/Execution-complete branch from f1e1e05 to ecd52ac Compare October 16, 2025 14:56

chrisstaite-menlo temporarily deployed to production October 16, 2025 14:56 — with GitHub Actions Inactive

chrisstaite-menlo had a problem deploying to production October 16, 2025 14:59 — with GitHub Actions Failure

chrisstaite-menlo force-pushed the feature/Execution-complete branch from ecd52ac to ae25509 Compare October 16, 2025 15:50

chrisstaite-menlo temporarily deployed to production October 16, 2025 15:50 — with GitHub Actions Inactive

palfrey approved these changes Oct 17, 2025

View reviewed changes

chrisstaite-menlo merged commit 8527f25 into main Oct 17, 2025
28 of 29 checks passed

chrisstaite-menlo deleted the feature/Execution-complete branch October 17, 2025 10:59

palfrey mentioned this pull request Oct 17, 2025

Single worker stream #1977

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notify execution complete#1975

Notify execution complete#1975
chrisstaite-menlo merged 1 commit intomainfrom
feature/Execution-complete

chrisstaite-menlo commented Oct 14, 2025 •

edited by MarcusSorealheis

Loading

Uh oh!

amankrx commented Oct 14, 2025

Uh oh!

chrisstaite-menlo commented Oct 15, 2025 •

edited

Loading

Uh oh!

palfrey left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chrisstaite-menlo commented Oct 16, 2025

Uh oh!

palfrey left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

chrisstaite-menlo commented Oct 14, 2025 • edited by MarcusSorealheis Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Checklist

Uh oh!

amankrx commented Oct 14, 2025

Uh oh!

chrisstaite-menlo commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

palfrey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chrisstaite-menlo commented Oct 16, 2025

Uh oh!

palfrey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chrisstaite-menlo commented Oct 14, 2025 •

edited by MarcusSorealheis

Loading

chrisstaite-menlo commented Oct 15, 2025 •

edited

Loading