Skip to content

Notify execution complete#1975

Merged
chrisstaite-menlo merged 1 commit intomainfrom
feature/Execution-complete
Oct 17, 2025
Merged

Notify execution complete#1975
chrisstaite-menlo merged 1 commit intomainfrom
feature/Execution-complete

Conversation

@chrisstaite-menlo
Copy link
Collaborator

@chrisstaite-menlo chrisstaite-menlo commented Oct 14, 2025

Description

When execution is complete, there's a large amount of IO still to be done. In the mean time a new action could be starting.

Previously an attempt to implement this was quite complex and caused panics. In this implementation a very simple mechanism is used which only executes on success and keeps track of which operations have been notified on the scheduler. This massively simplifies things.

Fixes #1903

Type of change

Please delete options that aren't relevant.

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Ran up on my test cluster and it was running at 100% constantly.

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

@amankrx
Copy link
Collaborator

amankrx commented Oct 14, 2025

I was running the basic_cas.json5 example locally and this failed with a few errors.

  2025-10-14T20:17:44.619643Z ERROR nativelink_service::worker_api_server: error: status: Internal, message: "Worker 1f0a93a9-23c3-642c-93c3-fad1453f631e does not exist in SimpleScheduler::update_action : Failed to operation Uuid(d38a2e62-3793-428f-a8db-5d58c5b7d846)", details: [], metadata: MetadataMap { headers: {} }                                                                                                  

  2025-10-14T20:17:44.626726Z  WARN nativelink_service::worker_api_server: UpdateForWorker channel was closed, thus closing connection to worker node, worker_id: 1f0a93a9-23c3-642c-93c3-fad1453f631e

  2025-10-14T20:17:44.628023Z ERROR nativelink_service::worker_api_server: error: status: InvalidArgument, message: "Worker not found in worker map in refresh_lifetime() 1f0a93a9-23c3-642c-93c3-fad1453f631e : Error refreshing lifetime in worker_keep_alive_received() : Could not process keep_alive from worker in inner_keep_alive()", details: [], metadata: MetadataMap { headers: {} }                                  

  2025-10-14T20:17:44.630005Z ERROR nativelink_worker::running_actions_manager: RunningActionImpl did not cleanup. This is a violation of the requirements, will attempt to do it in the background., operation_id: Uuid(31c42711-07cc-448a-ba1e-5de58f3028d6)                                                                                                                                                                    

Steps to Reproduce:

Run the Basic CAS Example:

bazel run nativelink -- \
    $(pwd)/nativelink-config/examples/basic_cas.json5

Run the Nativelink tests:

bazel test //... --verbose_failures --remote_instance_name=main --remote_cache=grpc://127.0.0.1:50051 --remote_executor=grpc://127.0.0.1:50051

Can you please check if you can reproduce this error?

@chrisstaite-menlo
Copy link
Collaborator Author

chrisstaite-menlo commented Oct 15, 2025

That sounds like a standard overloaded system with a worker keep alive set too low. Notably #1977 should actually help with this as it doesn't require setting up a separate channel.

You may be seeing this with this change as it's specifically designed to keep worker utilisation high.

Copy link
Member

@palfrey palfrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of very minor items, but otherwise happy. Ran this against the tests from #1971 as well, and worked fine.

When execution is complete, there's a large amount of IO still to be done.  In the mean time a new action could be starting.

Previously an attempt to implement this was quite complex and caused panics.  In this implementation a very simple mechanism is used which only executes on success and keeps track of which operations have been notified on the scheduler.  This massively simplifies things.

Fixes #1903
@chrisstaite-menlo
Copy link
Collaborator Author

Think we're all good to merge now I've run all the formatting suites again, should add a pre-commit hook for that! If you could approve please @palfrey

Copy link
Member

@palfrey palfrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@palfrey reviewed 7 of 10 files at r1, 2 of 3 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 1 LGTMs obtained, and all files reviewed

@chrisstaite-menlo chrisstaite-menlo merged commit 8527f25 into main Oct 17, 2025
28 of 29 checks passed
@chrisstaite-menlo chrisstaite-menlo deleted the feature/Execution-complete branch October 17, 2025 10:59
@palfrey palfrey mentioned this pull request Oct 17, 2025
5 tasks
MarcusSorealheis pushed a commit to MarcusSorealheis/nativelink that referenced this pull request Nov 3, 2025
When execution is complete, there's a large amount of IO still to be done.  In the mean time a new action could be starting.

Previously an attempt to implement this was quite complex and caused panics.  In this implementation a very simple mechanism is used which only executes on success and keeps track of which operations have been notified on the scheduler.  This massively simplifies things.

Fixes TraceMachina#1903

Co-authored-by: Chris Staite <chris@yourdreamnet.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Report operation complete while uploading results

4 participants