Conversation
84f72e6 to
ef3458a
Compare
ef3458a to
2246ee2
Compare
|
I was running the Steps to Reproduce: Run the Basic CAS Example: Run the Nativelink tests: Can you please check if you can reproduce this error? |
|
That sounds like a standard overloaded system with a worker keep alive set too low. Notably #1977 should actually help with this as it doesn't require setting up a separate channel. You may be seeing this with this change as it's specifically designed to keep worker utilisation high. |
nativelink-proto/com/github/trace_machina/nativelink/remote_execution/worker_api.proto
Outdated
Show resolved
Hide resolved
2246ee2 to
6b61f0c
Compare
6b61f0c to
f1e1e05
Compare
f1e1e05 to
ecd52ac
Compare
When execution is complete, there's a large amount of IO still to be done. In the mean time a new action could be starting. Previously an attempt to implement this was quite complex and caused panics. In this implementation a very simple mechanism is used which only executes on success and keeps track of which operations have been notified on the scheduler. This massively simplifies things. Fixes #1903
ecd52ac to
ae25509
Compare
|
Think we're all good to merge now I've run all the formatting suites again, should add a pre-commit hook for that! If you could approve please @palfrey |
palfrey
left a comment
There was a problem hiding this comment.
@palfrey reviewed 7 of 10 files at r1, 2 of 3 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status:complete! 1 of 1 LGTMs obtained, and all files reviewed
When execution is complete, there's a large amount of IO still to be done. In the mean time a new action could be starting. Previously an attempt to implement this was quite complex and caused panics. In this implementation a very simple mechanism is used which only executes on success and keeps track of which operations have been notified on the scheduler. This massively simplifies things. Fixes TraceMachina#1903 Co-authored-by: Chris Staite <chris@yourdreamnet.co.uk>
Description
When execution is complete, there's a large amount of IO still to be done. In the mean time a new action could be starting.
Previously an attempt to implement this was quite complex and caused panics. In this implementation a very simple mechanism is used which only executes on success and keeps track of which operations have been notified on the scheduler. This massively simplifies things.
Fixes #1903
Type of change
Please delete options that aren't relevant.
How Has This Been Tested?
Ran up on my test cluster and it was running at 100% constantly.
Checklist
bazel test //...passes locallygit amendsee some docsThis change is