Skip to content

Single worker stream#1977

Merged
chrisstaite-menlo merged 1 commit intomainfrom
feature/Single-worker-stream
Oct 17, 2025
Merged

Single worker stream#1977
chrisstaite-menlo merged 1 commit intomainfrom
feature/Single-worker-stream

Conversation

@chrisstaite-menlo
Copy link
Collaborator

@chrisstaite-menlo chrisstaite-menlo commented Oct 14, 2025

Description

When there are multiple schedulers in high availability mode then the workers can end up communicating with the wrong scheduler and state can get confused.

Modify the communications such that there is a single bi-directional gRPC stream between the client and the worker. This way they will always be one-to-one mapped even with a load balancer in front.

Type of change

Please delete options that aren't relevant.

  • Breaking change (fix or feature that would cause existing functionality to
    not work as expected)

How Has This Been Tested?

Running on my cluster this now functions perfectly with a one-to-one worker to scheduler and removes the races.

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

@chrisstaite-menlo
Copy link
Collaborator Author

I could in theory make this backwards compatible, but I think it's way nicer to just bump the major and require workers and schedulers to be updated at the same time.

@chrisstaite-menlo chrisstaite-menlo mentioned this pull request Oct 15, 2025
5 tasks
@palfrey
Copy link
Member

palfrey commented Oct 16, 2025

I could in theory make this backwards compatible, but I think it's way nicer to just bump the major and require workers and schedulers to be updated at the same time.

Wondering if we need to bump the major here. This will make schedulers and workers using different versions incompatible but IIRC we've never said that different versions are compatible with each other. We've generally done compatibility bumps only for client config breaks.

@chrisstaite-menlo chrisstaite-menlo force-pushed the feature/Execution-complete branch from 2246ee2 to 6b61f0c Compare October 16, 2025 14:38
@chrisstaite-menlo chrisstaite-menlo force-pushed the feature/Single-worker-stream branch from ce7b882 to 5c1e9dd Compare October 16, 2025 14:39
Copy link
Member

@palfrey palfrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. LGTM, tested against #1971 so happy with those tests as well

@chrisstaite-menlo chrisstaite-menlo force-pushed the feature/Execution-complete branch from 6b61f0c to f1e1e05 Compare October 16, 2025 14:54
@chrisstaite-menlo chrisstaite-menlo force-pushed the feature/Single-worker-stream branch from 5c1e9dd to 2d33b5d Compare October 16, 2025 14:54
@chrisstaite-menlo chrisstaite-menlo force-pushed the feature/Execution-complete branch from f1e1e05 to ecd52ac Compare October 16, 2025 14:56
@chrisstaite-menlo chrisstaite-menlo force-pushed the feature/Single-worker-stream branch from 2d33b5d to 758c63f Compare October 16, 2025 14:56
Copy link
Member

@palfrey palfrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@palfrey reviewed 8 of 8 files at r1, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and all files reviewed

Copy link
Member

@palfrey palfrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 1 LGTMs obtained, and all files reviewed

@chrisstaite-menlo chrisstaite-menlo force-pushed the feature/Execution-complete branch from ecd52ac to ae25509 Compare October 16, 2025 15:50
@chrisstaite-menlo chrisstaite-menlo force-pushed the feature/Single-worker-stream branch from 758c63f to 77e8307 Compare October 16, 2025 15:50
Base automatically changed from feature/Execution-complete to main October 17, 2025 10:59
@palfrey
Copy link
Member

palfrey commented Oct 17, 2025

#1975 is now merged, and I'm guessing that's the source of conflicts here

@chrisstaite-menlo
Copy link
Collaborator Author

Rebased which sorted it out. Happy to merge this if you are.

@palfrey
Copy link
Member

palfrey commented Oct 17, 2025

Rebased which sorted it out. Happy to merge this if you are.

Seeing build failures now

@chrisstaite-menlo chrisstaite-menlo force-pushed the feature/Single-worker-stream branch from e9a793e to 077070b Compare October 17, 2025 14:32
@chrisstaite-menlo chrisstaite-menlo force-pushed the feature/Single-worker-stream branch from 077070b to e116664 Compare October 17, 2025 14:37
@chrisstaite-menlo chrisstaite-menlo force-pushed the feature/Single-worker-stream branch from e116664 to 389a5e1 Compare October 17, 2025 14:38
@chrisstaite-menlo chrisstaite-menlo force-pushed the feature/Single-worker-stream branch from 389a5e1 to c5d6166 Compare October 17, 2025 14:38
@chrisstaite-menlo chrisstaite-menlo force-pushed the feature/Single-worker-stream branch from c5d6166 to 81dc012 Compare October 17, 2025 14:53
@chrisstaite-menlo chrisstaite-menlo force-pushed the feature/Single-worker-stream branch from 81dc012 to eb301cd Compare October 17, 2025 14:59
When there are multiple schedulers in high availability mode then the workers can end up communicating with the wrong scheduler and state can get confused.

Modify the communications such that there is a single bi-directional gRPC stream between the client and the worker.  This way they will always be one-to-one mapped even with a load balancer in front.
@chrisstaite-menlo chrisstaite-menlo force-pushed the feature/Single-worker-stream branch from eb301cd to ac8f8ad Compare October 17, 2025 15:10
@chrisstaite-menlo chrisstaite-menlo merged commit e9250ee into main Oct 17, 2025
29 of 30 checks passed
@chrisstaite-menlo chrisstaite-menlo deleted the feature/Single-worker-stream branch October 17, 2025 16:05
MarcusSorealheis pushed a commit to MarcusSorealheis/nativelink that referenced this pull request Nov 3, 2025
When there are multiple schedulers in high availability mode then the workers can end up communicating with the wrong scheduler and state can get confused.

Modify the communications such that there is a single bi-directional gRPC stream between the client and the worker.  This way they will always be one-to-one mapped even with a load balancer in front.

Co-authored-by: Chris Staite <chris@yourdreamnet.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants