Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify RunningAction and AwaitedAction #782

Merged
merged 1 commit into from
Apr 1, 2024

Conversation

zbirenbaum
Copy link
Member

@zbirenbaum zbirenbaum commented Mar 20, 2024

Refactor scheduler to use a single struct for state management of all incomplete actions. Removes RunningAction entirely and adds an optional worker_id field to AwaitedAction

closes: #781

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

All existing tests pass

Checklist

  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

@zbirenbaum zbirenbaum force-pushed the unify-running-action branch 2 times, most recently from 83019ef to 466fee6 Compare March 20, 2024 21:44
Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained


nativelink-scheduler/src/simple_scheduler.rs line 477 at r2 (raw file):

            );
            Arc::make_mut(&mut awaited_action.current_state).stage = ActionStage::Executing;
            awaited_action.worker_id = Some(worker_id);

Is it possible for this to be previously set? If so what happens if it is?

Copy link
Member Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained


nativelink-scheduler/src/simple_scheduler.rs line 477 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

Is it possible for this to be previously set? If so what happens if it is?

It might be possible for it to be previously set, but it shouldn't be an issue. If an action info in queued_actions has an assigned worker_id, that would mean it was previously running, but got dropped and added back to the queued actions list. If this occurs, then we need to assign it a new worker anyway.

Copy link
Member Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained


nativelink-scheduler/src/simple_scheduler.rs line 477 at r2 (raw file):

Previously, zbirenbaum (Zach Birenbaum) wrote…

It might be possible for it to be previously set, but it shouldn't be an issue. If an action info in queued_actions has an assigned worker_id, that would mean it was previously running, but got dropped and added back to the queued actions list. If this occurs, then we need to assign it a new worker anyway.

It looks like there's a requested change for removing the borrow from &mut awaited_action.current_state but doing so causes a compiler error.

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 0 of 1 LGTMs obtained


nativelink-scheduler/src/simple_scheduler.rs line 595 at r2 (raw file):

                Code::Internal,
                "Got a result from a worker that should not be running the action, Removing worker. Expected worker {} got worker {worker_id}",
                running_action.worker_id.unwrap(),

Unwrapping a None here would cause a panic, are we assuming all code paths have properly set this field to an Option?

I've seen this pattern with rust, where Options inner values are reached for but disregarding the None case.

@zbirenbaum zbirenbaum force-pushed the unify-running-action branch 2 times, most recently from 9d017f9 to c7ccaf9 Compare March 25, 2024 21:21
Copy link
Member Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, publish-image, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04


nativelink-scheduler/src/simple_scheduler.rs line 595 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

Unwrapping a None here would cause a panic, are we assuming all code paths have properly set this field to an Option?

I've seen this pattern with rust, where Options inner values are reached for but disregarding the None case.

Done.

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, publish-image, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04

@zbirenbaum zbirenbaum requested a review from allada March 25, 2024 21:23
Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, publish-image, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, publish-image, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04

@allada allada force-pushed the main branch 2 times, most recently from a2a06c2 to 09defc6 Compare March 26, 2024 18:19
Copy link

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 2 of 1 LGTMs obtained


nativelink-scheduler/src/simple_scheduler.rs line 529 at r4 (raw file):

                .update_action_with_internal_error_from_wrong_worker
                .inc();
            match running_action.worker_id {

nit: This ordering is a bit weird, I suggest something more like:

let Some(running_action_worker_id) = running_action.worker_id else {
  error!(
    "Got a result from a worker that should not be running the action, Removing worker. Expected action to be unassigned got worker {worker_id}"
  );
};

nativelink-scheduler/src/simple_scheduler.rs line 597 at r4 (raw file):

                    Code::Internal,
                    "Got a result from a worker that should not be running the action, Removing worker. Expected worker {} got worker {worker_id}",
                    running_action_worker_id

nit: inline this variable.

@zbirenbaum zbirenbaum force-pushed the unify-running-action branch 2 times, most recently from 1dbe041 to 4aa4086 Compare March 27, 2024 04:11
Copy link
Member Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-lre-cc, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04


nativelink-scheduler/src/simple_scheduler.rs line 529 at r4 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: This ordering is a bit weird, I suggest something more like:

let Some(running_action_worker_id) = running_action.worker_id else {
  error!(
    "Got a result from a worker that should not be running the action, Removing worker. Expected action to be unassigned got worker {worker_id}"
  );
};

Done.


nativelink-scheduler/src/simple_scheduler.rs line 597 at r4 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: inline this variable.

Done.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 1 LGTMs obtained


nativelink-scheduler/src/simple_scheduler.rs line 529 at r4 (raw file):

Previously, zbirenbaum (Zach Birenbaum) wrote…

Done.

nit: Do this the other way and put it above the outer if-statement. It'll make the code much more readable and de-uglify if running_action.worker_id == Some(*worker_id)

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 1 LGTMs obtained, and 1 discussions need to be resolved

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r6, all commit messages.
Reviewable status: 2 of 1 LGTMs obtained, and 1 discussions need to be resolved

Copy link
Member Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 1 LGTMs obtained, and 1 discussions need to be resolved


nativelink-scheduler/src/simple_scheduler.rs line 529 at r4 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Do this the other way and put it above the outer if-statement. It'll make the code much more readable and de-uglify if running_action.worker_id == Some(*worker_id)

By other way do you mean the match statement?

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 1 LGTMs obtained, and 1 discussions need to be resolved


nativelink-scheduler/src/simple_scheduler.rs line 529 at r4 (raw file):

Previously, zbirenbaum (Zach Birenbaum) wrote…

By other way do you mean the match statement?

No, I just mean, move this if-statement before this one:

if running_action.worker_id == Some(*worker_id) {

Refactor scheduler to use a single struct for state management of all
incomplete actions. Removes `RunningAction` entirely and adds an
optional `worker_id` field to `AwaitedAction`

closes: TraceMachina#781
Copy link
Member Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-lre-cc, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, vale, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04, and 1 discussions need to be resolved


nativelink-scheduler/src/simple_scheduler.rs line 529 at r4 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

No, I just mean, move this if-statement before this one:

if running_action.worker_id == Some(*worker_id) {

Done.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r7, all commit messages.
Reviewable status: :shipit: complete! 2 of 1 LGTMs obtained

@zbirenbaum zbirenbaum merged commit 7997f03 into TraceMachina:main Apr 1, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify action tracking on the scheduler
3 participants