Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Breaking] Generalize LRE to arbitrary toolchains #728

Merged
merged 1 commit into from
Mar 19, 2024

Conversation

aaronmondal
Copy link
Member

@aaronmondal aaronmondal commented Mar 6, 2024

This refactors the entire remote execution setup.

We now use "base" images to supply toolchains and have wrappers to create nativelink workers from those base images. This allows us to "enrich" arbitrary toolchain containers to turn them into nativelink workers.

In other words, we now have a framework to import non-Nix containers into our Nix infrastructure, such as "classic" Ubuntu-based toolchain containers.

Toolchain generation is now arbitrarily fine-grained. In practice, this means that for instance the Java and C++ toolchains are now separate entities. This has a large impact on the efficiency of multi-toolchain deployments. The Kubernetes example has been updated accordingly.

The LRE infrastructure is now treated as a special case of the new toolchain setup process. The rbe-configs-gen logic is now an implementation detail and the generator logic is no longer carried over into the final worker images. This brings down the image size for the LRE containers from ~2.5GB to ~1.7GB for C++ and ~600MB for Java. The slight overall reduction in container sizes is due to the omission of the Bazel executable. Bazel is required to generate the Starlark toolchain configurations but doesn't have to be present in the final worker images.


This change is Reviewable

@aaronmondal aaronmondal force-pushed the generalize-lre-approach branch 2 times, most recently from 98990b0 to 633e427 Compare March 6, 2024 22:20
Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 35 files at r1, 1 of 12 files at r2.
Reviewable status: 0 of 1 LGTMs obtained


flake.nix line 141 at r2 (raw file):

            name = "nativelink";
            config = {
              Entrypoint = [(pkgs.lib.getExe' nativelink "nativelink")];

til: function names can have ' https://github.com/NixOS/nixpkgs/blob/master/lib/meta.nix#L174


deployment-examples/kubernetes/worker-lre-cc.json.template line 27 at r2 (raw file):

        "fast": {
          "filesystem": {
            "content_path": "/tmp/.cache/nativelink/data-worker-test/content_path-cas",

Is tmp the intended place for nativelink data? If possible we should be uniformed in examples of where to expect the nativelink directory to be rooted at or maybe not hidden in a dot dir.

Copy link
Member Author

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained


deployment-examples/kubernetes/worker-lre-cc.json.template line 27 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

Is tmp the intended place for nativelink data? If possible we should be uniformed in examples of where to expect the nativelink directory to be rooted at or maybe not hidden in a dot dir.

I'm not sure what the optimal approach is.

We probably want to move away from /root to better support rootless setups with our recommended defaults. I used /tmp here because it seemed like a fairly standard choice and is usually given permissions 1777, i.e. this works without any additional user (shadow, passwd etc) setup.

An alternative would be to explicitly configure a user and a home directory.

cc @allada @kubevalet

@aaronmondal aaronmondal marked this pull request as ready for review March 7, 2024 01:25
@aaronmondal aaronmondal force-pushed the generalize-lre-approach branch 3 times, most recently from d909587 to 7dc5765 Compare March 7, 2024 01:29
Copy link
Member Author

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@allada +@adam-singer +@zbirenbaum +@MarcusSorealheis +@blakehatch cc @kubevalet

FYI Parts of the new docs here might seem a bit out of place. I'll change this with to the upcoming Ubuntu Remote exec example which I'll add in a separate commit.

Reviewable status: 0 of 5 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, publish-image, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, vale, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04 (waiting on @adam-singer, @allada, @blakehatch, @MarcusSorealheis, and @zbirenbaum)

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 22 of 35 files at r1, 9 of 12 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: 0 of 5 LGTMs obtained (waiting on @allada, @blakehatch, @MarcusSorealheis, and @zbirenbaum)


local-remote-execution/lre-java.nix line 18 at r3 (raw file):

        "${pkgs.gnutar}/bin"
      ]))
    "JAVA_HOME=${pkgs.jdk17_headless}/lib/openjdk"

Do we also need openjdk/bin within the PATH?


local-remote-execution/rbe_configs_gen_adjustments.diff line 14 at r3 (raw file):

+		return "/tmp/workdir"
 	case OSWindows:
 		return "C:/workdir"

Isn't windows C:\ or has / and \ been universally understood by windows now as the same thing?

Copy link
Member Author

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 LGTMs obtained (waiting on @adam-singer, @allada, @blakehatch, @MarcusSorealheis, and @zbirenbaum)


local-remote-execution/rbe_configs_gen_adjustments.diff line 14 at r3 (raw file):

Previously, adam-singer (Adam Singer) wrote…

Isn't windows C:\ or has / and \ been universally understood by windows now as the same thing?

This indeed looks wrong. But the entire thing falls apart on windows anyways as we can't even build the rbe-configs-gen tool on windows in the first place lol.

Copy link
Member Author

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 LGTMs obtained (waiting on @adam-singer, @allada, @blakehatch, @MarcusSorealheis, and @zbirenbaum)


local-remote-execution/lre-java.nix line 18 at r3 (raw file):

Previously, adam-singer (Adam Singer) wrote…

Do we also need openjdk/bin within the PATH?

It doesn't affect toolchain generation. However, this seems like a bug in rbe-configs-gen. It "should" affect toolchain generation.

I believe we should get a generated java_toolchain. Only having a local_java_runtime setup generated looks wrong: https://github.com/TraceMachina/nativelink/blob/main/local-remote-execution/generated/java/BUILD

Another reason to rewrite rbe-configs-gen 😅

Copy link
Contributor

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 12 files at r2, all commit messages.
Reviewable status: 0 of 5 LGTMs obtained (waiting on @adam-singer, @allada, @blakehatch, and @MarcusSorealheis)


local-remote-execution/rbe_configs_gen_adjustments.diff line 14 at r3 (raw file):

Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…

This indeed looks wrong. But the entire thing falls apart on windows anyways as we can't even build the rbe-configs-gen tool on windows in the first place lol.

I think that it should still be set to the correct string for windows since it looks weird to anyone exploring the code. They might think the reason it doesn't work is because of this. I think a note should be added stating windows is unsupported and why in the commit message, as well as here or in a README since having this check gives the impression it is supported.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 LGTMs obtained (waiting on @adam-singer, @blakehatch, @MarcusSorealheis, and @zbirenbaum)


-- commits line 20 at r3:
nit: side effect

Code quote:

sideeffect

deployment-examples/kubernetes/worker-lre-cc.json.template line 27 at r2 (raw file):

Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…

I'm not sure what the optimal approach is.

We probably want to move away from /root to better support rootless setups with our recommended defaults. I used /tmp here because it seemed like a fairly standard choice and is usually given permissions 1777, i.e. this works without any additional user (shadow, passwd etc) setup.

An alternative would be to explicitly configure a user and a home directory.

cc @allada @kubevalet

I'm personally not a fan of /root or /tmp. Maybe:
${HOME}/.cache/ instead? I believe this works on windows, but not sure.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 LGTMs obtained (waiting on @adam-singer, @blakehatch, @MarcusSorealheis, and @zbirenbaum)


deployment-examples/kubernetes/worker-lre-java.yaml line 21 at r3 (raw file):

          env:
            - name: RUST_LOG
              value: info

nit: This is super verbose, maybe warn instead?


deployment-examples/kubernetes/worker-lre-java.json.template line 31 at r3 (raw file):

            "eviction_policy": {
              // 10gb.
              "max_bytes": 10000000000,

nit: Maybe make this a variable that is set in the yaml?

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 1 of 5 LGTMs obtained (waiting on @adam-singer, @blakehatch, @MarcusSorealheis, and @zbirenbaum)

@MarcusSorealheis
Copy link
Collaborator

MarcusSorealheis commented Mar 10, 2024 via email

@aaronmondal aaronmondal force-pushed the generalize-lre-approach branch 3 times, most recently from 86107ad to 02e0fc3 Compare March 18, 2024 19:34
Copy link
Member Author

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @adam-singer from 3 discussions.
Reviewable status: 1 of 5 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Vercel, pre-commit-checks, publish-image, ubuntu-20.04 / stable (waiting on @adam-singer, @blakehatch, @MarcusSorealheis, and @zbirenbaum)


deployment-examples/kubernetes/worker-lre-java.json.template line 31 at r3 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Maybe make this a variable that is set in the yaml?

Agreed. I'll defer this to a second pass where I go over all values again to see whether there are other things we want to set as environment variables such as paths.


deployment-examples/kubernetes/worker-lre-cc.json.template line 27 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

I'm personally not a fan of /root or /tmp. Maybe:
${HOME}/.cache/ instead? I believe this works on windows, but not sure.

'HOME/.cache`seems like the most intuitive approach. It's a dotdir, but this is also the path that other caches use.

However, I'll hold off on this for now since it's unclear to me how to implement the user properly. The current setup doesn't set up users at all which makes it a bit simpler and more generic. Once we add a user we potentially need to add templating etc for the username as well. This is probably something we want to do but it seems better to defer such efforts to when we have Helm charts to handle such templating properly. We could also investigate approaches where we add the user accounts via K8s directly rather than baking them into the container.


local-remote-execution/rbe_configs_gen_adjustments.diff line 14 at r3 (raw file):

Previously, zbirenbaum (Zach Birenbaum) wrote…

I think that it should still be set to the correct string for windows since it looks weird to anyone exploring the code. They might think the reason it doesn't work is because of this. I think a note should be added stating windows is unsupported and why in the commit message, as well as here or in a README since having this check gives the impression it is supported.

This isn't visible on Reviewable, but this code is actually just a patch. It seems better to not change anything that doesn't have any relevant influence on how rbe-configs-gen behaved before this patch.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 5 LGTMs obtained, and pending CI: Remote / large-ubuntu-22.04 (waiting on @adam-singer, @blakehatch, @MarcusSorealheis, and @zbirenbaum)


deployment-examples/kubernetes/worker-lre-cc.json.template line 27 at r2 (raw file):

Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…

'HOME/.cache`seems like the most intuitive approach. It's a dotdir, but this is also the path that other caches use.

However, I'll hold off on this for now since it's unclear to me how to implement the user properly. The current setup doesn't set up users at all which makes it a bit simpler and more generic. Once we add a user we potentially need to add templating etc for the username as well. This is probably something we want to do but it seems better to defer such efforts to when we have Helm charts to handle such templating properly. We could also investigate approaches where we add the user accounts via K8s directly rather than baking them into the container.

In that case lets make a /nativelink directory. /tmp is certainly not the right approach.

This refactors the entire remote execution setup.

We now use "base" images to supply toolchains and have wrappers to
create nativelink workers from those base images. This allows us to
"enrich" arbitrary toolchain containers to turn them into nativelink
workers.

In other words, we now have a framework to import non-Nix containers
into our Nix infrastructure, such as "classic" Ubuntu-based toolchain
containers.

Toolchain generation is now arbitrarily fine-grained. In practice, this
means that for instance the Java and C++ toolchains are now separate
entities. This has a large impact on the efficiency of multi-toolchain
deployments. The Kubernetes example has been updated accordingly.

As a side effect of the new container structures the K8s deployment now
works without root permissions in the nativelink containers.

The LRE infrastructure is now treated as a special case of the new
toolchain setup process. The `rbe-configs-gen` logic is now an
implementation detail and the generator logic is no longer carried over
into the final worker images. This brings down the image size for the
LRE containers from ~2.5GB to ~1.7GB for C++ and ~600MB for Java. The
slight overall reduction in container sizes is due to the omission of
the Bazel executable. Bazel is required to generate the Starlark
toolchain configurations but doesn't have to be present in the final
worker images.
Copy link
Member Author

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 5 LGTMs obtained, and pending CI: docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), windows-2022 / stable (waiting on @adam-singer, @blakehatch, @MarcusSorealheis, and @zbirenbaum)


deployment-examples/kubernetes/worker-lre-cc.json.template line 27 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

In that case lets make a /nativelink directory. /tmp is certainly not the right approach.

Found that not creating a user runs the worker with root permissions. Added a user to the worker wrapper to support rootless usage by default and using the ~/.cache path by default. We might want to change this to HOME/.cache when implement this in a more templated fashion.

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 12 files at r2, 10 of 10 files at r4, 2 of 2 files at r5, 3 of 3 files at r6, all commit messages.
Reviewable status: 2 of 5 LGTMs obtained (waiting on @blakehatch, @MarcusSorealheis, and @zbirenbaum)

Copy link
Member Author

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-@MarcusSorealheis

Reviewable status: 2 of 4 LGTMs obtained (waiting on @blakehatch and @zbirenbaum)

Copy link
Member Author

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-@blakehatch -@zbirenbaum

Reviewable status: :shipit: complete! 2 of 2 LGTMs obtained

@aaronmondal aaronmondal merged commit 1a43ef9 into TraceMachina:main Mar 19, 2024
25 checks passed
Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 2 LGTMs obtained, and 1 discussions need to be resolved


flake.nix line 142 at r6 (raw file):

            contents = [
              nativelink
              pkgs.dockerTools.caCertificates

here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants