[Breaking] Generalize LRE to arbitrary toolchains #728

aaronmondal · 2024-03-06T05:58:38Z

This refactors the entire remote execution setup.

We now use "base" images to supply toolchains and have wrappers to create nativelink workers from those base images. This allows us to "enrich" arbitrary toolchain containers to turn them into nativelink workers.

In other words, we now have a framework to import non-Nix containers into our Nix infrastructure, such as "classic" Ubuntu-based toolchain containers.

Toolchain generation is now arbitrarily fine-grained. In practice, this means that for instance the Java and C++ toolchains are now separate entities. This has a large impact on the efficiency of multi-toolchain deployments. The Kubernetes example has been updated accordingly.

The LRE infrastructure is now treated as a special case of the new toolchain setup process. The rbe-configs-gen logic is now an implementation detail and the generator logic is no longer carried over into the final worker images. This brings down the image size for the LRE containers from ~2.5GB to ~1.7GB for C++ and ~600MB for Java. The slight overall reduction in container sizes is due to the omission of the Bazel executable. Bazel is required to generate the Starlark toolchain configurations but doesn't have to be present in the final worker images.

This change is

adam-singer

Reviewed 3 of 35 files at r1, 1 of 12 files at r2.
Reviewable status: 0 of 1 LGTMs obtained

flake.nix line 141 at r2 (raw file):

            name = "nativelink";
            config = {
              Entrypoint = [(pkgs.lib.getExe' nativelink "nativelink")];

til: function names can have ' https://github.com/NixOS/nixpkgs/blob/master/lib/meta.nix#L174

deployment-examples/kubernetes/worker-lre-cc.json.template line 27 at r2 (raw file):

        "fast": {
          "filesystem": {
            "content_path": "/tmp/.cache/nativelink/data-worker-test/content_path-cas",

Is tmp the intended place for nativelink data? If possible we should be uniformed in examples of where to expect the nativelink directory to be rooted at or maybe not hidden in a dot dir.

aaronmondal

Reviewable status: 0 of 1 LGTMs obtained

deployment-examples/kubernetes/worker-lre-cc.json.template line 27 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

Is tmp the intended place for nativelink data? If possible we should be uniformed in examples of where to expect the nativelink directory to be rooted at or maybe not hidden in a dot dir.

I'm not sure what the optimal approach is.

We probably want to move away from /root to better support rootless setups with our recommended defaults. I used /tmp here because it seemed like a fairly standard choice and is usually given permissions 1777, i.e. this works without any additional user (shadow, passwd etc) setup.

An alternative would be to explicitly configure a user and a home directory.

cc @allada @kubevalet

aaronmondal

+@allada +@adam-singer +@zbirenbaum +@MarcusSorealheis +@blakehatch cc @kubevalet

FYI Parts of the new docs here might seem a bit out of place. I'll change this with to the upcoming Ubuntu Remote exec example which I'll add in a separate commit.

Reviewable status: 0 of 5 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, publish-image, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, vale, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04 (waiting on @adam-singer, @allada, @blakehatch, @MarcusSorealheis, and @zbirenbaum)

adam-singer

Reviewed 22 of 35 files at r1, 9 of 12 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: 0 of 5 LGTMs obtained (waiting on @allada, @blakehatch, @MarcusSorealheis, and @zbirenbaum)

local-remote-execution/lre-java.nix line 18 at r3 (raw file):

        "${pkgs.gnutar}/bin"
      ]))
    "JAVA_HOME=${pkgs.jdk17_headless}/lib/openjdk"

Do we also need openjdk/bin within the PATH?

local-remote-execution/rbe_configs_gen_adjustments.diff line 14 at r3 (raw file):

+		return "/tmp/workdir"
 	case OSWindows:
 		return "C:/workdir"

Isn't windows C:\ or has / and \ been universally understood by windows now as the same thing?

aaronmondal

Reviewable status: 0 of 5 LGTMs obtained (waiting on @adam-singer, @allada, @blakehatch, @MarcusSorealheis, and @zbirenbaum)

local-remote-execution/rbe_configs_gen_adjustments.diff line 14 at r3 (raw file):

Previously, adam-singer (Adam Singer) wrote…

Isn't windows C:\ or has / and \ been universally understood by windows now as the same thing?

This indeed looks wrong. But the entire thing falls apart on windows anyways as we can't even build the rbe-configs-gen tool on windows in the first place lol.

aaronmondal

Reviewable status: 0 of 5 LGTMs obtained (waiting on @adam-singer, @allada, @blakehatch, @MarcusSorealheis, and @zbirenbaum)

local-remote-execution/lre-java.nix line 18 at r3 (raw file):

Previously, adam-singer (Adam Singer) wrote…

Do we also need openjdk/bin within the PATH?

It doesn't affect toolchain generation. However, this seems like a bug in rbe-configs-gen. It "should" affect toolchain generation.

I believe we should get a generated java_toolchain. Only having a local_java_runtime setup generated looks wrong: https://github.com/TraceMachina/nativelink/blob/main/local-remote-execution/generated/java/BUILD

Another reason to rewrite rbe-configs-gen 😅

zbirenbaum

Reviewed 1 of 12 files at r2, all commit messages.
Reviewable status: 0 of 5 LGTMs obtained (waiting on @adam-singer, @allada, @blakehatch, and @MarcusSorealheis)

local-remote-execution/rbe_configs_gen_adjustments.diff line 14 at r3 (raw file):

Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…

This indeed looks wrong. But the entire thing falls apart on windows anyways as we can't even build the rbe-configs-gen tool on windows in the first place lol.

I think that it should still be set to the correct string for windows since it looks weird to anyone exploring the code. They might think the reason it doesn't work is because of this. I think a note should be added stating windows is unsupported and why in the commit message, as well as here or in a README since having this check gives the impression it is supported.

allada

Reviewable status: 0 of 5 LGTMs obtained (waiting on @adam-singer, @blakehatch, @MarcusSorealheis, and @zbirenbaum)

-- commits line 20 at r3:
nit: side effect

Code quote:

sideeffect

deployment-examples/kubernetes/worker-lre-cc.json.template line 27 at r2 (raw file):

Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…

I'm not sure what the optimal approach is.

We probably want to move away from /root to better support rootless setups with our recommended defaults. I used /tmp here because it seemed like a fairly standard choice and is usually given permissions 1777, i.e. this works without any additional user (shadow, passwd etc) setup.

An alternative would be to explicitly configure a user and a home directory.

cc @allada @kubevalet

I'm personally not a fan of /root or /tmp. Maybe:
${HOME}/.cache/ instead? I believe this works on windows, but not sure.

allada

Reviewable status: 0 of 5 LGTMs obtained (waiting on @adam-singer, @blakehatch, @MarcusSorealheis, and @zbirenbaum)

deployment-examples/kubernetes/worker-lre-java.yaml line 21 at r3 (raw file):

          env:
            - name: RUST_LOG
              value: info

nit: This is super verbose, maybe warn instead?

deployment-examples/kubernetes/worker-lre-java.json.template line 31 at r3 (raw file):

            "eviction_policy": {
              // 10gb.
              "max_bytes": 10000000000,

nit: Maybe make this a variable that is set in the yaml?

allada

Reviewable status: 1 of 5 LGTMs obtained (waiting on @adam-singer, @blakehatch, @MarcusSorealheis, and @zbirenbaum)

MarcusSorealheis · 2024-03-10T03:50:54Z

I think this one should also involve Brian and Kube Valet.

…

On Sat, Mar 9, 2024 at 7:30 PM Nathan (Blaise) Bruer < ***@***.***> wrote: ***@***.**** approved this pull request. [image: <img class="emoji" title=":lgtm:" alt=":lgtm:" align="absmiddle" src="https://reviewable.io/lgtm.png" height="20" width="61"/>] <https://camo.githubusercontent.com/8b6b85592234eb2e495981b9faffb5423a6b40eca0f02d87e8b28c73b95dcd6d/68747470733a2f2f72657669657761626c652e696f2f6c67746d2e706e67> *Reviewable <https://reviewable.io/reviews/TraceMachina/nativelink/728#-:-NsaPCMe0Oll-Qbw-ZK_:bnfp4nl>* status: 1 of 5 LGTMs obtained (waiting on @adam-singer <https://github.com/adam-singer>, @blakehatch <https://github.com/blakehatch>, @MarcusSorealheis <https://github.com/MarcusSorealheis>, and @zbirenbaum <https://github.com/zbirenbaum>) — Reply to this email directly, view it on GitHub <#728 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR6TSFTGY765JDMFVCAMQLYXPHUHAVCNFSM6AAAAABEINDM22VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTSMRWGM3TCNZTHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Marcus Eagan

aaronmondal

Dismissed @adam-singer from 3 discussions.
Reviewable status: 1 of 5 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Vercel, pre-commit-checks, publish-image, ubuntu-20.04 / stable (waiting on @adam-singer, @blakehatch, @MarcusSorealheis, and @zbirenbaum)

deployment-examples/kubernetes/worker-lre-java.json.template line 31 at r3 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Maybe make this a variable that is set in the yaml?

Agreed. I'll defer this to a second pass where I go over all values again to see whether there are other things we want to set as environment variables such as paths.

deployment-examples/kubernetes/worker-lre-cc.json.template line 27 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

I'm personally not a fan of /root or /tmp. Maybe:
${HOME}/.cache/ instead? I believe this works on windows, but not sure.

'HOME/.cache`seems like the most intuitive approach. It's a dotdir, but this is also the path that other caches use.

However, I'll hold off on this for now since it's unclear to me how to implement the user properly. The current setup doesn't set up users at all which makes it a bit simpler and more generic. Once we add a user we potentially need to add templating etc for the username as well. This is probably something we want to do but it seems better to defer such efforts to when we have Helm charts to handle such templating properly. We could also investigate approaches where we add the user accounts via K8s directly rather than baking them into the container.

local-remote-execution/rbe_configs_gen_adjustments.diff line 14 at r3 (raw file):

Previously, zbirenbaum (Zach Birenbaum) wrote…

I think that it should still be set to the correct string for windows since it looks weird to anyone exploring the code. They might think the reason it doesn't work is because of this. I think a note should be added stating windows is unsupported and why in the commit message, as well as here or in a README since having this check gives the impression it is supported.

This isn't visible on Reviewable, but this code is actually just a patch. It seems better to not change anything that doesn't have any relevant influence on how rbe-configs-gen behaved before this patch.

allada

Reviewable status: 1 of 5 LGTMs obtained, and pending CI: Remote / large-ubuntu-22.04 (waiting on @adam-singer, @blakehatch, @MarcusSorealheis, and @zbirenbaum)

deployment-examples/kubernetes/worker-lre-cc.json.template line 27 at r2 (raw file):

Previously, aaronmondal (Aaron Siddhartha Mondal) wrote…

'HOME/.cache`seems like the most intuitive approach. It's a dotdir, but this is also the path that other caches use.

However, I'll hold off on this for now since it's unclear to me how to implement the user properly. The current setup doesn't set up users at all which makes it a bit simpler and more generic. Once we add a user we potentially need to add templating etc for the username as well. This is probably something we want to do but it seems better to defer such efforts to when we have Helm charts to handle such templating properly. We could also investigate approaches where we add the user accounts via K8s directly rather than baking them into the container.

In that case lets make a /nativelink directory. /tmp is certainly not the right approach.

This refactors the entire remote execution setup. We now use "base" images to supply toolchains and have wrappers to create nativelink workers from those base images. This allows us to "enrich" arbitrary toolchain containers to turn them into nativelink workers. In other words, we now have a framework to import non-Nix containers into our Nix infrastructure, such as "classic" Ubuntu-based toolchain containers. Toolchain generation is now arbitrarily fine-grained. In practice, this means that for instance the Java and C++ toolchains are now separate entities. This has a large impact on the efficiency of multi-toolchain deployments. The Kubernetes example has been updated accordingly. As a side effect of the new container structures the K8s deployment now works without root permissions in the nativelink containers. The LRE infrastructure is now treated as a special case of the new toolchain setup process. The `rbe-configs-gen` logic is now an implementation detail and the generator logic is no longer carried over into the final worker images. This brings down the image size for the LRE containers from ~2.5GB to ~1.7GB for C++ and ~600MB for Java. The slight overall reduction in container sizes is due to the omission of the Bazel executable. Bazel is required to generate the Starlark toolchain configurations but doesn't have to be present in the final worker images.

aaronmondal

Reviewable status: 1 of 5 LGTMs obtained, and pending CI: docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), windows-2022 / stable (waiting on @adam-singer, @blakehatch, @MarcusSorealheis, and @zbirenbaum)

deployment-examples/kubernetes/worker-lre-cc.json.template line 27 at r2 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

In that case lets make a /nativelink directory. /tmp is certainly not the right approach.

Found that not creating a user runs the worker with root permissions. Added a user to the worker wrapper to support rootless usage by default and using the ~/.cache path by default. We might want to change this to HOME/.cache when implement this in a more templated fashion.

adam-singer

Reviewed 1 of 12 files at r2, 10 of 10 files at r4, 2 of 2 files at r5, 3 of 3 files at r6, all commit messages.
Reviewable status: 2 of 5 LGTMs obtained (waiting on @blakehatch, @MarcusSorealheis, and @zbirenbaum)

aaronmondal

-@MarcusSorealheis

Reviewable status: 2 of 4 LGTMs obtained (waiting on @blakehatch and @zbirenbaum)

aaronmondal

-@blakehatch -@zbirenbaum

Reviewable status: complete! 2 of 2 LGTMs obtained

allada

Reviewable status: 2 of 2 LGTMs obtained, and 1 discussions need to be resolved

flake.nix line 142 at r6 (raw file):

            contents = [
              nativelink
              pkgs.dockerTools.caCertificates

here

aaronmondal force-pushed the generalize-lre-approach branch 2 times, most recently from 98990b0 to 633e427 Compare March 6, 2024 22:20

adam-singer reviewed Mar 6, 2024

View reviewed changes

aaronmondal commented Mar 7, 2024

View reviewed changes

aaronmondal force-pushed the generalize-lre-approach branch from 633e427 to 3f1d704 Compare March 7, 2024 01:24

aaronmondal marked this pull request as ready for review March 7, 2024 01:25

aaronmondal force-pushed the generalize-lre-approach branch 3 times, most recently from d909587 to 7dc5765 Compare March 7, 2024 01:29

aaronmondal assigned adam-singer, allada, blakehatch, MarcusSorealheis and zbirenbaum Mar 7, 2024

aaronmondal commented Mar 7, 2024

View reviewed changes

aaronmondal mentioned this pull request Mar 8, 2024

Custom containerized workers #313

Closed

adam-singer requested changes Mar 8, 2024

View reviewed changes

aaronmondal commented Mar 8, 2024

View reviewed changes

zbirenbaum requested review from allada and MarcusSorealheis March 8, 2024 20:02

zbirenbaum reviewed Mar 8, 2024

View reviewed changes

allada reviewed Mar 10, 2024

View reviewed changes

allada approved these changes Mar 10, 2024

View reviewed changes

aaronmondal force-pushed the generalize-lre-approach branch 3 times, most recently from 86107ad to 02e0fc3 Compare March 18, 2024 19:34

aaronmondal commented Mar 18, 2024

View reviewed changes

allada reviewed Mar 18, 2024

View reviewed changes

aaronmondal force-pushed the generalize-lre-approach branch from 02e0fc3 to 1ca2a24 Compare March 18, 2024 21:14

aaronmondal force-pushed the generalize-lre-approach branch from 1ca2a24 to 81295ef Compare March 19, 2024 03:59

aaronmondal commented Mar 19, 2024

View reviewed changes

adam-singer approved these changes Mar 19, 2024

View reviewed changes

aaronmondal unassigned MarcusSorealheis Mar 19, 2024

aaronmondal commented Mar 19, 2024

View reviewed changes

aaronmondal removed the request for review from MarcusSorealheis March 19, 2024 13:40

aaronmondal unassigned zbirenbaum and blakehatch Mar 19, 2024

aaronmondal commented Mar 19, 2024

View reviewed changes

aaronmondal merged commit 1a43ef9 into TraceMachina:main Mar 19, 2024
25 checks passed

allada reviewed Mar 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Breaking] Generalize LRE to arbitrary toolchains #728

[Breaking] Generalize LRE to arbitrary toolchains #728

aaronmondal commented Mar 6, 2024 •

edited by allada

Loading

adam-singer left a comment

aaronmondal left a comment

aaronmondal left a comment

adam-singer left a comment

aaronmondal left a comment

aaronmondal left a comment

zbirenbaum left a comment

allada left a comment

allada left a comment

allada left a comment

MarcusSorealheis commented Mar 10, 2024 via email •

edited by allada

Loading

aaronmondal left a comment

allada left a comment

aaronmondal left a comment

adam-singer left a comment

aaronmondal left a comment

aaronmondal left a comment

allada left a comment

[Breaking] Generalize LRE to arbitrary toolchains #728

[Breaking] Generalize LRE to arbitrary toolchains #728

Conversation

aaronmondal commented Mar 6, 2024 • edited by allada Loading

adam-singer left a comment

Choose a reason for hiding this comment

aaronmondal left a comment

Choose a reason for hiding this comment

aaronmondal left a comment

Choose a reason for hiding this comment

adam-singer left a comment

Choose a reason for hiding this comment

aaronmondal left a comment

Choose a reason for hiding this comment

aaronmondal left a comment

Choose a reason for hiding this comment

zbirenbaum left a comment

Choose a reason for hiding this comment

allada left a comment

Choose a reason for hiding this comment

allada left a comment

Choose a reason for hiding this comment

allada left a comment

Choose a reason for hiding this comment

MarcusSorealheis commented Mar 10, 2024 via email • edited by allada Loading

aaronmondal left a comment

Choose a reason for hiding this comment

allada left a comment

Choose a reason for hiding this comment

aaronmondal left a comment

Choose a reason for hiding this comment

adam-singer left a comment

Choose a reason for hiding this comment

aaronmondal left a comment

Choose a reason for hiding this comment

aaronmondal left a comment

Choose a reason for hiding this comment

allada left a comment

Choose a reason for hiding this comment

aaronmondal commented Mar 6, 2024 •

edited by allada

Loading

MarcusSorealheis commented Mar 10, 2024 via email •

edited by allada

Loading