Setuid root does not work in sandbox #2522

lukego · 2018-11-07T13:29:54Z

Running a setuid-root binary inside a Nix sandbox does not actually set the uid to 0. For example whoami will report nixbld and security.wappers programs will crash with an assertion failure when detecting that the effective uid does not match the file uid.

Can reproduce either by calling /run/wrappers/bin/sudo in the sandbox (have to make it visible) or by running /run/as/root $(which id) with my little asroot NixOS module.

Seems to me like the allow-new-privileges option was added to make this work (#1429) but that this is not working with Nix from NixOS 18.09.

Notes:

NixOS 18.09.
Setuid binary on a filesystem that does not have nosetuid flag.
Nix.conf includes allow-new-privileges = true.

Wild guess: Could this be due to userns sandboxing not providing a usable root user?

The text was updated successfully, but these errors were encountered:

edolstra · 2018-11-07T17:18:05Z

Wild guess: Could this be due to userns sandboxing not providing a usable root user?

Yes, that sounds likely. From the user_namespaces manpage:

However, if either the user or the group ID of the file has no mapping inside the namespace, the set-user-ID (set-group-ID) bit is silently ignored: the new program is executed, but the process's effective user (group) ID is left unchanged.

So for this to work, there would have to be a uid mapping to the root user in the parent namespace.

lukego · 2018-11-07T17:24:45Z

So for this to work, there would have to be a uid mapping to the root user in the parent namespace.

This would only be set when allow-new-privileges is enabled, I suppose?

copumpkin · 2018-11-07T18:10:58Z

Curious what the use case is here, can you elaborate? It feels slightly wrong to be doing setuid things inside a build sandbox but I'm probably missing something.

lukego · 2018-11-07T18:23:15Z

Curious what the use case is here, can you elaborate?

I should write a FAQ on this :-)

We have a Hydra cluster running CI benchmarks on some open source telecom network stack called Snabb. The Nix job builds a benchmark report with Rmarkdown and that involves running many (thousands) benchmark jobs. The benchmark jobs need root privileges due to the nature of the software e.g. running userspace 10G/100G ethernet device drivers, resolving physical memory, performing DMA, etc. This cannot be wrapped in virtualization due to the performance-sensitive nature e.g. enabling the IOMMU could impact performance.

Nix has been working amazingly for us based on the setup that @domenkozar cooked up. He has even had it running whole OpenStack clusters within the test environment. Have to do a NixCon talk about it one day. Meanwhile: Got to port the %!#@ thing to Nix 2.0 ;)

lukego · 2018-11-07T18:31:35Z

Nix is also used to test pull requests including checking for breakage in our bespoke device drivers etc. So being able to run applications with superuser privileges inside Nix is really crucial to our testing strategy.

edolstra · 2018-11-08T09:50:52Z

This would only be set when allow-new-privileges is enabled, I suppose?

Yes, having the mapping in general shouldn't hurt but it's best to be on the safe side.

If Nix is run with the --allow-new-privileges option then it should be possible to run setuid binaries to gain root access. However, this did not work in practice because the root user/group did not exist in the sandbox due to exclusion from the kernel user-namespace. The uid would always be nixbld (1000) even when executing a setuid-root binary. This change adds uid/gid 0 to the sandbox userns only when --allow-new-privileges is enabled and makes setuid work as expected. Note: This makes it effective to *run* setuid executables but it does not make it possible to create them. This means that in practice to gain root in the sandbox you must both provide --allow-new-privileges and also add a suitable binary to the Nix sandbox path. Resolves NixOS#2522.

lukego · 2018-11-08T15:36:23Z

Related problem: Once I acquire uid/gid 0 inside the sandbox I still can't do other things like mount a hugetlbfs filesystem. Is this another kernel namespace causing hassle?

(Can I somehow be rid of these restrictions once and for all somehow or is it better to take them one at a time?)

lukego · 2018-11-08T15:59:52Z

Specifically the mount() system call is failing with EPERM even though I am root:

mount("none", "/hugetlbfs", "hugetlbfs", MS_MGC_VAL, NULL) = -1 EPERM (Operation not permitted)
geteuid()                               = 0

Mic92 · 2018-11-08T16:00:26Z

This is a limitation the kernel puts in place. In a usernamespace processes even if root are only allowed to mount filesystems that have been white-listed. For that reason for example Ubuntu had a patch that allowed overlayfs. Bind mounts should be fine though.

lukego · 2018-11-08T16:02:39Z

@Mic92 Can I relax this restriction somehow? How does white-listing work?

I am running a program that expects to be able to mount filesystems within the sandbox. These are not mounted on the host and I don't want to mount them there either.

Mic92 · 2018-11-08T16:06:54Z

It is set at compile time for example procfs set the FS_USERNS_MOUNT flag:
https://github.com/torvalds/linux/blob/b00d209241ff7ac336543f92ea764b9d6b03032a/fs/proc/root.c#L121

Mic92 · 2018-11-08T16:08:34Z

The easiest way would be to ask nix to mount those from somewhere into the sandbox.

Mic92 · 2018-11-08T16:09:17Z

Also though in the nix case this is still not ideal because in particular hugetablefs would be shared across builds.

lukego · 2018-11-08T16:11:09Z

Tricky. On the one hand I don't really mind sharing the directory between sandboxes. On the other hand I want the application to have superuser privileges and not have to make special-case workarounds for each and every thing that it wants to do.

Just now I lean towards patching the kernel to whitelist hugetlbfs but then I'll never get binary cache hits on the kernels :)

lukego · 2018-11-08T16:14:51Z

On further reflection it's probably not safe to share one hugetlbfs mount between all sandboxes. I suspect that PIDs will not be unique across sandboxes and this will lead to collisions on the filesystem if two processes with the same PID try to create their own directories in the same bind-mounted folder.

So it might have to be the kernel patching route.

Mic92 · 2018-11-08T16:18:25Z

You have to be aware that this flag is only set if it is safe to do so.
It is allowed for tmpfs because memory cgroups also account those memory allocation.
Hugetablefs might be not handle this.
There is a sandbox-paths option in nix that allows bind mounting of filesystems into a build.

lukego · 2018-11-08T16:22:49Z

For what definition of "safe" though? I am not concerned about security on the local machine here e.g. privilege escalation within the sandbox. I am concerned about machine crashed though :).

If I bind-mount a shared directory from the host then I believe I'll also need to either prevent sandboxes from using the same PIDs (I'm assuming that they do now) or else patch the application I'm working with to protect against collisions with other processes having the same pid. This seems messy compared with a one-line kernel patch if that would work.

lukego · 2018-11-08T16:30:07Z

... Could also be that updating the application to support running multiple instances inside userns namespaces with a shared hugetlbfs folder [pause for breath] is a legitimate enhancement that can be useful in other deployment contexts in the future. Today I can bypass userns if I really want to but maybe tomorrow I can't because a user wants to deploy in a more restricted environment. Have to ponder...

Mic92 · 2018-11-08T16:33:56Z

By safe I mean actually secure.
You change would be applied here:
https://github.com/torvalds/linux/blob/b00d209241ff7ac336543f92ea764b9d6b03032a/fs/hugetlbfs/inode.c#L1287

lukego · 2018-11-08T16:39:11Z

Thank you very much for the information. I'll consider my options until the next time I'm back at the keys.

Mic92 · 2018-11-08T16:42:58Z

It might be even possible to compile a hugtblfs2 module out-of-tree as a single module that does not have the restriction of this one to avoid compiling the whole kernel.

lukego · 2018-11-08T17:53:46Z

Reflection: This all seems to "just work" with docker --privileged. I have not seen any problems doing superuser things like mounting hugetlbfs in that context. How come it is different with Nix?

Mic92 · 2018-11-08T20:00:35Z

Docker does not have usernamespaces enabled by default.

lukego · 2018-11-09T05:59:39Z

Could be that what I need to do is disable usernamespaces?

Have to take a step back and think about the goals/requirements for benchmarking system software with Nix where running on full bare metal takes priority over local privilege management on the build host. Otherwise it is just whack-a-mole.

Mic92 · 2018-11-09T08:17:39Z

Well, you would need to rewrite some of nix's sandbox logic for that at the moment it depends on usernamespaces to map the builder's uid/gid on the host to uid 1000/gid 1000
in the builder sandbox. Otherwise you could disable the sandbox as a whole.

lukego · 2018-11-09T09:23:43Z

Otherwise you could disable the sandbox as a whole

This may be the right solution. The main downside I see is that untracked dependencies ("impurities") can creep into the builds. However, compared with fighting kernel features like userns that starts to seem pretty minor.

Some other pros/cons that I overlook?

edolstra · 2018-11-09T09:29:02Z

Yeah, disabling user namespaces sounds like the way to go. It could be set per derivation using requiredSystemFeatures, e.g.

requiredSystemFeatures = [ "no-userns" ];

Alternatively, you could just disable sandboxing, since you might not need it for benchmarking anyway?

lukego · 2018-11-09T10:23:15Z

Can I have the chroot of the sandbox only? Is that compatible with the overall approach that Nix takes?

7c6f434c · 2018-11-09T10:33:48Z

You can always just run your daemon in a chroot that only sees /nix/store (and /bin/sh and /usr/bin/env). Enumerating the store is slow enough to notice in the benchmarking (and hard to write by accident), and if the build has a path reference directly, it would be in the sandbox anyway.

Mic92 · 2018-11-09T11:21:17Z

Not sure how your setup works, but you could also disable the sandbox just for the benchmark itself but not the build. Well you could even build a docker container with nix that contains the benchmark to have some test isolation: https://nixos.org/nixpkgs/manual/#sec-pkgs-dockerTools or a nixos-container based one: https://nixos.org/nixos/manual/index.html#ch-containers
At the moment we delegate most of our privilege-requiring tests to vm-based tests. There was once a plan to also extend this with container-based tests, but there we would probably also want usernamespace-based, unprivileged container for the nixpkgs usecase.

lukego · 2018-11-09T12:43:35Z

This is much trickier than I had anticipated :-) thanks for all the feedback everybody!

Suppose we follow the route of running the build without a Nix sandbox but then having the build/test script construct a sandbox of its own. For example to create a chroot that shadows /var with a scratch folder so that each test can do whatever it likes there without interfering with other tests.

Is that a viable approach and what some of these suggestions are hinting at? This way we still have a modest level of test isolation, we avoid advanced kernel features like usernamespaces, and we avoid depending on the intricacies of build.cc internals with SECCOMP and userns that are evolving in a different direction than our use case anyway. Or?

lukego · 2018-11-09T13:01:05Z

disabling user namespaces sounds like the way to go

@edolstra How would that be implemented? Just conditionalize some parts of build.cc to suppress some of the sandbox setup?

edolstra · 2018-11-09T14:55:01Z

Yes, basically make passing CLONE_NEWUSER and associated setup conditional, in the same way that CLONE_NEWNET is optional.

lukego · 2018-11-13T19:31:26Z

@edolstra I made a naieve attempt at conditionalizing the userns but nothing works with this attempt:

$ nix-build -E 'with import <nixpkgs> {}; runCommand "x" {} "echo ok | tee $out"'
these derivations will be built:
  /nix/store/39x3acks4hq3sp5lgwnw4lcdzw5xi72d-x.drv
error: writing to file: Operation not permitted

Maybe I disabled the usernamespace but other parts of build.cc are assuming the userids have been remapped or something. Hints would be warmly appreciated if anybody has time to peek at the patch and daemon-strace at https://gist.github.com/992d195be54c68cfe501b20727c37c16

stale · 2021-02-21T23:12:58Z

I marked this as stale due to inactivity. → More info

stale · 2022-04-28T01:09:28Z

I closed this issue due to inactivity. → More info

lukego mentioned this issue Nov 8, 2018

Allow root uid/gid (0) in sandbox userns when allowNewPrivileges #2525

Closed

flokli mentioned this issue Apr 24, 2020

binutils: 2.31.1 -> 2.33.1 NixOS/nixpkgs#78204

Closed

10 tasks

stale bot added the stale label Feb 21, 2021

stale bot closed this as completed Apr 28, 2022

thufschmitt reopened this Feb 24, 2023

Setuid root does not work in sandbox #2522

Setuid root does not work in sandbox #2522

Comments

lukego commented Nov 7, 2018 • edited

edolstra commented Nov 7, 2018 • edited

lukego commented Nov 7, 2018

copumpkin commented Nov 7, 2018

lukego commented Nov 7, 2018

lukego commented Nov 7, 2018

edolstra commented Nov 8, 2018

lukego commented Nov 8, 2018

lukego commented Nov 8, 2018

Mic92 commented Nov 8, 2018

lukego commented Nov 8, 2018

Mic92 commented Nov 8, 2018

Mic92 commented Nov 8, 2018

Mic92 commented Nov 8, 2018

lukego commented Nov 8, 2018

lukego commented Nov 8, 2018

Mic92 commented Nov 8, 2018

lukego commented Nov 8, 2018

lukego commented Nov 8, 2018

Mic92 commented Nov 8, 2018

lukego commented Nov 8, 2018

Mic92 commented Nov 8, 2018

lukego commented Nov 8, 2018

Mic92 commented Nov 8, 2018

lukego commented Nov 9, 2018

Mic92 commented Nov 9, 2018

lukego commented Nov 9, 2018

edolstra commented Nov 9, 2018

lukego commented Nov 9, 2018

7c6f434c commented Nov 9, 2018 via email

Mic92 commented Nov 9, 2018 • edited

lukego commented Nov 9, 2018

lukego commented Nov 9, 2018

edolstra commented Nov 9, 2018

lukego commented Nov 13, 2018

stale bot commented Feb 21, 2021

stale bot commented Apr 28, 2022

lukego commented Nov 7, 2018 •

edited

edolstra commented Nov 7, 2018 •

edited

Mic92 commented Nov 9, 2018 •

edited