Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setuid root does not work in sandbox #2522

Open
lukego opened this issue Nov 7, 2018 · 36 comments
Open

Setuid root does not work in sandbox #2522

lukego opened this issue Nov 7, 2018 · 36 comments
Labels

Comments

@lukego
Copy link

lukego commented Nov 7, 2018

Running a setuid-root binary inside a Nix sandbox does not actually set the uid to 0. For example whoami will report nixbld and security.wappers programs will crash with an assertion failure when detecting that the effective uid does not match the file uid.

Can reproduce either by calling /run/wrappers/bin/sudo in the sandbox (have to make it visible) or by running /run/as/root $(which id) with my little asroot NixOS module.

Seems to me like the allow-new-privileges option was added to make this work (#1429) but that this is not working with Nix from NixOS 18.09.

Notes:

  • NixOS 18.09.
  • Setuid binary on a filesystem that does not have nosetuid flag.
  • Nix.conf includes allow-new-privileges = true.

Wild guess: Could this be due to userns sandboxing not providing a usable root user?

@edolstra
Copy link
Member

edolstra commented Nov 7, 2018

Wild guess: Could this be due to userns sandboxing not providing a usable root user?

Yes, that sounds likely. From the user_namespaces manpage:

However, if either the user or the group ID of the file has no mapping inside the namespace, the set-user-ID (set-group-ID) bit is silently ignored: the new program is executed, but the process's effective user (group) ID is left unchanged.

So for this to work, there would have to be a uid mapping to the root user in the parent namespace.

@lukego
Copy link
Author

lukego commented Nov 7, 2018

So for this to work, there would have to be a uid mapping to the root user in the parent namespace.

This would only be set when allow-new-privileges is enabled, I suppose?

@copumpkin
Copy link
Member

Curious what the use case is here, can you elaborate? It feels slightly wrong to be doing setuid things inside a build sandbox but I'm probably missing something.

@lukego
Copy link
Author

lukego commented Nov 7, 2018

Curious what the use case is here, can you elaborate?

I should write a FAQ on this :-)

We have a Hydra cluster running CI benchmarks on some open source telecom network stack called Snabb. The Nix job builds a benchmark report with Rmarkdown and that involves running many (thousands) benchmark jobs. The benchmark jobs need root privileges due to the nature of the software e.g. running userspace 10G/100G ethernet device drivers, resolving physical memory, performing DMA, etc. This cannot be wrapped in virtualization due to the performance-sensitive nature e.g. enabling the IOMMU could impact performance.

Nix has been working amazingly for us based on the setup that @domenkozar cooked up. He has even had it running whole OpenStack clusters within the test environment. Have to do a NixCon talk about it one day. Meanwhile: Got to port the %!#@ thing to Nix 2.0 ;)

@lukego
Copy link
Author

lukego commented Nov 7, 2018

Nix is also used to test pull requests including checking for breakage in our bespoke device drivers etc. So being able to run applications with superuser privileges inside Nix is really crucial to our testing strategy.

@edolstra
Copy link
Member

edolstra commented Nov 8, 2018

This would only be set when allow-new-privileges is enabled, I suppose?

Yes, having the mapping in general shouldn't hurt but it's best to be on the safe side.

lukego added a commit to lukego/nix that referenced this issue Nov 8, 2018
If Nix is run with the --allow-new-privileges option then it should be
possible to run setuid binaries to gain root access. However, this did
not work in practice because the root user/group did not exist in the
sandbox due to exclusion from the kernel user-namespace. The uid would
always be nixbld (1000) even when executing a setuid-root binary.

This change adds uid/gid 0 to the sandbox userns only when
--allow-new-privileges is enabled and makes setuid work as expected.

Note: This makes it effective to *run* setuid executables but it does
not make it possible to create them. This means that in practice to
gain root in the sandbox you must both provide --allow-new-privileges
and also add a suitable binary to the Nix sandbox path.

Resolves NixOS#2522.
@lukego
Copy link
Author

lukego commented Nov 8, 2018

Related problem: Once I acquire uid/gid 0 inside the sandbox I still can't do other things like mount a hugetlbfs filesystem. Is this another kernel namespace causing hassle?

(Can I somehow be rid of these restrictions once and for all somehow or is it better to take them one at a time?)

@lukego
Copy link
Author

lukego commented Nov 8, 2018

Specifically the mount() system call is failing with EPERM even though I am root:

mount("none", "/hugetlbfs", "hugetlbfs", MS_MGC_VAL, NULL) = -1 EPERM (Operation not permitted)
geteuid()                               = 0

@Mic92
Copy link
Member

Mic92 commented Nov 8, 2018

This is a limitation the kernel puts in place. In a usernamespace processes even if root are only allowed to mount filesystems that have been white-listed. For that reason for example Ubuntu had a patch that allowed overlayfs. Bind mounts should be fine though.

@lukego
Copy link
Author

lukego commented Nov 8, 2018

@Mic92 Can I relax this restriction somehow? How does white-listing work?

I am running a program that expects to be able to mount filesystems within the sandbox. These are not mounted on the host and I don't want to mount them there either.

@Mic92
Copy link
Member

Mic92 commented Nov 8, 2018

It is set at compile time for example procfs set the FS_USERNS_MOUNT flag:
https://github.com/torvalds/linux/blob/b00d209241ff7ac336543f92ea764b9d6b03032a/fs/proc/root.c#L121

@Mic92
Copy link
Member

Mic92 commented Nov 8, 2018

The easiest way would be to ask nix to mount those from somewhere into the sandbox.

@Mic92
Copy link
Member

Mic92 commented Nov 8, 2018

Also though in the nix case this is still not ideal because in particular hugetablefs would be shared across builds.

@lukego
Copy link
Author

lukego commented Nov 8, 2018

Tricky. On the one hand I don't really mind sharing the directory between sandboxes. On the other hand I want the application to have superuser privileges and not have to make special-case workarounds for each and every thing that it wants to do.

Just now I lean towards patching the kernel to whitelist hugetlbfs but then I'll never get binary cache hits on the kernels :)

@lukego
Copy link
Author

lukego commented Nov 8, 2018

On further reflection it's probably not safe to share one hugetlbfs mount between all sandboxes. I suspect that PIDs will not be unique across sandboxes and this will lead to collisions on the filesystem if two processes with the same PID try to create their own directories in the same bind-mounted folder.

So it might have to be the kernel patching route.

@Mic92
Copy link
Member

Mic92 commented Nov 8, 2018

You have to be aware that this flag is only set if it is safe to do so.
It is allowed for tmpfs because memory cgroups also account those memory allocation.
Hugetablefs might be not handle this.
There is a sandbox-paths option in nix that allows bind mounting of filesystems into a build.

@lukego
Copy link
Author

lukego commented Nov 8, 2018

For what definition of "safe" though? I am not concerned about security on the local machine here e.g. privilege escalation within the sandbox. I am concerned about machine crashed though :).

If I bind-mount a shared directory from the host then I believe I'll also need to either prevent sandboxes from using the same PIDs (I'm assuming that they do now) or else patch the application I'm working with to protect against collisions with other processes having the same pid. This seems messy compared with a one-line kernel patch if that would work.

@lukego
Copy link
Author

lukego commented Nov 8, 2018

... Could also be that updating the application to support running multiple instances inside userns namespaces with a shared hugetlbfs folder [pause for breath] is a legitimate enhancement that can be useful in other deployment contexts in the future. Today I can bypass userns if I really want to but maybe tomorrow I can't because a user wants to deploy in a more restricted environment. Have to ponder...

@Mic92
Copy link
Member

Mic92 commented Nov 8, 2018

By safe I mean actually secure.
You change would be applied here:
https://github.com/torvalds/linux/blob/b00d209241ff7ac336543f92ea764b9d6b03032a/fs/hugetlbfs/inode.c#L1287

@lukego
Copy link
Author

lukego commented Nov 8, 2018

Thank you very much for the information. I'll consider my options until the next time I'm back at the keys.

@Mic92
Copy link
Member

Mic92 commented Nov 8, 2018

It might be even possible to compile a hugtblfs2 module out-of-tree as a single module that does not have the restriction of this one to avoid compiling the whole kernel.

@lukego
Copy link
Author

lukego commented Nov 8, 2018

Reflection: This all seems to "just work" with docker --privileged. I have not seen any problems doing superuser things like mounting hugetlbfs in that context. How come it is different with Nix?

@Mic92
Copy link
Member

Mic92 commented Nov 8, 2018

Docker does not have usernamespaces enabled by default.

@lukego
Copy link
Author

lukego commented Nov 9, 2018

Could be that what I need to do is disable usernamespaces?

Have to take a step back and think about the goals/requirements for benchmarking system software with Nix where running on full bare metal takes priority over local privilege management on the build host. Otherwise it is just whack-a-mole.

@Mic92
Copy link
Member

Mic92 commented Nov 9, 2018

Well, you would need to rewrite some of nix's sandbox logic for that at the moment it depends on usernamespaces to map the builder's uid/gid on the host to uid 1000/gid 1000
in the builder sandbox. Otherwise you could disable the sandbox as a whole.

@lukego
Copy link
Author

lukego commented Nov 9, 2018

Otherwise you could disable the sandbox as a whole

This may be the right solution. The main downside I see is that untracked dependencies ("impurities") can creep into the builds. However, compared with fighting kernel features like userns that starts to seem pretty minor.

Some other pros/cons that I overlook?

@edolstra
Copy link
Member

edolstra commented Nov 9, 2018

Yeah, disabling user namespaces sounds like the way to go. It could be set per derivation using requiredSystemFeatures, e.g.

requiredSystemFeatures = [ "no-userns" ];

Alternatively, you could just disable sandboxing, since you might not need it for benchmarking anyway?

@lukego
Copy link
Author

lukego commented Nov 9, 2018

Can I have the chroot of the sandbox only? Is that compatible with the overall approach that Nix takes?

@7c6f434c
Copy link
Member

7c6f434c commented Nov 9, 2018 via email

@Mic92
Copy link
Member

Mic92 commented Nov 9, 2018

Not sure how your setup works, but you could also disable the sandbox just for the benchmark itself but not the build. Well you could even build a docker container with nix that contains the benchmark to have some test isolation: https://nixos.org/nixpkgs/manual/#sec-pkgs-dockerTools or a nixos-container based one: https://nixos.org/nixos/manual/index.html#ch-containers
At the moment we delegate most of our privilege-requiring tests to vm-based tests. There was once a plan to also extend this with container-based tests, but there we would probably also want usernamespace-based, unprivileged container for the nixpkgs usecase.

@lukego
Copy link
Author

lukego commented Nov 9, 2018

This is much trickier than I had anticipated :-) thanks for all the feedback everybody!

Suppose we follow the route of running the build without a Nix sandbox but then having the build/test script construct a sandbox of its own. For example to create a chroot that shadows /var with a scratch folder so that each test can do whatever it likes there without interfering with other tests.

Is that a viable approach and what some of these suggestions are hinting at? This way we still have a modest level of test isolation, we avoid advanced kernel features like usernamespaces, and we avoid depending on the intricacies of build.cc internals with SECCOMP and userns that are evolving in a different direction than our use case anyway. Or?

@lukego
Copy link
Author

lukego commented Nov 9, 2018

disabling user namespaces sounds like the way to go

@edolstra How would that be implemented? Just conditionalize some parts of build.cc to suppress some of the sandbox setup?

@edolstra
Copy link
Member

edolstra commented Nov 9, 2018

Yes, basically make passing CLONE_NEWUSER and associated setup conditional, in the same way that CLONE_NEWNET is optional.

@lukego
Copy link
Author

lukego commented Nov 13, 2018

@edolstra I made a naieve attempt at conditionalizing the userns but nothing works with this attempt:

$ nix-build -E 'with import <nixpkgs> {}; runCommand "x" {} "echo ok | tee $out"'
these derivations will be built:
  /nix/store/39x3acks4hq3sp5lgwnw4lcdzw5xi72d-x.drv
error: writing to file: Operation not permitted

Maybe I disabled the usernamespace but other parts of build.cc are assuming the userids have been remapped or something. Hints would be warmly appreciated if anybody has time to peek at the patch and daemon-strace at https://gist.github.com/992d195be54c68cfe501b20727c37c16

@stale
Copy link

stale bot commented Feb 21, 2021

I marked this as stale due to inactivity. → More info

@stale stale bot added the stale label Feb 21, 2021
@stale
Copy link

stale bot commented Apr 28, 2022

I closed this issue due to inactivity. → More info

@stale stale bot closed this as completed Apr 28, 2022
@thufschmitt thufschmitt reopened this Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants