-
-
Notifications
You must be signed in to change notification settings - Fork 13.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] nixos/containers: add unprivileged option #67336
base: master
Are you sure you want to change the base?
Conversation
Here are some errors we should address before merging this. Any input is welcomed :) nscd.service inside unprivileged container fails with:
Starting/reloading container results in the the following log output:
Those errors do not seem critical since I've been successfully running and reloading unprivileged containers for more than half a year now. My understanding is that we should just skip the mount step when running inside a container. |
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/nixos-container-limitations/1835/7 |
Would it help to disable |
@arianvp Yes, I guess reverting |
Just as a reminder: If we can't make this work for 20.03 we have to fix the documentation from #67232. |
c2b042c
to
381373f
Compare
I disabled special-fs mounts inside nixos containers which fixes mount errors. Let me know if you are aware of cases when it might break things. |
381373f
to
aba55d1
Compare
Also, creating
This is because |
Seems like the blocking issue has been fixed: (systemd/systemd#13622), so as long as we will get the new systemd version we can continue to work on this :) |
Hello, I'm a bot and I thank you in the name of the community for your contributions. Nixpkgs is a busy repository, and unfortunately sometimes PRs get left behind for too long. Nevertheless, we'd like to help committers reach the PRs that are still important. This PR has had no activity for 180 days, and so I marked it as stale, but you can rest assured it will never be closed by a non-human. If this is still important to you and you'd like to remove the stale label, we ask that you leave a comment. Your comment can be as simple as "still important to me". But there's a bit more you can do: If you received an approval by an unprivileged maintainer and you are just waiting for a merge, you can @ mention someone with merge permissions and ask them to help. You might be able to find someone relevant by using Git blame on the relevant files, or via GitHub's web interface. You can see if someone's a member of the nixpkgs-committers team, by hovering with the mouse over their username on the web interface, or by searching them directly on the list. If your PR wasn't reviewed at all, it might help to find someone who's perhaps a user of the package or module you are changing, or alternatively, ask once more for a review by the maintainer of the package/module this is about. If you don't know any, you can use Git blame on the relevant files, or GitHub's web interface to find someone who touched the relevant files in the past. If your PR has had reviews and nevertheless got stale, make sure you've responded to all of the reviewer's requests / questions. Usually when PR authors show responsibility and dedication, reviewers (privileged or not) show dedication as well. If you've pushed a change, it's possible the reviewer wasn't notified about your push via email, so you can always officially request them for a review, or just @ mention them and say you've addressed their comments. Lastly, you can always ask for help at our Discourse Forum, or more specifically, at this thread or at #nixos' IRC channel. |
The coresponding PR was merged in january, so we probably have the right systemd version now? |
@uvNikita any news on this? |
@aanderse I think the best path would be to implement containers module v2.0 (see #69414) where we would use systemd-netowrkd and nspawn files which would reduce amount of scripts and workarounds necessary. Adding unprivileged and ephemeral options support there should be a trivial task I think. In fact, this exactly the way I'm currently using unprivileged, ephemeral containers -- a custom stripped-down nixos containers module similar to the one developed in #69414. |
@uvNikita great. I'm looking forward to it. Thanks for the reply! |
I marked this as stale due to inactivity. → More info |
Now we're doing it correct user-namespacing here as well, for that a few filesystem-fixes had to be applied. For more context, please refer to NixOS#67336 Also credits go to the author of the aforementioned PR, I basically pulled these changes into this branch.
Sometimes it's needed to build a configuration within a `nix-build` for systemd units. While this is fairly easy for .service-units (where you can easily define overrides), it's not possible for `systemd-nspawn(1)`. This is mostly a hack to get dedicated bind-mounts of store paths from `pkgs.closureInfo` into the configuration without IFD. In the long term we either want to fix this in systemd or find a more suited solution though. nixos/containers-next: initialize first draft for new NixOS containers w/networkd This is the first batch of changes for a new container-module replacing the current `nixos-container`-subsystem in the longterm. The state in here is still strongly inspired by the `containers`[1]-module to declare declarative nspawn-instances by using NixOS config for the host and the container itself. For now, this module uses the tentative namespace `nixos.containers', but that's subject to change. This new module will also contain the following key-differences: * Rather than writing a big abstraction-layer on top, we'll rely on `.nspawn`-units[2]. This has the benefits that (1) we can stop adding options for each new nspawn-feature (such as MACVLANs, ephemeral instances, etc.) because it can be directly written into the `.nspawn`-unit using the module system like systemd.nspawn.foo.filesConfig = { BindReadOnly = /* ... */ }; Also, administrators don't need to learn too much about our abstractions, they only need to know a few basics about the module-system and how to write systemd units. * This feature strictly enforces `systemd-networkd` on both the container & the host. It can be turned off for containers in the host-namespace without a private network though. The reason for this is that the current `nixos-container` implementation has the long-standing bug that the container's uplink is broken *until* the container has booted since the host-side of the veth-pair is configured in `ExecStartPost=`[3]. This is, because there's no proper way to take care of it in an earlier stage since `systemd-nspawn` creates the interface itself. This has e.g. the implication that services inside the container wrongly assume that they connect to e.g. an external database via network (since `network{,-online}.target` was reached), however this is not the case due to the unconfigured host-side veth interface. However, when using `systemd-networkd(8)` on both sides, this is not the case anymore since systemd will automatially take care of configuring the network correctly when an nspawn unit starts and `networkd` is active. Apart from a basic draft, this also contains support for RFC1918 IPv4-addresses configured via DHCP and ULA-IPv6 addresses configured via SLAAC and `radvd(8)` including support for ephemeral containers. Further additions such as a better config-activation mechanism and a tool to manage containers imperatively will follow. [1] https://nixos.org/manual/nixos/stable/options.html#opt-containers [2] https://www.freedesktop.org/software/systemd/man/systemd.nspawn.html# [3] https://github.com/NixOS/nixpkgs/blob/8b0f315b7691adcee291b2ff139a1beed7c50d94/nixos/modules/virtualisation/nixos-containers.nix#L189-L240 nixos/containers-next: initialize first draft for new NixOS containers w/networkd This is the first batch of changes for a new container-module replacing the current `nixos-container`-subsystem in the longterm. The state in here is still strongly inspired by the `containers`[1]-module to declare declarative nspawn-instances by using NixOS config for the host and the container itself. For now, this module uses the tentative namespace `nixos.containers', but that's subject to change. This new module will also contain the following key-differences: * Rather than writing a big abstraction-layer on top, we'll rely on `.nspawn`-units[2]. This has the benefits that (1) we can stop adding options for each new nspawn-feature (such as MACVLANs, ephemeral instances, etc.) because it can be directly written into the `.nspawn`-unit using the module system like systemd.nspawn.foo.filesConfig = { BindReadOnly = /* ... */ }; Also, administrators don't need to learn too much about our abstractions, they only need to know a few basics about the module-system and how to write systemd units. * This feature strictly enforces `systemd-networkd` on both the container & the host. It can be turned off for containers in the host-namespace without a private network though. The reason for this is that the current `nixos-container` implementation has the long-standing bug that the container's uplink is broken *until* the container has booted since the host-side of the veth-pair is configured in `ExecStartPost=`[3]. This is, because there's no proper way to take care of it in an earlier stage since `systemd-nspawn` creates the interface itself. This has e.g. the implication that services inside the container wrongly assume that they connect to e.g. an external database via network (since `network{,-online}.target` was reached), however this is not the case due to the unconfigured host-side veth interface. However, when using `systemd-networkd(8)` on both sides, this is not the case anymore since systemd will automatially take care of configuring the network correctly when an nspawn unit starts and `networkd` is active. Apart from a basic draft, this also contains support for RFC1918 IPv4-addresses configured via DHCP and ULA-IPv6 addresses configured via SLAAC and `radvd(8)` including support for ephemeral containers. Further additions such as a better config-activation mechanism and a tool to manage containers imperatively will follow. [1] https://nixos.org/manual/nixos/stable/options.html#opt-containers [2] https://www.freedesktop.org/software/systemd/man/systemd.nspawn.html# [3] https://github.com/NixOS/nixpkgs/blob/8b0f315b7691adcee291b2ff139a1beed7c50d94/nixos/modules/virtualisation/nixos-containers.nix#L189-L240 nixos/containers-next: implement small wrapper for nspawn port-forwards This exposes a given `containerPort` to the host address. So if port 80 from the container is forwarded to the host's port 8080 and the container uses `2001:DB8::42` and the host-side uses `2001:DB8::23` on the veth-interface, then `[2001:DB::42]:80` will be available on the host as `[2001:DB8::2]:8080`. nixos/containers-next: implement more advanced networking tests This change tests various combinations of static & dynamic addressing and also fixes a bug where `radvd(8)` was errorneously configured for veth-pairs where it's actually not needed. This test is also supposed to show how to use `systemd`-configuration to implement most of the features (for instance there's no custom set of options to implement MACVLANs) and serves as regression-test for future `systemd`-updates in NixOS. Please note that the `ndppd`-hack is only here because QEMU doesn't do proper IPv6 neighbour resolution. In fact, I left comments whenever some workarounds were needed for the testing-facility. nixos/tests/container-migration: init This test is supposed to demonstrate how to migrate a single container to the new subsystem. Of course, docs on how to rewrite config isn't written yet, this is mainly a POC to show that it's generally possible by * Deploying a new configuration (using `nixos.containers`) being equivalent to the old one. * Moving the state from `/var/lib/containers` to `/var/lib/machines`. * Rebooting the host - unfortunately - because otherwise `systemd-networkd` will reach an inconsistent state - at least with v247. For the reboot-part I also had to change the QEMU vm-builder a bit to actually support persistent boot-disks. nixos/containers-next: allow static configuration for a virtual zone as well This is already the case for dynamically assigned addresses (e.g. via SLAAC or DHCPv4) where `0.0.0.0/24` and `::/64` provides a pool of private IPs. However if such a zone is supposed to be fully static, the same should be possible as well. nixos/switch-to-configuration: import old config activation changes This is basically what I tried in NixOS#84608 at first - being able to reload or restart a container based on the NixOS-specific `re{load,start}IfChanged` options for systemd units, but with a few differences: * I switched back to using `nsenter(1)` from util-linux for the same rationale as in ebb6e38: without this, the activation would hang until a timeout is exceeded if the service-manager inside the container is reloaded. * I also disabled `systemd-networkd-wait-online.service` inside the container because it'd also hang even if the interfaces are configured properly. We should investigate how to fix it / if it was already fixed at some point. Also implemented a small test to ensure that a config-activation works fine, even with networking. nixos/containers-next: fix broken machinectl reboot and probably more It seems as systemd ignores `systemd-nspawn@` (the template unit) if an override exists and a custom unit for the service (i.e. `systemd-nspawn@containername.service`): [root@server:~]# systemctl status systemd-nspawn@ldap ● systemd-nspawn@ldap.service Loaded: loaded (/nix/store/rm4kigdbzl78iai8jfbgxbslvalk8bwa-unit-systemd-nspawn-ldap.service/systemd-nspawn@ldap.service; linked; vendor preset: enabled) Drop-In: /nix/store/fr9zabpvp3077cbb6jnpxm42qxqw9yk2-system-units/systemd-nspawn@.service.d └─overrides.conf Active: active (running) since Tue 2021-03-16 15:01:32 UTC; 23min ago This breaks at least `machinectl reboot` which needs `RestartForceExitStatus = 133` as setting. For now, I've added all settings to the module itself. nixos/switch-to-configuration: Implement more generic decisions for config activations in containers Actually, using `re{load,start}IfChanged` isn't the best decision for containers because some containers have to be reloaded or restarted depending on what has changed. For instance, a new bind-mount requires a `machinectl reboot`, but a change in the NixOS config only needs a `systemctl reload` (which runs `switch-to-configuration` inside the container). To model this, I decided to add four keywords and an option `activation.strategy` to declarative containers: * `strategy = "none"` means that the container will be entirely ignored by `switch-to-configuration`. * `strategy = "restart"` will always `machinectl reboot` the container if a change was detected. * `strategy = "reload"` will always `systemctl reload` the container if a change was detected. * `strategy = "dynamic"` will check what has changed inside the container. If only the NixOS config inside the container has changed, a reload will be scheduled, otherwise a restart. Always did a nearly full rewrite of the activation test to cover several corner-cases and combination of such settings. nixos/containers-next: add read-only `nixos.containers.rendered` option This option is an attr-set that maps containers to their NixOS configuration since `nixos.containers.instances` directly transforms the config to a NixOS derivation. Also, the raw `nixos.containers.instances` isn't really usable since it usually contains a list of chunks that are evaluated by the module-system. This is actually useful to introspect the configuration just as it's done with e.g. `resources.machines`[1] in nixops. For instance, I'm configuring my Prometheus scraping targets like this by gathering all active exporters in my machines and their containers: { config, lib, ... }: with lib; let containers = flip mapAttrsToList machine.nixos.containers.rendered (const (x: x.config)); in flip concatMap (attrValues containers) (c: flip concatMap (attrValues c.services.prometheus.exporters) (exporter: (optional exporter.enable "${config.networking.fqdn}:${toString exporter.port}"))) [1] https://nixos.mayflower.consulting/blog/2018/10/26/nixops-machine-configs/ nixos/all-tests: register tests Also add a `jobset.nix` to test this on my self-hosted Hydra (which btw uses this feature already :p). nixos/containers-next: make sure that the module works fine with `restrictedEval` being active This is necessary to get it running on my Hydra. nixos/containers-next: add test for SSH inside a nspawn machine Just another small testcase to confirm that the container's network works fine. nixos/containers-next: enable private users by default nixos/systemd-nspawn: make `/etc/systemd/nspawn` mutable Now only `/etc/systemd/nspawn/<name>.nspawn` will be a symlink rather than having the full directory as a symlink. This is actually consistent with `networkd` (both don't have alternate locations for transient units) and will become necessary when implementing imperative containers since these should also use nspawn units. nixos/containers-next: fix eval after 21.05 breaking changes `stdenv.lib` and `pkgs.utillinux` are deprecated now and cause an error when disallowing aliases (which is the default when evaluating nixpkgs). nixos-nspawn: init This is a first draft for imperative containers - basically a replacement for `nixos-container` - based on Python. It's still missing a few features, but is actually a working POC with the following key-differences: * Rather than Perl, Python is used now. While the choice of a language is always debatable, I'm pretty convinced that Python is easier to access than Perl and a lot more people are willing to write Python code (that's for instance the reason why the test-driver was eventually ported to Python). * Similar to `extra-container`[1], this also contains way more features than the stock `nixos-container` implementation. This is because we basically provide all options from `nixos.containers` and evaluate them after that. The additional configs (such as `activation`/`network`/etc) are rendered into JSON and can be read by the script to imperatively create `.nspawn` & `.network` units. [1] https://github.com/erikarvstedt/extra-container nixos/containers-next: implement proper user-namespacing support Now we're doing it correct user-namespacing here as well, for that a few filesystem-fixes had to be applied. For more context, please refer to NixOS#67336 Also credits go to the author of the aforementioned PR, I basically pulled these changes into this branch. nixos/containers-next: add support for `LoadCredential=` With user-namespacing set to `pick`[1], bind-mounts will always be owned by `nouser:nogroup`. This is a problem for secrets since these shouldn't be world-readable and with a `nouser:nogroup` from another user-namespace (the `root` inside container isn't an actual `root` anymore) the secrets would be unreadable. To work around this, `LoadCredential=` can be used. In fact, using `--load-credential` - unfortunately there's no switch for `.nspawn`-units - passes a secret into a container where it can be re-used by using the host's credential-ID as `path` in a `.service`-file inside the container. So basically { nixos.containers.instances.foo.credentials = [ { id = "foo"; path = "/run/secrets/foo";} ]; } makes the secret available as `/run/host/credentials/foo` and by specifying LoadCredential=foo:foo in `example.service`, the credential will be readable by the `ExecStart=` inside `example.service` from `/run/credentials/example.service/foo`. [1] https://www.freedesktop.org/software/systemd/man/systemd-nspawn.html#--private-users= nixos/containers-next-imperative: init sudo-nspawn: init This is a slightly modified sudo enabling `--enable-static-sudoers` which ensures that `sudoers.so` is linked statically into the executable[1]: > --enable-static-sudoers > By default, the sudoers plugin is built and installed as a > dynamic shared object. When the --enable-static-sudoers > option is specified, the sudoers plugin is compiled directly > into the sudo binary. Unlike --disable-shared, this does > not prevent other plugins from being used and the intercept > and noexec options will continue to function. This is necessary here because of user-namespaced `nspawn`-instances: these have their own UID/GID-range. If a container called `ldap` has `PrivateUsers=pick` enabled, this may look like this: $ ls /var/lib/machines drwxr-xr-x 15 vu-ldap-0 vg-ldap-0 15 Mar 11 2021 ldap -rw------- 1 root root 0 Sep 12 16:13 .#ldap.lck $ id vu-ldap-0 uid=1758003200(vu-ldap-0) gid=65534(nogroup) groups=65534(nogroup) However, this means that bind-mounts (such as `/nix/store`) will be owned by `nobody:nogroup` which is a problem for `sudo(8)` which expects `sudoers.so` being owned by `root`. To work around this, the aforementioned configure-flag will be used to ensure that this library is statically linked into `bin/sudo` itself. We cannot do a full static build though since `sudo(8)` still needs to `dlopen(3)` various other libraries to function properly with PAM. [1] https://www.sudo.ws/install.html nixos/switch-to-configuration: fix a few problems with nspawn instances Config activation of declarative containers used to be error-prone in some cases: * If a machine was powered off and had its config changed, the activation broke like this: systemd-nspawn@ldap.service is not active, cannot reload. The easiest workaround is to just skip inactive containers. The host-side configuration - i.e. the `nspawn`-unit and (optionally) the network configuration - is still activated and will be used on the next start. * Sometimes, `systemd-nspawn@`-instances are marked to be started by the diffing-code. This should not happen since `systemd-nspawn@`-instances are now treated specially which means that these will only be started if they're newly added. * If both `dbus.service` and an arbitrary container will be reloaded in the same transaction (i.e. in the same `systemctl reload`-call) this will freeze the system making it unreachable even via `ssh(1)` for about two minutes and leaving the following errors in the log: Sep 11 21:32:16 roflmayr systemd[1]: Reloading D-Bus System Message Bus. Sep 11 21:32:41 roflmayr dbus-send[1868379]: Error org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. Sep 11 21:32:41 roflmayr systemd[1]: dbus.service: Control process exited, code=exited, status=1/FAILURE Sep 11 21:32:41 roflmayr systemd[1]: Reload failed for D-Bus System Message Bus. While I'm not entirely sure what's going on here, I realized that this issue disappears if all services that are scheduled for reload are processed before the containers. I guess that this avoids host-side system-services interfering with a container's system-manager. nixos-nspawn: misc improvements & cleanups This enhances the test-coverage of the script significantly and also adds fixes for a few existing problems such as * missing call-traces * a spurious error when invoking the command without arguments and cleans the code up a bit. nixos/containers-next: move to subdir and factor out defaults for containers This was done because imperative & declarative containers have a common base configuration that was duplicated before, so moving it into a file used by both facilities is better here. To avoid cluttering the `virtualisation/`-subtree of NixOS too much, I decided to create a new subdir for this. nixos-nspawn: implement activation & networking However only in a simplified manner - my main intention was to write a replacement for the `containers`-module and this was just a side-effect, so further features should be implemented by the community. Basically, `nixos-nspawn` update now activates the config on its own, but without support for `strategy = "dynamic";` to avoid having to duplicate the Perl implementation here. Instead, either `reload`/`restart`/`none` is the default and can be overridden with `nixos-nspawn --reload` / `nixos-nspawn --restart`. Since this is a completely manual change anyways, this is IMHO good-enough for now. The same applies to `nixos-nspawn rollback`. Also, the rendered `.network`-units now support addresses just like declarative containers do with the exception of IPv6 SLAAC because I'd have to imperatively change `radvd` for this which is out of scope[1]. Finally, the test was enhanced to cover more cases related to the new features. [1] Actually, this would introduce too much impurity anyways. Instead, `networkd` should implement IPv6 SLAAC for nspawn on its own so we can remove `radvd` and properly implement this here. nixos/activation-scripts: turn off `var`-script for containers It's already taken care of and only causes `permission denied`-errors that make config activations seem failed even though they aren't. Revert "nixos/activation-scripts: turn off `var`-script for containers" This reverts commit 6f281b9ad31cf6d9ef396de788d06ea4e35f8112. This is actually not a good idea since the `var`-activation-script is actually the component that ensures that `/var/empty` exists which is `$HOME` for quite a number of services. nixos/containers-next: only create OS structure in `/var/lib/machines` if it doesn't exist Because after that, this can screw with permissions if the container is using a private user-namespace. This actually solves the activation issues and the `var`-script can still be used in here. nixos/tests/containers-next: add testcase for custom `ResolvConf`-setting nixos/container-migration-test: confirm that nixos-container is still usable after switching to the new API nixos/containers-next: assert that networkd is used nixos/tests/containers-next-imperative: ensure that imperative containers can be powered off without state issues nixos/tests/container-migration: fix eval nixos/containers-next: fix eval nixos/qemu-vm: increase /boot to 120M Otherwise test-cases that install several NixOS generations into `/boot` will fail with `No space left on device`. nixos/container-migration: actually move state of containers nixos/containers-next: fix test nixos/containers-next: s/literalExample/literalExpression/g nixos/useHostResolvConf: deprecate option nixos/containers-next-imperative: fix test * Don't use underscores in hostnames, this appears to break systemd-resolved now. * Minor fixes for the test. nixos/containers-next: fix `systemd-networkd-wait-online.service` hanging indefinetely See NixOS#140669 (comment) for further context. Co-authored-by: Franz Pletz <fpletz@fnordicwalking.de> Co-authored-by: zseri <zseri.devel@ytrizja.de> nixos/containers-next: config -> system-config nixos/containers-next: confirm that exposed hostnames also work for services like nginx nixos/containers-next: review fixes * Fix naming of migration test. * Explain why `persistentBookDisk` is needed. * Document that `jobset.nix` is only temporary and should be removed before merging. * Remove superfluous `touch $out`. sudo-nspawn: merge with `pkgs.sudo` The feature can now be activated via `withStaticSudoers`. Also, the patches aren't needed anymore since these are part of the current `sudo`-release that's also in `nixpkgs`. nixos-nspawn: refactor python setup * Simplify shebangs * Fix `python3`-inclusion on `nix-shell`-shebang * Don't `flake8` the code on build. Co-authored-by: Sandro <sandro.jaeckel@gmail.com> nixos/qemu-vm: fix manual evaluation containers-next: Support independent use of container-options.nix containers-next: Add bindMounts option containers-next: Dont shut down imperative containers during rebuild
Now we're doing it correct user-namespacing here as well, for that a few filesystem-fixes had to be applied. For more context, please refer to NixOS#67336 Also credits go to the author of the aforementioned PR, I basically pulled these changes into this branch.
Motivation for this change
Depends on #67332. Fixes #57087.
Things done
sandbox
innix.conf
on non-NixOS)nix-shell -p nix-review --run "nix-review wip"
./result/bin/
)nix path-info -S
before and after)Notify maintainers
cc @mmahut @danbst @Mic92 @fpletz @arianvp