Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/systemd-sandbox: A generic sandboxing module #87661

Open
wants to merge 1 commit into
base: master
from

Conversation

@dasJ
Copy link
Member

dasJ commented May 12, 2020

This can be enabled using systemd.services.<name>.sandbox and supports
different levels of confinement, allowing for new new confinement
options to be added in the future.

We have been using a similar module for about half a year with extreme success, being able to sandbox stuff quickly and efficently.

cc @ajs124 As you also wrote parts of the module iirc wanted me to write the module back then
cc @Izorkin Maybe this can make #85567 more generic
cc @flokli @Mic92 @Ma27 because you might care about this

Motivation for this change
Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.
2 is the highest amount of sandboxing this module supports right now.
'';
default = 0;
type = enum [ 0 1 2 ];

This comment has been minimized.

Copy link
@sorki

sorki May 12, 2020

Member

I think more descriptive strings would be better - something like

enum [ "unsafe" "basic" "full" ];

What happens when you set it to default 1 or 2 btw? :)

This comment has been minimized.

Copy link
@dasJ

dasJ May 12, 2020

Author Member

Yeah, the reason for the numbers is forward compatibility. As I have seen over the past systemd releases, more confinement options are added over time (for example ProtectKernelLogs and ProtectClock). Enabling them for all services which have confinement enabled would most likely break some of them. This way, we can add more options (and numeric values) in the future without having to do someting like [ "unsafe" "basic" "full" "fuller" "fullest" "full2023" "full_newer" ]

I also thought about setting some of the options by default (like PrivateTmp since postgres has its socket in /run now), but I suspect a lot of stuff to break and I cannot test all nixos modules ;) The first that comes to mind is Docker which really needs cgroups, for example.
Also, this affects ALL units, including systemd-shipped units like udev and this PR would most likely break all NixOS systems on the unstable channel.

This comment has been minimized.

Copy link
@sorki

sorki May 12, 2020

Member

Thanks, that makes sense. I will play with it in VM.

@Izorkin
Copy link
Contributor

Izorkin commented May 12, 2020

How to replace

CapabilityBoundingSet = "";
ProtectHome = true;
MemoryDenyWriteExecute = true;

to

CapabilityBoundingSet = [ "CAP_NET_BIND_SERVICE" "CAP_SYS_RESOURCE" ];
ProtectHome = mkDefault true;
MemoryDenyWriteExecute = !(builtins.any (mod: (mod.allowMemoryWriteExecute or false)) pkgs.nginx.modules);
@dasJ dasJ force-pushed the helsinki-systems:systemd-sandbox branch from 00a04c1 to 1bb1719 May 12, 2020
@dasJ
Copy link
Member Author

dasJ commented May 12, 2020

@Izorkin Since I use mkOverride 900, you can just set sandbox = 2 and overwrite all changes you don't need in your module by just assigning as you did before.

This can be enabled using `systemd.services.<name>.sandbox` and supports
different levels of confinement, allowing for new new confinement
options to be added in the future.
Copy link
Contributor

Mic92 left a comment

I think this abstraction goes a bit too far. The options provided by systemd are already quite abstract and it is not quite clear what should go in each level. What if I have for example a python program that could run with all sandboxing options except the MemoryDenyWriteExecute one. I rather would prefer to have these options in every service set since they are quite service specific.

@emilazy
Copy link
Member

emilazy commented May 12, 2020

FWIW, I think the top tier could use splitting up; ProtectSystem, ProtectHome etc. are useful drop-in hardening options for network-exposed services, but PrivateUsers, PrivateNetwork, CapabilityBoundingSet = [] and so on prevent it from being useful for the majority of services. Unfortunately, PrivateUsers prevents giving any additional host-level privileges (such as binding low ports) to the process at all; I'm not sure if this is an inherent limitation or just an infelicity of the current systemd implementation, but it prevented me from using it in #83474.

I support some kind of abstraction for hardening because I don't think individually setting 10+ options on each systemd service is going to scale, especially as new hardening options are introduced over time, and the status quo is very bad:

emily@renko ~> systemd-analyze security
UNIT                                 EXPOSURE PREDICATE HAPPY
ModemManager.service                      5.8 MEDIUM    😐    
NetworkManager-dispatcher.service         9.6 UNSAFE    😨    
NetworkManager.service                    7.8 EXPOSED   🙁    
accounts-daemon.service                   5.4 MEDIUM    😐    
avahi-daemon.service                      9.6 UNSAFE    😨    
bluetooth.service                         6.8 MEDIUM    😐    
colord.service                            8.8 EXPOSED   🙁    
dbus.service                              9.6 UNSAFE    😨    
display-manager.service                   9.8 UNSAFE    😨    
dnscrypt-proxy2.service                   8.3 EXPOSED   🙁    
emergency.service                         9.5 UNSAFE    😨    
geoclue.service                           7.4 MEDIUM    😐    
getty@tty1.service                        9.6 UNSAFE    😨    
getty@tty7.service                        9.6 UNSAFE    😨    
iwd.service                               5.9 MEDIUM    😐    
nix-daemon.service                        9.6 UNSAFE    😨    
nscd.service                              8.2 EXPOSED   🙁    
pcscd.service                             9.6 UNSAFE    😨    
polkit.service                            9.6 UNSAFE    😨    
rescue.service                            9.5 UNSAFE    😨    
rtkit-daemon.service                      7.1 MEDIUM    😐    
systemd-ask-password-console.service      9.3 UNSAFE    😨    
systemd-ask-password-wall.service         9.4 UNSAFE    😨    
systemd-initctl.service                   9.3 UNSAFE    😨    
systemd-journald.service                  4.4 OK        🙂    
systemd-logind.service                    2.8 OK        🙂    
systemd-machined.service                  6.1 MEDIUM    😐    
systemd-timesyncd.service                 2.1 OK        🙂    
systemd-udevd.service                     7.1 MEDIUM    😐    
udisks2.service                           9.6 UNSAFE    😨    
upower.service                            2.3 OK        🙂    
user@1000.service                         9.4 UNSAFE    😨    
zfs-zed.service                           9.6 UNSAFE    😨    
zpool-trim.service                        9.6 UNSAFE    😨    
emily@patchouli ~> systemd-analyze security
UNIT                                 EXPOSURE PREDICATE HAPPY
acme-dns.service                          1.5 OK        🙂    
dbus.service                              9.6 UNSAFE    😨    
dnscrypt-proxy2.service                   8.3 EXPOSED   🙁    
emergency.service                         9.5 UNSAFE    😨    
getty@tty1.service                        9.6 UNSAFE    😨    
nginx.service                             9.3 UNSAFE    😨    
nix-daemon.service                        9.6 UNSAFE    😨    
nscd.service                              8.2 EXPOSED   🙁    
rescue.service                            9.5 UNSAFE    😨    
sshd.service                              9.6 UNSAFE    😨    
systemd-ask-password-console.service      9.3 UNSAFE    😨    
systemd-ask-password-wall.service         9.4 UNSAFE    😨    
systemd-coredump@0.service                3.2 OK        🙂    
systemd-initctl.service                   9.3 UNSAFE    😨    
systemd-journald.service                  4.4 OK        🙂    
systemd-logind.service                    2.8 OK        🙂    
systemd-rfkill.service                    9.3 UNSAFE    😨    
systemd-timesyncd.service                 2.1 OK        🙂    
systemd-udevd.service                     7.1 MEDIUM    😐    
user@1000.service                         9.4 UNSAFE    😨    
zfs-zed.service                           9.6 UNSAFE    😨    
zpool-trim.service                        9.6 UNSAFE    😨    

But I do think it needs some careful consideration as to exactly what interface to expose. I think that rather than monotonic levels of sandboxing, several discrete named categories of sandboxing, plus some convenience bundles that contain common combinations, would be the best approach. Of course, some thought needs to be put in to avoid a proliferation of categories that don't reduce the complexity over the raw systemd options, but e.g. "protect the kernel stuff and devices", "reduce syscall attack surface", and "isolate filesystem with just the things I need" are viable categories that I think would reduce the complexity in hardening significantly.

Some options you haven't included that #83474 mentions / systemd-analyze security looks at: UMask, ProtectClock, ProtectKernelLogs, RestrictNamespaces.

By the way, it would be good to integrate the existing systemd-confinement.nix functionality, although it could use a rewrite to not use RootDirectory and avoid writeable mounts (see #64405 for some progress on that, though DynamicUser itself undoes a lot of mount sandboxing options so you unfortunately probably don't actually want to use it for hardened services in its current state).

@Mic92
Copy link
Contributor

Mic92 commented May 12, 2020

An alternative to this module would be to set some hardening flags by default and only unset them in the services that actually need them. This will also remove the over all bloat of flags we need to set in every module.

@dasJ
Copy link
Member Author

dasJ commented May 12, 2020

@Mic92 this is actually what this module does. It does mkOverride 900 so services can override the sandboxing, for example Docker could use

{
  sandbox = 2;
  serviceConfig.ProtectControlGroups = false;
}
@martinetd
Copy link
Contributor

martinetd commented May 13, 2020

I also think sandbox = 2 is a bit optimistic and won't play nice with other security mechanisms we have (ProtectSystem and DynamicUser don't play well with systemd.services._name_.confinement); I'd rather see something more integrated if possible...

For example PrivateDevice=yes will mount a new /dev within the service chroot, but with confinement set to full it will actually lower the service security because there is now a /dev that didn't exist before! this stuff really isn't obvious to compose. And yes it's not reflected in systemd-analyse security as they don't know about what we've done.
So maybe we can do something more well thought out with the existing confinement module e.g. extend confinement.mode to have intermediate levels maybe?

For some other missing options though (mostly sandbox 1, ProtectHostname or RestrictRealtime really make sense to have always set by default... Even RestrictAddressFamilies...), I like the idea but I think it will be tricky.. Maybe a 3-steps process, like:

  • add option
  • let services use it and set workarounds if required
  • make option default (conditioned on stateVersion)
    I'm not too sure how nixos deals with this kind of changes usually, just setting it by default based on stateVersion might work?
    But just adding the option isn't enough for me, many system modules that could benefit from it will never use it.

cc @aszlig for further opinion on this, I'm just a user of his superb confinement module :)

@Mic92
Copy link
Contributor

Mic92 commented May 13, 2020

@Mic92 this is actually what this module does. It does mkOverride 900 so services can override the sandboxing, for example Docker could use

{
  sandbox = 2;
  serviceConfig.ProtectControlGroups = false;
}

I mean we could enable some sane default options all systemd services without having to enable sandbox explicitly.

@flokli flokli mentioned this pull request May 22, 2020
6 of 10 tasks complete
@aanderse aanderse mentioned this pull request Jul 6, 2020
12 of 18 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.