Skip to content
This repository has been archived by the owner on Oct 27, 2023. It is now read-only.

SELinux Module for NVIDIA containers #42

Closed
zvonkok opened this issue Sep 25, 2018 · 13 comments
Closed

SELinux Module for NVIDIA containers #42

zvonkok opened this issue Sep 25, 2018 · 13 comments

Comments

@zvonkok
Copy link

zvonkok commented Sep 25, 2018

When we run NVIDIA containers on a SELinux enabled distribution we need a separate SELinux module to run the container contained. Without a SELinux module we have to run the container privileged as this is the only way to allow specific SELinux contexts to interact (read, write, chattr, ...) with the files mounted into the container.

A container running privileged will get the spc_t label that is allowed to rw, chattr of base types. The base types (device_t, bin_t, proc_t, ...) are introduced by the bind mounts of the hook. A bind mount cannot have two different SELinux contexts as SELinux operates on inode level.

I have created the following SELinux nvidia-container.te that works with podman/cri-o/docker.

A prerequisit for the SELinux module to work correctly is to ensure that the labels are correct for the mounted files. Therefore I have added a additional line to the oci-nvidia-hook where I am running a

nvidia-container-cli -k list | restorecon -v -f -

With this, everytime a container is started the files to be mounted will have the correct SELinux label and the SELinux will work.

Now I can run NVIDIA containers without the privileged , can cap-drop=ALL capabilites and security-opt=no-new-privileges.

podman run  --security-opt=no-new-privileges --cap-drop=ALL --security-opt label=type:nvidia_container_t \
            --rm -it docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1
docker run  --security-opt=no-new-privileges --cap-drop=ALL --security-opt label=type:nvidia_container_t \
            --rm -it docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1

podman run  --user 1000:1000 --security-opt=no-new-privileges --cap-drop=ALL --security-opt label=type:nvidia_container_t \
            --rm -it docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1
docker run  --user 1000:1000 --security-opt=no-new-privileges --cap-drop=ALL --security-opt label=type:nvidia_container_t \
            --rm -it docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1
@zvonkok
Copy link
Author

zvonkok commented Oct 8, 2018

Besided restoring the context of NVIDIA files for mounting, one crucial part of the story is the correct label of /var/lib/kubelet/.* The label has to be container_file_t, The device-plugin reads/communicates with kubelet.sock and kubelet_internal_checkpoint.

With the above mentioned module it is possible to run the device-plugin with a restricted SCC and with

        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
          seLinuxOptions:
            type: nvidia_container_t

No need to run the device-plugin or the gpu workload privileged in SELinux context.

@3XX0
Copy link
Member

3XX0 commented Oct 24, 2018

I don't understand why do you need more than dev_rw_xserver_misc(nvidia_container_t), the runtime hook shouldn't be affected by the container policy so why does your policy need rules for the runtime.

@zvonkok
Copy link
Author

zvonkok commented Oct 24, 2018

# /usr/share/selinux/devel/include

policy_module(nvidia-container, 0.1)

gen_require(`
        type container_runtime_tmpfs_t;
        type xserver_exec_t;
')

I am basing nvidia_container_t of container_t and the next rule allows the nvidia container to exit cleanly.

container_domain_template(nvidia_container)
allow nvidia_container_t container_runtime_t:process sigchld;

The hook mounts /proc/driver/nvidia/gpus/0000:00:1d as a tmpfs in the container that gets a container_runtime_tmpfs_t label, so one has to allow nvidia_container_t to read, list, directories
and files that have this label.

# --- podman/docker
getattr_dirs_pattern(nvidia_container_t, container_runtime_tmpfs_t, container_runtime_tmpfs_t)
list_dirs_pattern(nvidia_container_t, container_runtime_tmpfs_t, container_runtime_tmpfs_t)
read_files_pattern(nvidia_container_t, container_runtime_tmpfs_t, container_runtime_tmpfs_t)

The bin files mounted by the hook have xserver_exec_t label, the next rule allows nvidia_container_t to access this labels and execute them.

# --- running nvidia-smi
allow nvidia_container_t xserver_exec_t:file { entrypoint execute getattr };

This should be clear, the devices have xserver_misc_device_t so again allow nvidia_container_t to access the devices.

# --- allow nvidia_container_t xserver_misc_device_t:chr_file { getattr ioctl open read write };
# --- alloc mem, ... /dev/nvidia*
dev_rw_xserver_misc(nvidia_container_t)

There is currently no problem with the libraries container_t or nvidia_container_t can read lib_t or container_file_t.

The hook creates symlinks of each library and the symlinks get the correct label (container_file_t) inherited by the parent folder.

A symlink has an own inode and hence gets an own selinux label.

root@e67d1214d198:/usr/lib/x86_64-linux-gnu# ls -lZ libcuda.so.1 
lrwxrwxrwx. 1 root root system_u:object_r:container_file_t:s0:c301,c422 17 Oct 24 18:25 libcuda.so.1 -> libcuda.so.410.48

This does not mean that you can create symlinks for a file that you're not able to read with a correct label, the type reading the symlinks must have permissions to read from the symlink src and dst.

@flx42
Copy link
Member

flx42 commented Oct 24, 2018

I believe @3XX0 expected the hook to run with context unconfined_u:unconfined_r:unconfined_t. But I just checked and it's indeed system_u:system_r:container_runtime_t.

@zvonkok
Copy link
Author

zvonkok commented Oct 24, 2018

Beware if you're running with podman, you will need at least container-selinux >= 2.73. Prior to that version podman will run the hook with the following context: unconfined_u:unconfined_r:xserver_t that is fixed by recent container-selinux packages.

This will mount e.g /etc/nvidia/nvidia-application-profiles-rc.d/ as unconfined_u:object_r:xserver_tmpfs_t:s0 so the policy will not work.

@3XX0
Copy link
Member

3XX0 commented Oct 24, 2018

No what I would like to know is why do we need the container_runtime* (i.e. all the non xserver_*) rules in the first place.
I mean tmpfs/sigchld handling is pretty basic and should have been inherited from the container domain.

But really, the problem I have with this patch is that you are assuming that the host driver files carry xserver contexts and we can't really be opinionated about that.
So I see 3 options:

  • We upstream to container-selinux a tunable or optional policy to allow containers to access X11 assets.
  • We mandate specific NVIDIA file contexts and use these in the policy you proposed
  • We provide an interface instead so that users can do whatever they want

@3XX0
Copy link
Member

3XX0 commented Oct 25, 2018

So I just looked into it and I think this is due to the fact that your policy doesn't have the svirt_sandbox_domain attribute:

   container_t
      corenet_unlabeled_type
      domain
      kernel_system_state_reader
      mcs_constrained_type
      process_user_target
      container_domain
      container_net_domain
      syslog_client_type
      pcmcia_typeattr_7
      pcmcia_typeattr_6
      pcmcia_typeattr_5
      pcmcia_typeattr_4
      pcmcia_typeattr_3
      pcmcia_typeattr_2
      pcmcia_typeattr_1
      sandbox_net_domain
      sandbox_caps_domain
      svirt_sandbox_domain
   Aliases
      svirt_lxc_net_t

@zvonkok
Copy link
Author

zvonkok commented Oct 25, 2018

You're right. I've update the policy for the missing attributes, see updated nvidia-container.te.

But still the problem with tmpfs stays because container_domain is only allowed to dir read.

 container_domain container_runtime_tmpfs_t:dir read;

But the container is doing more than just dir read, thats why I have added the other rules.
container_runtime_t can do getattr, list_dirs, read_files_pattern, but container_domain
cannot.

Does the /proc/driver/nvidia/gpus/... need to be mounted as a tmpfs? proc_t can be read by
container_t just fine.

The host labels are another point of discussion. The default selinux-policy is labelling bin files
as xserver_bin_t and the devices as xserver_misc_device_t. The reason why i rely on these
types is: I do not want to give the nvidia container access to base-types. I want to contain it by
the default rules and only allow the access to xserver_* .

@3XX0
Copy link
Member

3XX0 commented Oct 25, 2018

Oh right, I assumed it was already the case but thinking more about it, runc always bind mounts on top of its container_runtime_tmpfs_t files so I guess it never needs read access like we do.
That's what I have on my end:

   allow container_domain container_runtime_tmpfs_t : sock_file { write getattr append open } ; 
   allow container_domain container_runtime_tmpfs_t : lnk_file { read getattr } ; 
   allow svirt_sandbox_domain file_type : file entrypoint ; 
   allow svirt_sandbox_domain file_type : dir { getattr search open } ; 
   allow container_domain file_type : filesystem getattr ; 
   allow container_domain container_runtime_tmpfs_t : dir { getattr search open } ; 
   allow svirt_sandbox_domain file_type : filesystem getattr ; 

So we might just be missing file { open read } and dir { read } unless search is sufficient. It would be nice to have that by default in container-selinux though.

I understand why you do it, but coming back to our options, I would prefer we provide better file contexts than these ones in the first place (e.g. nvidia_device_t, nvidia_exec_t):

/usr/(.*/)?nvidia/.+\.so(\..*)?                    regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/(.*/)?lib(64)?(/.*)?/nvidia/.+\.so(\..*)?     regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/lib.*/libnvidia\.so(\.[^/]*)*                 regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/lib(/.*)?/nvidia/.+\.so(\..*)?                regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/lib(/.*)?/nvidia_drv.*\.so(\.[^/]*)*          regular file       system_u:object_r:textrel_shlib_t:s0 
/dev/nvidia.*                                      character device   system_u:object_r:xserver_misc_device_t:s0 
/usr/bin/nvidia.*                                  regular file       system_u:object_r:xserver_exec_t:s0 
/usr/lib/nvidia.*\.so(\.[^/]*)*                    regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/lib/libnvidia\.so(\.[^/]*)*                   regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/lib/nvidia-graphics(-[^/]*/)?libXvMCNVIDIA\.so.* regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/lib/nvidia-graphics(-[^/]*/)?libnvidia.*\.so(\.[^/]*)* regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/lib/nvidia-graphics(-[^/]*/)?libGL(core)?\.so(\.[^/]*)* regular file       system_u:object_r:textrel_shlib_t:s0 
/var/log/nvidia-installer\.log.*                   regular file       system_u:object_r:xserver_log_t:s0 
/usr/lib/vdpau/libvdpau_nvidia\.so.*               regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/lib/xorg/modules/extensions/nvidia(-[^/]*)?/libglx\.so(\.[^/]*)* regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/lib/xorg/modules/drivers/nvidia_drv\.o        regular file       system_u:object_r:textrel_shlib_t:s0 

If we were to rely on the default xserver ones, I suggest we write an extension to container-selinux which may prove to be useful for other people (e.g. forwarding X inside containers)

@zvonkok
Copy link
Author

zvonkok commented Oct 25, 2018

Totally on your side I do not like the xserver_* things either, it was just as a first step to have something "working".

We're currenltly in a position to create here an example workflow how to enable hw acclereators in general on a system with SELinux. If other hw vendors follow the path with a similar method to provide needed libraries (prestart-hook, bind mounts) we could create here "generic" rules for labeling and a policy for accelerators.

  1. We need to take care of the "correct" labelling for the host
  2. On top of that create a policy that enables containers to interact with these labels (be them nvidia_* or xserver_* does not matter)

@wzhanw
Copy link

wzhanw commented Aug 12, 2019

@3XX0 and @zvonkok , I try to setup the openshift+nvidia docker hook+SELinux environment for AI training job, I found some AI training Frameworks (Pytorch) want to write something to the /dev/shm in the gpu container, but after I run the container with "container_t" or @zvonkok's "nvidia_container_t", the /dev/shm in the container is not accessible by the training code.
I'm new to the SELinux, do you know how to configure the rule? Thank you.

@RenaudWasTaken
Copy link
Contributor

I think this is done, there is an example selinux policy for DGX available here: https://github.com/NVIDIA/dgx-selinux

@qhaas
Copy link

qhaas commented Aug 18, 2020

I think this is done, there is an example selinux policy for DGX available here: https://github.com/NVIDIA/dgx-selinux

That should likely be generalized to non-DGX EL7 / EL8 environments and made part of this project's packages.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants