Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Operator with RHEL8/SElinux: Driver Container failed to deploy if SELinux Enforcing mode is activated. Error message: modprobe: ERROR: could not insert 'nvidia': Permission denied #553

Open
francisguillier opened this issue Jul 17, 2023 · 6 comments

Comments

@francisguillier
Copy link
Contributor

francisguillier commented Jul 17, 2023

Steps to reproduce the issue:

1/ Set the RHEL 8 server in SELinux ENFORCING mode:

[nvidia@ipp1-0686 ~]$ sestatus

SELinux status:                 enabled

SELinuxfs mount:                /sys/fs/selinux

SELinux root directory:         /etc/selinux

Loaded policy name:             targeted

Current mode:                   enforcing

Mode from config file:          enforcing

Policy MLS status:              enabled

Policy deny_unknown status:     allowed

Memory protection checking:     actual (secure)

Max kernel policy version:      33

[nvidia@ipp1-0686 ~]$

2/ Install GPU Operator

Result:

nvidia@ipp1-0686 ~]$ kubectl get pod -A

NAMESPACE             NAME                                                              READY   STATUS     RESTARTS         AGE

kube-system           calico-kube-controllers-674fff74c8-l9jh5                          1/1     Running    0                125m

kube-system           calico-node-2nnx6                                                 1/1     Running    0                125m

kube-system           coredns-5d78c9869d-5sl7f                                          1/1     Running    0                124m

kube-system           coredns-5d78c9869d-82pzw                                          1/1     Running    0                124m

kube-system           etcd-ipp1-0686.nvidia.com                                         1/1     Running    0                125m

kube-system           kube-apiserver-ipp1-0686.nvidia.com                               1/1     Running    0                125m

kube-system           kube-controller-manager-ipp1-0686.nvidia.com                      1/1     Running    1 (72m ago)      125m

kube-system           kube-proxy-f65pp                                                  1/1     Running    0                125m

kube-system           kube-scheduler-ipp1-0686.nvidia.com                               1/1     Running    1 (72m ago)      125m

nvidia-gpu-operator   gpu-feature-discovery-jvcvr                                       0/1     Init:0/1   0                124m

nvidia-gpu-operator   gpu-operator-1688750998-node-feature-discovery-master-6fc4q9qw7   1/1     Running    0                125m

nvidia-gpu-operator   gpu-operator-1688750998-node-feature-discovery-worker-5rbgt       1/1     Running    0                123m

nvidia-gpu-operator   gpu-operator-794c4dd5c4-5mvs8                                     1/1     Running    0                125m

nvidia-gpu-operator   nvidia-container-toolkit-daemonset-bjfts                          0/1     Init:0/1   0                124m

nvidia-gpu-operator   nvidia-dcgm-exporter-69bcl                                        0/1     Init:0/1   0                124m

nvidia-gpu-operator   nvidia-device-plugin-daemonset-lx4jb                              0/1     Init:0/1   0                124m

nvidia-gpu-operator   nvidia-driver-daemonset-sjhb7                                     0/1     Running    18 (8m14s ago)   124m

nvidia-gpu-operator   nvidia-operator-validator-jsxw9                                   0/1     Init:0/4   0                124m

[nvidia@ipp1-0686 ~]$


Driver container logs:

Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 525.105.17) is now complete.

 

+ _load

+ _load_driver

+ echo 'Parsing kernel module parameters...'

Parsing kernel module parameters...

+ _get_module_params

+ local base_path=/drivers

+ '[' -f /drivers/nvidia.conf ']'

+ '[' -f /drivers/nvidia-uvm.conf ']'

+ '[' -f /drivers/nvidia-modeset.conf ']'

+ '[' -f /drivers/nvidia-peermem.conf ']'

Loading ipmi and i2c_core kernel modules...

+ echo 'Loading ipmi and i2c_core kernel modules...'

+ modprobe -a i2c_core ipmi_msghandler ipmi_devintf

+ echo 'Loading NVIDIA driver kernel modules...'

+ set -o xtrace +o nounset

Loading NVIDIA driver kernel modules...

+ modprobe nvidia

modprobe: ERROR: could not insert 'nvidia': Permission denied

+ _shutdown

+ _unload_driver

+ rmmod_args=()

+ local rmmod_args

+ local nvidia_deps=0

+ local nvidia_refs=0

+ local nvidia_uvm_refs=0

+ local nvidia_modeset_refs=0

+ local nvidia_peermem_refs=0

+ echo 'Stopping NVIDIA persistence daemon...'

@anoopsinghnegi
Copy link

We also facing same issue, with SELinux Enforcing gpu-operator driver-daemon pod failed, it fail to install nvidia module, getting "permission deined" error,

with SELinux disabled OR permissive gpu-operator successfully getting deployed without any issue.

Kubernete cluster created through kubeadm ( 1.27.0), following the setup details:

  • OS version: RHEL 8.8
  • containerd version: v1.7.0
  • kernel version: 4.18.0-477.15.1.el8_8.x86_64
  • kubelet-1.27.0, kubeadm-1.27.0, kubectl-1.27.0
  • kubernetes version: v1.27.0
  • helm version: v3.12.1
  • Nvidia GPU operator version: 23.3.2

Step to reproduce the issue:

  1. Setup kubernetes cluster with 2 node with RHEL8.8 (1 Master + 1 Worker GPU Tesla-P4)
  2. SELinux is disabled, GPU operator installed successfully(all pods are up)
  3. Switch SELinux to Enforcing(edit /etc/selinux/config) and reboot the GPU host to make SELinux in effect.
  4. GPU operator pods restarts once GPU node comes up .
  5. GPU operator driver-daemon pod fail to come up (with error "permission deined").

Used helm to install gpu-operator

helm install --wait --generate-name -n gpu --create-namespace nvidia/gpu-operator -f ./custom-values.yaml

custom-values passed to gpu-operator

operator:
  defaultRuntime: containerd
mig:
  strategy: single
driver:
  upgradePolicy:
    autoUpgrade: false
  env: 
  - name: HTTPS_PROXY
    value: http://gamma123.proxy.net:80
  - name: HTTP_PROXY
    value: http://gamma123.proxy.net:80
  - name: NO_PROXY
    value: ""
  - name: https_proxy
    value: http://gamma123.proxy.net:80
  - name: http_proxy
    value: http://gamma123.proxy.net:80
  - name: no_proxy
    value: ""
dcgmExporter:
  env:
  - name: DCGM_EXPORTER_LISTEN
    value: ":9400"
  - name: DCGM_EXPORTER_KUBERNETES
    value: "true"
  - name: DCGM_EXPORTER_COLLECTORS
    value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
  - name: DCGM_EXPORTER_INTERVAL
    value: "1000"  
  serviceMonitor:
    enabled: true
    interval: 5s

GPU-operator driver pod log after successful installation(with SELinux disabled)

kubectl logs nvidia-driver-daemonset-bpqdx -n gpu


Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 525.105.17) is now complete.

+ _load
+ _load_driver
+ echo 'Parsing kernel module parameters...'
+ _get_module_params
Parsing kernel module parameters...
+ local base_path=/drivers
+ '[' -f /drivers/nvidia.conf ']'
+ '[' -f /drivers/nvidia-uvm.conf ']'
+ '[' -f /drivers/nvidia-modeset.conf ']'
+ '[' -f /drivers/nvidia-peermem.conf ']'
+ echo 'Loading ipmi and i2c_core kernel modules...'
+ modprobe -a i2c_core ipmi_msghandler ipmi_devintf
Loading ipmi and i2c_core kernel modules...
Loading NVIDIA driver kernel modules...
+ echo 'Loading NVIDIA driver kernel modules...'
+ set -o xtrace +o nounset
+ modprobe nvidia
+ modprobe nvidia-uvm
+ modprobe nvidia-modeset
+ set +o xtrace -o nounset
Starting NVIDIA persistence daemon...
ls: cannot access '/proc/driver/nvidia-nvswitch/devices/*': No such file or directory
Mounting NVIDIA driver rootfs...
Check SELinux status
SELinux is disabled, skipping...
Done, now waiting for signal

GPU operator driver pod failed with error (SELinux enforcing)

# sestatus 
SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             targeted
Current mode:                   enforcing
Mode from config file:          enforcing
Policy MLS status:              enabled
Policy deny_unknown status:     allowed
Memory protection checking:     actual (secure)
Max kernel policy version:      33

kubectl logs nvidia-driver-daemonset-bpqdx -n gpu

Post-install sanity check passed.

Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 525.105.17) is now complete.

+ _load
+ _load_driver
+ echo 'Parsing kernel module parameters...'
+ _get_module_params
Parsing kernel module parameters...
+ local base_path=/drivers
+ '[' -f /drivers/nvidia.conf ']'
+ '[' -f /drivers/nvidia-uvm.conf ']'
+ '[' -f /drivers/nvidia-modeset.conf ']'
+ '[' -f /drivers/nvidia-peermem.conf ']'
Loading ipmi and i2c_core kernel modules...
+ echo 'Loading ipmi and i2c_core kernel modules...'
+ modprobe -a i2c_core ipmi_msghandler ipmi_devintf
Loading NVIDIA driver kernel modules...
+ echo 'Loading NVIDIA driver kernel modules...'
+ set -o xtrace +o nounset
+ modprobe nvidia
modprobe: ERROR: could not insert 'nvidia': Permission denied
+ _shutdown
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
Stopping NVIDIA persistence daemon...
+ local nvidia_modeset_refs=0
+ local nvidia_peermem_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
+ '[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
+ '[' -f /var/run/nvidia-fabricmanager/nv-fabricmanager.pid ']'
+ echo 'Unloading NVIDIA driver kernel modules...'
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
Unloading NVIDIA driver kernel modules...
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ '[' -f /sys/module/nvidia/refcnt ']'
+ '[' -f /sys/module/nvidia_peermem/refcnt ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ return 0
+ _unmount_rootfs
+ echo 'Unmounting NVIDIA driver rootfs...'
Unmounting NVIDIA driver rootfs...
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ rm -f /run/nvidia/nvidia-driver.pid /run/kernel/postinst.d/update-nvidia-driver
+ return 0

GPU host audit logs (grep "avc" /var/log/audit/audit*) (SELinux enforcing)

/var/log/audit/audit.log.2:type=AVC msg=audit(1691186980.587:9501): avc:  denied  { module_load } for  pid=2431660 comm="modprobe" path="/usr/lib/modules/4.18.0-477.15.1.el8_8.x86_64/kernel/drivers/video/nvidia.ko" dev="overlay" ino=1678053642 scontext=system_u:system_r:unconfined_service_t:s0 tcontext=system_u:object_r:var_lib_t:s0 tclass=system permissive=0

/var/log/audit/audit.log.2:type=AVC msg=audit(1691187405.325:9647): avc:  denied  { module_load } for  pid=2454548 comm="modprobe" path="/usr/lib/modules/4.18.0-477.15.1.el8_8.x86_64/kernel/drivers/video/nvidia.ko" dev="overlay" ino=1208384829 scontext=system_u:system_r:unconfined_service_t:s0 tcontext=system_u:object_r:var_lib_t:s0 tclass=system permissive=0

/var/log/audit/audit.log.2:type=AVC msg=audit(1691187788.936:9732): avc:  denied  { module_load } for  pid=2476670 comm="modprobe" path="/usr/lib/modules/4.18.0-477.15.1.el8_8.x86_64/kernel/drivers/video/nvidia.ko" dev="overlay" ino=2081018738 scontext=system_u:system_r:unconfined_service_t:s0 tcontext=system_u:object_r:var_lib_t:s0 tclass=system permissive=0

GPU host : cat /etc/containerd/config.toml

disabled_plugins = []
imports = []
oom_score = 0
plugin_dir = ""
required_plugins = []
root = "/var/lib/containerd"
state = "/run/containerd"
temp = ""
version = 2

[cgroup]
  path = ""

[debug]
  address = ""
  format = ""
  gid = 0
  level = ""
  uid = 0

[grpc]
  address = "/run/containerd/containerd.sock"
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216
  tcp_address = ""
  tcp_tls_ca = ""
  tcp_tls_cert = ""
  tcp_tls_key = ""
  uid = 0

[metrics]
  address = ""
  grpc_histogram = false

[plugins]

  [plugins."io.containerd.gc.v1.scheduler"]
    deletion_threshold = 0
    mutation_threshold = 100
    pause_threshold = 0.02
    schedule_delay = "0s"
    startup_delay = "100ms"

  [plugins."io.containerd.grpc.v1.cri"]
    cdi_spec_dirs = ["/etc/cdi", "/var/run/cdi"]
    device_ownership_from_security_context = false
    disable_apparmor = false
    disable_cgroup = false
    disable_hugetlb_controller = true
    disable_proc_mount = false
    disable_tcp_service = true
    drain_exec_sync_io_timeout = "0s"
    enable_cdi = false
    enable_selinux = false
    enable_tls_streaming = false
    enable_unprivileged_icmp = false
    enable_unprivileged_ports = false
    ignore_image_defined_volumes = false
    image_pull_progress_timeout = "1m0s"
    max_concurrent_downloads = 3
    max_container_log_line_size = 16384
    netns_mounts_under_state_dir = false
    restrict_oom_score_adj = false
    sandbox_image = "registry.k8s.io/pause:3.8"
    selinux_category_range = 1024
    stats_collect_period = 10
    stream_idle_timeout = "4h0m0s"
    stream_server_address = "127.0.0.1"
    stream_server_port = "0"
    systemd_cgroup = false
    tolerate_missing_hugetlb_controller = true
    unset_seccomp_profile = ""

    [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "/opt/cni/bin"
      conf_dir = "/etc/cni/net.d"
      conf_template = ""
      ip_pref = ""
      max_conf_num = 1
      setup_serially = false

    [plugins."io.containerd.grpc.v1.cri".containerd]
      disable_snapshot_annotations = true
      discard_unpacked_layers = false
      ignore_blockio_not_enabled_errors = false
      ignore_rdt_not_enabled_errors = false
      no_pivot = false
      snapshotter = "overlayfs"

      [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
        base_runtime_spec = ""
        cni_conf_dir = ""
        cni_max_conf_num = 0
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        privileged_without_host_devices_all_devices_allowed = false
        runtime_engine = ""
        runtime_path = ""
        runtime_root = ""
        runtime_type = ""
        sandbox_mode = ""
        snapshotter = ""

        [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          base_runtime_spec = ""
          cni_conf_dir = ""
          cni_max_conf_num = 0
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          privileged_without_host_devices_all_devices_allowed = false
          runtime_engine = ""
          runtime_path = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          sandbox_mode = "podsandbox"
          snapshotter = ""

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            BinaryName = ""
            CriuImagePath = ""
            CriuPath = ""
            CriuWorkPath = ""
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            NoPivotRoot = false
            Root = ""
            ShimCgroup = ""
            SystemdCgroup = true

      [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
        base_runtime_spec = ""
        cni_conf_dir = ""
        cni_max_conf_num = 0
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        privileged_without_host_devices_all_devices_allowed = false
        runtime_engine = ""
        runtime_path = ""
        runtime_root = ""
        runtime_type = ""
        sandbox_mode = ""
        snapshotter = ""

        [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime.options]

    [plugins."io.containerd.grpc.v1.cri".image_decryption]
      key_model = "node"

    [plugins."io.containerd.grpc.v1.cri".registry]
      config_path = ""

      [plugins."io.containerd.grpc.v1.cri".registry.auths]

      [plugins."io.containerd.grpc.v1.cri".registry.configs]

      [plugins."io.containerd.grpc.v1.cri".registry.headers]

      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]

    [plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
      tls_cert_file = ""
      tls_key_file = ""

  [plugins."io.containerd.internal.v1.opt"]
    path = "/opt/containerd"

  [plugins."io.containerd.internal.v1.restart"]
    interval = "10s"

  [plugins."io.containerd.internal.v1.tracing"]
    sampling_ratio = 1.0
    service_name = "containerd"

  [plugins."io.containerd.metadata.v1.bolt"]
    content_sharing_policy = "shared"

  [plugins."io.containerd.monitor.v1.cgroups"]
    no_prometheus = false

  [plugins."io.containerd.nri.v1.nri"]
    disable = true
    disable_connections = false
    plugin_config_path = "/etc/nri/conf.d"
    plugin_path = "/opt/nri/plugins"
    plugin_registration_timeout = "5s"
    plugin_request_timeout = "2s"
    socket_path = "/var/run/nri/nri.sock"

  [plugins."io.containerd.runtime.v1.linux"]
    no_shim = false
    runtime = "runc"
    runtime_root = ""
    shim = "containerd-shim"
    shim_debug = false

  [plugins."io.containerd.runtime.v2.task"]
    platforms = ["linux/amd64"]
    sched_core = false

  [plugins."io.containerd.service.v1.diff-service"]
    default = ["walking"]

  [plugins."io.containerd.service.v1.tasks-service"]
    blockio_config_file = ""
    rdt_config_file = ""

  [plugins."io.containerd.snapshotter.v1.aufs"]
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.btrfs"]
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.devmapper"]
    async_remove = false
    base_image_size = ""
    discard_blocks = false
    fs_options = ""
    fs_type = ""
    pool_name = ""
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.native"]
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.overlayfs"]
    root_path = ""
    upperdir_label = false

  [plugins."io.containerd.snapshotter.v1.zfs"]
    root_path = ""

  [plugins."io.containerd.tracing.processor.v1.otlp"]
    endpoint = ""
    insecure = false
    protocol = ""

  [plugins."io.containerd.transfer.v1.local"]

[proxy_plugins]

[stream_processors]

  [stream_processors."io.containerd.ocicrypt.decoder.v1.tar"]
    accepts = ["application/vnd.oci.image.layer.v1.tar+encrypted"]
    args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"]
    env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"]
    path = "ctd-decoder"
    returns = "application/vnd.oci.image.layer.v1.tar"

  [stream_processors."io.containerd.ocicrypt.decoder.v1.tar.gzip"]
    accepts = ["application/vnd.oci.image.layer.v1.tar+gzip+encrypted"]
    args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"]
    env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"]
    path = "ctd-decoder"
    returns = "application/vnd.oci.image.layer.v1.tar+gzip"

[timeouts]
  "io.containerd.timeout.bolt.open" = "0s"
  "io.containerd.timeout.metrics.shimstats" = "2s"
  "io.containerd.timeout.shim.cleanup" = "5s"
  "io.containerd.timeout.shim.load" = "5s"
  "io.containerd.timeout.shim.shutdown" = "3s"
  "io.containerd.timeout.task.state" = "2s"

[ttrpc]
  address = ""
  gid = 0
  uid = 0

@shivamerla
Copy link
Contributor

shivamerla commented Aug 29, 2023

After changing the label of /lib/modules/$KERNEL_VERSION/kernel/drivers/video to modules_object_t driver load is working with SELinux enabled. I have built a private image quay.io/shivamerla/driver:535.86.10-rhel8.6 so others can try it out. We will apply this change with GPU Operator release v23.9.0 and drivers that will be released with it.

drwxr-xr-x.  4 root root system_u:object_r:container_file_t:s0  102 Aug 29 00:22 vfio
drwxr-xr-x.  2 root root system_u:object_r:container_file_t:s0  142 Aug 29 00:22 vhost
drwxr-xr-x.  4 root root system_u:object_r:modules_object_t:s0  124 Aug 29 00:26 video
drwxr-xr-x.  3 root root system_u:object_r:container_file_t:s0   28 Aug 29 00:22 virt
[root@node1 ~]# lsmod | grep nvidia
nvidia_modeset       1159168  0
nvidia_uvm           1179648  0
nvidia              39055360  51 nvidia_uvm,nvidia_modeset
drm                   585728  6 vmwgfx,drm_kms_helper,nvidia,ttm

[root@node1 ~]# kubectl  get pods -n gpu-operator
NAME                                                          READY   STATUS      RESTARTS        AGE
gpu-feature-discovery-29z7l                                   1/1     Running     0               9m41s
gpu-operator-8597b78788-57xsw                                 1/1     Running     0               25m
gpu-operator-node-feature-discovery-master-5678c7dbb4-mpfr5   1/1     Running     0               25m
gpu-operator-node-feature-discovery-worker-k5d59              1/1     Running     0               25m
nvidia-container-toolkit-daemonset-dxm7d                      1/1     Running     1 (2m46s ago)   9m41s
nvidia-cuda-validator-rll46                                   0/1     Completed   0               2m22s
nvidia-dcgm-exporter-rk2rc                                    1/1     Running     0               9m41s
nvidia-device-plugin-daemonset-92mrh                          1/1     Running     0               9m41s
nvidia-driver-daemonset-678jf                                 1/1     Running     0               10m
nvidia-operator-validator-7pk5g                               1/1     Running     0               9m41s

[root@node1 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.6 (Ootpa)
[root@node1 ~]# getenforce 
Enforcing
[root@node1 ~]#

@anoopsinghnegi
Copy link

@shivamerla - We tried the solution on our RHEL8.8 setup by referring to your private test image. You have added chcon -t modules_object_t /usr/lib/modules/${KERNEL_VERSION}/kernel/drivers/video/ to change the files context, but it didn't worked and still permission error coming during driver load.

We did some patching to understand the issue in our setup. We could see the context type of compiled modules still showing var_lib_t, even after chcon -t modules_object_t /usr/lib/modules/${KERNEL_VERSION}/kernel/drivers/video/

-rw-r--r--. 1 root root system_u:object_r:var_lib_t:s0 2502576 Sep 11 15:01 /usr/lib/modules/4.18.0-477.21.1.el8_8.x86_64/kernel/drivers/video/nvidia-modeset.ko -rw-r--r--. 1 root root system_u:object_r:var_lib_t:s0 467240 Sep 11 15:01 /usr/lib/modules/4.18.0-477.21.1.el8_8.x86_64/kernel/drivers/video/nvidia-peermem.ko -rw-r--r--. 1 root root system_u:object_r:var_lib_t:s0 71707768 Sep 11 15:01 /usr/lib/modules/4.18.0-477.21.1.el8_8.x86_64/kernel/drivers/video/nvidia-uvm.ko -rw-r--r--. 1 root root system_u:object_r:var_lib_t:s0 79858296 Sep 11 15:01 /usr/lib/modules/4.18.0-477.21.1.el8_8.x86_64/kernel/drivers/video/nvidia.ko

I think these are copied from source /usr/src/nvidia-525.105.17/kernel and here context type is var_lib_t

Then, we tried with chcon -R -t modules_object_t /usr/lib/modules/${KERNEL_VERSION}/kernel/drivers/video/ (before driver load) to change the files inside the directory and this worked(driver load executed successfully.)

-rw-r--r--. 1 root root system_u:object_r:modules_object_t:s0 2502576 Sep 11 15:01 /usr/lib/modules/4.18.0-477.21.1.el8_8.x86_64/kernel/drivers/video/nvidia-modeset.ko -rw-r--r--. 1 root root system_u:object_r:modules_object_t:s0 467240 Sep 11 15:01 /usr/lib/modules/4.18.0-477.21.1.el8_8.x86_64/kernel/drivers/video/nvidia-peermem.ko -rw-r--r--. 1 root root system_u:object_r:modules_object_t:s0 71707768 Sep 11 15:01 /usr/lib/modules/4.18.0-477.21.1.el8_8.x86_64/kernel/drivers/video/nvidia-uvm.ko -rw-r--r--. 1 root root system_u:object_r:modules_object_t:s0 79858296 Sep 11 15:01 /usr/lib/modules/4.18.0-477.21.1.el8_8.x86_64/kernel/drivers/video/nvidia.ko

so can we proide the fix with chcon -R -t modules_object_t /usr/lib/modules/${KERNEL_VERSION}/kernel/drivers/video when SELinux is enabled.

@shivamerla
Copy link
Contributor

shivamerla commented Sep 19, 2023

@anoopsinghnegi i have updated the image, please re-pull and verify with quay.io/shivamerla/driver:535.104.05-rhel8.6 and quay.io/shivamerla/driver:535.104.05-rhel8.8. MR here: https://gitlab.com/nvidia/container-images/driver/-/merge_requests/269

@anoopsinghnegi
Copy link

@shivamerla, It's working, driver loaded successfully with SELinux enforcing using image quay.io/shivamerla/driver:535.104.05-rhel8.8, thanks for the fix.

@anoopsinghnegi
Copy link

@shivamerla - any update on this issue - even the latest version of gpu-operator v23.9.1 is failing with SELinux enforcing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants