Skip to content

CentOS 7.8 Support - GLIBC_2.27 #72

@jtm5044

Description

@jtm5044

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node?
    -- No - CentOS 7.8
  • Are you running Kubernetes v1.13+?
    -- Yes - v1.18.6
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
    -- Yes - v19.03.12
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
    -- No - apparently this is N/A now
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

I'm trying to run on CentOS 7.8 with a single-node kubernetes cluster that I set up with kubeadm (no openshift).

I'm getting errors on the nvidia-driver-validation, nvidia-device-plugin-daemon-set, and nvidia-dcgm-exporter pods - each complaining that "GLIBC_2.27 not found". It seems like it is trying to use the host glibc which on centos 7 is GLIBC_2.17.

From looking at the commits it seems that CentOS support is a recent development and perhaps there is some flag or configuration that I need to provide to run on CentOS 7 that hasn't been documented yet.

Any ideas? Thank you!

2. Steps to reproduce the issue

  • Start with a clean install of CentOS 7.8
  • Install docker, initialize kubernetes cluster
  • Setup Helm and NVIDIA Repo
  • helm install --devel nvidia/gpu-operator --wait --generate-name

Error on all three pods (nvidia-driver-validation, nvidia-device-plugin-daemon-set, and nvidia-dcgm-exporter pods):

Error: failed to start container "cuda-vector-add": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1)\\n\""": unknown
Back-off restarting failed container

3. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods --all-namespaces
$ kubectl get pods --all-namespaces
NAMESPACE                NAME                                                              READY   STATUS             RESTARTS   AGE
gpu-operator-resources   nvidia-container-toolkit-daemonset-79427                          1/1     Running            0          38m
gpu-operator-resources   nvidia-dcgm-exporter-s6gxj                                        0/1     CrashLoopBackOff   12         38m
gpu-operator-resources   nvidia-device-plugin-daemonset-bpwt7                              0/1     CrashLoopBackOff   12         38m
gpu-operator-resources   nvidia-device-plugin-validation                                   0/1     Pending            0          38m
gpu-operator-resources   nvidia-driver-daemonset-kqvdt                                     0/1     CrashLoopBackOff   5          38m
gpu-operator-resources   nvidia-driver-validation                                          0/1     CrashLoopBackOff   14         38m
kube-system              calico-kube-controllers-578894d4cd-wb6bm                          1/1     Running            3          43h
kube-system              calico-node-j8qmb                                                 1/1     Running            3          43h
kube-system              coredns-66bff467f8-28q78                                          1/1     Running            3          2d18h
kube-system              coredns-66bff467f8-dg55z                                          1/1     Running            3          2d18h
kube-system              etcd-pho-test-4.mitre.org                                         1/1     Running            3          2d18h
kube-system              kube-apiserver-pho-test-4.mitre.org                               1/1     Running            3          2d18h
kube-system              kube-controller-manager-pho-test-4.mitre.org                      1/1     Running            4          2d18h
kube-system              kube-proxy-vwm9b                                                  1/1     Running            3          2d18h
kube-system              kube-scheduler-pho-test-4.mitre.org                               1/1     Running            5          2d18h
kube-system              metrics-server-f7cdcc99-mkvfb                                     1/1     Running            3          24h
kubernetes-dashboard     dashboard-metrics-scraper-6b4884c9d5-r6prr                        1/1     Running            3          43h
kubernetes-dashboard     kubernetes-dashboard-7b544877d5-c2b66                             1/1     Running            3          43h
photonapi                gpu-operator-1597413719-node-feature-discovery-master-f76b2nd78   1/1     Running            0          38m
photonapi                gpu-operator-1597413719-node-feature-discovery-worker-t84dc       1/1     Running            0          38m
photonapi                gpu-operator-774ff7994c-bptf8                                     1/1     Running            0          38m
  • kubernetes daemonset status: kubectl get ds --all-namespaces
$ kubectl get ds --all-namespaces
NAMESPACE                NAME                                                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
gpu-operator-resources   nvidia-container-toolkit-daemonset                      1         1         1       1            1           feature.node.kubernetes.io/pci-10de.present=true   39m
gpu-operator-resources   nvidia-dcgm-exporter                                    1         1         0       1            0           feature.node.kubernetes.io/pci-10de.present=true   39m
gpu-operator-resources   nvidia-device-plugin-daemonset                          1         1         0       1            0           feature.node.kubernetes.io/pci-10de.present=true   39m
gpu-operator-resources   nvidia-driver-daemonset                                 1         1         0       1            0           feature.node.kubernetes.io/pci-10de.present=true   39m
kube-system              calico-node                                             1         1         1       1            1           kubernetes.io/os=linux                             43h
kube-system              kube-proxy                                              1         1         1       1            1           kubernetes.io/os=linux                             2d18h
photonapi                gpu-operator-1597413719-node-feature-discovery-worker   1         1         1       1            1           <none>                                             39m
  • If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME

Here is the output for nvidia-driver-validation, the others are all similar:

$ kubectl describe pod -n gpu-operator-resources nvidia-driver-validation
Name:         nvidia-driver-validation
Namespace:    gpu-operator-resources
Priority:     0
Node:         pho-test-4.mitre.org/10.128.210.23
Start Time:   Fri, 14 Aug 2020 10:02:09 -0400
Labels:       app=nvidia-driver-validation
Annotations:  cni.projectcalico.org/podIP: 10.244.202.208/32
              cni.projectcalico.org/podIPs: 10.244.202.208/32
Status:       Running
IP:           10.244.202.208
IPs:
  IP:           10.244.202.208
Controlled By:  ClusterPolicy/cluster-policy
Containers:
  cuda-vector-add:
    Container ID:   docker://88a469dcded3a1afc3da4dde9a4b383965ec7e9df809bf7e96f00a1846db5202
    Image:          nvidia/samples:cuda10.2-vectorAdd
    Image ID:       docker-pullable://nvidia/samples@sha256:19f202eecbfa5211be8ce0f0dc02deb44a25a7e3584c993b22b1654f3569ffaf
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1)\\\\n\\\"\"": unknown
      Exit Code:    128
      Started:      Fri, 14 Aug 2020 10:39:16 -0400
      Finished:     Fri, 14 Aug 2020 10:39:16 -0400
    Ready:          False
    Restart Count:  14
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-j58hc (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-j58hc:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-j58hc
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  feature.node.kubernetes.io/pci-10de.present=true
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 nvidia.com/gpu:NoSchedule
Events:
  Type     Reason     Age                  From                           Message
  ----     ------     ----                 ----                           -------
  Normal   Scheduled  41m                  default-scheduler              Successfully assigned gpu-operator-resources/nvidia-driver-validation to pho-test-4.mitre.org
  Normal   Pulled     39m (x5 over 41m)    kubelet, pho-test-4.mitre.org  Container image "nvidia/samples:cuda10.2-vectorAdd" already present on machine
  Normal   Created    39m (x5 over 41m)    kubelet, pho-test-4.mitre.org  Created container cuda-vector-add
  Warning  Failed     39m (x5 over 41m)    kubelet, pho-test-4.mitre.org  Error: failed to start container "cuda-vector-add": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1)\\\\n\\\"\"": unknown
  Warning  BackOff    86s (x178 over 40m)  kubelet, pho-test-4.mitre.org  Back-off restarting failed container
  • Output of running a container on the GPU machine: docker run -it alpine echo foo
$ docker run -it alpine echo foo
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
df20fa9351a1: Pull complete 
Digest: sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
Status: Downloaded newer image for alpine:latest
foo
  • Docker configuration file: cat /etc/docker/daemon.json
$ cat /etc/docker/daemon.json
{
  "exec-opts": [
    "native.cgroupdriver=systemd"
  ],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2",
  "storage-opts": [
    "overlay2.override_kernel_check=true"
  ],
  "runtimes": {
    "nvidia": {
      "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime"
    }
  },
  "default-runtime": "nvidia"
}
  • NVIDIA shared directory: ls -la /run/nvidia
ls -la /run/nvidia
total 8
drwxr-xr-x.  2 root root  80 Aug 14 10:42 .
drwxr-xr-x. 32 root root 980 Aug 14 02:06 ..
-rw-r--r--.  1 root root   5 Aug 14 10:42 nvidia-driver.pid
-rw-r--r--.  1 root root   5 Aug 14 10:02 toolkit.pid
  • NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit
ls -la /usr/local/nvidia/toolkit
total 3992
drwxr-xr-x. 3 root root    4096 Aug 14 10:02 .
drwxr-xr-x. 3 root root      21 Aug 14 10:02 ..
drwxr-xr-x. 3 root root      38 Aug 14 10:02 .config
lrwxrwxrwx. 1 root root      30 Aug 14 10:02 libnvidia-container.so.1 -> ./libnvidia-container.so.1.0.7
-rwxr-xr-x. 1 root root  151088 Aug 14 10:02 libnvidia-container.so.1.0.7
-rwxr-xr-x. 1 root root     154 Aug 14 10:02 nvidia-container-cli
-rwxr-xr-x. 1 root root   34832 Aug 14 10:02 nvidia-container-cli.real
-rwxr-xr-x. 1 root root     166 Aug 14 10:02 nvidia-container-runtime
lrwxrwxrwx. 1 root root      26 Aug 14 10:02 nvidia-container-runtime-hook -> ./nvidia-container-toolkit
-rwxr-xr-x. 1 root root 2008936 Aug 14 10:02 nvidia-container-runtime.real
-rwxr-xr-x. 1 root root     195 Aug 14 10:02 nvidia-container-toolkit
-rwxr-xr-x. 1 root root 1871848 Aug 14 10:02 nvidia-container-toolkit.real
  • NVIDIA driver directory: ls -la /run/nvidia/driver
ls -la /run/nvidia/driver
ls: cannot access /run/nvidia/driver: No such file or directory

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions