Skip to content

nvidia-device-plugin-daemonset stuck in crash loop #250

@flaker

Description

@flaker

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node? -> 18.04.5
  • Are you running Kubernetes v1.13+? -> v1.16.9
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? -> 18.09.9
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes? -> Yes, both
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces) -> Yes

1. Issue or feature description

The pod nvidia-device-plugin-daemonset is stuck in Init:CrashLoopBackOff

2. Steps to reproduce the issue

Apply the helm chart

 helm install loadgen-gpu --wait nvidia/gpu-operator -f values.yaml --version 1.6.2 --set operator.defaultRuntime=containerd --debug

wait ...

all the pods will come into place except for that one.

Something that might be of value is that I replaced the containerd.io package that was included in the Ubuntu image with the containerd package from the standard repos since it seemed to provide a newer version.

3. Information to attach (optional if deemed irrelevant)

Describe of the problematic pod

(base) ➜  helm kc describe pod nvidia-device-plugin-daemonset-vlkm2 -n  gpu-operator-resources
Name:         nvidia-device-plugin-daemonset-vlkm2
Namespace:    gpu-operator-resources
Priority:     0
Node:         ip-172-32-62-232.us-east-2.compute.internal/172.32.62.232
Start Time:   Fri, 27 Aug 2021 17:11:04 -0400
Labels:       app=nvidia-device-plugin-daemonset
              controller-revision-hash=7ff4fb4c4b
              pod-template-generation=1
Annotations:  scheduler.alpha.kubernetes.io/critical-pod:
Status:       Pending
IP:           172.32.33.219
IPs:
  IP:           172.32.33.219
Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
Init Containers:
  toolkit-validation:
    Container ID:  docker://6893451d063f1e493ba0f20a94d53455f9b82d7deee91808d9bb515634760cbc
    Image:         nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    Image ID:      docker-pullable://nvcr.io/nvidia/k8s/cuda-sample@sha256:4593078cdb8e786d35566faa2b84da1123acea42f0d4099e84e2af0448724af1
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      /tmp/vectorAdd
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused": unavailable
      Exit Code:    128
      Started:      Fri, 27 Aug 2021 17:17:22 -0400
      Finished:     Fri, 27 Aug 2021 17:17:22 -0400
    Ready:          False
    Restart Count:  6
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from nvidia-device-plugin-token-j5wdj (ro)
Containers:
  nvidia-device-plugin-ctr:
    Container ID:
    Image:         nvcr.io/nvidia/k8s-device-plugin:v0.8.2-ubi8
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      --mig-strategy=single
      --pass-device-specs=true
      --fail-on-init-error=true
      --device-list-strategy=envvar
      --nvidia-driver-root=/run/nvidia/driver
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      NVIDIA_VISIBLE_DEVICES:      all
      NVIDIA_DRIVER_CAPABILITIES:  all
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from nvidia-device-plugin-token-j5wdj (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:
  nvidia-device-plugin-token-j5wdj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nvidia-device-plugin-token-j5wdj
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  kops.k8s.io/instancegroup=gpu-nodes
                 nvidia.com/gpu.present=true
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
                 nvidia.com/gpu
Events:
  Type     Reason                  Age                     From                                                  Message
  ----     ------                  ----                    ----                                                  -------
  Normal   Scheduled               9m43s                   default-scheduler                                     Successfully assigned gpu-operator-resources/nvidia-device-plugin-daemonset-vlkm2 to ip-172-32-62-232.us-east-2.compute.internal
  Warning  FailedCreatePodSandBox  9m42s                   kubelet, ip-172-32-62-232.us-east-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "3658bb95ae86ade79a31794b6f7e5a20f319549170ccca2b0d33dacf0918a55b" network for pod "nvidia-device-plugin-daemonset-vlkm2": networkPlugin cni failed to set up pod "nvidia-device-plugin-daemonset-vlkm2_gpu-operator-resources" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "3658bb95ae86ade79a31794b6f7e5a20f319549170ccca2b0d33dacf0918a55b" network for pod "nvidia-device-plugin-daemonset-vlkm2": networkPlugin cni failed to teardown pod "nvidia-device-plugin-daemonset-vlkm2_gpu-operator-resources" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
  Normal   SandboxChanged          9m25s (x2 over 9m41s)   kubelet, ip-172-32-62-232.us-east-2.compute.internal  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulling                 9m24s                   kubelet, ip-172-32-62-232.us-east-2.compute.internal  Pulling image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2"
  Normal   Pulled                  9m20s                   kubelet, ip-172-32-62-232.us-east-2.compute.internal  Successfully pulled image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2"
  Normal   Started                 8m33s (x4 over 9m20s)   kubelet, ip-172-32-62-232.us-east-2.compute.internal  Started container toolkit-validation
  Normal   Created                 7m50s (x5 over 9m20s)   kubelet, ip-172-32-62-232.us-east-2.compute.internal  Created container toolkit-validation
  Normal   Pulled                  7m50s (x4 over 9m19s)   kubelet, ip-172-32-62-232.us-east-2.compute.internal  Container image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2" already present on machine
  Warning  BackOff                 4m36s (x22 over 9m18s)  kubelet, ip-172-32-62-232.us-east-2.compute.internal  Back-off restarting failed container

Logs for the problematic pod

helm kc logs -f nvidia-device-plugin-daemonset-vlkm2 -n gpu-operator-resources
Error from server (BadRequest): container "nvidia-device-plugin-ctr" in pod "nvidia-device-plugin-daemonset-vlkm2" is waiting to start: PodInitializing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions