Skip to content

nvidia-container-toolkit-daemonset-hrncd goes in CrashLoopBackOff after a Completed state indefinitely #456

@nneram

Description

@nneram

1. Issue or feature description

nvidia-container-toolkit-daemonset-hrncd goes in CrashLoopBackOff after a Completed state indefinitely.

2. Steps to reproduce the issue

Create a cluster with Ubuntu 20.04, Kubectl serveur version 1.23.13, Containerd version 1.6.10

Install nvidia-gpu-operator:

helm install --wait --generate-name      -n gpu-operator --create-namespace      nvidia/gpu-operator
NAME: gpu-operator-1670258596
LAST DEPLOYED: Mon Dec  5 17:43:19 2022
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

nvidia-container-toolkit-daemonset-hrncd goes in CrashLoopBackOff after a Completed state (due to Daemonset).
During my investigation on this I can see that the container change the config in /etc/containerd/config.toml, and restart containerd. My understanding is that it should be looping in Waiting for signal but it goes for a cleanup of the configuration. I can see the containerd service been restarted 3 times during the running phase. I don't know what's happening, even by looking in the code I don't understand why he receives a signal (Idk from where either).
https://github.com/NVIDIA/nvidia-container-toolkit/blob/a9fb7a4a8807f1fa7e09e63138a928a410bb15a7/tools/container/nvidia-toolkit/run.go#L261

Do you have any clue on this ? Ty !

3. Information to attach (optional if deemed irrelevant)

  • nvidia-container-toolkit-ctr log`

nvidia-container-toolkit-ctr.txt

  • kubernetes pods status: kubectl get pods --all-namespaces

pods_overview.txt

  • Containerd configuration file: cat /etc/containerd/config.toml

coontainerd_config.txt

  • NVIDIA shared directory: ls -la /run/nvidia

nvidia_shared_directory.txt

  • NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit

NVIDIA_packages_directory.txt

  • NVIDIA driver directory: ls -la /run/nvidia/driver

NVIDIA_driver_directory.txt

  • kubelet logs journalctl -u kubelet > kubelet.logs

kubelet_logs.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions