DCGM initialization error #222

AnkitPurohit01 · 2021-07-07T04:59:01Z

1. Quick Debug Checklist

Are you running on an Ubuntu 18.04 node?
Are you running Kubernetes v1.13+?
Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
Do you have i2c_core and ipmi_msghandler loaded on the nodes?
Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

We installed Nvidia GPU operator version 1.7.1 on our kubernetes cluster using HELM.
But there seems to be DCGM initialization error and GPU resources are not discovered by the node

Please check the following logs

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

kubernetes pods status: kubectl get pods --all-namespaces

$ k get pod
NAME                                       READY   STATUS             RESTARTS   AGE
gpu-feature-discovery-5jjwl                1/1     Running            3          20h
gpu-feature-discovery-jfxq8                1/1     Running            0          20h
gpu-feature-discovery-kcr2p                1/1     Running            3          20h
nvidia-container-toolkit-daemonset-8r4df   1/1     Running            0          20h
nvidia-container-toolkit-daemonset-c2lw8   1/1     Running            0          20h
nvidia-container-toolkit-daemonset-mmvzk   1/1     Running            0          20h
nvidia-cuda-validator-fcffx                0/1     Completed          0          20h
nvidia-cuda-validator-j8x8w                0/1     Completed          0          20h
nvidia-cuda-validator-q79nf                0/1     Completed          0          20h
nvidia-dcgm-exporter-5kc4x                 0/1     CrashLoopBackOff   242        20h
nvidia-dcgm-exporter-98kbb                 0/1     CrashLoopBackOff   242        20h
nvidia-dcgm-exporter-fdqgd                 0/1     CrashLoopBackOff   242        20h
nvidia-device-plugin-daemonset-jwsm4       1/1     Running            0          20h
nvidia-device-plugin-daemonset-rsjs8       1/1     Running            3          20h
nvidia-device-plugin-daemonset-tz4z9       1/1     Running            3          20h
nvidia-driver-daemonset-rx22m              1/1     Running            0          20h
nvidia-driver-daemonset-t8tkj              1/1     Running            0          20h
nvidia-driver-daemonset-vb6hh              1/1     Running            0          20h
nvidia-operator-validator-rkpqf            0/1     Init:3/4           163        20h
nvidia-operator-validator-tft4t            0/1     Init:3/4           165        20h
nvidia-operator-validator-xdjk8            0/1     Init:3/4           165        20h

kubernetes daemonset status: kubectl get ds -n gpu-operator-resources

NAME                                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
gpu-feature-discovery                3         3         3       3            3           nvidia.com/gpu.deploy.gpu-feature-discovery=true   5d2h
nvidia-container-toolkit-daemonset   3         3         3       3            3           nvidia.com/gpu.deploy.container-toolkit=true       5d2h
nvidia-dcgm-exporter                 3         3         0       3            0           nvidia.com/gpu.deploy.dcgm-exporter=true           5d2h
nvidia-device-plugin-daemonset       3         3         3       3            3           nvidia.com/gpu.deploy.device-plugin=true           5d2h
nvidia-driver-daemonset              3         3         3       3            3           nvidia.com/gpu.deploy.driver=true                  5d2h
nvidia-mig-manager                   0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             5d2h
nvidia-operator-validator            3         3         0       3            0           nvidia.com/gpu.deploy.operator-validator=true      5d2h

kubectl describe daemonsets -n gpu-operator-resources

Events:
  Type     Reason            Age                   From                  Message
  ----     ------            ----                  ----                  -------
  Normal   SuccessfulCreate  27s                   daemonset-controller  Created pod: nvidia-operator-validator-tft4t
  Normal   SuccessfulCreate  27s                   daemonset-controller  Created pod: nvidia-operator-validator-rkpqf
  Normal   SuccessfulCreate  27s                   daemonset-controller  Created pod: nvidia-operator-validator-xdjk8

If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
DCGM error state : k logs nvidia-dcgm-exporter-5kc4x

time="2021-07-06T00:15:36Z" level=info msg="Starting dcgm-exporter"
DCGM Failed to find any GPUs on the node.
time="2021-07-06T00:15:36Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

nvidia-operator-validator Pods :

cuda-validation init container

time="2021-07-05T04:00:45Z" level=info msg="pod nvidia-cuda-validator-q79nf is curently in Pending phase"
time="2021-07-05T04:00:50Z" level=info msg="pod nvidia-cuda-validator-q79nf have run successfully"

driver-validation init container

running command chroot with args [/run/nvidia/driver nvidia-smi]
Mon Jul  5 04:00:24 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:84:00.0 Off |                    0 |
| N/A   39C    P0    26W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

nvidia-operator-validator init container

Error from server (BadRequest): container "nvidia-operator-validator" in pod "nvidia-operator-validator-xdjk8" is waiting to start: PodInitializing

plugin-validation init container

time="2021-07-06T01:25:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 1"
time="2021-07-06T01:25:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 2"
time="2021-07-06T01:25:32Z" level=info msg="GPU resources are not yet discovered by the node, retry: 3"
...
time="2021-07-06T01:27:42Z" level=info msg="GPU resources are not yet discovered by the node, retry: 29"
time="2021-07-06T01:27:47Z" level=info msg="GPU resources are not yet discovered by the node, retry: 30"
time="2021-07-06T01:27:52Z" level=info msg="Error: error validating plugin installation: GPU resources are not discovered by the node"

toolkit-validation init container

Mon Jul  5 04:00:43 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:84:00.0 Off |                    0 |
| N/A   37C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

shivamerla · 2021-07-09T03:24:23Z

@AnkitPurohit01 Can you get logs from device-plugin-daemonset pods? I see that some of those pods have restarted, so the logs might help. Also please run this command from any of the worker nodes and attach along with syslog. /run/nvidia/driver/usr/bin/nvidia-bug-report.sh

AnkitPurohit01 · 2021-07-12T03:58:11Z

Logs from device-plugin-daemonset pods

$ kubectl logs nvidia-device-plugin-daemonset-rsjs8 -n gpu-operator-resources
2021/07/05 04:00:38 Loading NVML
2021/07/05 04:00:38 Starting FS watcher.
2021/07/05 04:00:38 Starting OS watcher.
2021/07/05 04:00:38 Retreiving plugins.
2021/07/05 04:00:38 No MIG devices found. Falling back to mig.strategy=&{}
2021/07/05 04:00:38 No devices found. Waiting indefinitely.

$ kubectl logs nvidia-device-plugin-daemonset-jwsm4 -n gpu-operator-resources
2021/07/05 04:00:34 Loading NVML
2021/07/05 04:00:34 Starting FS watcher.
2021/07/05 04:00:34 Starting OS watcher.
2021/07/05 04:00:34 Retreiving plugins.
2021/07/05 04:00:34 No MIG devices found. Falling back to mig.strategy=&{}
2021/07/05 04:00:34 No devices found. Waiting indefinitely.

$ kubectl logs nvidia-device-plugin-daemonset-tz4z9 -n gpu-operator-resources
2021/07/06 01:38:57 Loading NVML
2021/07/06 01:38:57 Starting FS watcher.
2021/07/06 01:38:57 Starting OS watcher.
2021/07/06 01:38:57 Retreiving plugins.
2021/07/06 01:38:57 No MIG devices found. Falling back to mig.strategy=&{}
2021/07/06 01:38:57 No devices found. Waiting indefinitely.

AnkitPurohit01 · 2021-07-12T07:05:24Z

The result of /run/nvidia/driver/usr/bin/nvidia-bug-report.sh
please check here.
https://drive.google.com/file/d/1CxLqEZNxH3aBCVTwAH622ddkoBnMfYYO/view?usp=sharing

dvaldivia · 2023-07-24T21:38:00Z

did you found a solution or cause to this problem @AnkitPurohit01 ?

francescov1 · 2023-12-06T19:36:52Z

For anyone still looking, I found this solution worked for me: NVIDIA/dcgm-exporter#59 (comment). And there's some additional info about the solution here: NVIDIA/gpu-monitoring-tools#96 (comment)

dbugit mentioned this issue Dec 8, 2021

DCGM exporter connection to server pod times out #294

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DCGM initialization error #222

DCGM initialization error #222

AnkitPurohit01 commented Jul 7, 2021 •

edited

Loading

shivamerla commented Jul 9, 2021

AnkitPurohit01 commented Jul 12, 2021

AnkitPurohit01 commented Jul 12, 2021

dvaldivia commented Jul 24, 2023

francescov1 commented Dec 6, 2023

DCGM initialization error #222

DCGM initialization error #222

Comments

AnkitPurohit01 commented Jul 7, 2021 • edited Loading

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

shivamerla commented Jul 9, 2021

AnkitPurohit01 commented Jul 12, 2021

AnkitPurohit01 commented Jul 12, 2021

dvaldivia commented Jul 24, 2023

francescov1 commented Dec 6, 2023

AnkitPurohit01 commented Jul 7, 2021 •

edited

Loading