Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DCGM initialization error #222

Open
10 of 11 tasks
AnkitPurohit01 opened this issue Jul 7, 2021 · 5 comments
Open
10 of 11 tasks

DCGM initialization error #222

AnkitPurohit01 opened this issue Jul 7, 2021 · 5 comments

Comments

@AnkitPurohit01
Copy link

AnkitPurohit01 commented Jul 7, 2021

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node?
  • Are you running Kubernetes v1.13+?
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

We installed Nvidia GPU operator version 1.7.1 on our kubernetes cluster using HELM.
But there seems to be DCGM initialization error and GPU resources are not discovered by the node

Please check the following logs

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods --all-namespaces
$ k get pod
NAME                                       READY   STATUS             RESTARTS   AGE
gpu-feature-discovery-5jjwl                1/1     Running            3          20h
gpu-feature-discovery-jfxq8                1/1     Running            0          20h
gpu-feature-discovery-kcr2p                1/1     Running            3          20h
nvidia-container-toolkit-daemonset-8r4df   1/1     Running            0          20h
nvidia-container-toolkit-daemonset-c2lw8   1/1     Running            0          20h
nvidia-container-toolkit-daemonset-mmvzk   1/1     Running            0          20h
nvidia-cuda-validator-fcffx                0/1     Completed          0          20h
nvidia-cuda-validator-j8x8w                0/1     Completed          0          20h
nvidia-cuda-validator-q79nf                0/1     Completed          0          20h
nvidia-dcgm-exporter-5kc4x                 0/1     CrashLoopBackOff   242        20h
nvidia-dcgm-exporter-98kbb                 0/1     CrashLoopBackOff   242        20h
nvidia-dcgm-exporter-fdqgd                 0/1     CrashLoopBackOff   242        20h
nvidia-device-plugin-daemonset-jwsm4       1/1     Running            0          20h
nvidia-device-plugin-daemonset-rsjs8       1/1     Running            3          20h
nvidia-device-plugin-daemonset-tz4z9       1/1     Running            3          20h
nvidia-driver-daemonset-rx22m              1/1     Running            0          20h
nvidia-driver-daemonset-t8tkj              1/1     Running            0          20h
nvidia-driver-daemonset-vb6hh              1/1     Running            0          20h
nvidia-operator-validator-rkpqf            0/1     Init:3/4           163        20h
nvidia-operator-validator-tft4t            0/1     Init:3/4           165        20h
nvidia-operator-validator-xdjk8            0/1     Init:3/4           165        20h
  • kubernetes daemonset status: kubectl get ds -n gpu-operator-resources
NAME                                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
gpu-feature-discovery                3         3         3       3            3           nvidia.com/gpu.deploy.gpu-feature-discovery=true   5d2h
nvidia-container-toolkit-daemonset   3         3         3       3            3           nvidia.com/gpu.deploy.container-toolkit=true       5d2h
nvidia-dcgm-exporter                 3         3         0       3            0           nvidia.com/gpu.deploy.dcgm-exporter=true           5d2h
nvidia-device-plugin-daemonset       3         3         3       3            3           nvidia.com/gpu.deploy.device-plugin=true           5d2h
nvidia-driver-daemonset              3         3         3       3            3           nvidia.com/gpu.deploy.driver=true                  5d2h
nvidia-mig-manager                   0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             5d2h
nvidia-operator-validator            3         3         0       3            0           nvidia.com/gpu.deploy.operator-validator=true      5d2h
  • kubectl describe daemonsets -n gpu-operator-resources
Events:
  Type     Reason            Age                   From                  Message
  ----     ------            ----                  ----                  -------
  Normal   SuccessfulCreate  27s                   daemonset-controller  Created pod: nvidia-operator-validator-tft4t
  Normal   SuccessfulCreate  27s                   daemonset-controller  Created pod: nvidia-operator-validator-rkpqf
  Normal   SuccessfulCreate  27s                   daemonset-controller  Created pod: nvidia-operator-validator-xdjk8
  • If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
  • DCGM error state : k logs nvidia-dcgm-exporter-5kc4x
time="2021-07-06T00:15:36Z" level=info msg="Starting dcgm-exporter"
DCGM Failed to find any GPUs on the node.
time="2021-07-06T00:15:36Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"
  • nvidia-operator-validator Pods :
  1. cuda-validation init container
time="2021-07-05T04:00:45Z" level=info msg="pod nvidia-cuda-validator-q79nf is curently in Pending phase"
time="2021-07-05T04:00:50Z" level=info msg="pod nvidia-cuda-validator-q79nf have run successfully"
  1. driver-validation init container
running command chroot with args [/run/nvidia/driver nvidia-smi]
Mon Jul  5 04:00:24 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:84:00.0 Off |                    0 |
| N/A   39C    P0    26W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
  1. nvidia-operator-validator init container
Error from server (BadRequest): container "nvidia-operator-validator" in pod "nvidia-operator-validator-xdjk8" is waiting to start: PodInitializing
  1. plugin-validation init container
time="2021-07-06T01:25:22Z" level=info msg="GPU resources are not yet discovered by the node, retry: 1"
time="2021-07-06T01:25:27Z" level=info msg="GPU resources are not yet discovered by the node, retry: 2"
time="2021-07-06T01:25:32Z" level=info msg="GPU resources are not yet discovered by the node, retry: 3"
...
time="2021-07-06T01:27:42Z" level=info msg="GPU resources are not yet discovered by the node, retry: 29"
time="2021-07-06T01:27:47Z" level=info msg="GPU resources are not yet discovered by the node, retry: 30"
time="2021-07-06T01:27:52Z" level=info msg="Error: error validating plugin installation: GPU resources are not discovered by the node"
  1. toolkit-validation init container
Mon Jul  5 04:00:43 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:84:00.0 Off |                    0 |
| N/A   37C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
@shivamerla
Copy link
Contributor

@AnkitPurohit01 Can you get logs from device-plugin-daemonset pods? I see that some of those pods have restarted, so the logs might help. Also please run this command from any of the worker nodes and attach along with syslog. /run/nvidia/driver/usr/bin/nvidia-bug-report.sh

@AnkitPurohit01
Copy link
Author

  1. Logs from device-plugin-daemonset pods
$ kubectl logs nvidia-device-plugin-daemonset-rsjs8 -n gpu-operator-resources
2021/07/05 04:00:38 Loading NVML
2021/07/05 04:00:38 Starting FS watcher.
2021/07/05 04:00:38 Starting OS watcher.
2021/07/05 04:00:38 Retreiving plugins.
2021/07/05 04:00:38 No MIG devices found. Falling back to mig.strategy=&{}
2021/07/05 04:00:38 No devices found. Waiting indefinitely.
$ kubectl logs nvidia-device-plugin-daemonset-jwsm4 -n gpu-operator-resources
2021/07/05 04:00:34 Loading NVML
2021/07/05 04:00:34 Starting FS watcher.
2021/07/05 04:00:34 Starting OS watcher.
2021/07/05 04:00:34 Retreiving plugins.
2021/07/05 04:00:34 No MIG devices found. Falling back to mig.strategy=&{}
2021/07/05 04:00:34 No devices found. Waiting indefinitely.
$ kubectl logs nvidia-device-plugin-daemonset-tz4z9 -n gpu-operator-resources
2021/07/06 01:38:57 Loading NVML
2021/07/06 01:38:57 Starting FS watcher.
2021/07/06 01:38:57 Starting OS watcher.
2021/07/06 01:38:57 Retreiving plugins.
2021/07/06 01:38:57 No MIG devices found. Falling back to mig.strategy=&{}
2021/07/06 01:38:57 No devices found. Waiting indefinitely.

@AnkitPurohit01
Copy link
Author

The result of /run/nvidia/driver/usr/bin/nvidia-bug-report.sh
please check here.
https://drive.google.com/file/d/1CxLqEZNxH3aBCVTwAH622ddkoBnMfYYO/view?usp=sharing

@dvaldivia
Copy link

did you found a solution or cause to this problem @AnkitPurohit01 ?

@francescov1
Copy link

For anyone still looking, I found this solution worked for me: NVIDIA/dcgm-exporter#59 (comment). And there's some additional info about the solution here: NVIDIA/gpu-monitoring-tools#96 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants