-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DCGM initialization error #222
Comments
@AnkitPurohit01 Can you get logs from device-plugin-daemonset pods? I see that some of those pods have restarted, so the logs might help. Also please run this command from any of the worker nodes and attach along with syslog. |
|
The result of |
did you found a solution or cause to this problem @AnkitPurohit01 ? |
For anyone still looking, I found this solution worked for me: NVIDIA/dcgm-exporter#59 (comment). And there's some additional info about the solution here: NVIDIA/gpu-monitoring-tools#96 (comment) |
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
We installed Nvidia GPU operator version 1.7.1 on our kubernetes cluster using HELM.
But there seems to be
DCGM initialization error
andGPU resources are not discovered by the node
Please check the following logs
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
kubectl get pods --all-namespaces
kubectl get ds -n gpu-operator-resources
kubectl describe daemonsets -n gpu-operator-resources
kubectl describe pod -n NAMESPACE POD_NAME
The text was updated successfully, but these errors were encountered: