-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors in nv-hostengine log #141
Comments
Can you please check the dmesg messages and confirm if you are using the GSP driver? |
The installed drivers are 535.129.03, based on https://download.nvidia.com/XFree86/Linux-x86_64/535.129.03/README/gsp.html,
but
I don't see GSP mentioned in dmesg. Could you provide more details on to what to look for in dmesg? |
We see the same errors on DGX-H100. Same nv-hostengine, driver, (GSP) firmware, etc.
Not sure if this is relevant, but here's the metrics collected by dcgm-exporter.
@nikkon-dev please let us know if you need additional information, and if this is a DCGM or dcgm-exporter issue. |
@nikkon-dev Any news on this? Upgraded to dcgm 3.3.3 and dcgm-exporter 3.3.3-3.2.0
Still seeing the errors, but found also
The system has the latest DGXOS 6.1, latest fw 1.1.3, and all ubuntu packages upgrades applied; the driver is 535.154.05. |
FWIW I'm seeing these same messages using libraries from the
For the GPU firmare: cat /proc/driver/nvidia/gpus/0000:00:03.0/information
Model: NVIDIA L4
IRQ: 11
GPU UUID: GPU-7109a529-1f03-bf39-329f-0c4b56c7e6e3
Video BIOS: 95.04.29.00.07
Bus Type: PCI
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:00:03.0
Device Minor: 0
GPU Firmware: 535.129.03
GPU Excluded: No |
Though I want to point out, I'm deploying
So I dunno if dcgm-exporter is elegantly handling/defaulting here in the case of errors, or if its reporting an incorrect metric because of real errors from DCGM. |
I am also facing the same issue where the logs are in error state. When I change the tag to latest for the dcgm image nvcr.io/nvidia/k8s/dcgm-exporter:latest , I see the following logs time="2024-05-24T08:48:21Z" level=info msg="Starting dcgm-exporter" However, I am not getting the metrics on Grafana. I can see the nv-hostengine logs which do not look good
|
We use dcdm-exporter as described in https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html#connecting-to-an-existing-dcgm-agent. The
nv-hostengine
is version 3.1.8, thedcgm-exporter
container isnvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
.We use a custom metrics file with the following metrics:
On a DGX-H100 system, with DGXOS6 installed and latest FW updates I have noticed the following errors in the
nv-hostengine
logs.Any ideas what these are?
In addition, if we enable
DCGM_FI_DEV_XID_ERRORS
then the logs get filled quite quickly by the following ERROR:The text was updated successfully, but these errors were encountered: