dcgm-exporter missing many metrics after upgrade #143

huww98 · 2020-12-21T16:07:38Z

I've updated our dcgm-exporter deployed directly in docker to tag 2.0.13-2.1.2-ubuntu20.04, but many metrics are missing.

It only exports 18 metrics, compared with 34 in tag 1.7.2. Is this expected? or it is a bug?

This is the command we use:

docker run -d --gpus all -p 9400:9400 --name dcgm-exporter --restart unless-stopped nvidia/dcgm-exporter:2.0.13-2.1.2-ubuntu20.04

The following metrics are missing. I do see them enabled in default-counters.csv though.

DCGM_FI_DEV_MEMORY_TEMP
DCGM_FI_DEV_PCIE_TX_THROUGHPUT
DCGM_FI_DEV_PCIE_RX_THROUGHPUT
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION
DCGM_FI_DEV_LOW_UTIL_VIOLATION
DCGM_FI_DEV_RELIABILITY_VIOLATION
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL
DCGM_FI_DEV_RETIRED_SBE
DCGM_FI_DEV_RETIRED_DBE
DCGM_FI_DEV_RETIRED_PENDING
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL
DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL
DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL
DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL

The text was updated successfully, but these errors were encountered:

shatil · 2021-02-12T04:07:37Z

After upgrading from 2.0.0-rc12 to 2.1.2 (building from source using the tags in the Git repo), I'm missing these:

DCGM_FI_DEV_PCIE_RX_THROUGHPUT
DCGM_FI_DEV_PCIE_RX_THROUGHPUT
DCGM_FI_DEV_PCIE_TX_THROUGHPUT
DCGM_FI_DEV_PCIE_TX_THROUGHPUT

The rest appear to be there, but I haven't really compared the values to see if they end up in the same ballpark.

asaulys · 2021-02-17T18:19:28Z

based upon: 2.0.0-rc.12...master ... there are some changes related to metrics.... filtering zero values and masking others 'based on constant'... worthy looking into these to see if they're causing the missing metrics. iirc there were a few that would never display real values...

jfolz · 2021-03-07T11:21:04Z

One of our machines "involuntarily" updated the dcgm exporter docker image and we're now missing some metrics like DCGM_FI_DEV_GPU_UTIL, which is kind of crucial.

Here's the full list:

DCGM_FI_DEV_GPU_UTIL 
DCGM_FI_DEV_POWER_VIOLATION
DCGM_FI_DEV_THERMAL_VIOLATION	
DCGM_FI_DEV_SYNC_BOOST_VIOLATION		
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION	
DCGM_FI_DEV_LOW_UTIL_VIOLATION	
DCGM_FI_DEV_RELIABILITY_VIOLATION
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL	
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL	
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL	
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL	
DCGM_FI_DEV_RETIRED_SBE	
DCGM_FI_DEV_RETIRED_DBE	
DCGM_FI_DEV_RETIRED_PENDING 	
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL	
DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL	
DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL	
DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL

We also gained these:

DCGM_FI_DEV_VGPU_LICENSE_STATUS
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS
DCGM_FI_DEV_ROW_REMAP_FAILURE

nikkon-dev · 2021-03-07T22:19:05Z

Thank you for using the dcgm-exporter project and reporting this issue. We are sad to hear your scenarios were negatively affected by our changes. Unfortunately, we deliberately made the changes to the set of the enabled by default metrics. I’d recommend you to provide your .csv configuration file with only those metrics that you need and use.
There were several reasons behind this change:
1). The previous set of metrics included ECC metrics that are very expensive to collect. We got multiple complaints about 100% CPU core utilization, mainly because of those ECC metrics. We also found out that most users do not provide their .csv configuration files with the only necessary metrics. Instead, users filtered out or just not used the metrics that were not interesting. That behavior leads to significant performance hits, as every metric collection is not free and needs significant GPU resources to collect.
2). Some metrics, previously enabled by default, are deprecated and should be replaced with new ones. For example, DCGM_FI_DEV_GPU_UTIL should be replaced with DCGM_FI_PROF_GR_ENGINE_ACTIVE, or DCGM_FI_PROF_SM_ACTIVE or DCGM_FI_PROF_SM_OCCUPANCY, based on your needs; DCGM_FI_DEV_PCIE_{RX/TX}THROUGHPUT may be replaced with DCGM_FI_PROF_PCIE{RX/TX}_BYTES; ECC metrics that are very CPU heavy may be replaced with DCGM_FI_DEV_XID_ERRORS.
3) The previous default metrics set had almost all DCGM_FI_PROF* metrics enabled, which created unnecessary load on GPUs – not all PROF metrics could be collected in a single pass together.

Considering all the above, we changed the default .csv configuration file and kept only a basic set of metrics that would not made unnecessary load on users’ systems. And we urge you to provide your .csv configuration files with carefully selected metrics that you need to monitor. We have not deleted the metrics themselves, so you can get the previous metrics ignoring my recommendations about deprecated ones.

jfolz · 2021-03-10T09:22:16Z

@nikkon-dev thanks for the update. For the meantime we re-enabled DCGM_FI_DEV_GPU_UTIL via collectors config, but I fully agree that it's a bad metric that doesn't reflect actual utilization (GPU could be doing 1+1 over and over for all we know). Ideally I would like to transition to DCGM_FI_PROF_SM_OCCUPANCY as long as that does not incur a performance hit. Some advice on the performance impact of individual metrics would be very welcome :)

Effectively the issue we had was one of documentation. We deploy the dcgm-exporter docker image as a systemd service as defined by deepops. It pulls the newest image whenever the service starts - that's the root problem in my book and we're looking at options how to fix that. From our point of view metrics we needed just suddenly disappeared and we couldn't figure out on our own how to get them back. Looking through the commits it was version 2.3.0 that disabled DCGM_FI_DEV_GPU_UTIL in the default config. The release notes only say "Enable 1.x backwards compatibility, refactor default watch fields". A changelog or more complete release notes that actually reflect what was changed would have helped a lot.

mattf · 2021-03-29T01:01:58Z

@nikkon-dev what recommendation do you have for people using https://grafana.com/grafana/dashboards/12239 ?

nikkon-dev · 2021-03-30T01:09:56Z

@nikkon-dev what recommendation do you have for people using https://grafana.com/grafana/dashboards/12239 ?

@mattf, Thank you for pointing to that Grafana dashboard. I reached to the author, and we will update the dashboard according to the current set of enabled-by-default metrics. For the future, we want to research if such dashboards could be autogenerated based on the dcgm-exporter configuration.

anannaya · 2021-08-17T04:07:24Z

@nikkon-dev Do you have new updated dashboard ?

nikkon-dev · 2021-09-24T17:39:19Z

@anannaya,

We updated the dashboard to reflect the current state of the default dcgm-exporter configuration.
Remember that it's not a robust solution, and any change in the set of enabled metrics may lead to a broken dashboard.
We are considering a long-term solution, but for now, I would recommend specifying the CVS config explicitly and not rely on the default one that we provide as an example - we may decide to change it at any moment in the future.

WBR,
Nik

mathrock74 mentioned this issue Apr 19, 2021

GPU Utilization not being displayed by dcgm-exporter.service NVIDIA/deepops#906

Closed

jfolz mentioned this issue Jun 9, 2021

dcgm-exporter missing metrics for A100 GPU #166

Open

YH-Wu mentioned this issue Jul 14, 2021

GPU Utilization metric (DCGM_FI_DEV_GPU_UTIL) disabled by default #200

Closed

nikkon-dev mentioned this issue Aug 26, 2021

enable DCGM_FI_DEV_GPU_UTIL && update docker file NVIDIA/dcgm-exporter#8

Closed

GuillaumeSmaha mentioned this issue Oct 26, 2021

GPU-operator doesn't allow to specify a volume to mount metrics file for nvidia-dcgm-exporter NVIDIA/gpu-operator#275

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dcgm-exporter missing many metrics after upgrade #143

dcgm-exporter missing many metrics after upgrade #143

huww98 commented Dec 21, 2020

shatil commented Feb 12, 2021

asaulys commented Feb 17, 2021

jfolz commented Mar 7, 2021 •

edited

nikkon-dev commented Mar 7, 2021

jfolz commented Mar 10, 2021 •

edited

mattf commented Mar 29, 2021

nikkon-dev commented Mar 30, 2021 •

edited

anannaya commented Aug 17, 2021

nikkon-dev commented Sep 24, 2021

dcgm-exporter missing many metrics after upgrade #143

dcgm-exporter missing many metrics after upgrade #143

Comments

huww98 commented Dec 21, 2020

shatil commented Feb 12, 2021

asaulys commented Feb 17, 2021

jfolz commented Mar 7, 2021 • edited

nikkon-dev commented Mar 7, 2021

jfolz commented Mar 10, 2021 • edited

mattf commented Mar 29, 2021

nikkon-dev commented Mar 30, 2021 • edited

anannaya commented Aug 17, 2021

nikkon-dev commented Sep 24, 2021

jfolz commented Mar 7, 2021 •

edited

jfolz commented Mar 10, 2021 •

edited

nikkon-dev commented Mar 30, 2021 •

edited