Skip to content
This repository has been archived by the owner on Nov 2, 2021. It is now read-only.

dcgm-exporter missing many metrics after upgrade #143

Open
huww98 opened this issue Dec 21, 2020 · 9 comments
Open

dcgm-exporter missing many metrics after upgrade #143

huww98 opened this issue Dec 21, 2020 · 9 comments

Comments

@huww98
Copy link

huww98 commented Dec 21, 2020

I've updated our dcgm-exporter deployed directly in docker to tag 2.0.13-2.1.2-ubuntu20.04, but many metrics are missing.

It only exports 18 metrics, compared with 34 in tag 1.7.2. Is this expected? or it is a bug?

This is the command we use:

docker run -d --gpus all -p 9400:9400 --name dcgm-exporter --restart unless-stopped nvidia/dcgm-exporter:2.0.13-2.1.2-ubuntu20.04

The following metrics are missing. I do see them enabled in default-counters.csv though.

DCGM_FI_DEV_MEMORY_TEMP
DCGM_FI_DEV_PCIE_TX_THROUGHPUT
DCGM_FI_DEV_PCIE_RX_THROUGHPUT
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION
DCGM_FI_DEV_LOW_UTIL_VIOLATION
DCGM_FI_DEV_RELIABILITY_VIOLATION
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL
DCGM_FI_DEV_RETIRED_SBE
DCGM_FI_DEV_RETIRED_DBE
DCGM_FI_DEV_RETIRED_PENDING
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL
DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL
DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL
DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL
@shatil
Copy link

shatil commented Feb 12, 2021

After upgrading from 2.0.0-rc12 to 2.1.2 (building from source using the tags in the Git repo), I'm missing these:

DCGM_FI_DEV_PCIE_RX_THROUGHPUT
DCGM_FI_DEV_PCIE_RX_THROUGHPUT
DCGM_FI_DEV_PCIE_TX_THROUGHPUT
DCGM_FI_DEV_PCIE_TX_THROUGHPUT

The rest appear to be there, but I haven't really compared the values to see if they end up in the same ballpark.

@asaulys
Copy link

asaulys commented Feb 17, 2021

based upon: 2.0.0-rc.12...master ... there are some changes related to metrics.... filtering zero values and masking others 'based on constant'... worthy looking into these to see if they're causing the missing metrics. iirc there were a few that would never display real values...

@jfolz
Copy link

jfolz commented Mar 7, 2021

One of our machines "involuntarily" updated the dcgm exporter docker image and we're now missing some metrics like DCGM_FI_DEV_GPU_UTIL, which is kind of crucial.

Here's the full list:

DCGM_FI_DEV_GPU_UTIL 
DCGM_FI_DEV_POWER_VIOLATION
DCGM_FI_DEV_THERMAL_VIOLATION	
DCGM_FI_DEV_SYNC_BOOST_VIOLATION		
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION	
DCGM_FI_DEV_LOW_UTIL_VIOLATION	
DCGM_FI_DEV_RELIABILITY_VIOLATION
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL	
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL	
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL	
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL	
DCGM_FI_DEV_RETIRED_SBE	
DCGM_FI_DEV_RETIRED_DBE	
DCGM_FI_DEV_RETIRED_PENDING 	
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL	
DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL	
DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL	
DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL

We also gained these:

DCGM_FI_DEV_VGPU_LICENSE_STATUS
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS
DCGM_FI_DEV_ROW_REMAP_FAILURE

@nikkon-dev
Copy link

Thank you for using the dcgm-exporter project and reporting this issue. We are sad to hear your scenarios were negatively affected by our changes. Unfortunately, we deliberately made the changes to the set of the enabled by default metrics. I’d recommend you to provide your .csv configuration file with only those metrics that you need and use.
There were several reasons behind this change:
1). The previous set of metrics included ECC metrics that are very expensive to collect. We got multiple complaints about 100% CPU core utilization, mainly because of those ECC metrics. We also found out that most users do not provide their .csv configuration files with the only necessary metrics. Instead, users filtered out or just not used the metrics that were not interesting. That behavior leads to significant performance hits, as every metric collection is not free and needs significant GPU resources to collect.
2). Some metrics, previously enabled by default, are deprecated and should be replaced with new ones. For example, DCGM_FI_DEV_GPU_UTIL should be replaced with DCGM_FI_PROF_GR_ENGINE_ACTIVE, or DCGM_FI_PROF_SM_ACTIVE or DCGM_FI_PROF_SM_OCCUPANCY, based on your needs; DCGM_FI_DEV_PCIE_{RX/TX}THROUGHPUT may be replaced with DCGM_FI_PROF_PCIE{RX/TX}_BYTES; ECC metrics that are very CPU heavy may be replaced with DCGM_FI_DEV_XID_ERRORS.
3) The previous default metrics set had almost all DCGM_FI_PROF* metrics enabled, which created unnecessary load on GPUs – not all PROF metrics could be collected in a single pass together.

Considering all the above, we changed the default .csv configuration file and kept only a basic set of metrics that would not made unnecessary load on users’ systems. And we urge you to provide your .csv configuration files with carefully selected metrics that you need to monitor. We have not deleted the metrics themselves, so you can get the previous metrics ignoring my recommendations about deprecated ones.

@jfolz
Copy link

jfolz commented Mar 10, 2021

@nikkon-dev thanks for the update. For the meantime we re-enabled DCGM_FI_DEV_GPU_UTIL via collectors config, but I fully agree that it's a bad metric that doesn't reflect actual utilization (GPU could be doing 1+1 over and over for all we know). Ideally I would like to transition to DCGM_FI_PROF_SM_OCCUPANCY as long as that does not incur a performance hit. Some advice on the performance impact of individual metrics would be very welcome :)

Effectively the issue we had was one of documentation. We deploy the dcgm-exporter docker image as a systemd service as defined by deepops. It pulls the newest image whenever the service starts - that's the root problem in my book and we're looking at options how to fix that. From our point of view metrics we needed just suddenly disappeared and we couldn't figure out on our own how to get them back. Looking through the commits it was version 2.3.0 that disabled DCGM_FI_DEV_GPU_UTIL in the default config. The release notes only say "Enable 1.x backwards compatibility, refactor default watch fields". A changelog or more complete release notes that actually reflect what was changed would have helped a lot.

@mattf
Copy link

mattf commented Mar 29, 2021

@nikkon-dev what recommendation do you have for people using https://grafana.com/grafana/dashboards/12239 ?

@nikkon-dev
Copy link

nikkon-dev commented Mar 30, 2021

@nikkon-dev what recommendation do you have for people using https://grafana.com/grafana/dashboards/12239 ?

@mattf, Thank you for pointing to that Grafana dashboard. I reached to the author, and we will update the dashboard according to the current set of enabled-by-default metrics. For the future, we want to research if such dashboards could be autogenerated based on the dcgm-exporter configuration.

@anannaya
Copy link

@nikkon-dev Do you have new updated dashboard ?

@nikkon-dev
Copy link

@anannaya,

We updated the dashboard to reflect the current state of the default dcgm-exporter configuration.
Remember that it's not a robust solution, and any change in the set of enabled metrics may lead to a broken dashboard.
We are considering a long-term solution, but for now, I would recommend specifying the CVS config explicitly and not rely on the default one that we provide as an example - we may decide to change it at any moment in the future.

WBR,
Nik

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants