Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric DCGM_FI_DEV_FB_FREE is not exported as part of 2.4.5-2.6.7-ubuntu20.04 image #103

Closed
omer-dayan opened this issue Sep 20, 2022 · 2 comments

Comments

@omer-dayan
Copy link

omer-dayan commented Sep 20, 2022

I upgraded the gpu-operator v1.11.1 and it came with the dcgm-exporter:2.4.5-2.6.7-ubuntu20.04 image

Something weird that happened is that I stopped seeing the metrics DCGM_FI_DEV_FB_FREE and DCGM_FI_DEV_FB_USED
However when changing the dcgm exporter image to 2.4.6-2.6.8-ubuntu20.04 tag they appeared again.
Is it suppose to be like that?

k8s: v1.20.5
GPU: A100 (MIG Enabled, mixed strategy, single device of 1g.5gb)
GPU Operator: v1.11.1 (Drived disabled - as they installed on the host)
Driver: 470.141.03
CUDA: 11.4

@glowkey
Copy link
Collaborator

glowkey commented Sep 20, 2022

There was a regression in DCGM 2.4.5 that caused the framebuffer metrics to be 0. This was fixed in 2.4.6.

@faiq
Copy link

faiq commented Sep 23, 2022

@glowkey i'm also finding missing metrics and modifying the dcgm to an older version was how i was able to get them to show up

@omer-dayan i was able to get those metrics with these cluster policy modifications with the operator

  values.yaml: |
    nfd:
      enabled: false
    driver:
      enabled: false
    toolkit:
      version: v1.11.0-ubuntu20.04
    gfd:
      enabled: true
    dcgm:
      enabled: true
      version: 2.3.6-1-ubuntu20.04
    dcgmExporter:
      enabled: true
      version: 2.4.6-2.6.10-ubuntu20.04
    validator:
      image: mesosphere/gpu-operator-validator
      repository: docker.io
      version: v1.11.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants