Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are custom DCGM Exporter Metrics supported anymore? #22

Open
weakcamel opened this issue Feb 15, 2024 · 6 comments
Open

Are custom DCGM Exporter Metrics supported anymore? #22

weakcamel opened this issue Feb 15, 2024 · 6 comments
Assignees

Comments

@weakcamel
Copy link

weakcamel commented Feb 15, 2024

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
  • Kernel Version: n/a
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K3s
  • GPU Operator Version: 23.6.0

2. Issue or feature description

This is potentially a documentation issue - unless the feature is no longer supported then it's also missing as deprecated from the changelogs.

See that 23.3.2 version of GPU Operator used to support customization of the DCGM Exporter config via a config map:

The 23.5.0 and following docs however are missing this section entirely, e.g.:

Does it mean that this is no longer supported? or is this still allowed and just missed while re-organizing the docs?

3. Steps to reproduce the issue

See the documentation links above.

4. Information to attach (optional if deemed irrelevant)

n/a

@weakcamel
Copy link
Author

Note: I found that old (missing now) documentation section in this issue: NVIDIA/gpu-operator#648

@mikemckiernan
Copy link
Member

@weakcamel, I'm the guilty party for the reorganizing of the docs. I believe that section from 23.3.2 is still supported--none of the engineers said it wasn't.

I'll work to confirm that it still applies. When I restore the content, I'll locate it somewhere in https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/index.html. I apologize for the confusion. Your report suggests that you might be sympathetic to the idea that customizing the metrics might not be a "getting started" task.

I'm OK to work with this issue from here, but won't object if someone scoots by and moves it to github.com/nvidia/cloud-native-docs.

@weakcamel
Copy link
Author

weakcamel commented Feb 17, 2024

@mikemckiernan No worries at all!

It's really good news that this wasn't a deprecated feature and just a side effect of houskeeping :)

As for the location, I agree it's not necessarily "Getting Started" but I would think it should probably be a part of gpu-operator (not gpu-telemetry)? I personally wouldn't ever made the connection to look for gpu-telemetry. Also, setup of the exporter as part of gpu-operator is quite significantly different from running it on its own.

Maybe a subsection under Advanced Operator Configuration? There doesn't seem to be any part related to metrics in current docs at all.

Screenshot 2024-02-17 at 08 50 09

@cdesiniotis
Copy link
Contributor

@weakcamel since this is a documentation issue, would you mind moving this issue to github.com/nvidia/cloud-native-docs?

@elezar elezar transferred this issue from NVIDIA/gpu-operator Feb 28, 2024
@elezar
Copy link
Member

elezar commented Feb 28, 2024

I have transferred the issue.

@weakcamel
Copy link
Author

Thanks for moving the ticket - yes, it's perfectly fine.

On a related note, one thing got me thinking: the old documentation snippet explains which options to override (e.g. dcgmExporter.config.name). yet those options aren't actually documented in the Helm chart docs nor the values file itself. Shouldn't they be?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants