Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

这个GPU共享插件支持使用dcgm-exporter做监控吗 #211

Open
db-root opened this issue Jun 26, 2023 · 4 comments
Open

这个GPU共享插件支持使用dcgm-exporter做监控吗 #211

db-root opened this issue Jun 26, 2023 · 4 comments

Comments

@db-root
Copy link

db-root commented Jun 26, 2023

kubernetes version:v1.23.16

nvidia-docker info

Client: Docker Engine - Community
Version: 24.0.2
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.10.5
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.18.1
Path: /usr/libexec/docker/cli-plugins/docker-compose

Server:
Containers: 110
Running: 56
Paused: 0
Stopped: 54
Images: 40
Server Version: 20.10.24
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
Default Runtime: nvidia
Init Binary: docker-init
containerd version: 3dce8...
runc version: v1.1.7-0-g860f061
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.4.0-150-generic
Operating System: Ubuntu 20.04.6 LTS
OSType: linux
Architecture: x86_64
CPUs: 48
Total Memory: 125.6GiB
Name: node01
ID:
Docker Root Dir: /app/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

我在集群上部署了gpushare做GPU共享,并且使用dcgm-exporter来做监控。https://github.com/NVIDIA/dcgm-exporter
但是在普罗米修斯上看不到GPU利用率的参数值,以及无法监控pod的gpu资源利用率
有同学用过这种方案吗,麻烦支持一下。
image

@binz123
Copy link

binz123 commented Jul 14, 2023

同问+1

@fenwuyaoji
Copy link

fenwuyaoji commented Jul 14, 2023 via email

@ZhangSetSail
Copy link

同问+1

@ZhangSetSail
Copy link

目前能收集到的指标太少了,温度功耗指标我该怎么获取。
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants