Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The GPU Capacity of hosts and kubectl describe node does not match #344

Open
Yancey1989 opened this issue Aug 23, 2017 · 5 comments
Open

Comments

@Yancey1989
Copy link
Collaborator

Yancey1989 commented Aug 23, 2017

  • kubectl describe node
Capacity:
 alpha.kubernetes.io/nvidia-gpu:	8
 cpu:					24
 memory:				264042760Ki
 pods:					110
Allocatable:
 alpha.kubernetes.io/nvidia-gpu:	8
 cpu:					24
 memory:				263940360Ki
 pods:					110
System Info:
  • Host info
$ nvidia-smi
Wed Aug 23 23:26:17 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 0000:04:00.0     Off |                    0 |
| N/A   34C    P0    53W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P40           Off  | 0000:05:00.0     Off |                    0 |
| N/A   29C    P0    50W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P40           Off  | 0000:06:00.0     Off |                    0 |
| N/A   33C    P0    51W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P40           Off  | 0000:07:00.0     Off |                    0 |
| N/A   35C    P0    51W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P40           Off  | 0000:0B:00.0     Off |                    0 |
| N/A   33C    P0    51W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P40           Off  | 0000:0C:00.0     Off |                    0 |
| N/A   32C    P0    50W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P40           Off  | 0000:0E:00.0     Off |                    0 |
| N/A   29C    P0    49W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
  • Device info
$ ls /dev/nvidia*
/dev/nvidia0  /dev/nvidia1  /dev/nvidia2  /dev/nvidia3  /dev/nvidia4  /dev/nvidia5  /dev/nvidia6  /dev/nvidia7  /dev/nvidiactl  /dev/nvidia-uv
@Yancey1989
Copy link
Collaborator Author

@pineking
Copy link

pineking commented Aug 24, 2017 via email

@pineking
Copy link

pineking commented Aug 24, 2017 via email

@Yancey1989
Copy link
Collaborator Author

多谢 @pineking dmesg 里确实有初始化失败的日志:

[1903878.128627] NVRM: RmInitAdapter failed! (0x26:0xffff:1096)
[1903878.128678] NVRM: rm_init_adapter failed for device bearing minor number 6

你们有遇到过类似问题么?

@pineking
Copy link

之前碰到过,重启解决,最近没出现丢卡情况
没继续找根源。。。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants