nvidia.com/gpu: 0 #314

mama0512 · 2024-05-21T03:45:31Z

kubectl describe node 416a100
Name: 416a100
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=k3s
beta.kubernetes.io/os=linux
gpu=on
k3s.io/hostname=416a100
k3s.io/internal-ip=192.168.2.145
kubernetes.io/arch=amd64
kubernetes.io/hostname=416a100
kubernetes.io/os=linux
node-role.kubernetes.io/master=true
node.kubernetes.io/instance-type=k3s
Annotations: flannel.alpha.coreos.com/backend-data: {"VtepMAC":"a2:0a:a5:6d:d7:7e"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 192.168.2.145
hami.io/mutex.lock: 2024-05-13T13:04:17Z
hami.io/node-handshake: Requesting_2024.05.20 11:44:46
hami.io/node-nvidia-register:
GPU-b7c4eb59-dd76-ca5a-8482-56fd796b0a75,10,40960,100,NVIDIA-NVIDIA A100-PCIE-40GB,0,true:GPU-ec7d894f-bb24-dc73-1adb-17806ec68749,10,4096...
k3s.io/node-args: ["server","--docker"]
k3s.io/node-config-hash: 6DWNFWQMIJPJNOOSYKGNXGNN7DPG53Z77PAPIQ56XNVT2UPS3TFA====
k3s.io/node-env: {"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/cba07c8500bccabd42d9215a6af6b01181cb6ca5755d12ae1e4e02b27b50bafa"}
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 09 May 2024 15:48:32 +0800
Taints:
Unschedulable: false
Lease:
HolderIdentity: 416a100
AcquireTime:
RenewTime: Tue, 21 May 2024 11:41:41 +0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message

NetworkUnavailable False Tue, 21 May 2024 00:17:41 +0800 Tue, 21 May 2024 00:17:41 +0800 FlannelIsUp Flannel is running on this node
MemoryPressure False Tue, 21 May 2024 11:41:24 +0800 Thu, 09 May 2024 15:48:31 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 21 May 2024 11:41:24 +0800 Thu, 09 May 2024 15:48:31 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 21 May 2024 11:41:24 +0800 Thu, 09 May 2024 15:48:31 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 21 May 2024 11:41:24 +0800 Tue, 21 May 2024 00:17:52 +0800 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 192.168.2.145
Hostname: 416a100
Capacity:
cpu: 80
ephemeral-storage: 459819088Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263739228Ki
nvidia.com/gpu: 0
pods: 110
Allocatable:
cpu: 80
ephemeral-storage: 447312008456
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263739228Ki
nvidia.com/gpu: 0
pods: 110
2.nvidia-smi
nvidia-smi
Tue May 21 11:44:30 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:36:00.0 Off | 0 |
| N/A 42C P0 46W / 250W | 13MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB Off | 00000000:37:00.0 Off | 0 |
| N/A 42C P0 45W / 250W | 13MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA RTX A6000 Off | 00000000:9D:00.0 Off | Off |
| 30% 37C P8 22W / 300W | 14MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA RTX A6000 Off | 00000000:9E:00.0 Off | Off |
| 30% 36C P8 28W / 300W | 14MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2480 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2480 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 2480 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 2480 G /usr/lib/xorg/Xorg 4MiB |

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

4.hami pod:
kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
helm-install-traefik-nzsh4 0/1 Completed 0 11d
svclb-traefik-cpwxf 2/2 Running 40 11d
metrics-server-7b4f8b595-5kn69 1/1 Running 21 11d
local-path-provisioner-64d457c485-nccpm 1/1 Running 20 11d
coredns-5d69dc75db-q7rxn 1/1 Running 20 11d
traefik-5dd496474-rxmr2 1/1 Running 20 11d
nvidia-device-plugin-daemonset-jg762 1/1 Running 0 5m50s
hami-device-plugin-nv5gs 2/2 Running 0 4m43s
hami-scheduler-757847d79f-n7dbf 2/2 Running 0 4m43s

lengrongfu · 2024-05-22T01:16:48Z

This issue is about what problem?

wawa0210 · 2024-07-09T11:06:51Z

@mama0512 Can you provide more information on this issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia.com/gpu: 0 #314

nvidia.com/gpu: 0 #314

mama0512 commented May 21, 2024

lengrongfu commented May 22, 2024

wawa0210 commented Jul 9, 2024

nvidia.com/gpu: 0 #314

nvidia.com/gpu: 0 #314

Comments

mama0512 commented May 21, 2024

lengrongfu commented May 22, 2024

wawa0210 commented Jul 9, 2024