Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia.com/gpu: 0 #314

Open
mama0512 opened this issue May 21, 2024 · 2 comments
Open

nvidia.com/gpu: 0 #314

mama0512 opened this issue May 21, 2024 · 2 comments

Comments

@mama0512
Copy link

  1. kubectl describe node 416a100
    Name: 416a100
    Roles: master
    Labels: beta.kubernetes.io/arch=amd64
    beta.kubernetes.io/instance-type=k3s
    beta.kubernetes.io/os=linux
    gpu=on
    k3s.io/hostname=416a100
    k3s.io/internal-ip=192.168.2.145
    kubernetes.io/arch=amd64
    kubernetes.io/hostname=416a100
    kubernetes.io/os=linux
    node-role.kubernetes.io/master=true
    node.kubernetes.io/instance-type=k3s
    Annotations: flannel.alpha.coreos.com/backend-data: {"VtepMAC":"a2:0a:a5:6d:d7:7e"}
    flannel.alpha.coreos.com/backend-type: vxlan
    flannel.alpha.coreos.com/kube-subnet-manager: true
    flannel.alpha.coreos.com/public-ip: 192.168.2.145
    hami.io/mutex.lock: 2024-05-13T13:04:17Z
    hami.io/node-handshake: Requesting_2024.05.20 11:44:46
    hami.io/node-nvidia-register:
    GPU-b7c4eb59-dd76-ca5a-8482-56fd796b0a75,10,40960,100,NVIDIA-NVIDIA A100-PCIE-40GB,0,true:GPU-ec7d894f-bb24-dc73-1adb-17806ec68749,10,4096...
    k3s.io/node-args: ["server","--docker"]
    k3s.io/node-config-hash: 6DWNFWQMIJPJNOOSYKGNXGNN7DPG53Z77PAPIQ56XNVT2UPS3TFA====
    k3s.io/node-env: {"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/cba07c8500bccabd42d9215a6af6b01181cb6ca5755d12ae1e4e02b27b50bafa"}
    node.alpha.kubernetes.io/ttl: 0
    volumes.kubernetes.io/controller-managed-attach-detach: true
    CreationTimestamp: Thu, 09 May 2024 15:48:32 +0800
    Taints:
    Unschedulable: false
    Lease:
    HolderIdentity: 416a100
    AcquireTime:
    RenewTime: Tue, 21 May 2024 11:41:41 +0800
    Conditions:
    Type Status LastHeartbeatTime LastTransitionTime Reason Message

NetworkUnavailable False Tue, 21 May 2024 00:17:41 +0800 Tue, 21 May 2024 00:17:41 +0800 FlannelIsUp Flannel is running on this node
MemoryPressure False Tue, 21 May 2024 11:41:24 +0800 Thu, 09 May 2024 15:48:31 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 21 May 2024 11:41:24 +0800 Thu, 09 May 2024 15:48:31 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 21 May 2024 11:41:24 +0800 Thu, 09 May 2024 15:48:31 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 21 May 2024 11:41:24 +0800 Tue, 21 May 2024 00:17:52 +0800 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 192.168.2.145
Hostname: 416a100
Capacity:
cpu: 80
ephemeral-storage: 459819088Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263739228Ki
nvidia.com/gpu: 0
pods: 110
Allocatable:
cpu: 80
ephemeral-storage: 447312008456
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263739228Ki
nvidia.com/gpu: 0
pods: 110
2.nvidia-smi
nvidia-smi
Tue May 21 11:44:30 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:36:00.0 Off | 0 |
| N/A 42C P0 46W / 250W | 13MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB Off | 00000000:37:00.0 Off | 0 |
| N/A 42C P0 45W / 250W | 13MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA RTX A6000 Off | 00000000:9D:00.0 Off | Off |
| 30% 37C P8 22W / 300W | 14MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA RTX A6000 Off | 00000000:9E:00.0 Off | Off |
| 30% 36C P8 28W / 300W | 14MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2480 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2480 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 2480 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 2480 G /usr/lib/xorg/Xorg 4MiB |

3.sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Tue May 21 03:44:52 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:36:00.0 Off | 0 |
| N/A 42C P0 46W / 250W | 13MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB Off | 00000000:37:00.0 Off | 0 |
| N/A 42C P0 45W / 250W | 13MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA RTX A6000 Off | 00000000:9D:00.0 Off | Off |
| 30% 37C P8 22W / 300W | 14MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA RTX A6000 Off | 00000000:9E:00.0 Off | Off |
| 30% 35C P8 27W / 300W | 14MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

4.hami pod:
kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
helm-install-traefik-nzsh4 0/1 Completed 0 11d
svclb-traefik-cpwxf 2/2 Running 40 11d
metrics-server-7b4f8b595-5kn69 1/1 Running 21 11d
local-path-provisioner-64d457c485-nccpm 1/1 Running 20 11d
coredns-5d69dc75db-q7rxn 1/1 Running 20 11d
traefik-5dd496474-rxmr2 1/1 Running 20 11d
nvidia-device-plugin-daemonset-jg762 1/1 Running 0 5m50s
hami-device-plugin-nv5gs 2/2 Running 0 4m43s
hami-scheduler-757847d79f-n7dbf 2/2 Running 0 4m43s

@lengrongfu
Copy link
Member

This issue is about what problem?

@wawa0210
Copy link
Member

wawa0210 commented Jul 9, 2024

@mama0512 Can you provide more information on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants