-
Notifications
You must be signed in to change notification settings - Fork 472
Closed
Description
I am following the instructions to the letter in EGX Stack v3.1 for AWS - Install Guide for Ubuntu Server x86-64 and am running into an issue when trying to validate the installation with nvidia-smi.
I'm running on a g4dn.2xlarge, Ubuntu 20.04.
In summary, the nvidia-smi pod is hanging indefinitely. kubectl pod describe pod nvidia-smi reads:
Name: nvidia-smi
Namespace: default
Priority: 0
Node: <none>
Labels: run=nvidia-smi
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Containers:
nvidia-smi:
Image: nvidia/cuda:11.1.1-base
Port: <none>
Host Port: <none>
Args:
nvidia-smi
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-v5g2r (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-v5g2r:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-v5g2r
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 35s (x103 over 151m) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
I think the last line there is most relevant.
Describe Node
The output of kubectl describe node is:
Name: ip-172-31-21-241
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
feature.node.kubernetes.io/cpu-cpuid.ADX=true
feature.node.kubernetes.io/cpu-cpuid.AESNI=true
feature.node.kubernetes.io/cpu-cpuid.AVX=true
feature.node.kubernetes.io/cpu-cpuid.AVX2=true
feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true
feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true
feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true
feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true
feature.node.kubernetes.io/cpu-cpuid.FMA3=true
feature.node.kubernetes.io/cpu-cpuid.MPX=true
feature.node.kubernetes.io/cpu-hardware_multithreading=true
feature.node.kubernetes.io/kernel-config.NO_HZ=true
feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
feature.node.kubernetes.io/kernel-version.full=5.11.0-1020-aws
feature.node.kubernetes.io/kernel-version.major=5
feature.node.kubernetes.io/kernel-version.minor=11
feature.node.kubernetes.io/kernel-version.revision=0
feature.node.kubernetes.io/pci-10de.present=true
feature.node.kubernetes.io/pci-1d0f.present=true
feature.node.kubernetes.io/storage-nonrotationaldisk=true
feature.node.kubernetes.io/system-os_release.ID=ubuntu
feature.node.kubernetes.io/system-os_release.VERSION_ID=20.04
feature.node.kubernetes.io/system-os_release.VERSION_ID.major=20
feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-172-31-21-241
kubernetes.io/os=linux
node-role.kubernetes.io/master=
nvidia.com/gpu.present=true
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
nfd.node.kubernetes.io/extended-resources:
nfd.node.kubernetes.io/feature-labels:
cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BW,cpu-cpuid.AVX512CD,cpu-cpuid.AVX512DQ,cpu-cpuid.AVX512F,cpu-...
nfd.node.kubernetes.io/master.version: v0.6.0
nfd.node.kubernetes.io/worker.version: v0.6.0
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 172.31.21.241/20
projectcalico.org/IPv4IPIPTunnelAddr: 192.168.198.128
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 09 Dec 2021 19:00:28 +0000
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: ip-172-31-21-241
AcquireTime: <unset>
RenewTime: Thu, 09 Dec 2021 22:04:22 +0000
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Thu, 09 Dec 2021 19:05:06 +0000 Thu, 09 Dec 2021 19:05:06 +0000 CalicoIsUp Calico is running on this node
MemoryPressure False Thu, 09 Dec 2021 22:01:00 +0000 Thu, 09 Dec 2021 19:00:27 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 09 Dec 2021 22:01:00 +0000 Thu, 09 Dec 2021 19:00:27 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 09 Dec 2021 22:01:00 +0000 Thu, 09 Dec 2021 19:00:27 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 09 Dec 2021 22:01:00 +0000 Thu, 09 Dec 2021 19:05:00 +0000 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 172.31.21.241
Hostname: ip-172-31-21-241
Capacity:
cpu: 8
ephemeral-storage: 64989720Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32407904Ki
pods: 110
Allocatable:
cpu: 8
ephemeral-storage: 59894525853
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32305504Ki
pods: 110
System Info:
Machine ID: ec218707683e25881184af68445ebd87
System UUID: ec218707-683e-2588-1184-af68445ebd87
Boot ID: d59c4692-4971-43f2-a237-2d2fe49434cc
Kernel Version: 5.11.0-1020-aws
OS Image: Ubuntu 20.04.3 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.13
Kubelet Version: v1.18.14
Kube-Proxy Version: v1.18.14
PodCIDR: 192.168.0.0/24
PodCIDRs: 192.168.0.0/24
Non-terminated Pods: (14 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default gpu-operator-1639077567-node-feature-discovery-master-84485st4x 0 (0%) 0 (0%) 0 (0%) 0 (0%) 164m
default gpu-operator-1639077567-node-feature-discovery-worker-ffljl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 164m
default gpu-operator-76fb8d5c55-rq7jj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 164m
gpu-operator-resources nvidia-container-toolkit-daemonset-g4rzv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 164m
gpu-operator-resources nvidia-driver-daemonset-gsl5l 0 (0%) 0 (0%) 0 (0%) 0 (0%) 164m
kube-system calico-kube-controllers-7f94cf5997-zr46g 0 (0%) 0 (0%) 0 (0%) 0 (0%) 179m
kube-system calico-node-8w9jt 250m (3%) 0 (0%) 0 (0%) 0 (0%) 179m
kube-system coredns-66bff467f8-hnbb6 100m (1%) 0 (0%) 70Mi (0%) 170Mi (0%) 3h3m
kube-system coredns-66bff467f8-msqdc 100m (1%) 0 (0%) 70Mi (0%) 170Mi (0%) 3h3m
kube-system etcd-ip-172-31-21-241 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3h3m
kube-system kube-apiserver-ip-172-31-21-241 250m (3%) 0 (0%) 0 (0%) 0 (0%) 3h3m
kube-system kube-controller-manager-ip-172-31-21-241 200m (2%) 0 (0%) 0 (0%) 0 (0%) 3h3m
kube-system kube-proxy-mssxq 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3h3m
kube-system kube-scheduler-ip-172-31-21-241 100m (1%) 0 (0%) 0 (0%) 0 (0%) 3h3m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1 (12%) 0 (0%)
memory 140Mi (0%) 340Mi (1%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
I notice there is no nvidia-gpu listed in Allocated resources.
Thanks for your support.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels