Skip to content

Cannot validate install; GPU not available #295

@joshwyatt

Description

@joshwyatt

I am following the instructions to the letter in EGX Stack v3.1 for AWS - Install Guide for Ubuntu Server x86-64 and am running into an issue when trying to validate the installation with nvidia-smi.

I'm running on a g4dn.2xlarge, Ubuntu 20.04.

In summary, the nvidia-smi pod is hanging indefinitely. kubectl pod describe pod nvidia-smi reads:

Name:         nvidia-smi
Namespace:    default
Priority:     0
Node:         <none>
Labels:       run=nvidia-smi
Annotations:  <none>
Status:       Pending
IP:           
IPs:          <none>
Containers:
  nvidia-smi:
    Image:      nvidia/cuda:11.1.1-base
    Port:       <none>
    Host Port:  <none>
    Args:
      nvidia-smi
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-v5g2r (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  default-token-v5g2r:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-v5g2r
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  35s (x103 over 151m)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

I think the last line there is most relevant.

Describe Node

The output of kubectl describe node is:

Name:               ip-172-31-21-241
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.MPX=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
                    feature.node.kubernetes.io/kernel-version.full=5.11.0-1020-aws
                    feature.node.kubernetes.io/kernel-version.major=5
                    feature.node.kubernetes.io/kernel-version.minor=11
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/pci-10de.present=true
                    feature.node.kubernetes.io/pci-1d0f.present=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    feature.node.kubernetes.io/system-os_release.ID=ubuntu
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=20.04
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=20
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-172-31-21-241
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/master=
                    nvidia.com/gpu.present=true
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    nfd.node.kubernetes.io/extended-resources: 
                    nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BW,cpu-cpuid.AVX512CD,cpu-cpuid.AVX512DQ,cpu-cpuid.AVX512F,cpu-...
                    nfd.node.kubernetes.io/master.version: v0.6.0
                    nfd.node.kubernetes.io/worker.version: v0.6.0
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 172.31.21.241/20
                    projectcalico.org/IPv4IPIPTunnelAddr: 192.168.198.128
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 09 Dec 2021 19:00:28 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-172-31-21-241
  AcquireTime:     <unset>
  RenewTime:       Thu, 09 Dec 2021 22:04:22 +0000
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Thu, 09 Dec 2021 19:05:06 +0000   Thu, 09 Dec 2021 19:05:06 +0000   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Thu, 09 Dec 2021 22:01:00 +0000   Thu, 09 Dec 2021 19:00:27 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Thu, 09 Dec 2021 22:01:00 +0000   Thu, 09 Dec 2021 19:00:27 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Thu, 09 Dec 2021 22:01:00 +0000   Thu, 09 Dec 2021 19:00:27 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Thu, 09 Dec 2021 22:01:00 +0000   Thu, 09 Dec 2021 19:05:00 +0000   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  172.31.21.241
  Hostname:    ip-172-31-21-241
Capacity:
  cpu:                8
  ephemeral-storage:  64989720Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32407904Ki
  pods:               110
Allocatable:
  cpu:                8
  ephemeral-storage:  59894525853
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32305504Ki
  pods:               110
System Info:
  Machine ID:                 ec218707683e25881184af68445ebd87
  System UUID:                ec218707-683e-2588-1184-af68445ebd87
  Boot ID:                    d59c4692-4971-43f2-a237-2d2fe49434cc
  Kernel Version:             5.11.0-1020-aws
  OS Image:                   Ubuntu 20.04.3 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.13
  Kubelet Version:            v1.18.14
  Kube-Proxy Version:         v1.18.14
PodCIDR:                      192.168.0.0/24
PodCIDRs:                     192.168.0.0/24
Non-terminated Pods:          (14 in total)
  Namespace                   Name                                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                                               ------------  ----------  ---------------  -------------  ---
  default                     gpu-operator-1639077567-node-feature-discovery-master-84485st4x    0 (0%)        0 (0%)      0 (0%)           0 (0%)         164m
  default                     gpu-operator-1639077567-node-feature-discovery-worker-ffljl        0 (0%)        0 (0%)      0 (0%)           0 (0%)         164m
  default                     gpu-operator-76fb8d5c55-rq7jj                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         164m
  gpu-operator-resources      nvidia-container-toolkit-daemonset-g4rzv                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         164m
  gpu-operator-resources      nvidia-driver-daemonset-gsl5l                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         164m
  kube-system                 calico-kube-controllers-7f94cf5997-zr46g                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         179m
  kube-system                 calico-node-8w9jt                                                  250m (3%)     0 (0%)      0 (0%)           0 (0%)         179m
  kube-system                 coredns-66bff467f8-hnbb6                                           100m (1%)     0 (0%)      70Mi (0%)        170Mi (0%)     3h3m
  kube-system                 coredns-66bff467f8-msqdc                                           100m (1%)     0 (0%)      70Mi (0%)        170Mi (0%)     3h3m
  kube-system                 etcd-ip-172-31-21-241                                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h3m
  kube-system                 kube-apiserver-ip-172-31-21-241                                    250m (3%)     0 (0%)      0 (0%)           0 (0%)         3h3m
  kube-system                 kube-controller-manager-ip-172-31-21-241                           200m (2%)     0 (0%)      0 (0%)           0 (0%)         3h3m
  kube-system                 kube-proxy-mssxq                                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h3m
  kube-system                 kube-scheduler-ip-172-31-21-241                                    100m (1%)     0 (0%)      0 (0%)           0 (0%)         3h3m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                1 (12%)     0 (0%)
  memory             140Mi (0%)  340Mi (1%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events: 

I notice there is no nvidia-gpu listed in Allocated resources.

Thanks for your support.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions