Following the QuickStart but my pod is stuck in pending state #176

dwschulze · 2020-05-31T16:59:27Z

I've followed the QuickStart instructions down to the Docs section on my two nodes. I took the .yaml shown in the Running GPU Jobs section and ran it from the master with

kubectl apply -f nvidia.cuda.yaml

I had modified the .yaml to set nvidia.com/gpu: 1 because I only have one gpu on each of my nodes. However my pod stays in the pending state:

$ kubectl get pods
NAME      READY   STATUS    RESTARTS   AGE
gpu-pod   0/2     Pending   0          24h

I've verified that cuda binaries run on both nodes.

Are there other steps I need to take to get this plugin to execute gpu jobs? Is it necessary to execute any of the steps below the Docs section such as the With Docker Build section because Option 2 fails?

It's not clear what the steps below the Docs section are for.

I'm running on Ubuntu 18.04 with kubernetes 1.18.2.

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

dwschulze · 2020-06-08T15:48:10Z

Running kubectl describe nodes shows that I have 0 gpus allocatable, which would explain why my pods don't get out of pending state. The plugin is not recognizing my gpus.

klueska · 2020-06-08T16:02:56Z

You seem to have followed all of the instructions correctly, and your driver seems to be working properly, since you can run nvidia-smi on the host.

Can you verify your nvidia-docker2 installation by running the following:

docker run nvidia/cuda nvidia-smi

dwschulze · 2020-06-08T16:28:15Z

Did you mean nvidia-docker run nvidia/cuda nvidia-smi ?

$ docker run nvidia/cuda nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown.

$ nvidia-docker run nvidia/cuda nvidia-smi
Mon Jun  8 16:24:29 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 440.59       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro P6000        Off  | 00000000:09:00.0 Off |                  Off |
| 26%   18C    P8     9W / 250W |      0MiB / 24449MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

What about driver version, kubernetes version, and plugin version? I'm using the latest drivers (4.40) and kubernetes version 1.18.3, and whatever version of the plugin the create daemonset command installs. Other people have reported version compatibility problems.

klueska · 2020-06-08T16:47:07Z

No, I meant just:

docker run nvidia/cuda nvidia-smi

Assuming you followed the instructions in the QuickStart to make nvidia the default runtime for docker, then you shouldn't need to use the nvidia-docker wrapper script (docker alone will work). In fact it's required in order for Kubernetes support to work.

You will need to enable the nvidia runtime as your default runtime on your node. We will be editing the docker daemon config file which is usually present at /etc/docker/daemon.json:

{
    "default-runtime": "nvidia",  <---- This is the important line
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

dwschulze · 2020-06-08T17:27:25Z

The directory /etc/docker got overwritten when I was reinstalling things this morning. I made the changes you showed in /etc/docker/daemon.json, restarted kubelet on the nodes and created the daemonset again on the master. Running kubectl describe nodes still shows 0 for Capacity and Allocatable for gpus. I tried the docker command on the node again and this is what I get: $ docker run nvidia/cuda nvidia-smi docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown. ERRO[0000] error waiting for container: context canceled I'm running the latest versions of the driver and kubernetes. Is this maybe a versioning problem?

…

On Mon, Jun 8, 2020 at 10:47 AM Kevin Klues ***@***.***> wrote: No, I meant just: docker run nvidia/cuda nvidia-smi Assuming you followed the instructions in the QuickStart to make nvidia the default runtime for docker, then you shouldn't need to use the nvidia-docker wrapper script (docker alone will work). In fact it's required to work in order for Kubernetes support to work. You will need to enable the nvidia runtime as your default runtime on your node. We will be editing the docker daemon config file which is usually present at /etc/docker/daemon.json: { "default-runtime": "nvidia", <---- This is the important line "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } } — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#176 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAK46CTJQDD6ZZHHHEMOKTDRVUIZVANCNFSM4NPHQULA> .

klueska · 2020-06-08T17:33:43Z

Did you also restart docker after making that change to daemon.json?

klueska · 2020-06-08T17:35:36Z

Shouldn't be a versioning problem.

Until you get the following to work, nothing will work under Kubernetes:

$ docker run nvidia/cuda nvidia-smi
docker: Error response from daemon: OCI runtime create failed:
container_linux.go:349: starting container process caused "exec:
\"nvidia-smi\": executable file not found in $PATH": unknown.
ERRO[0000] error waiting for container: context canceled

dwschulze · 2020-06-08T17:45:03Z

Restarting docker allows the docker command to run, and kubectl describe nodes shows 1 gpu Allocatable. When I try to run the examples from this page: https://docs.nvidia.com/datacenter/kubernetes/kubernetes-upstream/index.html (have to clone the git repo and branch -- urls on that page are bad) all pods are in pending state. Nothing will run. This is where I was last week. What should I try next?

…

On Mon, Jun 8, 2020 at 11:34 AM Kevin Klues ***@***.***> wrote: Did you also restart docker after making that change to daemon.json? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#176 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAK46CX2NDKKEH2EXMYSBKDRVUOIXANCNFSM4NPHQULA> .

klueska · 2020-06-08T18:09:30Z

Which examples from there are you running?

Most of that site seems to be more about how to deploy Kubernetes in general, with only a very short section about the plugin with a single example:

kubectl create -f https://github.com/NVIDIA/k8s-device-plugin/blob/examples/workloads/deployment.yml

I only glanced at it briefly though, so I could have missed something if there were more.

dwschulze · 2020-06-08T20:25:43Z

That one fails to deploy. I’ve cloned the repo and checkout the examples branch and when I run deployment.yml from my file system it creates a deployment with 32 replicas that all stay in the pending state. I’m not sure what is supposed to happen when you request more replicas than you have available nodes, but I would expect one of those 32 pods to run. The application itself just sleeps for 100 seconds, and I assume that is to give you a chance to run kubectl exec -it gpu-pod nvidia-smi where you would use the pod name of a running pod from the replica set. For me all 32 just stay in the pending state. Same if I run the pod.yml example. That is why I wonder if I’ve got some kind of version mismatch. There do seem to be some (undocumented) version requirements between the driver, plugin, and Kubernetes. From: Kevin Klues <notifications@github.com> Sent: Monday, June 8, 2020 12:10 PM To: NVIDIA/k8s-device-plugin <k8s-device-plugin@noreply.github.com> Cc: dwschulze <dean.w.schulze@gmail.com>; Author <author@noreply.github.com> Subject: Re: [NVIDIA/k8s-device-plugin] Following the QuickStart but my pod is stuck in pending state (#176) Which examples from there are you running? Most of that site seems to be more about how to deploy Kubernetes in general, with only a very short section about the plugin with a single example: kubectl create -f https://github.com/NVIDIA/k8s-device-plugin/blob/examples/workloads/deployment.yml — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#176 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAK46CUJ3XKFIUJJOSVSCRTRVUSOTANCNFSM4NPHQULA> . <https://github.com/notifications/beacon/AAK46CSLKN5MZ5K6MWJPXFDRVUSOTA5CNFSM4NPHQULKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEYY2J6Y.gif>

dwschulze · 2020-06-08T21:02:35Z

Two things I've noticed.

Creating/deleting the plugin daemonset creates / deletes the file /var/lib/kubelet/device-plugins/nvidia.sock on the nodes.

The nvidia-device-plugin.yml contains

      containers:
      - image: nvidia/k8s-device-plugin:1.0.0-beta6

There is no nvidia/k8s-device-plugin docker image on my master node. There is also no docker container running on my master node with nvid in the name. I haven't used a daemonset before. Should I be able to see a docker imager or container with that name?

klueska · 2020-06-09T16:11:03Z

There shouldn't be any plugins on your master -- it only runs on worker nodes.

dwschulze · 2020-06-09T18:03:14Z

There shouldn't be any plugins on your master -- it only runs on worker nodes.

Oh, they're on the nodes:

nvidia/k8s-device-plugin 1.0.0-beta6 c0fa7866a301 6 weeks ago 64.2MB

Also, which version of the CUDA developer libraries do I need to work with the source code?

klueska · 2020-06-10T13:15:07Z

I was just responding to your statement of:

There is no nvidia/k8s-device-plugin docker image on my master node.

Regarding:

Also, which version of the CUDA developer libraries do I need to work with the source code?

I'm not sure what you are asking here. What is "the source code"?

dwschulze · 2020-06-11T14:57:59Z

I got the source code by cloning this github page. The build instructions say: Without Docker Build $ C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build So they expect you to have cuda libraries installed, but they don’t say which version of cuda. From: Kevin Klues <notifications@github.com> Sent: Wednesday, June 10, 2020 7:15 AM To: NVIDIA/k8s-device-plugin <k8s-device-plugin@noreply.github.com> Cc: dwschulze <dean.w.schulze@gmail.com>; Author <author@noreply.github.com> Subject: Re: [NVIDIA/k8s-device-plugin] Following the QuickStart but my pod is stuck in pending state (#176) I was just responding to your statement of: There is no nvidia/k8s-device-plugin docker image on my master node. Regarding: Also, which version of the CUDA developer libraries do I need to work with the source code? I'm not sure what you are asking here. What is "the source code"? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#176 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAK46CSAHH4NM2JKY5Y4YZDRV6BOVANCNFSM4NPHQULA> . <https://github.com/notifications/beacon/AAK46CRO4F5UW6TH2D5HW6TRV6BOVA5CNFSM4NPHQULKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEZCCNDQ.gif>

rlantz-cfa · 2020-07-05T11:50:09Z

Similar issue here: I'm using EKS (Kubernetes v 1.16.8), and it's creating the nodegroup with the correct AMI. That is, the Amazon EKS-optimized accelerated AMI as described here

The instructions there use the 1.0 beta version of your DS, but I've tried deploying it with both Helm 3, and kubectl ... v0.6.0 .... For whatever reason the scheduler is not recognizing the GPU on the host.

FWIW, I have to run sudo docker on the host to get the above command to work, but when I do (e.g. sudo docker run <image-tag> python /tmp/test.py) it works as expected. Maybe there's an issue with the AMI?

| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   32C    P0    25W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

tensor([[0.4713, 0.7497, 0.5766],
        [0.3508, 0.8708, 0.7834]], device='cuda:0')

As suggested in the AWS docs, when I run kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu" I get <None> for GPU. Do I need something in my node groups to make sure the nvidia.com/gpu label is present?

If I try to run a pod (spec below) I get: Warning FailedScheduling 7s default-scheduler 0/3 nodes are available: 2 node(s) didn't match node selector, 3 Insufficient nvidia.com/gpu

The phrase Insufficient nvidia.com/gpu seems important here, but I'm not sure what combination of label or annotation to

kind: Pod
# some stuff redacted...
metadata:
  name: pytorch-test
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: <label-name>
            operator: In
            values: 
              - <affinity-tag>
  tolerations:
  - key: "nontraining"
    operator: "Exists"
    effect: "NoSchedule"
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
  restartPolicy: OnFailure
  containers:
  - name: pytorch-nvidia
    image: <image-tag>
    command: ["python", "/tmp/test.py"]
    resources:
      limits:
        nvidia.com/gpu: 1 # requesting 1 GPU

When I describe the node it shows the following labels and annotations:

Labels:             alpha.eksctl.io/cluster-name=<cluster-name>
                    alpha.eksctl.io/instance-id=<id>
                    alpha.eksctl.io/nodegroup-name=<nodegroup-name>
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=p3.2xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2b
                    <label>=<label>
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=<host-name>
                    kubernetes.io/os=linux
                    nvidia.com/gpu=true
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true

rlantz-cfa · 2020-07-05T19:05:56Z

Quick follow up for anyone who happens upon this thread... for me the issue was that I had a taint on the node group that prevented the daemonset from being scheduled there. To resolve I just added a toleration to that spec like below, and the DS works like a charm with the default AWS accelerated AMI.

spec:
  tolerations:
  - key: CriticalAddonsOnly
    operator: Exists
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  - key: <custom-label>
    operator: Exists
    effect: NoSchedule

tuvovan · 2021-05-07T02:10:16Z

any update on this?

RakeshRaj97 · 2022-06-06T01:19:16Z

Facing same issue. Any update on this?

github-actions · 2024-02-29T04:25:20Z

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions · 2024-03-31T04:26:14Z

This issue was automatically closed due to inactivity.

dwschulze mentioned this issue Jun 11, 2020

Run Locally instructions fail #178

Closed

11 tasks

jasmingacic mentioned this issue Oct 22, 2020

GPU support k0sproject/k0s#181

Closed

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 29, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Following the QuickStart but my pod is stuck in pending state #176

Following the QuickStart but my pod is stuck in pending state #176

dwschulze commented May 31, 2020

dwschulze commented Jun 8, 2020

klueska commented Jun 8, 2020

dwschulze commented Jun 8, 2020

klueska commented Jun 8, 2020 •

edited

Loading

dwschulze commented Jun 8, 2020 via email

klueska commented Jun 8, 2020

klueska commented Jun 8, 2020

dwschulze commented Jun 8, 2020 via email

klueska commented Jun 8, 2020 •

edited

Loading

dwschulze commented Jun 8, 2020 via email

dwschulze commented Jun 8, 2020

klueska commented Jun 9, 2020

dwschulze commented Jun 9, 2020

klueska commented Jun 10, 2020

dwschulze commented Jun 11, 2020 via email

rlantz-cfa commented Jul 5, 2020

rlantz-cfa commented Jul 5, 2020 •

edited

Loading

tuvovan commented May 7, 2021

RakeshRaj97 commented Jun 6, 2022

github-actions bot commented Feb 29, 2024

github-actions bot commented Mar 31, 2024

Following the QuickStart but my pod is stuck in pending state #176

Following the QuickStart but my pod is stuck in pending state #176

Comments

dwschulze commented May 31, 2020

dwschulze commented Jun 8, 2020

klueska commented Jun 8, 2020

dwschulze commented Jun 8, 2020

klueska commented Jun 8, 2020 • edited Loading

dwschulze commented Jun 8, 2020 via email

klueska commented Jun 8, 2020

klueska commented Jun 8, 2020

dwschulze commented Jun 8, 2020 via email

klueska commented Jun 8, 2020 • edited Loading

dwschulze commented Jun 8, 2020 via email

dwschulze commented Jun 8, 2020

klueska commented Jun 9, 2020

dwschulze commented Jun 9, 2020

klueska commented Jun 10, 2020

dwschulze commented Jun 11, 2020 via email

rlantz-cfa commented Jul 5, 2020

rlantz-cfa commented Jul 5, 2020 • edited Loading

tuvovan commented May 7, 2021

RakeshRaj97 commented Jun 6, 2022

github-actions bot commented Feb 29, 2024

github-actions bot commented Mar 31, 2024

klueska commented Jun 8, 2020 •

edited

Loading

klueska commented Jun 8, 2020 •

edited

Loading

rlantz-cfa commented Jul 5, 2020 •

edited

Loading