Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Following the QuickStart but my pod is stuck in pending state #176

Closed
dwschulze opened this issue May 31, 2020 · 21 comments
Closed

Following the QuickStart but my pod is stuck in pending state #176

dwschulze opened this issue May 31, 2020 · 21 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@dwschulze
Copy link

I've followed the QuickStart instructions down to the Docs section on my two nodes. I took the .yaml shown in the Running GPU Jobs section and ran it from the master with

kubectl apply -f nvidia.cuda.yaml

I had modified the .yaml to set nvidia.com/gpu: 1 because I only have one gpu on each of my nodes. However my pod stays in the pending state:

$ kubectl get pods
NAME      READY   STATUS    RESTARTS   AGE
gpu-pod   0/2     Pending   0          24h

I've verified that cuda binaries run on both nodes.

Are there other steps I need to take to get this plugin to execute gpu jobs? Is it necessary to execute any of the steps below the Docs section such as the With Docker Build section because Option 2 fails?

It's not clear what the steps below the Docs section are for.

I'm running on Ubuntu 18.04 with kubernetes 1.18.2.

$ nvidia-smi
Sun May 31 10:54:58 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P6000 Off | 00000000:09:00.0 Off | Off |
| 26% 23C P8 8W / 250W | 0MiB / 24449MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

@dwschulze
Copy link
Author

Running kubectl describe nodes shows that I have 0 gpus allocatable, which would explain why my pods don't get out of pending state. The plugin is not recognizing my gpus.

@klueska
Copy link
Contributor

klueska commented Jun 8, 2020

You seem to have followed all of the instructions correctly, and your driver seems to be working properly, since you can run nvidia-smi on the host.

Can you verify your nvidia-docker2 installation by running the following:

docker run nvidia/cuda nvidia-smi

@dwschulze
Copy link
Author

Did you mean nvidia-docker run nvidia/cuda nvidia-smi ?

$ docker run nvidia/cuda nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown.
$ nvidia-docker run nvidia/cuda nvidia-smi
Mon Jun  8 16:24:29 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 440.59       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro P6000        Off  | 00000000:09:00.0 Off |                  Off |
| 26%   18C    P8     9W / 250W |      0MiB / 24449MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

What about driver version, kubernetes version, and plugin version? I'm using the latest drivers (4.40) and kubernetes version 1.18.3, and whatever version of the plugin the create daemonset command installs. Other people have reported version compatibility problems.

@klueska
Copy link
Contributor

klueska commented Jun 8, 2020

No, I meant just:

docker run nvidia/cuda nvidia-smi

Assuming you followed the instructions in the QuickStart to make nvidia the default runtime for docker, then you shouldn't need to use the nvidia-docker wrapper script (docker alone will work). In fact it's required in order for Kubernetes support to work.

You will need to enable the nvidia runtime as your default runtime on your node. We will be editing the docker daemon config file which is usually present at /etc/docker/daemon.json:

{
    "default-runtime": "nvidia",  <---- This is the important line
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

@dwschulze
Copy link
Author

dwschulze commented Jun 8, 2020 via email

@klueska
Copy link
Contributor

klueska commented Jun 8, 2020

Did you also restart docker after making that change to daemon.json?

@klueska
Copy link
Contributor

klueska commented Jun 8, 2020

Shouldn't be a versioning problem.

Until you get the following to work, nothing will work under Kubernetes:

$ docker run nvidia/cuda nvidia-smi
docker: Error response from daemon: OCI runtime create failed:
container_linux.go:349: starting container process caused "exec:
\"nvidia-smi\": executable file not found in $PATH": unknown.
ERRO[0000] error waiting for container: context canceled

@dwschulze
Copy link
Author

dwschulze commented Jun 8, 2020 via email

@klueska
Copy link
Contributor

klueska commented Jun 8, 2020

Which examples from there are you running?

Most of that site seems to be more about how to deploy Kubernetes in general, with only a very short section about the plugin with a single example:

kubectl create -f https://github.com/NVIDIA/k8s-device-plugin/blob/examples/workloads/deployment.yml

I only glanced at it briefly though, so I could have missed something if there were more.

@dwschulze
Copy link
Author

dwschulze commented Jun 8, 2020 via email

@dwschulze
Copy link
Author

Two things I've noticed.

Creating/deleting the plugin daemonset creates / deletes the file /var/lib/kubelet/device-plugins/nvidia.sock on the nodes.

The nvidia-device-plugin.yml contains

      containers:
      - image: nvidia/k8s-device-plugin:1.0.0-beta6

There is no nvidia/k8s-device-plugin docker image on my master node. There is also no docker container running on my master node with nvid in the name. I haven't used a daemonset before. Should I be able to see a docker imager or container with that name?

@klueska
Copy link
Contributor

klueska commented Jun 9, 2020

There shouldn't be any plugins on your master -- it only runs on worker nodes.

@dwschulze
Copy link
Author

There shouldn't be any plugins on your master -- it only runs on worker nodes.

Oh, they're on the nodes:

nvidia/k8s-device-plugin 1.0.0-beta6 c0fa7866a301 6 weeks ago 64.2MB

Also, which version of the CUDA developer libraries do I need to work with the source code?

@klueska
Copy link
Contributor

klueska commented Jun 10, 2020

I was just responding to your statement of:

There is no nvidia/k8s-device-plugin docker image on my master node.

Regarding:

Also, which version of the CUDA developer libraries do I need to work with the source code?

I'm not sure what you are asking here. What is "the source code"?

@dwschulze
Copy link
Author

dwschulze commented Jun 11, 2020 via email

@rlantz-cfa
Copy link

Similar issue here: I'm using EKS (Kubernetes v 1.16.8), and it's creating the nodegroup with the correct AMI. That is, the Amazon EKS-optimized accelerated AMI as described here

The instructions there use the 1.0 beta version of your DS, but I've tried deploying it with both Helm 3, and kubectl ... v0.6.0 .... For whatever reason the scheduler is not recognizing the GPU on the host.

FWIW, I have to run sudo docker on the host to get the above command to work, but when I do (e.g. sudo docker run <image-tag> python /tmp/test.py) it works as expected. Maybe there's an issue with the AMI?

| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   32C    P0    25W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

tensor([[0.4713, 0.7497, 0.5766],
        [0.3508, 0.8708, 0.7834]], device='cuda:0') 

As suggested in the AWS docs, when I run kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu" I get <None> for GPU. Do I need something in my node groups to make sure the nvidia.com/gpu label is present?

If I try to run a pod (spec below) I get: Warning FailedScheduling 7s default-scheduler 0/3 nodes are available: 2 node(s) didn't match node selector, 3 Insufficient nvidia.com/gpu

The phrase Insufficient nvidia.com/gpu seems important here, but I'm not sure what combination of label or annotation to

kind: Pod
# some stuff redacted...
metadata:
  name: pytorch-test
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: <label-name>
            operator: In
            values: 
              - <affinity-tag>
  tolerations:
  - key: "nontraining"
    operator: "Exists"
    effect: "NoSchedule"
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
  restartPolicy: OnFailure
  containers:
  - name: pytorch-nvidia
    image: <image-tag>
    command: ["python", "/tmp/test.py"]
    resources:
      limits:
        nvidia.com/gpu: 1 # requesting 1 GPU

When I describe the node it shows the following labels and annotations:

Labels:             alpha.eksctl.io/cluster-name=<cluster-name>
                    alpha.eksctl.io/instance-id=<id>
                    alpha.eksctl.io/nodegroup-name=<nodegroup-name>
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=p3.2xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2b
                    <label>=<label>
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=<host-name>
                    kubernetes.io/os=linux
                    nvidia.com/gpu=true
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true

@rlantz-cfa
Copy link

rlantz-cfa commented Jul 5, 2020

Quick follow up for anyone who happens upon this thread... for me the issue was that I had a taint on the node group that prevented the daemonset from being scheduled there. To resolve I just added a toleration to that spec like below, and the DS works like a charm with the default AWS accelerated AMI.

spec:
  tolerations:
  - key: CriticalAddonsOnly
    operator: Exists
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  - key: <custom-label>
    operator: Exists
    effect: NoSchedule

@tuvovan
Copy link

tuvovan commented May 7, 2021

any update on this?

@RakeshRaj97
Copy link

Facing same issue. Any update on this?

Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 29, 2024
Copy link

This issue was automatically closed due to inactivity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants