-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Following the QuickStart but my pod is stuck in pending state #176
Comments
Running |
You seem to have followed all of the instructions correctly, and your driver seems to be working properly, since you can run Can you verify your
|
Did you mean
What about driver version, kubernetes version, and plugin version? I'm using the latest drivers (4.40) and kubernetes version 1.18.3, and whatever version of the plugin the create daemonset command installs. Other people have reported version compatibility problems. |
No, I meant just:
Assuming you followed the instructions in the QuickStart to make
|
The directory /etc/docker got overwritten when I was reinstalling things
this morning. I made the changes you showed in /etc/docker/daemon.json,
restarted kubelet on the nodes and created the daemonset again on the
master. Running kubectl describe nodes still shows 0 for Capacity and
Allocatable for gpus. I tried the docker command on the node again and
this is what I get:
$ docker run nvidia/cuda nvidia-smi
docker: Error response from daemon: OCI runtime create failed:
container_linux.go:349: starting container process caused "exec:
\"nvidia-smi\": executable file not found in $PATH": unknown.
ERRO[0000] error waiting for container: context canceled
I'm running the latest versions of the driver and kubernetes. Is this
maybe a versioning problem?
…On Mon, Jun 8, 2020 at 10:47 AM Kevin Klues ***@***.***> wrote:
No, I meant just:
docker run nvidia/cuda nvidia-smi
Assuming you followed the instructions in the QuickStart to make nvidia
the default runtime for docker, then you shouldn't need to use the
nvidia-docker wrapper script (docker alone will work). In fact it's
required to work in order for Kubernetes support to work.
You will need to enable the nvidia runtime as your default runtime on your node. We will be editing the docker daemon config file which is usually present at /etc/docker/daemon.json:
{
"default-runtime": "nvidia", <---- This is the important line
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#176 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAK46CTJQDD6ZZHHHEMOKTDRVUIZVANCNFSM4NPHQULA>
.
|
Did you also restart docker after making that change to |
Shouldn't be a versioning problem. Until you get the following to work, nothing will work under Kubernetes:
|
Restarting docker allows the docker command to run, and kubectl describe
nodes shows 1 gpu Allocatable. When I try to run the examples from this
page:
https://docs.nvidia.com/datacenter/kubernetes/kubernetes-upstream/index.html
(have to clone the git repo and branch -- urls on that page are bad) all
pods are in pending state. Nothing will run. This is where I was last
week.
What should I try next?
…On Mon, Jun 8, 2020 at 11:34 AM Kevin Klues ***@***.***> wrote:
Did you also restart docker after making that change to daemon.json?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#176 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAK46CX2NDKKEH2EXMYSBKDRVUOIXANCNFSM4NPHQULA>
.
|
Which examples from there are you running? Most of that site seems to be more about how to deploy Kubernetes in general, with only a very short section about the plugin with a single example:
I only glanced at it briefly though, so I could have missed something if there were more. |
That one fails to deploy. I’ve cloned the repo and checkout the examples branch and when I run deployment.yml from my file system it creates a deployment with 32 replicas that all stay in the pending state. I’m not sure what is supposed to happen when you request more replicas than you have available nodes, but I would expect one of those 32 pods to run. The application itself just sleeps for 100 seconds, and I assume that is to give you a chance to run
kubectl exec -it gpu-pod nvidia-smi
where you would use the pod name of a running pod from the replica set.
For me all 32 just stay in the pending state. Same if I run the pod.yml example.
That is why I wonder if I’ve got some kind of version mismatch. There do seem to be some (undocumented) version requirements between the driver, plugin, and Kubernetes.
From: Kevin Klues <notifications@github.com>
Sent: Monday, June 8, 2020 12:10 PM
To: NVIDIA/k8s-device-plugin <k8s-device-plugin@noreply.github.com>
Cc: dwschulze <dean.w.schulze@gmail.com>; Author <author@noreply.github.com>
Subject: Re: [NVIDIA/k8s-device-plugin] Following the QuickStart but my pod is stuck in pending state (#176)
Which examples from there are you running?
Most of that site seems to be more about how to deploy Kubernetes in general, with only a very short section about the plugin with a single example:
kubectl create -f https://github.com/NVIDIA/k8s-device-plugin/blob/examples/workloads/deployment.yml
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#176 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAK46CUJ3XKFIUJJOSVSCRTRVUSOTANCNFSM4NPHQULA> . <https://github.com/notifications/beacon/AAK46CSLKN5MZ5K6MWJPXFDRVUSOTA5CNFSM4NPHQULKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEYY2J6Y.gif>
|
Two things I've noticed. Creating/deleting the plugin daemonset creates / deletes the file /var/lib/kubelet/device-plugins/nvidia.sock on the nodes. The nvidia-device-plugin.yml contains
There is no nvidia/k8s-device-plugin docker image on my master node. There is also no docker container running on my master node with nvid in the name. I haven't used a daemonset before. Should I be able to see a docker imager or container with that name? |
There shouldn't be any plugins on your master -- it only runs on worker nodes. |
Oh, they're on the nodes: nvidia/k8s-device-plugin 1.0.0-beta6 c0fa7866a301 6 weeks ago 64.2MB Also, which version of the CUDA developer libraries do I need to work with the source code? |
I was just responding to your statement of:
Regarding:
I'm not sure what you are asking here. What is "the source code"? |
I got the source code by cloning this github page. The build instructions say:
Without Docker
Build
$ C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build
So they expect you to have cuda libraries installed, but they don’t say which version of cuda.
From: Kevin Klues <notifications@github.com>
Sent: Wednesday, June 10, 2020 7:15 AM
To: NVIDIA/k8s-device-plugin <k8s-device-plugin@noreply.github.com>
Cc: dwschulze <dean.w.schulze@gmail.com>; Author <author@noreply.github.com>
Subject: Re: [NVIDIA/k8s-device-plugin] Following the QuickStart but my pod is stuck in pending state (#176)
I was just responding to your statement of:
There is no nvidia/k8s-device-plugin docker image on my master node.
Regarding:
Also, which version of the CUDA developer libraries do I need to work with the source code?
I'm not sure what you are asking here. What is "the source code"?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#176 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAK46CSAHH4NM2JKY5Y4YZDRV6BOVANCNFSM4NPHQULA> . <https://github.com/notifications/beacon/AAK46CRO4F5UW6TH2D5HW6TRV6BOVA5CNFSM4NPHQULKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEZCCNDQ.gif>
|
Similar issue here: I'm using EKS (Kubernetes v 1.16.8), and it's creating the nodegroup with the correct AMI. That is, the Amazon EKS-optimized accelerated AMI as described here The instructions there use the 1.0 beta version of your DS, but I've tried deploying it with both Helm 3, and FWIW, I have to run
As suggested in the AWS docs, when I run If I try to run a pod (spec below) I get: The phrase
When I describe the node it shows the following labels and annotations:
|
Quick follow up for anyone who happens upon this thread... for me the issue was that I had a taint on the node group that prevented the daemonset from being scheduled there. To resolve I just added a toleration to that spec like below, and the DS works like a charm with the default AWS accelerated AMI.
|
any update on this? |
Facing same issue. Any update on this? |
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. |
This issue was automatically closed due to inactivity. |
I've followed the QuickStart instructions down to the Docs section on my two nodes. I took the .yaml shown in the Running GPU Jobs section and ran it from the master with
kubectl apply -f nvidia.cuda.yaml
I had modified the .yaml to set
nvidia.com/gpu: 1
because I only have one gpu on each of my nodes. However my pod stays in the pending state:I've verified that cuda binaries run on both nodes.
Are there other steps I need to take to get this plugin to execute gpu jobs? Is it necessary to execute any of the steps below the Docs section such as the
With Docker Build
section because Option 2 fails?It's not clear what the steps below the Docs section are for.
I'm running on Ubuntu 18.04 with kubernetes 1.18.2.
$ nvidia-smi
Sun May 31 10:54:58 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P6000 Off | 00000000:09:00.0 Off | Off |
| 26% 23C P8 8W / 250W | 0MiB / 24449MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
The text was updated successfully, but these errors were encountered: