Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0/2 nodes are available: 2 Insufficient nvidia.com/gpu #159

Closed
nikosep opened this issue Mar 3, 2020 · 9 comments
Closed

0/2 nodes are available: 2 Insufficient nvidia.com/gpu #159

nikosep opened this issue Mar 3, 2020 · 9 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@nikosep
Copy link

nikosep commented Mar 3, 2020

Facing this old issue.
I have gone through all the relevant workarounds, although still the issue persists.

Kubernetes version: 1.14
Docker version on GPU node: 19.03.6
GPU node: 4 x GTX1080Ti

I am trying to deploy this example:

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: tensorflow-gpu
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: tensorflow-gpu
    spec:
      volumes:
      - hostPath:
          path: /usr/lib/nvidia-418/bin
        name: bin
      - hostPath:
          path: /usr/lib/nvidia-418
        name: lib
      - hostPath:
          path: /usr/lib/x86_64-linux-gnu/libcuda.so.1
        name: libcuda-so-1
      - hostPath:
          path: /usr/lib/x86_64-linux-gnu/libcuda.so
        name: libcuda-so
      containers:
      - name: tensorflow
        image: tensorflow/tensorflow:latest-gpu
        ports:
        - containerPort: 8888
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - mountPath: /usr/local/nvidia/bin
          name: bin
        - mountPath: /usr/local/nvidia/lib
          name: lib
        - mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
          name: libcuda-so-1
        - mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so
          name: libcuda-so
---
apiVersion: v1
kind: Service
metadata:
  name: tensorflow-gpu-service
  labels:
    app: tensorflow-gpu
spec:
  selector:
    app: tensorflow-gpu
  ports:
  - port: 8888
    protocol: TCP
    nodePort: 30061
  type: LoadBalancer
---

And I am getting the following error:
0/2 nodes are available: 2 Insufficient nvidia.com/gpu

Specifying the GPU node explicitly on the deployment yaml I am getting the following error:
Update plugin resources failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected.

/etc/docker/daemon.json on GPU node:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

I have restarted docker and kubelet.

I am using this nvidia daemon:
https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

Should I
-label somehow the GPU node that has nvidia gpu?
-restart master node?

Any help here is more than welcome !

@Sarang-Sangram
Copy link

I am facing the same issue , going through the container logs it is throwing below error , which I assume is something wrong with the image itself:

libdc1394 error: Failed to initialize libdc1394

@nikosep
Copy link
Author

nikosep commented Mar 18, 2020

I am facing the same issue , going through the container logs it is throwing below error , which I assume is something wrong with the image itself:

libdc1394 error: Failed to initialize libdc1394

I think you need to use as base Dockerfile image the nvidia one:
FROM nvidia/cuda:10.0-cudnn7-runtime-ubuntu16.04
( I guess you have installed nvidia daemon on the cluster)

@Sarang-Sangram
Copy link

You mean in the pod spec file ? Even after I use the above image I am seeing error :

libdc1394 error: Failed to initialize libdc1394

@Sarang-Sangram
Copy link

So I skipped that example pod, and tried this deployment with less no, of replicas and it worked fine

https://github.com/NVIDIA/k8s-device-plugin/blob/examples/workloads/deployment.yml

@RenaudWasTaken
Copy link
Contributor

Hello!

Sorry for the lag, can you fill the default issue template, this is usually super helpful and it's easier to help :)

  • The output of nvidia-smi -a on your host
  • Your docker configuration file (e.g: /etc/docker/daemon.json)
  • The k8s-device-plugin container logs
  • The node description (kubectl describe nodes)
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

@regulusv
Copy link

regulusv commented May 8, 2020

@RenaudWasTaken I think the issue is Docker default runtime is unable to set "nvidia" for Docker 19.03, runtime : nvidia has been deprecated, we need a fix on that

@klueska
Copy link
Contributor

klueska commented May 8, 2020

Removed my previous comment with a link to this one so that there is one canonical place with a response to this issue:

#168 (comment)

Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 29, 2024
Copy link

This issue was automatically closed due to inactivity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants