Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker: Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/moby/761bd05e8ceb95e1459db860b160e9dda095254a969ebd9a0b777524f73f9263/log.json: no such file or directory): exec: "nvidia-container-runtime": executable file not found in $PATH: unknown. #166

Closed
wjimenez5271 opened this issue May 5, 2020 · 6 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@wjimenez5271
Copy link

When following the latest instructions to install the nvidia driver on https://github.com/NVIDIA/nvidia-docker/, it says that nvidia-docker2 has been deprecated and one should install the nvidia container toolkit. I followed the instructions for Ubuntu 18.04 with Docker 19.03, however this does not seem to install the nvidia-container-runtime binary mentioned in the README for this project. This results in the docker not being able to start any container after updating the default runtime per the README in /etc/docker/daemon.json. Is this device plugin not compatible with the latest iteration of the nvidia driver? Here is the error message:

docker: Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/moby/761bd05e8ceb95e1459db860b160e9dda095254a969ebd9a0b777524f73f9263/log.json: no such file or directory): exec: "nvidia-container-runtime": executable file not found in $PATH: unknown.

and just to show:

ls /usr/bin/nvidia-container-runtime
ls: cannot access '/usr/bin/nvidia-container-runtime': No such file or directory

I also tried nvidia-container-cli as this is installed by the current package. Is it possible this repo needs to be updated to reflect nvidia-docker2's deprecation?

@ardenpm
Copy link

ardenpm commented May 19, 2020

The docs in this repo specifically state that nvidia-container-toolkit should not be used and that nvidia-docker2 should be used instead (even though deprecated) since K8s isn't aware of the --gpus Docker flag yet (not sure if that is still the case).

So it looks like the instructions for Docker and K8s are currently different. I setup per the instructions in this repo for K8s but right now I can't run anything in Docker so I doubt it will work in K8s. When I try to run with the nvidia runtime I get segfaults immediately. Still trying to track that down.

@klueska
Copy link
Contributor

klueska commented May 19, 2020

I agree, the docs are confusing and should be synchronized better.

Please see my comment here for an explanation on how nvidia-docker2 and nvidia-container-toolkit are related: #168 (comment)

Regarding the segfault, I'm curious if it could be related to: NVIDIA/nvidia-docker#1280 (comment)

@ardenpm
Copy link

ardenpm commented May 19, 2020

Indeed, that comment helped make it clear. It was also reassuring to know that basically behind the scenes its basically the same since the deprecation statements on nvidia-docker2 are a bit disconcerting.

Now on the segfault, this was/is really strange. I think mine was actually different to the one in this issue. nvidia-container-cli also would segfault immediately even just using the info commands, so I don’t think it was specific to docker.

All of my testing there was on CentOS 7 latest and I wasn’t able to resolve the problem. Since I needed to do some testing I switched to Ubuntu 18.04 and was not able to replicate the issue there at all.

I still have both images in a stopped state on AWS from my testing so I can probably get more details on the actual segfault stack trace but I am not sure if others are encountering this. The actual error was related to munmap_chunk: invalid pointer.

@kengz
Copy link

kengz commented Jun 7, 2020

I had the same issue setting up a k8s cluster with GPUs. Went through the comments here and other related issues, and put together the steps to make it work, probably useful to people looking for solution:

Kubernetes NVIDIA GPU device plugin

  • follow the official NVIDIA GPU device plugin until the step to configure runtime

  • as explained in this comment, k8s still needs nvidia-container-runtime; install it:

    # install the old nvidia-container-runtime for k8s
    curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
      sudo apt-key add -
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
      sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
    sudo apt-get update
    sudo apt-get install -y nvidia-container-runtime
  • add the following /etc/docker/daemon.json as required by k8s

    {
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        }
    }
  • restart docker and test:

    sudo systemctl restart docker
    # test that docker can run with GPU without the --gpus flag
    docker run nvidia/cuda:10.2-runtime-ubuntu18.04 nvidia-smi
  • finally, install the NVIDIA device plugin on your cluster:

    kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 29, 2024
Copy link

This issue was automatically closed due to inactivity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants