Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

same node pods communication through unix socket #109

Open
Fizzbb opened this issue Mar 15, 2022 · 2 comments
Open

same node pods communication through unix socket #109

Fizzbb opened this issue Mar 15, 2022 · 2 comments

Comments

@Fizzbb
Copy link
Collaborator

Fizzbb commented Mar 15, 2022

v0.3.0 release vgpu-server create socket at /run/alnair.sock on the host machine, user container mount hostpath volumn /run/alnair.sock to communicate.
vgpu-server was deployed directly on the host. Will this communication still work when vgpu-server is deployed in a pod?

Kubernetes has example of using empty dir to communicate between containers within a pod, but not exactly what we want.
https://kubernetes.io/docs/tasks/access-application-cluster/communicate-containers-same-pod-shared-volume/
Reference:
https://serverfault.com/questions/881875/expose-a-unix-socket-to-the-host-system-from-inside-from-a-docker-container/881895#881895
use unix socket in docker
https://web.archive.org/web/20210411145047/https://www.jujens.eu/posts/en/2017/Feb/15/docker-unix-socket/

@Fizzbb
Copy link
Collaborator Author

Fizzbb commented Mar 16, 2022

In the manifest, if we mount /run from hostpath to vgpu-server container, the container cannot start. Error message is

"Error: failed to start container "alnair-vgpu-server": Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/5afe69b7c6a1edf6187c693b8e9876f1e7df11b0b37badf1a61eae16c7694f4f/merged/run/nvidia-persistenced/socket: no such device or address: unknown"
However, one gpu node reports this, while the other did not. Even after set up some docker(20.10.12), and nvidia docker2(2.9.1), nvidia-container-cli(1.8.1) version
NVIDIA/nvidia-docker#885

@Fizzbb
Copy link
Collaborator Author

Fizzbb commented Mar 16, 2022

the mount error, is caused by nvidia-docker2, install nvidia-container-runtime, instead of nvidia-docker2,
NVIDIA/nvidia-docker#825
"@3XX0 is this something that changed in version 2? we've been using nvidia-docker for a while now and we do have nvidia-smi as part of our docker images. we started to get the mount error once we switched to nvidia-docker2"
"No this requirement didn't change between the versions, as part of v2 we prevent the container from starting if you have the NVIDIA driver (e.g: nvidia-smi) in your image as this will lead to undefined behavior."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant