Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocminfo fails in rocm/rocm-terminal #116

Open
danpetreamd opened this issue Oct 26, 2023 · 5 comments
Open

rocminfo fails in rocm/rocm-terminal #116

danpetreamd opened this issue Oct 26, 2023 · 5 comments

Comments

@danpetreamd
Copy link

rocm-smi works fine.

The following was run on a 4x GPU System:

$ docker run -it --rm --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 16G rocm/rocm-terminal:latest

# rocminfo
ROCk module is loaded
Unable to open /dev/kfd read-write: Permission denied
Failed to get user name to check for video group membership

# rocm-smi
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%
0    37.0c           38.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%
1    40.0c           39.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%
2    41.0c           42.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%
3    39.0c           35.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%
================================================================================
============================= End of ROCm SMI Log ==============================

# groups
rocm-user sudo video

rocminfo works fine in rocm/dev-ubuntu-22.04 and rocm/pytorch:

$ docker run -it --rm --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 16G rocm/pytorch:latest

# rocminfo | grep MI100
  Marketing Name:          AMD Instinct MI100
  Marketing Name:          AMD Instinct MI100
  Marketing Name:          AMD Instinct MI100
  Marketing Name:          AMD Instinct MI100

# groups
root video
@danpetreamd
Copy link
Author

Using the instructions in the README.md:

$ sudo docker run -it --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video rocm/rocm-terminal
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

# rocminfo
ROCk module is loaded
Unable to open /dev/kfd read-write: Permission denied
Failed to get user name to check for video group membership

@danpetreamd
Copy link
Author

I'm wondering if this image is still in use and/or if we can deprecate it.

@baryluk
Copy link

baryluk commented Dec 17, 2023

Same here.

Also, one does not need do sudo docker ...

As long as user is in a docker group, one can do just docker .... Usage of docker with sudo should not be promoted like this (not that it matter too much).

$ sudo docker run -it --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video rocm/rocm-terminal
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

rocm-user@015b5fcf64bf:~$ rocminfo 
ROCk module is loaded
Unable to open /dev/kfd read-write: Permission denied
Failed to get user name to check for video group membership
rocm-user@015b5fcf64bf:~$ logout
$ 

There reason is because video group is not good, it should be render:

$ ls -l /dev/kfd 
crw-rw---- 1 root render 243, 0 Dec 14 04:54 /dev/kfd
$
$ grep render /etc/group
render:x:993:user
$

For some reasons it does not work tho:

$ sudo docker run -it --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video --group-add render rocm/rocm-terminal
docker: Error response from daemon: Unable to find group render: no matching entries in group file.
ERRO[0000] error waiting for container: context canceled 

probably because /etc/group in the container is different.

Running docker run with --user=root is a an option, which is not too bad (file system, processes, etc, are still isolated and safe), but would be nice to find a nicer solution.

@baryluk
Copy link

baryluk commented Dec 17, 2023

This looks related - #90

@baryluk
Copy link

baryluk commented Feb 8, 2024

Still broken when following instructions current README.md

If I pass render gid by number it complains, but works:

$ sudo docker run -it --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add $(getent group render | cut -d: -f3)  rocm/rocm-terminal
groups: cannot find name for group ID 993
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

rocm-user@3e3292ebfa5c:~$ 

and /dev/kfd works inside (i.e. rocminfo has no issues accessing it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants