Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I allocate different AMD gpu device to different process? #841

Closed
nanguanqi opened this issue Jul 11, 2019 · 7 comments
Closed

How can I allocate different AMD gpu device to different process? #841

nanguanqi opened this issue Jul 11, 2019 · 7 comments

Comments

@nanguanqi
Copy link

Suppose, on the machine, there are 2 AMD gpu devices. How can I make process 1 use device0, process2 use device1? Is there an environment variable like "CUDA_VISIABLE_DEVICES" to set visible AMD GPU devices for processes?

Dy default, are AMD GPU devices shared by all processes?

@nanguanqi
Copy link
Author

If it is OpenCL application, set the GPU_DEVICE_ORDINAL environment parameter could help. If it is HC or HIP application, does this GPU_DEVICE_ORDINAL take effect to allocate gpu device to application?

@sunway513
Copy link
Contributor

There're a couple of methods to expose only selected GPUs to the user process for hip/hcc path:

  1. Use HIP_VISIBLE_DEVICES environment variable to select the target GPUs for the process from the HIP level. e.g. use the following to select the first GPU:
  • export HIP_VISIBLE_DEVICES=0
  1. Use ROCR_VISIBLE_DEVICES environment variable to select the target GPUs from the ROCr (ROCm user-bit driver) level. e.g. the following to select the first GPU:
  • export ROCR_VISIBLE_DEVICES=0
  1. Pass selected GPU driver interfaces (/dev/dri/render#) )to Docker container. e.g. use the following docker run command option to select the first GPU:
  • sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri/renderD128 --group-add video
    Note you should see the following four interfaces if you have a 4xGPU system:
    $ ls /dev/dri/render*
    /dev/dri/renderD128 /dev/dri/renderD129 /dev/dri/renderD130 /dev/dri/renderD131

@nanguanqi
Copy link
Author

@sunway513 many thanks for your response. It is real helpful to me. I am new for AMD GPU world, and I have no real environment for testing and experiment. Could I know more about ROCR_VISIBLE_DEVICES?

For ROCR_VISIBLE_DEVICES environment variable, does it work for both HCC, HIP and openCL applications?

Is there any environment pre-condition to use the ROCR_VISIBLE_DEVICES environment variable for exposing selected devices? What is difference between ROCm user-bit driver and driver installed from rocm-dkms primary meta-package?

If I install rocm platform via rocm-dkms primary meta-package way, then I could leverage this variable to expose selected gpu devices for different application process?

@sunway513
Copy link
Contributor

Hi @nanguanqi , you are welcome :-)
ROCR_VISIBLE_DEVICES operates on the ROCm ROCr runtime, that's under the layer below hip/hcc/math-libs etc; therefore, I'd assume it'll work equally for OCL path in ROCm stack.

ROCm user bit drivers include ROCr and THUNK.
If you use docker as an example, those two user bit drivers shall be included inside docker container.
Rock-dkms, on the other hand, includes kernel driver (amdgpu) and device firmware, those must be installed on the bare metal.
THUNK and amdgpu kernel driver talks via /dev/dri and /dev/kfd interfaces, you can consult with the following docker run command for ROCm containers:
docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video

And to your last question, yes, that would work.

@Necktwi
Copy link

Necktwi commented Oct 28, 2022

@sunway513 but how do i confirm that my process is using the right device?

@sunway513
Copy link
Contributor

Hi, you can open another terminal and watch for the GPU activities using the following command:
watch -n 0.1 rocm-smi

@Necktwi
Copy link

Necktwi commented Nov 5, 2022

@sunway513, I got Radeon WX4100(node 1) and MI100(node 2), I've set export HIP_VISIBLE_DEVICES=2; export ROCR_VISIBLE_DEVICES=2 and did rocminfo which show only node 1(wx4100) while I'm expecting node 2((MI100). I tried all the combinations 0,1,2 and "0,1,2" but the MI100 card will never be active except for "0,1,2".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants