Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PCIe hotplug support for amdgpu #131

Open
wants to merge 2 commits into
base: roc-5.0.x
Choose a base branch
from

Conversation

xushuotao
Copy link

@xushuotao xushuotao commented Apr 5, 2022

  1. during probing, decrement KFD lock which were incremented when device was removed; otherwise kfd_open is going to fail.
  2. remove p2p/io links in sysfs when device is hotplugged out.

@fxkuehl
Copy link

@fxkuehl fxkuehl commented Apr 5, 2022

Thank you for the patch. I haven't reviewed it in detail yet, but it seems like a simple enough change that would be worth including in our upstream driver.

We do most of our public code reviews by mailing list on amd-gfx@lists.freedesktop.org for changes to the upstream driver. They should be against the amd-staging-drm-next branch that you can get from https://gitlab.freedesktop.org/agd5f/linux.git.

Please run your patch through checkpatch.pl for formal coding style compliance. Thanks!

@fxkamd
Copy link
Contributor

@fxkamd fxkamd commented Apr 5, 2022

Thank you for the patch. I haven't reviewed it in detail yet, but it seems like a simple enough change that would be worth including in our upstream driver.

We do most of our public code reviews by mailing list on amd-gfx@lists.freedesktop.org for changes to the upstream driver. They should be against the amd-staging-drm-next branch that you can get from https://gitlab.freedesktop.org/agd5f/linux.git.

Please run your patch through checkpatch.pl for formal coding style compliance. Thanks!

I was signed in with the wrong account. I should have replied with this one.

@xushuotao xushuotao force-pushed the msft/hotplug branch 2 times, most recently from 2428412 to 6e5f7b2 Compare Apr 6, 2022
@xushuotao
Copy link
Author

@xushuotao xushuotao commented Apr 6, 2022

@fxkamd Thank you very much for your quick review. I have

  1. check the coding style compliance and made sure it passed the checkpatch.pl
  2. Created a patch against amd-staging-drm-next from https://gitlab.freedesktop.org/agd5f/linux.git; and emailed the patch to amd-gfx@lists.freedesktop.org also with the link to this PR request.

Thank you very much!
Best regards,
Shuotao

Thank you for the patch. I haven't reviewed it in detail yet, but it seems like a simple enough change that would be worth including in our upstream driver.
We do most of our public code reviews by mailing list on amd-gfx@lists.freedesktop.org for changes to the upstream driver. They should be against the amd-staging-drm-next branch that you can get from https://gitlab.freedesktop.org/agd5f/linux.git.
Please run your patch through checkpatch.pl for formal coding style compliance. Thanks!

I was signed in with the wrong account. I should have replied with this one.

@fxkamd
Copy link
Contributor

@fxkamd fxkamd commented Apr 6, 2022

Thank you. The p2plink stuff is not upstream yet. So we'll need to deal with that on our DKMS branch after your upstream changes land.

The preferred way to send patches to amd-gfx is with "git send-email".

@xushuotao
Copy link
Author

@xushuotao xushuotao commented Apr 7, 2022

@fxkamd I have pushed new code that can deal with multi-GPU system per Andrey and Joshi's comment.

  1. I have tested on our 4-node MI100 GPU machine and it works fine with tensorflow: you can freely add and remove GPU nodes, and the rocm stack seems to be working.
  2. Unfortunately the patch upstream (5.16.0-kfd) is unstable out of the box for a multi-GPU job on MI100 system. I have patched that version, and at least rocminfo shows correct info in that. I will send the new patch for review soon.

@fxkamd
Copy link
Contributor

@fxkamd fxkamd commented Apr 7, 2022

OK. We realized that Mukul was working on basically the same issue for a different use case. He will clean up his code and post it for review soon. I believe his patch will be more general. Maybe you can give it a try when he posts it, to see if it addresses your use case.

Currently, the IO-links to the device being removed from topology,
are not cleared. As a result, there would be dangling links left in
the KFD topology. This patch aims to fix the following:
1. Cleanup all IO links to the device being removed.
   [note by shuotao]: Also cleaned up p2plinks for
   backward-compatiblity for rocm-5.0.x; this shall be squashed in the
   upstream of amd-staging-drm-next.
2. Ensure that node numbering in sysfs and nodes proximity domain
   values are consistent after the device is removed:
   a. Adding a device and removing a GPU device are made mutually
      exclusive.
   b. The global proximity domain counter is no longer required to be
      an atomic counter. A normal 32-bit counter can be used instead.
3. Update generation_count to let user-mode know that topology has
   changed due to device removal.

Reviewed-by: Shuotao Xu <shuotaoxu@microsoft.com>
CC: Shuotao Xu <shuotaoxu@microsoft.com>
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
@xushuotao
Copy link
Author

@xushuotao xushuotao commented Apr 8, 2022

OK. We realized that Mukul was working on basically the same issue for a different use case. He will clean up his code and post it for review soon. I believe his patch will be more general. Maybe you can give it a try when he posts it, to see if it addresses your use case.

  1. I tried Mukul's patch and it worked. (In order to work with rocm-5.0.x, I will need to move back the p2p link cleanup which would be deprecated in the upstream).
  2. I also added another patch to deal with unbalanced kfd_locked upon pcie removal. Maybe someone at AMD could take a look

With those two patches, the system worked fine on 4-node MI100 GPU machine for hotplug capabilities with tf.

This PR now has two overall commits that reflect those two patches.

I have also submitted patch to amd-gfx@lists.freedesktop.org.

Thank you very much for all the support!

Best regards,
Shuotao

Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
devices can open doors for many advanced applications in data center
in the next few years, such as for GPU resource
disaggregation. Current AMDKFD does not support hotplug out b/o the
following reasons:

1. During PCIe removal, unbalanced kfd_locked increment during
   kgd2kfd_suspend. Fixed it by only incrementing the lock if
   drm_device is not unplugged.

2. Remove redudant p2p/io links in sysfs when device is hotplugged
   out.

3. New kfd node_id is not properly assigned after a new device is
   added after a gpu is hotplugged out in a system. libhsakmt will
   find this anomaly, (i.e. node_from != <dev node id> in iolinks),
   when taking a topology_snapshot, thus returns fault to the rocm
   stack.

-- This patch fixes issue 1; another patch by Mukul fixes issues 2&3.
-- Tested on a 4-GPU MI100 gpu nodes with kernel
   5.13.0-kfd/rocm-5.0.2; kernel 5.16.0-kfd is unstable out-of-box for
   MI100.

Signed-off-by: Shuotao Xu <shuotaoxu@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants