Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: oneAPI SYCL for AMD GPUs #3341

Merged
merged 4 commits into from
Jun 27, 2023

Conversation

nmnobre
Copy link
Contributor

@nmnobre nmnobre commented Jun 2, 2023

Hi @WeiqunZhang and @ax3l,

This PR demonstrates how we could use Intel's SYCL compiler to target AMD and Nvidia GPUs.
I understand this goes against AMReX's philosophy of using each vendor's software to target each vendor's hardware.
For that reason, I'm taking a proof-of-concept, here's-what-we-can-do approach, similarly to #3184, without making any changes to the officially supported compilation flows. The objective is to incite discussion, to lure your interest and for me to better understand if this is something you'd ultimately be interested in or not. :-)

Adding support for Nvidia GPUs is quite trivial, mostly because the support in DPC++ is more mature.
On the other hand, adding support for AMD GPUs requires avoiding std:: functions in device code, and using their sycl:: equivalents instead, which would imply reverting changes we made in the past.

Lastly, the public CI workflows don't seem to actually run any SYCL code, they are limited to compilation tests. It'd be interesting to execute this code on actual hardware to assess correctness and performance. I'm fairly confident as I've been testing SYCL with the Electromagnetic PIC tutorial for a while now on both AMD and Nvidia GPUs, but that doesn't stress the entirety of AMReX's code...

Let me know what you think :-)
-Nuno

CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS WITH THE
SOFTWARE.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below:

Can you download those huge installer files in .github/workflows/dependencies/codeplay/ on the fly?
If we plan to merge this, we would not commit 3MB files to AMReX' git repo, because it makes downloads for downstream users in superbuilds very slow :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, totally agree.
At the moment, you need to create a token, to which you associate a list of IPs, which then allows you to download on demand using curl/wget. Codeplay say we can contact them to lift the IP whitelist restriction to allow any client to download the plugins. See here. Unfortunately, these plugins are still quite tied to oneAPI releases, but they do tend to be released within one or two days of the respective oneAPI release.

Copy link
Member

@ax3l ax3l Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So you use the GitHub action IP range for this?

If you like, we can add a secret environment variable for this in the project that you can pick up in the scripts.

@ax3l
Copy link
Member

ax3l commented Jun 3, 2023

Hi @nmnobre,

Thank you for this, very exciting!

I would not say this goes against any of our philosophy :)
If we are able to support targeting the same platforms with different acceleration backends in a sustainable, then this adds to diversity of supported toolsets, more research tools, more performance tools, more resilience, enables apples-to-apples comparisons for backends for vendors, and overall more awesomeness.

@@ -20,7 +20,7 @@ echo 'Acquire::Retries "3";' | sudo tee /etc/apt/apt.conf.d/80-retries
# Ref.: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html#ubuntu
curl -O https://repo.radeon.com/rocm/rocm.gpg.key
sudo apt-key add rocm.gpg.key
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian/ ubuntu main' \
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/5.4.3/ ubuntu main' \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the contrary. The AMD plug-in only supports version 5.4.3. Using 5.5.x leads to some opaque pointers problems. Again, unfortunately these plug-ins are quite tied to specific versions of their dependencies... for the time being at least.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we need have a new script or a new optional argument for installing an older version. This script has already been used for other jobs that are using the latest rocm release available (5.5 at the moment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ax3l
Copy link
Member

ax3l commented Jun 3, 2023

Why did you need to change all the std:: math to amrex::Math::? 😅

Is this the main change of this PR to make the compiler stack work? (Would be totally fine, imho.)

@nmnobre
Copy link
Contributor Author

nmnobre commented Jun 3, 2023

Why did you need to change all the std:: math to amrex::Math::? 😅

Is this the main change of this PR to make the compiler stack work? (Would be totally fine, imho.)

Because, unfortunately, loads of CXX standard library functions are unsupported on device code for the AMD/HIP backend. This is indeed the most disruptive change.

@WeiqunZhang
Copy link
Member

We have heard that a few AMReX applications have successfully run on A100 using SYCL without changes in amrex/Src/. So let's not worry about SYCL on AMD GPUs for now until the support for it is more mature. Once we figure out the plugin download issue, we can add a CI for oneAPI on Nvidia GPUs.

@nmnobre nmnobre marked this pull request as draft June 6, 2023 16:06
@nmnobre nmnobre changed the title [WIP/POC] CI: oneAPI SYCL for AMD and Nvidia GPUs CI: oneAPI SYCL for AMD GPUs Jun 15, 2023
@nmnobre
Copy link
Contributor Author

nmnobre commented Jun 15, 2023

Hi @WeiqunZhang,

Would this work for you for the time being?
I'm skipping all tests to avoid any linking and, thus, any problems with the CXX stdlib functions.

I've confirmed today I was fortunate the ElectromagneticPIC tutorial never uses any of the AMReX functionality calling into std functions within device code. That's why I never faced this problem...

I hear your concerns about the allowed subgroup sizes, I agree with them :)
But let's first see if you are happy with this test, before we proceed to that problem (in any case, we already do a runtime check, so we aren't totally in the dark).

Thank you!

@WeiqunZhang
Copy link
Member

Sounds good. Left a comment about the script installing rocm.

@WeiqunZhang WeiqunZhang merged commit 0236a37 into AMReX-Codes:development Jun 27, 2023
66 checks passed
ax3l pushed a commit that referenced this pull request Aug 12, 2023
Hi @ax3l, @WeiqunZhang,

I hope you're enjoying your time in Switzerland. :)

I've removed some redundancy around the nvcc dependencies files, mostly
for consistency with the changes I did to the hip dependencies file in
#3341. It should all be pretty self-explanatory.

Cheers,
-Nuno
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants