Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Preview] Add QMC_GPU=no/NVIDIA/AMD/INTEL #3930

Draft
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

ye-luo
Copy link
Contributor

@ye-luo ye-luo commented Apr 6, 2022

Do not merge! Soliciting feedback.

Proposed changes

The current set of GPU flags are not user-friendly. They are still technically good because they are well defined and orthogonal each other.
So last night I was thinking if we can introduce a user friendly user facing CMake option while keep using technical flags internally.

This list can be expanded to include SYCL as well in the future. Users will be restricted to choose one only.

What type(s) of changes does this code introduce?

  • New feature
  • Build related changes

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

epyc-server

Checklist

  • Yes. This PR is up to date with current the current state of 'develop'

@markdewing
Copy link
Contributor

I would think we would want to skip the *_legacy options, since we eventually want to get rid of those. They would still be available by setting the underlying flags, but they don't need to be advertised in this option.

We should clarify the difference between platform hardware and the programming model. CUDA is often used as synonymous with NVIDIA hardware, but the similarity of the HIP programming model complicates that in practice (the programming model for AMD GPU is basically looks like CUDA + translation to HIP + small amount of HIP-specific code)

A top-level flag such as this should refer to hardware (NVIDIA, AMD, INTEL), and then set the programming model flags (ENABLE_CUDA, etc).
(I don't know if we'll reach a point where a single executable can target multiple brands of GPU at run time? That could complicate a simple model of targeting one type of hardware at the top level flag)

@markdewing
Copy link
Contributor

Would it be reasonable to say our portability strategy seems to be converging on:

OpenMP offload plus small amounts of "native" programming model code for interfacing with libraries (BLAS), higher performance kernels, and additional features.
The "native" models for the various platforms:

  • NVIDIA - CUDA
  • AMD GPU- HIP (translation of CUDA to HIP) and ROCM
  • INTEL GPU - Sycl

The reason for specifying that we choose a "native" model for each platform is that there are may projects to provide portability between these models and hardware (e.g., sycl can target NVIDIA and AMD, a project retarget CUDA to Level Zero, etc. These projects have varying levels of maturity and support). We are unlikely to want to use any of these (?)

@prckent prckent marked this pull request as draft April 6, 2022 16:48
@ye-luo
Copy link
Contributor Author

ye-luo commented Apr 6, 2022

I would think we would want to skip the *_legacy options, since we eventually want to get rid of those. They would still be available by setting the underlying flags, but they don't need to be advertised in this option.

I prefer avoid treating legacy special but I see your point of "not advising". So more opinions are well come.
The idea of QMC_GPU to treat more CMake flags internal and keep them away from most users. Expert can still directly access old options with a warning being printed.

We should clarify the difference between platform hardware and the programming model. CUDA is often used as synonymous with NVIDIA hardware, but the similarity of the HIP programming model complicates that in practice (the programming model for AMD GPU is basically looks like CUDA + translation to HIP + small amount of HIP-specific code)

I kind of think most users won't care such details and also it is not our job to educate all users nor they are willing to learn.

A top-level flag such as this should refer to hardware (NVIDIA, AMD, INTEL), and then set the programming model flags (ENABLE_CUDA, etc).

That is a very reasonable suggestion.

(I don't know if we'll reach a point where a single executable can target multiple brands of GPU at run time? That could complicate a simple model of targeting one type of hardware at the top level flag)

You can consider OpenMP runtime library itself a program and dynamically loads vendor APIs. It is not trivial and thus we won't do that.

@ye-luo ye-luo changed the title [Preview] Add QMC_GPU=no/CUDA/CUDA_legacy/ROCM/ROCM_legacy [Preview] Add QMC_GPU=no/NVIDIA/AMD/INTEL Apr 6, 2022
@ye-luo
Copy link
Contributor Author

ye-luo commented Apr 6, 2022

Remove legacy GPU implementation access from QMC_GPU.
Now QMC_GPU=no/NVIDIA/AMD/INTEL

@PDoakORNL
Copy link
Contributor

I think this makes sense.

And I think its probably better for development as well. Builds away from the "suggested" require more flags but their intent reads clearly. I assume that we can write the cmake such that QMC_GPU sets the suggested set of internal flags and they get replaced if they are explicitly set.

QMC_GPU=NVIDA ENABLE_OFFLOAD=OFF
or 
QMC_GPU=no ENABLE_OFFLOAD=on
or super outlier
QMC_GPU=no QMC_OMP=off

Do we just leave legacy as QMC_CUDA and whatever collection of flags you use to compile it with HIP?

@ye-luo
Copy link
Contributor Author

ye-luo commented Apr 6, 2022

I assume that we can write the cmake such that QMC_GPU sets the suggested set of internal flags and they get replaced if they are explicitly set.

I tend to avoid this. -DQMC_GPU=AMD -DQMC_CUDA2HIP=OFF the intention is unclear and I don't know the proper way of overriding. When QMC_GPU!=no is used, all the programming model options are forced by the "QMCPACK forced configuration choice"

If developers need to explicitly control flags for GPU programming models, they know what they are doing. So just don't set QMC_GPU and set programming models option they need directly. They will see a warning but they should know what they are doing.

If needed in the future, we can add NVIDIA_SYCL and NVIDIA_CUDA as fine-grain variant of NVIDIA.

@PDoakORNL
Copy link
Contributor

I think QMC_CUDA2HIP is just a terrible flag and without reading the cmake again I couldn't tell you what parts of the native CUDA code it effects. I know exactly what I expect if I turn OFFLOAD on and off with a particular GPU setting.

I would suggest it not be referenced in the cmake except to set the internal flags and that it set them with the cache keyword so if they are specified on the command line they are not overwritten.

if(QMC_GPU match "NVIDIA)
set(ENABLE_CUDA true CACHE BOOL "")
set(ENABLE_OFFLOAD true CACHE BOOL "") etc.
...

It will do what you want for users and be convenient for developers as well.

@ye-luo
Copy link
Contributor Author

ye-luo commented Apr 6, 2022

I think QMC_CUDA2HIP is just a terrible flag and without reading the cmake again I couldn't tell you what parts of the native CUDA code it effects. I know exactly what I expect if I turn OFFLOAD on and off with a particular GPU setting.

I cannot remove QMC_CUDA2HIP cmake option if we need legacy CUDA converted HIP to keep working with -DQMC_CUDA=ON -DQMC_CUDA2HIP=ON.

I would suggest it not be referenced in the cmake except to set the internal flags and that it set them with the cache keyword so if they are specified on the command line they are not overwritten.

if(QMC_GPU match "NVIDIA)
set(ENABLE_CUDA true CACHE BOOL "")
set(ENABLE_OFFLOAD true CACHE BOOL "") etc.
...

It will do what you want for users and be convenient for developers as well.

for the code above, -DQMC_GPU=NVIDIA -DENABLE_CUDA=OFF ends up with ENABLE_CUDA being OFF.
In my setting,

if(QMC_GPU match "NVIDIA")
  set(ENABLE_CUDA ON)
endif()
option(ENABLE_CUDA OFF) // it becomes no-op.  QMC_GPU wins. ` -DENABLE_CUDA=OFF` is ignored.

To make developer friendly, I can clear incompatible values in the cache.

@prckent prckent self-requested a review April 7, 2022 14:23
Copy link
Contributor

@PDoakORNL PDoakORNL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the code above, -DQMC_GPU=NVIDIA -DENABLE_CUDA=OFF ends up with ENABLE_CUDA being OFF.
In my setting,

And that is exactly the behavior I want. The explicit option overrides the general. The enable flags remain cache variables that I can see in at least the advanced mode of ccmake.

The way it works now you set directly level variables for the ENABLE flags rudely remove cache variable ENABLE if I have them set. If you really can't deal I guess should can do set(... CACHE FORCE). But honestly a user should be doing a clean build so they should be just running cmake on a clean build directory and -DQMC_GPU=NVIDIA the way I suggested will be fine. They won't have ENABLE variables so it will be fine.

@ye-luo
Copy link
Contributor Author

ye-luo commented Apr 8, 2022

for the code above, -DQMC_GPU=NVIDIA -DENABLE_CUDA=OFF ends up with ENABLE_CUDA being OFF.

And that is exactly the behavior I want. The explicit option overrides the general. The enable flags remain cache variables that I can see in at least the advanced mode of ccmake.

I meant -DQMC_GPU=NVIDIA -DENABLE_CUDA=OFF doesn't make sense. Allowing this combo defeat the purpose of using QMC_GPU. Users/developers should either use indirect QMC_GPU options and hands off or being handy using direct options like ENABLE_CUDA. I cannot support mixed request. When user requested flags is not consistent with QMC_GPU predefined setting, I'd rather treating it as user error and stop CMake. I will try to force caching variables but cached variables are complex state machines, I need to test them very carefully.

@PDoakORNL
Copy link
Contributor

Someone else should weigh in on this. It's an obscuring flag for development and troubleshooting since now the easy visibility of the ENABLE flags is removed. I think its best thought of as a way to set different defaults for the more specific build flags when you indicate a particular hardware.

Leave the freedom to set the more specific flags and not have them overwritten and/or disappeared.

Since users should be building once in a clean directory with only -QMC_GPU=<...> I don't see why the behavior I suggest is an issue.

@prckent
Copy link
Contributor

prckent commented Apr 8, 2022

I will eventually weigh in properly here.

Quick take: I think we should disambiguate between a single clean flag that the user specifies to build the code on the cmake line from what we do internally. Currently these seem to be conflated which, based on the discussion, is problematic. Obviously this is not how things are done today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants