[Preview] Add QMC_GPU=no/NVIDIA/AMD/INTEL #3930

ye-luo · 2022-04-06T16:10:46Z

Do not merge! Soliciting feedback.

Proposed changes

The current set of GPU flags are not user-friendly. They are still technically good because they are well defined and orthogonal each other.
So last night I was thinking if we can introduce a user friendly user facing CMake option while keep using technical flags internally.

This list can be expanded to include SYCL as well in the future. Users will be restricted to choose one only.

What type(s) of changes does this code introduce?

New feature
Build related changes

Does this introduce a breaking change?

No

What systems has this change been tested on?

epyc-server

Checklist

Yes. This PR is up to date with current the current state of 'develop'

markdewing · 2022-04-06T16:32:18Z

I would think we would want to skip the *_legacy options, since we eventually want to get rid of those. They would still be available by setting the underlying flags, but they don't need to be advertised in this option.

We should clarify the difference between platform hardware and the programming model. CUDA is often used as synonymous with NVIDIA hardware, but the similarity of the HIP programming model complicates that in practice (the programming model for AMD GPU is basically looks like CUDA + translation to HIP + small amount of HIP-specific code)

A top-level flag such as this should refer to hardware (NVIDIA, AMD, INTEL), and then set the programming model flags (ENABLE_CUDA, etc).
(I don't know if we'll reach a point where a single executable can target multiple brands of GPU at run time? That could complicate a simple model of targeting one type of hardware at the top level flag)

markdewing · 2022-04-06T16:44:41Z

Would it be reasonable to say our portability strategy seems to be converging on:

OpenMP offload plus small amounts of "native" programming model code for interfacing with libraries (BLAS), higher performance kernels, and additional features.
The "native" models for the various platforms:

NVIDIA - CUDA
AMD GPU- HIP (translation of CUDA to HIP) and ROCM
INTEL GPU - Sycl

The reason for specifying that we choose a "native" model for each platform is that there are may projects to provide portability between these models and hardware (e.g., sycl can target NVIDIA and AMD, a project retarget CUDA to Level Zero, etc. These projects have varying levels of maturity and support). We are unlikely to want to use any of these (?)

ye-luo · 2022-04-06T16:50:42Z

I would think we would want to skip the *_legacy options, since we eventually want to get rid of those. They would still be available by setting the underlying flags, but they don't need to be advertised in this option.

I prefer avoid treating legacy special but I see your point of "not advising". So more opinions are well come.
The idea of QMC_GPU to treat more CMake flags internal and keep them away from most users. Expert can still directly access old options with a warning being printed.

We should clarify the difference between platform hardware and the programming model. CUDA is often used as synonymous with NVIDIA hardware, but the similarity of the HIP programming model complicates that in practice (the programming model for AMD GPU is basically looks like CUDA + translation to HIP + small amount of HIP-specific code)

I kind of think most users won't care such details and also it is not our job to educate all users nor they are willing to learn.

A top-level flag such as this should refer to hardware (NVIDIA, AMD, INTEL), and then set the programming model flags (ENABLE_CUDA, etc).

That is a very reasonable suggestion.

(I don't know if we'll reach a point where a single executable can target multiple brands of GPU at run time? That could complicate a simple model of targeting one type of hardware at the top level flag)

You can consider OpenMP runtime library itself a program and dynamically loads vendor APIs. It is not trivial and thus we won't do that.

ye-luo · 2022-04-06T21:24:46Z

Remove legacy GPU implementation access from QMC_GPU.
Now QMC_GPU=no/NVIDIA/AMD/INTEL

PDoakORNL · 2022-04-06T21:42:10Z

I think this makes sense.

And I think its probably better for development as well. Builds away from the "suggested" require more flags but their intent reads clearly. I assume that we can write the cmake such that QMC_GPU sets the suggested set of internal flags and they get replaced if they are explicitly set.

QMC_GPU=NVIDA ENABLE_OFFLOAD=OFF
or 
QMC_GPU=no ENABLE_OFFLOAD=on
or super outlier
QMC_GPU=no QMC_OMP=off

Do we just leave legacy as QMC_CUDA and whatever collection of flags you use to compile it with HIP?

ye-luo · 2022-04-06T21:54:57Z

I assume that we can write the cmake such that QMC_GPU sets the suggested set of internal flags and they get replaced if they are explicitly set.

I tend to avoid this. -DQMC_GPU=AMD -DQMC_CUDA2HIP=OFF the intention is unclear and I don't know the proper way of overriding. When QMC_GPU!=no is used, all the programming model options are forced by the "QMCPACK forced configuration choice"

If developers need to explicitly control flags for GPU programming models, they know what they are doing. So just don't set QMC_GPU and set programming models option they need directly. They will see a warning but they should know what they are doing.

If needed in the future, we can add NVIDIA_SYCL and NVIDIA_CUDA as fine-grain variant of NVIDIA.

PDoakORNL · 2022-04-06T23:28:39Z

I think QMC_CUDA2HIP is just a terrible flag and without reading the cmake again I couldn't tell you what parts of the native CUDA code it effects. I know exactly what I expect if I turn OFFLOAD on and off with a particular GPU setting.

I would suggest it not be referenced in the cmake except to set the internal flags and that it set them with the cache keyword so if they are specified on the command line they are not overwritten.

if(QMC_GPU match "NVIDIA)
set(ENABLE_CUDA true CACHE BOOL "")
set(ENABLE_OFFLOAD true CACHE BOOL "") etc.
...

It will do what you want for users and be convenient for developers as well.

ye-luo · 2022-04-06T23:54:06Z

I think QMC_CUDA2HIP is just a terrible flag and without reading the cmake again I couldn't tell you what parts of the native CUDA code it effects. I know exactly what I expect if I turn OFFLOAD on and off with a particular GPU setting.

I cannot remove QMC_CUDA2HIP cmake option if we need legacy CUDA converted HIP to keep working with -DQMC_CUDA=ON -DQMC_CUDA2HIP=ON.

I would suggest it not be referenced in the cmake except to set the internal flags and that it set them with the cache keyword so if they are specified on the command line they are not overwritten.
if(QMC_GPU match "NVIDIA)
set(ENABLE_CUDA true CACHE BOOL "")
set(ENABLE_OFFLOAD true CACHE BOOL "") etc.
...
It will do what you want for users and be convenient for developers as well.

for the code above, -DQMC_GPU=NVIDIA -DENABLE_CUDA=OFF ends up with ENABLE_CUDA being OFF.
In my setting,

if(QMC_GPU match "NVIDIA")
  set(ENABLE_CUDA ON)
endif()
option(ENABLE_CUDA OFF) // it becomes no-op.  QMC_GPU wins. ` -DENABLE_CUDA=OFF` is ignored.

To make developer friendly, I can clear incompatible values in the cache.

PDoakORNL

for the code above, -DQMC_GPU=NVIDIA -DENABLE_CUDA=OFF ends up with ENABLE_CUDA being OFF.
In my setting,

And that is exactly the behavior I want. The explicit option overrides the general. The enable flags remain cache variables that I can see in at least the advanced mode of ccmake.

The way it works now you set directly level variables for the ENABLE flags rudely remove cache variable ENABLE if I have them set. If you really can't deal I guess should can do set(... CACHE FORCE). But honestly a user should be doing a clean build so they should be just running cmake on a clean build directory and -DQMC_GPU=NVIDIA the way I suggested will be fine. They won't have ENABLE variables so it will be fine.

ye-luo · 2022-04-08T00:10:29Z

for the code above, -DQMC_GPU=NVIDIA -DENABLE_CUDA=OFF ends up with ENABLE_CUDA being OFF.

And that is exactly the behavior I want. The explicit option overrides the general. The enable flags remain cache variables that I can see in at least the advanced mode of ccmake.

I meant -DQMC_GPU=NVIDIA -DENABLE_CUDA=OFF doesn't make sense. Allowing this combo defeat the purpose of using QMC_GPU. Users/developers should either use indirect QMC_GPU options and hands off or being handy using direct options like ENABLE_CUDA. I cannot support mixed request. When user requested flags is not consistent with QMC_GPU predefined setting, I'd rather treating it as user error and stop CMake. I will try to force caching variables but cached variables are complex state machines, I need to test them very carefully.

PDoakORNL · 2022-04-08T14:07:12Z

Someone else should weigh in on this. It's an obscuring flag for development and troubleshooting since now the easy visibility of the ENABLE flags is removed. I think its best thought of as a way to set different defaults for the more specific build flags when you indicate a particular hardware.

Leave the freedom to set the more specific flags and not have them overwritten and/or disappeared.

Since users should be building once in a clean directory with only -QMC_GPU=<...> I don't see why the behavior I suggest is an issue.

prckent · 2022-04-08T14:24:07Z

I will eventually weigh in properly here.

Quick take: I think we should disambiguate between a single clean flag that the user specifies to build the code on the cmake line from what we do internally. Currently these seem to be conflated which, based on the discussion, is problematic. Obviously this is not how things are done today.

Add QMC_GPU=no/CUDA/CUDA_legacy/ROCM/ROCM_legacy

eaf2ec1

prckent marked this pull request as draft April 6, 2022 16:48

Change QMC_GPU to no/NVIDIA/AMD/INTEL.

9b8fe26

ye-luo changed the title ~~[Preview] Add QMC_GPU=no/CUDA/CUDA_legacy/ROCM/ROCM_legacy~~ [Preview] Add QMC_GPU=no/NVIDIA/AMD/INTEL Apr 6, 2022

Clear cached options.

9f312d2

prckent self-requested a review April 7, 2022 14:23

PDoakORNL requested changes Apr 7, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Preview] Add QMC_GPU=no/NVIDIA/AMD/INTEL #3930

[Preview] Add QMC_GPU=no/NVIDIA/AMD/INTEL #3930

ye-luo commented Apr 6, 2022

markdewing commented Apr 6, 2022

markdewing commented Apr 6, 2022

ye-luo commented Apr 6, 2022

ye-luo commented Apr 6, 2022

PDoakORNL commented Apr 6, 2022

ye-luo commented Apr 6, 2022 •

edited

Loading

PDoakORNL commented Apr 6, 2022

ye-luo commented Apr 6, 2022

PDoakORNL left a comment

ye-luo commented Apr 8, 2022

PDoakORNL commented Apr 8, 2022

prckent commented Apr 8, 2022

[Preview] Add QMC_GPU=no/NVIDIA/AMD/INTEL #3930

Are you sure you want to change the base?

[Preview] Add QMC_GPU=no/NVIDIA/AMD/INTEL #3930

Conversation

ye-luo commented Apr 6, 2022

Proposed changes

What type(s) of changes does this code introduce?

Does this introduce a breaking change?

What systems has this change been tested on?

Checklist

markdewing commented Apr 6, 2022

markdewing commented Apr 6, 2022

ye-luo commented Apr 6, 2022

ye-luo commented Apr 6, 2022

PDoakORNL commented Apr 6, 2022

ye-luo commented Apr 6, 2022 • edited Loading

PDoakORNL commented Apr 6, 2022

ye-luo commented Apr 6, 2022

PDoakORNL left a comment

Choose a reason for hiding this comment

ye-luo commented Apr 8, 2022

PDoakORNL commented Apr 8, 2022

prckent commented Apr 8, 2022

ye-luo commented Apr 6, 2022 •

edited

Loading