Offload one-body Jastrow ratio calculation for NLPP #3905

ye-luo · 2022-03-15T16:08:11Z

Proposed changes

Enable J1 NLPP calculation on GPU.
It is not an always win option as pre-existing kernels and multi-thread offload has made the GPU quite busy and moving more computation on to GPU may loose.
Overall, win/loose is within 5% of total walltime. I prefer keeping it on by default. As we further optimize kernels, this will win more.

What type(s) of changes does this code introduce?

New feature
Testing changes (e.g. new unit/integration/performance tests)

Does this introduce a breaking change?

No

What systems has this change been tested on?

epyc-server

Checklist

Yes. This PR is up to date with current the current state of 'develop'

ye-luo · 2022-03-15T16:09:57Z

Test this please

prckent · 2022-03-15T16:49:31Z

Can you please expand on your last comment? If the worst case slowdown is 5-10%, what is the best case speedup that you have seen? And on which GPU? I would imagine that the penalities will only be smaller on future GPUs and with better runtimes, faster hosts etc. i.e. We should have this on as a default.

ye-luo · 2022-03-15T17:13:18Z

Positive is win with J1 offload. Negative is lose.
NiO running on summit.

a16   -1.6%
a32   -1.1%
a64    1.8%
a128   0.5%
a256  -2.5%

Note that for this small % values, timer noise can be larger.
When GPU is fully busy, running on host can be faster. So it is tricky to make a decision.

prckent · 2022-03-15T17:40:37Z

Q. Is this running fully async or with serialization?

prckent

LGTM (CI still needs to pass)

I think a sensible strategy is to offload everything "obvious" as a first pass and then optimize later. This falls in the obvious category for me given the legacy CUDA code and the non-trivial work here for large batch size, electron, ion count.

ye-luo · 2022-03-15T19:38:57Z

LGTM (CI still needs to pass)

I think a sensible strategy is to offload everything "obvious" as a first pass and then optimize later. This falls in the obvious category for me given the legacy CUDA code and the non-trivial work here for large batch size, electron, ion count.

I agree with this strategy. I'm pretty sure certain kernels can be optimized (computer engineering) and thus we will benefit from this added offload code path. In the meantime, both offload and non-offload code-path runs on CPU and can be accessed via input and both are tested by unit tests.

ye-luo · 2022-03-15T19:39:08Z

Test this please

ye-luo · 2022-03-15T20:25:39Z

Test this please

ye-luo · 2022-03-15T20:48:16Z

@prckent I corrected a unit test which covers the offload code-path now. Need another approval before merging.

ye-luo added 8 commits March 12, 2022 23:06

Add J1OrbitalSoA.cpp

3cac843

Change J1OrbitalSoA from struct to class.

a89ebcd

Add offload function in J1

a0267c3

Connect input selection.

3502a07

Skip transfer if not needed.

abec331

Add J1 batched API tests.

8cc1e67

Merge branch 'j1-batch-apis' into J1-offload

cc12b4e

Expand test.

4fb9ba0

Fix legacy CUDA builds.

60176bd

prckent self-requested a review March 15, 2022 17:40

prckent previously approved these changes Mar 15, 2022

View reviewed changes

Merge branch 'develop' into J1-offload

a630429

Properly exercise the batched code path in J1 tests.

32e01ef

ye-luo dismissed prckent’s stale review via 32e01ef March 15, 2022 19:57

prckent approved these changes Mar 15, 2022

View reviewed changes

prckent merged commit 3a78e66 into QMCPACK:develop Mar 15, 2022

ye-luo deleted the J1-offload branch March 15, 2022 21:35

ye-luo mentioned this pull request Mar 18, 2022

Fix incorrect use of DTModes #3909

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offload one-body Jastrow ratio calculation for NLPP #3905

Offload one-body Jastrow ratio calculation for NLPP #3905

ye-luo commented Mar 15, 2022 •

edited

Loading

ye-luo commented Mar 15, 2022

prckent commented Mar 15, 2022

ye-luo commented Mar 15, 2022 •

edited

Loading

prckent commented Mar 15, 2022

prckent left a comment

ye-luo commented Mar 15, 2022

ye-luo commented Mar 15, 2022

ye-luo commented Mar 15, 2022

ye-luo commented Mar 15, 2022

Offload one-body Jastrow ratio calculation for NLPP #3905

Offload one-body Jastrow ratio calculation for NLPP #3905

Conversation

ye-luo commented Mar 15, 2022 • edited Loading

Proposed changes

What type(s) of changes does this code introduce?

Does this introduce a breaking change?

What systems has this change been tested on?

Checklist

ye-luo commented Mar 15, 2022

prckent commented Mar 15, 2022

ye-luo commented Mar 15, 2022 • edited Loading

prckent commented Mar 15, 2022

prckent left a comment

Choose a reason for hiding this comment

ye-luo commented Mar 15, 2022

ye-luo commented Mar 15, 2022

ye-luo commented Mar 15, 2022

ye-luo commented Mar 15, 2022

ye-luo commented Mar 15, 2022 •

edited

Loading

ye-luo commented Mar 15, 2022 •

edited

Loading