Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offload one-body Jastrow ratio calculation for NLPP #3905

Merged
merged 11 commits into from
Mar 15, 2022

Conversation

ye-luo
Copy link
Contributor

@ye-luo ye-luo commented Mar 15, 2022

Proposed changes

Enable J1 NLPP calculation on GPU.
It is not an always win option as pre-existing kernels and multi-thread offload has made the GPU quite busy and moving more computation on to GPU may loose.
Overall, win/loose is within 5% of total walltime. I prefer keeping it on by default. As we further optimize kernels, this will win more.

What type(s) of changes does this code introduce?

  • New feature
  • Testing changes (e.g. new unit/integration/performance tests)

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

epyc-server

Checklist

  • Yes. This PR is up to date with current the current state of 'develop'

@ye-luo
Copy link
Contributor Author

ye-luo commented Mar 15, 2022

Test this please

@prckent
Copy link
Contributor

prckent commented Mar 15, 2022

Can you please expand on your last comment? If the worst case slowdown is 5-10%, what is the best case speedup that you have seen? And on which GPU? I would imagine that the penalities will only be smaller on future GPUs and with better runtimes, faster hosts etc. i.e. We should have this on as a default.

@ye-luo
Copy link
Contributor Author

ye-luo commented Mar 15, 2022

Positive is win with J1 offload. Negative is lose.
NiO running on summit.

a16   -1.6%
a32   -1.1%
a64    1.8%
a128   0.5%
a256  -2.5%

Note that for this small % values, timer noise can be larger.
When GPU is fully busy, running on host can be faster. So it is tricky to make a decision.

@prckent
Copy link
Contributor

prckent commented Mar 15, 2022

Q. Is this running fully async or with serialization?

@prckent prckent self-requested a review March 15, 2022 17:40
prckent
prckent previously approved these changes Mar 15, 2022
Copy link
Contributor

@prckent prckent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (CI still needs to pass)

I think a sensible strategy is to offload everything "obvious" as a first pass and then optimize later. This falls in the obvious category for me given the legacy CUDA code and the non-trivial work here for large batch size, electron, ion count.

@ye-luo
Copy link
Contributor Author

ye-luo commented Mar 15, 2022

LGTM (CI still needs to pass)

I think a sensible strategy is to offload everything "obvious" as a first pass and then optimize later. This falls in the obvious category for me given the legacy CUDA code and the non-trivial work here for large batch size, electron, ion count.

I agree with this strategy. I'm pretty sure certain kernels can be optimized (computer engineering) and thus we will benefit from this added offload code path. In the meantime, both offload and non-offload code-path runs on CPU and can be accessed via input and both are tested by unit tests.

@ye-luo
Copy link
Contributor Author

ye-luo commented Mar 15, 2022

Test this please

@ye-luo
Copy link
Contributor Author

ye-luo commented Mar 15, 2022

Test this please

@ye-luo
Copy link
Contributor Author

ye-luo commented Mar 15, 2022

@prckent I corrected a unit test which covers the offload code-path now. Need another approval before merging.

@prckent prckent merged commit 3a78e66 into QMCPACK:develop Mar 15, 2022
@ye-luo ye-luo deleted the J1-offload branch March 15, 2022 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants