-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offload one-body Jastrow ratio calculation for NLPP #3905
Conversation
Test this please |
Can you please expand on your last comment? If the worst case slowdown is 5-10%, what is the best case speedup that you have seen? And on which GPU? I would imagine that the penalities will only be smaller on future GPUs and with better runtimes, faster hosts etc. i.e. We should have this on as a default. |
Positive is win with J1 offload. Negative is lose.
Note that for this small % values, timer noise can be larger. |
Q. Is this running fully async or with serialization? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (CI still needs to pass)
I think a sensible strategy is to offload everything "obvious" as a first pass and then optimize later. This falls in the obvious category for me given the legacy CUDA code and the non-trivial work here for large batch size, electron, ion count.
I agree with this strategy. I'm pretty sure certain kernels can be optimized (computer engineering) and thus we will benefit from this added offload code path. In the meantime, both offload and non-offload code-path runs on CPU and can be accessed via input and both are tested by unit tests. |
Test this please |
Test this please |
@prckent I corrected a unit test which covers the offload code-path now. Need another approval before merging. |
Proposed changes
Enable J1 NLPP calculation on GPU.
It is not an always win option as pre-existing kernels and multi-thread offload has made the GPU quite busy and moving more computation on to GPU may loose.
Overall, win/loose is within 5% of total walltime. I prefer keeping it on by default. As we further optimize kernels, this will win more.
What type(s) of changes does this code introduce?
Does this introduce a breaking change?
What systems has this change been tested on?
epyc-server
Checklist