Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llvm/{cuda,}: Reduce and optimize data copies in model execution #2311

Merged
merged 7 commits into from
Feb 7, 2022

Conversation

jvesely
Copy link
Collaborator

@jvesely jvesely commented Feb 7, 2022

Do not create a copy of parameters if there are no parameter states.
Add memcopy helper to W/A heavy register usage by store(load(src), dst) idiom.
Upload base arguments to on-chip 'shared' memory for GPU evaluate (optional, default ON).
Add human-readable names to alloca instructions.
Drop unused 'evaluate' kernel wrapper code.
Drop 'cuda_data' debug switch.

Improves performance ~2.5x for both CPU and GPU execution (measured using predator-prey model with 101 levels of attention)

…e's no modulation

The compiler is not able to eliminate these.

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
…omposition functions

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
This has not been used since GPU accelerated evaluations
were switched to range_evaluate.

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
A 4B aligned version performs better on GPUs.
A llvm.memcopy fallback is available via PNL_LLVM_DEBUG=unaligned_copy.

Both variants much more register efficient (and usually faster) than
store(load(...)).

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
Move all shared params for GPU grid evaluate.
Use PNL_LLVM_DEBUG='cuda_no_shared' to opt out.

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
Prefer large blocks if the kernel uses shared memory.
Prefer warp sized blocks otherwise.
Print kenel statistics and selected block size with
PNL_LLVM_DEBUG=stat

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
Print upload/download statistics when 'stat' switch is present.

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
@jvesely jvesely added compiler Runtime Compiler CUDA CUDA target for the runtime compiler labels Feb 7, 2022
@jvesely jvesely added this to In progress in LLVM Runtime Compiler via automation Feb 7, 2022
@github-actions
Copy link

github-actions bot commented Feb 7, 2022

This PR causes the following changes to the html docs (ubuntu-latest-3.7-x64):

No differences!

...

See CI logs for the full diff.

@jvesely jvesely merged commit f465fe9 into PrincetonUniversity:devel Feb 7, 2022
LLVM Runtime Compiler automation moved this from In progress to Done Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler Runtime Compiler CUDA CUDA target for the runtime compiler
Projects
Development

Successfully merging this pull request may close these issues.

None yet

1 participant