llvm/{cuda,}: Reduce and optimize data copies in model execution #2311

jvesely · 2022-02-07T05:40:01Z

Do not create a copy of parameters if there are no parameter states.
Add memcopy helper to W/A heavy register usage by store(load(src), dst) idiom.
Upload base arguments to on-chip 'shared' memory for GPU evaluate (optional, default ON).
Add human-readable names to alloca instructions.
Drop unused 'evaluate' kernel wrapper code.
Drop 'cuda_data' debug switch.

Improves performance ~2.5x for both CPU and GPU execution (measured using predator-prey model with 101 levels of attention)

…e's no modulation The compiler is not able to eliminate these. Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

…omposition functions Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

This has not been used since GPU accelerated evaluations were switched to range_evaluate. Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

A 4B aligned version performs better on GPUs. A llvm.memcopy fallback is available via PNL_LLVM_DEBUG=unaligned_copy. Both variants much more register efficient (and usually faster) than store(load(...)). Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Move all shared params for GPU grid evaluate. Use PNL_LLVM_DEBUG='cuda_no_shared' to opt out. Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Prefer large blocks if the kernel uses shared memory. Prefer warp sized blocks otherwise. Print kenel statistics and selected block size with PNL_LLVM_DEBUG=stat Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Print upload/download statistics when 'stat' switch is present. Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

github-actions · 2022-02-07T05:47:54Z

This PR causes the following changes to the html docs (ubuntu-latest-3.7-x64):

No differences!

...

See CI logs for the full diff.

jvesely added 7 commits February 6, 2022 21:39

llvm/{mechanism,port}: Do not create local copy of parameters if ther…

283ae69

…e's no modulation The compiler is not able to eliminate these. Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

llvm: Add human readable names to most alloca ops for mechanism and c…

cbb51bb

…omposition functions Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

llvm/cuda: Drop 'evaluate' specific codepaths in kernel wrapper

a278089

This has not been used since GPU accelerated evaluations were switched to range_evaluate. Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

llvm/cuda: Add option to upload arguments to shared memory

16a1c42

Move all shared params for GPU grid evaluate. Use PNL_LLVM_DEBUG='cuda_no_shared' to opt out. Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

llvm/cuda: Automatically calculate the 'best' block size

9616932

Prefer large blocks if the kernel uses shared memory. Prefer warp sized blocks otherwise. Print kenel statistics and selected block size with PNL_LLVM_DEBUG=stat Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

llvm/cuda: Drop 'cuda_data' debug switch

1ae98af

Print upload/download statistics when 'stat' switch is present. Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

jvesely added compiler Runtime Compiler CUDA CUDA target for the runtime compiler labels Feb 7, 2022

jvesely added this to In progress in LLVM Runtime Compiler via automation Feb 7, 2022

jvesely merged commit f465fe9 into PrincetonUniversity:devel Feb 7, 2022

LLVM Runtime Compiler automation moved this from In progress to Done Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llvm/{cuda,}: Reduce and optimize data copies in model execution #2311

llvm/{cuda,}: Reduce and optimize data copies in model execution #2311

jvesely commented Feb 7, 2022 •

edited

github-actions bot commented Feb 7, 2022

llvm/{cuda,}: Reduce and optimize data copies in model execution #2311

llvm/{cuda,}: Reduce and optimize data copies in model execution #2311

Conversation

jvesely commented Feb 7, 2022 • edited

github-actions bot commented Feb 7, 2022

jvesely commented Feb 7, 2022 •

edited