Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llvm/cuda: Optimize memory copies in memory execution #2328

Merged
merged 4 commits into from
Feb 21, 2022

Conversation

jvesely
Copy link
Collaborator

@jvesely jvesely commented Feb 21, 2022

Restrict 'has_initializers' parameter to mechanisms. This leaves most functions with parameters that are modified by parameter ports.
Don't copy base parameters to private parameter space if all of them will be replaced by parameter port outputs.
Use mechanism base params structure in places that don't modify parameters via parameter ports (e.g. running input/output ports). Unlike the modified result, which is private per evaluation, the base structure is shared.
This improves cache performance and memory utilization for both CPUs and GPUs.

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
…echanism

In reality we only use it in RTM.
Reduces space needed for RO parameters:
predator-prey: 7.73kB -> 5.96kB
stability-flexibility: 8.84kB -> 5.75kB

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
…rwritten

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
… parameters

The original approach created a copy to modulate mechanism parameters
(if needed to apply mech parameter ports) and then passed this copy to
internal function invocation.
Invoking internal functions would then create more copies (if needed)
to apply parameter ports of function parameters.

The new approach passes the mechanism base parameters,
so the copies of function parameters can be made from the original.

The overall amount of copied data is the same,
but now the same shared source is used for all copies

This is especially beneficial for GPUs that place the shared parameters
in high BW on-chip memories.

The observed effect for stability-flexibility model is ~20% reduction
in the total amount of data read from thread private memory,
Resulting in ~10% improvement in kernel execution time.
Measured on P620 GPU.

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
@jvesely jvesely added compiler Runtime Compiler CUDA CUDA target for the runtime compiler labels Feb 21, 2022
@jvesely jvesely added this to In progress in LLVM Runtime Compiler via automation Feb 21, 2022
@github-actions
Copy link

This PR causes the following changes to the html docs (ubuntu-latest-3.7-x64):

No differences!

...

See CI logs for the full diff.

@jvesely jvesely merged commit 31c15ce into PrincetonUniversity:devel Feb 21, 2022
LLVM Runtime Compiler automation moved this from In progress to Done Feb 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler Runtime Compiler CUDA CUDA target for the runtime compiler
Projects
Development

Successfully merging this pull request may close these issues.

None yet

1 participant