-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Function call from a kernel #1057
Comments
You can call a function inside a kernel. This is completely independent of RAJA. Here's an example: https://github.com/LLNL/RAJAPerf/blob/develop/src/basic/TRAP_INT-Cuda.cpp#L139 The macro defining the kernel body is here: https://github.com/LLNL/RAJAPerf/blob/develop/src/basic/TRAP_INT.hpp#L39 and the function being called is defined here: https://github.com/LLNL/RAJAPerf/blob/develop/src/basic/TRAP_INT-Cuda.cpp#L27 |
@rhornung67 thanks for your email. In the example you provided, the function is defined in the RAJA space:
My code was written for CPU platforms and all functions are written in C++. Based on the example you provided, I would have to re-define / convert all C++ function to RAJA so that they can be called from a RAJA kernel. Am I correct? |
What do you mean by "RAJA space"? What do you mean by "convert all C++ functions to RAJA"? You shouldn't have to convert any functions, just add the appropriate hot/device annotations so the compiler will generate appropriate versions that are callable in host or device code. |
In the above example, the variables |
The macros are defined here https://github.com/LLNL/RAJA/blob/develop/include/RAJA/util/macros.hpp#L41 and here https://github.com/LLNL/RAJA/blob/develop/include/RAJA/util/macros.hpp#L41. We use them inside RAJA and provide them to users as a convenience. You don't have to use them. You can define your own if you want. The questions you are asking are related to topics independent of RAJA. For example, if you wrote native CUDA code, you still have to deal with making your functions callable from host and/or device code sections as needed. There is no avoiding that. |
Hi @delcmo ,
If you have an example, I can try to help. Additionally you can add both |
Hi @artv3 thanks for your reply. This is what I am after. I just started with RAJA and may have overlooked a few things in the documentation. If I understand correctly your respond, by including |
If you want the option to call it from either CPU or GPU, you will need include the
|
Ok, perfect let me try this and see if I can get it to compile and run. Thanks for taking the time to reply and to provide that working example. I will get back to you if I have any questions. |
Sounds good, I can definitely help. I use RAJA in a couple of applications and can help with integration/usage questions. |
Hi, I was able to compile and the function from both the CPU and GPU using the following syntax:
I am now trying to convert one of my function to get it to run on GPU:
This function computes the macro variables (density, temperature and momentum components) in a Lattice Boltzmann Method code. I encountered two issues here:
When calling the function
The code compiles and runs fine if I use the following method to declare
What is the correct way to pass an array from the host to the device? Is my approach correct here? If not how should I change it?
I avoided the above issue by moving the function from How do I make a function defined in a different file available both on the host and the device? Thanks, |
@delcmo Sounds like you're getting the hang of things. This is great.
Also, a word of caution. cudaMalloc calls are expensive, much more than CPU malloc calls. Moreover, cudaMallocManaged are even more costly, plus you incur the cost of host-device memory transfers triggered by runtime page faults. The allocation sizes will be an integer multiple of the page size, which is large, even if you think you are allocating a few bytes. On systems like Summit or Sierra, these are implemented in hardware. If you run on other devices that do not support managed memory in hardware, then the transfers are handled in runtime software and are more expensive. If you are going this route just to get the code to run, that's fine. But you need to think carefully about host-device memory management and should try to minimize memory space transfers as much as possible for performance reasons.
|
Hi @delcmo, Regarding 1. The If you would like to pass in
and inside the function you would have to use the parenthesis operator to access data in f, g. The role of the views however; is not to transfer data from the CPU to the GPU, but rather to enable different reshaping of the data. In potential you choose to treat the data as a 4 dimensional array using views
Using the Regarding 2. I believe you want to add |
Hi @rhornung67 and @artv3, thanks for the reply. The piece of code I provided does not include the RAJA Views. Below is the full code:
When I read the RAJA documentation, the RAJA Views object seemed a good option as it avoids the As for the second issue, the main function is implemented in |
Hi @delcmo, the RAJA Views just hold a pointer, the host code is still responsible for allocating the memory. Views will work on both the host and the device The usage should look like this:
I also see you are using Given the structure of your kernel, have you looked at RAJA Teams? We have a basic example that compares writing C style nested loops and how they could look like with Teams here RAJA Teams creates an execution space where you express kernels as nested for loops. It also supports choosing how to run kernels based on a run-time parameter (host or device). If you could post the C++ version of your kernel, I can give an outline of what the Teams variant would look like. Regarding the second issue, let me see if I can come up with an example. |
I have not looked at the block mapping yet. I spent quite a bit of time learning nested blocks and making sure the code compiles and gives the correct solution. I will have to look at this later. I looked at the matrix multiplication example to get started with the nested for loops and built from it. The C++ code is quite long (I think) as you can see below:
I will look at your examples and try to understand the logic of the RAJA Teams. Thanks again for your help. |
Hi @delcmo, I took a look at your code, and one possible implementation with RAJA Teams which enables running sequentially or with CUDA could look like this: This first part could go in a header, away from source code since it can be reused in other kernels.
And the kernel is below. Here I chose to use global thread policies which flattens thread/team model to create unique thread IDs based on the compute grid rather than the standard hierarchical structure. On the GPU each loop iteration will get assigned to a thread; while on the CPU they fall back to regular for loops.
|
One additional detail to note is that I am using macros to guard the CUDA code, this is needed in case we build on a CPU only platform where the CUDA functions calls do not exist. Additionally, there is a memory manager under examples that will allocate with unified memory when we have a CUDA build or new when a CPU only build is detected - this enables portability when building on specific platforms. |
thanks for providing the code. I was looking at it and noticed that some of the options alike Thanks, |
Hi @delcmo , yes those are specific to Teams. There is partial documentation in the develop branch of read the docs: I think we currently point our release the docs to our last released version. |
in the post where you converted the code from C++ to RAJA, you left the loop over the direction unchanged.
Was this by design? How would RAJA handle that loop in the current implementation? |
That was a choice I made. Given that the parallelism comes from the top three loops, those inner loops would always be executed sequentially. Converting it into a RAJA loop wouldn't add more capability. If we chose to convert it to a RAJA loop it could look something like this:
Additionally, since the loop methods capture by reference we would also just be able modify/access anything in the loop hierarchy, this is a different design approach than RAJA kernel which required passing parameters by reference into the lambda arguments. |
Hi @artv3, if I understand you correctly, I can keep the C++ for loop syntax as it is. Does that mean each GPU thread will perform a for loop over the directions? |
Yes! standard C++ loops will be executed sequentially by each thread. |
I was able to modify my code and add the RAJA syntax to get it to work on GPUs. I still have some testing to do but I am on the right track. In the mean time I have started to look at a different part of the code that is a little bit more complex:
The integer The other question I have is related to the different directions Thanks, |
Hi @delcmo, if I understand correctly -- and we can iterate on this. Regarding the first part I think you can use the same global policies for the outer for loops (z,y,x), this will create unique threads and the loop body with be executed sequentially by each thread. For the second question, if you express the loop body as for loop over 6 values where you compute the value of a, offeset1, offset2 then we can use the RAJA loop methods to assign the work to threads in the x-direction (which will get executed in parallel) the x loop could be assigned to threads in the y direction, and the remaining 2 loops {y,z} could be assigned to blocks in the x, y direction. -- Basically hierarchical parallelism. The key though is being able to express the work as for loops. |
I was able to re-write that piece of code using a fourth nested loop as follows:
The above implementation should be easier to convert using RAJA syntax. Would that be appropriate? I have a question on one of the previous iteration for the piece of code
Under the current implementation, the code would use 8 threads. How does that relate to the number of GPUs on each node? I run on Summit and there are 6 GPUs per node. I tried to run on the GPUs after changing
|
Hi @delcmo , yes the proposed code would work. In regards to the error..... I am actually not sure. Could you try running it through cuda-gdb and share a stack trace? I'm wondering if data is not on the GPU or the number of teams are too large. Could you share the sizes of NTeams_ ? Additionally RAJA only targets 1 GPU, it is through combining MPI + RAJA that we can target multiple GPUs. Also with the configuration of threads corresponds to 8^3 = 512 threads per team (or CUDA thread block which we call teams). Then the number of blocks are calculated by the values of NTeams_. Here we are computing on a 3D compute grid. |
The code works with NTeams = 8 but gives me the error message when I set
NTeams = 10 or higher.
```
Backtrace for this error:
CUDAassert: invalid configuration argument
/gpfs/alpine/proj-shared/cfd136/raja/build-210412c/include/RAJA/policy/cuda/MemUtils_CUDA.hpp
210
terminate called after throwing an instance of 'std::runtime_error'
what(): CUDAassert
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
ERROR: One or more process (first noticed rank 6) terminated with signal 6
(core dumped)
```
…On Mon, Jun 21, 2021 at 2:25 PM Arturo Vargas ***@***.***> wrote:
Hi @delcmo <https://github.com/delcmo> , yes the proposed code would work.
In regards to the error..... I am actually not sure. Could you try running
it through cuda-gdb and share a stack trace? I'm wondering if data is not
on the GPU or the number of teams are too large. Could you share the sizes
of NTeams_ ?
Additionally RAJA only targets 1 GPU, it is through combining MPI + RAJA
that we can target multiple GPUs.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1057 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABD4GCUZRWHCQPMMOV4IFB3TT572PANCNFSM45TIE6KA>
.
--
Marc-Olivier Delchini
|
To double check, are you choosing values for NThreads or NTeams_{x,y.z}? The max number of threads in CUDA per block (RAJA Team) is 1024, NThreads = 10 should work..... I'll see if I can reproduce locally. |
I set the number of threads as it is in the original code you provided in
the previous email. The number of teams is calculated from
```
const int NTeams_x = RAJA_DIVIDE_CEILING_INT(x_range_teams.size(),NThreads);
const int NTeams_y = RAJA_DIVIDE_CEILING_INT(y_range_teams.size(),NThreads);
const int NTeams_z = RAJA_DIVIDE_CEILING_INT(z_range_teams.size(),NThreads);
```
where `x_range_teams` is 20 (number of mesh elements in the y-direction. `x_range_teams` and `z_range_teams` are also set to 20.
On Wed, Jun 23, 2021 at 5:19 PM Arturo Vargas ***@***.***> wrote:
To double check, are you choosing values for NThreads or NTeams_{x,y.z}?
The max number of threads in CUDA per block (RAJA Team) is 1024, NThreads =
10 should work..... I'll see if I can reproduce locally.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1057 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABD4GCV665KUIYKFMQ56I23TUJFVVANCNFSM45TIE6KA>
.
--
Marc-Olivier Delchini
|
I run some more tests and the piece of code runs fine with any value of Nthread <= 8. When the number of teams fall below 4 (Nthread >=9 ), I get a Cuda error message. The number of teams is computed from
|
Does it work if you try launching an empty kernel with the values used directly?
The error hints that the compute grid is not being correctly, but I'm not sure what is going on. |
I added to the code the above lines and run it to get the following:
```
running cuda kernel
...
running cuda kernel
ERROR: One or more process (first noticed rank 3) terminated with signal 6
(core dumped)
```
It outputs `running cuda kernel` for a while and then stops with the above
error message.
…On Thu, Jun 24, 2021 at 1:13 PM Arturo Vargas ***@***.***> wrote:
Does it work if you try launching an empty kernel with the values used
directly?
RAJA::expt::launch<launch_policy>
(RAJA::expt::DEVICE,
RAJA::expt::Resources(RAJA::expt::Teams(10, 10, 10), RAJA::expt::Threads(8, 8, 8)),
[=] RAJA_HOST_DEVICE(RAJA::expt::LaunchContext ctx) {
printf("running cuda kernel \n");
}); // outer lambda
The error hints that the compute grid is not being correctly, but I'm not
sure what is going on.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1057 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABD4GCXEQCFBTCSA34SRFODTUNRRZANCNFSM45TIE6KA>
.
--
Marc-Olivier Delchini
|
Can you try running running the raja-teams example under examples? |
I will have to try the example but I run into another problem lately that I can reproduce with the piece of code below:
The outlet I get is the following:
which I interpret as the When I look at the RAJA team examples, the C++ |
Hi @delcmo , no the |
Ok thanks for the clarification. I added `cudaDeviceSynchronize();` and it
seems to output the correct information. `ndir_` is defined but I removed
that line in the post.
I am still struggling with the error message related to
```
CUDAassert: too many resources requested for launch
/gpfs/alpine/proj-shared/cfd136/raja/build-210412c/include/RAJA/policy/cuda/MemUtils_CUDA.hpp
210
terminate called after throwing an instance of 'std::runtime_error'
what(): CUDAassert
```
For some reason, it is triggered this time by adding a print statement to
the kernel `printf()`. Could it be related to a saturated memory?
Also, when using the `RAJA::View` class, are there any specific rules to be
aware of? iIf defining `RAJA::View<double, RAJA::Layout<1>> f_(f, sizef);`
can I read and write the values of `f_` from a kernel without limitation?
Should I restraint the use of `RAJA::View` to a minimum in my C++code? I am
seeing weird behavior that I cannot explain when using `f_` like object. If
I try to read `f_` and update it from a kernel I also get the above error
message.
…On Mon, Jun 28, 2021 at 2:44 PM Arturo Vargas ***@***.***> wrote:
Hi @delcmo <https://github.com/delcmo> , no the loop_icount method is
only needed when performing tiling. Where is ndir_ defined? If it is a
member or global value, it could be an issue when it comes to lambda
capturing. Something else to try is adding cudaDeviceSynchronize(); after
your kernel.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1057 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABD4GCSQQ6N53LVBSHPPLB3TVC7KTANCNFSM45TIE6KA>
.
--
Marc-Olivier Delchini
|
Hi @delcmo , the only rule I can think of is that the data in the view has to be in the same memory space that you are trying to access it. If you use unified memory, data transfers should be taken care of for you. Regarding your first error, I just thought of something. Can you try reducing the number of threads per team maybe try 6? and it could be that your kernel uses too much register space. One way to also check this, would be to comment out the code inside the ::launch method, if it runs its probably maxing out register space. |
I reduced the number of threads and you are correct it runs. I was
previously using the maximum number of threads (Nthreads_x = Nthreads_y = 8
and Nthreads_z = 16). My lack of experience with CUCA programming cost me a
few days of work ... ;)
That being said, I now wonder what would be the correct way to set the
number of threads?
…On Mon, Jun 28, 2021 at 5:24 PM Arturo Vargas ***@***.***> wrote:
Hi @delcmo <https://github.com/delcmo> , the only rule I can think of is
that the data in the view has to be in the same memory space that you are
trying to access it. If you use unified memory, data transfers should be
taken care of for you.
Regarding your first error, I just thought of something. Can you try
reducing the number of threads per team?
I came across this article:
https://stackoverflow.com/questions/26201172/cuda-too-many-resources-requested-for-launch
and it could be that your kernel uses too much register space. One way to
also check this, would be to comment out the code inside the ::launch
method, if it runs its probably maxing out register space.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1057 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABD4GCW45VOO2NH7YTZPJS3TVDSCFANCNFSM45TIE6KA>
.
--
Marc-Olivier Delchini
|
Choosing the number of threads is a bit of an art, it's hard to say. It depends on your kernel, how much local memory it needs. It is definitely something to play around with. Happy it is now working though. One suggestion I have though is to change your CUDA launch policy from Additionally, the latest RAJA now has CUDA kernels running on the non-default CUDA stream; and if you allocate memory using |
One additional option in RAJA is to configure CAMP to use the CUDA default stream. That is done by defining the following variable at compile time: |
I stopped using What does default stream mean? How many teams can I use in my code? Am I limited to three, i.e. three nested loop, or can I declare a fourth team to have four nested loops? |
A CUDA stream; is the queue in which instructions are stored. CUDA has multiple queues so kernels can be executed async. Teams/Threads support up to 3 dimensions (like the CUDA programming model). Regarding the |
I run on SUMMIT at ORNL. I will double check the CUDA examples and the my code because based on your previous email I clearly have something weird in my code. |
If its like Sierra, its probably moving the data for you. Try running your code with the following flag: --atsdisable with lalloc |
I will give it a try but that would make sense because I did checked that the code is running on the GPUs and I no longer get a CUDA error message. I assume that I will have to modify the code to make it compatible with any CUDA machines and not rely on the specificity of Summit. |
You are correct, the data are being moved.
I am doing some testing on both CPU and GPU to assess the speed up we now
get from using GPU over CPU. There is one thing I am not clear on is, what
the number of threads mean for CPUs when entering the RAJA loop tool? Is
that ignored by the CPU or do I have to be careful what value is used?
…On Tue, Jul 6, 2021 at 11:33 PM Arturo Vargas ***@***.***> wrote:
If its like Sierra, its probably moving the data for you. Try running your
code with the following flag: --atsdisable with lalloc
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1057 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABD4GCTXDQGQP6UEL4ADOCDTWPDIJANCNFSM45TIE6KA>
.
--
Marc-Olivier Delchini
|
On the CPU, the number of teams/threads get ignored, RAJA::loops get converted to standard for loops. |
Ok thanks for the quick reply. I will make sure this is correct in my code
by running on CPUs with different number of threads.
Thanks,
…On Wed, Aug 4, 2021 at 1:29 PM Arturo Vargas ***@***.***> wrote:
On the CPU, the number of teams/threads get ignored, RAJA::loops get
converted to standard for loops.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1057 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABD4GCSY4JMA5AX5Q3QB6HDT3F2F5ANCNFSM45TIE6KA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
--
Marc-Olivier Delchini
|
Hello,
I have recently started to use RAJA package to port some of the C++ subroutines to GPUs and I was wondering how to call a C++ function from a RAJA kernel. I could not find any related issues and the documentation does not seem to address this approach.
Does RAJA support call of C++ function from a kernel? If it does, could you please point me to an example or any documentation?
Thanks,
Marco
The text was updated successfully, but these errors were encountered: