Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about seq_exec compiler optimizations #1630

Closed
yencal opened this issue Apr 17, 2024 · 2 comments
Closed

Question about seq_exec compiler optimizations #1630

yencal opened this issue Apr 17, 2024 · 2 comments

Comments

@yencal
Copy link

yencal commented Apr 17, 2024

Greetings, could someone please explain what sort of compiler optimization happens behind the scenes when one uses seq_exec?
Because for a simple 3d finite difference nested loop, I see that the RAJA version is about four times faster than the C code.

Here is the C code. It does not show the body of the loop for brevity, but it is the same as the RAJA version:

  for (int k = 0; k < nz; ++k ) {
    for (int j = 0; j < ny; ++j ) {
      for (int i = 0; i < nx; ++i ) {
         A[i + nx * (j + ny * k)] = ...

and RAJA version of the same loop as such

  using EXEC_POLICY_3D =
    RAJA::KernelPolicy<
      RAJA::statement::For<2, RAJA::seq_exec,      // k
        RAJA::statement::For<1, RAJA::seq_exec,    // j
          RAJA::statement::For<0, RAJA::seq_exec,  // i
            RAJA::statement::Lambda<0>
          >
        >
      >
    >;
  RAJA::kernel<EXEC_POLICY_3D>(
    RAJA::make_tuple( RAJA::TypedRangeSegment<int>(0, nz),
                      RAJA::TypedRangeSegment<int>(0, ny),
                      RAJA::TypedRangeSegment<int>(0, nx) ),

    [=] RAJA_DEVICE ( int k, int j, int i) {
        A[i + nx * (j + ny * k)] = ...

Note that I use the same compiler flags for both codes:

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++17 -Ofast -march=native")

And I am using M3 MacBook.
Thanks

@rhornung67
Copy link
Member

I believe that the compiler sees essentially the same source code whether written as a C-style for-loop or using a RAJA kernel exec method with a seq_exec policy. https://github.com/LLNL/RAJA/blob/develop/include/RAJA/policy/sequential/forall.hpp#L65

That is, there are no pragmas or other annotations applied in RAJA internals. That said, we often observe cases where RAJA code runs faster than native C-style code, but it is not clear why. However, 4x faster seems extraordinary. Have you compared the assembly code for the two versions?

@yencal
Copy link
Author

yencal commented Apr 19, 2024

Ok, I will check the assembly code and ensure the cmake flags are propagated accordingly. Thanks

@yencal yencal closed this as completed Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants