Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guarantee parallel_scan to use two passes #6897

Open
masterleinad opened this issue Mar 25, 2024 · 7 comments
Open

Guarantee parallel_scan to use two passes #6897

masterleinad opened this issue Mar 25, 2024 · 7 comments
Labels
Question For Kokkos internal and external contributors and users

Comments

@masterleinad
Copy link
Contributor

Currently, we only guarantee that parallel_scan calls the functor with is_final==true but not if it's called with is_final==false.
Note that all backends apart from Serial use a two-pass implementation. If the user can't know/rely on two passes, it's not possible to avoid repeating expensive calculations, i.e.

void (int i, int& chunk_start, bool is_final) {
  if (is_final) {
    if (is_relevant(i)) {
      result(chunk_start) = i;
      ++chunk_start;
    }
  } else {
    if (expensive_conditon(i)) {
      is_relevant(i) = true;
      ++chunk_start;
    }
  }
} 

doesn't work if there is only one pass. Instead, we would need to launch another kernel to evaluate expensive_condition beforehand so that we need to launch three kernels for most backends (one "expensive" one for parallel_for, and two "cheap" ones for parallel_scan). It appears that the benefit of avoiding one pass in the Serial backend might not be sufficient to justify not allowing caching results in the first parallel_scan pass.

@dalg24
Copy link
Member

dalg24 commented Mar 26, 2024

Does the documentation say anything about this? If the answer is "no we don't", does the doc include anything regarding guarantees with parallel_reduce or parallel_for (I understand parallel_scan is different).

It appears that the benefit of avoiding one pass in the Serial backend might not be sufficient to justify not allowing caching results in the first parallel_scan pass.

We will see about that. Some might not be really happy about the serial parallel scan taking a 2x slow down.

@masterleinad
Copy link
Contributor Author

We will see about that. Some might not be really happy about the serial parallel scan taking a 2x slow down.

I'd be curious to hear under which circumstances users care about performance for the Serial backend that much.

@dalg24
Copy link
Member

dalg24 commented Mar 26, 2024 via email

@stanmoore1
Copy link
Contributor

In general we want Kokkos Serial to have as low overhead as possible, isn't this why a serial backend exists? Otherwise why not just run OpenMP with 1 thread?

@masterleinad
Copy link
Contributor Author

Otherwise why not just run OpenMP with 1 thread?

The Serial backend doesn't require any dependencies whereas all other backends do so it's a good fallback and starting point.

@stanmoore1
Copy link
Contributor

Most systems have OpenMP support so I think the bar is pretty low. But my point is really that we want the Serial backend to be as close to zero overhead as possible. No atomic overhead, and no multiple loops in the parallel_scan is my desire.

@ajpowelsnl ajpowelsnl added the Question For Kokkos internal and external contributors and users label Apr 22, 2024
@ajpowelsnl
Copy link
Contributor

@stanmoore1 , @masterleinad , regarding "no atomic overhead"; as of Kokkos-4.3 , there's a new CMake keyword that might be useful for serial / host-only builds: Kokkos_ENABLE_ATOMICS_BYPASS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Question For Kokkos internal and external contributors and users
Projects
None yet
Development

No branches or pull requests

4 participants