Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

joint_exclusive_scan does not work in-place #1440

Open
al42and opened this issue Apr 18, 2024 · 1 comment
Open

joint_exclusive_scan does not work in-place #1440

al42and opened this issue Apr 18, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@al42and
Copy link
Contributor

al42and commented Apr 18, 2024

Bug summary

The current implementation of joint_exclusive_scan (at least HIP-like; SSCP does not support them yet) does not seem to support in-place operations, even though the standard requires it ("Note that first may be equal to result.")

Also, it allocates a __shared__ scratch storage for the operation (inside of __hipsycl_inclusive_scan_over_group) even if the output is __shared__ too and can safely be used for scratch. Not a bug, just inefficiency

To Reproduce

#include <sycl/sycl.hpp>

template <int workGroupSize, int nElements>
auto exclusivePrefixSumInplace(sycl::handler &cgh, int *gm_data) {
  static_assert(nElements % workGroupSize == 0,
                "This simple scan kernel does not handle padding");
  return [=](sycl::nd_item<1> itemIdx) {
    const int tid = itemIdx.get_local_id(0);
    const sycl::group<1> workGroup = itemIdx.get_group();
    sycl::joint_exclusive_scan(workGroup, gm_data, gm_data + nElements, gm_data,
                               0, sycl::plus<int>{});
  };
}

template <int workGroupSize, int nElements>
auto exclusivePrefixSumOutOfPlace(sycl::handler &cgh, int *gm_data) {
  static_assert(nElements % workGroupSize == 0,
                "This simple scan kernel does not handle padding");
  sycl::local_accessor<int, 1> sm_localBuf(nElements, cgh);
  return [=](sycl::nd_item<1> itemIdx) {
    const int tid = itemIdx.get_local_id(0);
    const sycl::group<1> workGroup = itemIdx.get_group();
    int *sm_localBufPtr = sm_localBuf.get_pointer();
    sycl::joint_exclusive_scan(workGroup, gm_data, gm_data + nElements,
                               sm_localBufPtr, 0, sycl::plus<int>{});
    sycl::group_barrier(workGroup);
    for (int elem = tid; elem < nElements; elem += workGroupSize) {
      gm_data[elem] = sm_localBufPtr[elem];
    }
  };
}

int main() {
  for (const auto &device : sycl::device::get_devices()) {
    // Creating SYCL queue
    sycl::queue q(device, {sycl::property::queue::in_order()});

    std::cout << "Running on " << device.get_info<sycl::info::device::name>()
              << "\n";

    constexpr size_t size = 8192;
    constexpr int wgSize = 256;
    int *dataD = sycl::malloc_device<int>(size, q);
    std::vector<int> dataH(size);

    std::cout << "Running inplace..." << std::endl;
    {
      q.fill<int>(dataD, 1, size);
      q.submit([&](sycl::handler &cgh) {
        sycl::nd_range<1> range{wgSize, wgSize};
        cgh.parallel_for(range,
                         exclusivePrefixSumInplace<wgSize, size>(cgh, dataD));
      });
      q.copy<int>(dataD, dataH.data(), size).wait();

      int numFail = 0;
      for (size_t i = 0; i < size; i++) {
        numFail += (dataH[i] != i);
      }
      std::cout << "Got " << numFail << " failues!" << std::endl;
    }
    std::cout << "Running with a temp buffer..." << std::endl;
    {
      q.fill<int>(dataD, 1, size);
      q.submit([&](sycl::handler &cgh) {
        sycl::nd_range<1> range{wgSize, wgSize};
        cgh.parallel_for(
            range, exclusivePrefixSumOutOfPlace<wgSize, size>(cgh, dataD));
      });
      q.copy<int>(dataD, dataH.data(), size).wait();

      int numFail = 0;
      for (size_t i = 0; i < size; i++) {
        numFail += (dataH[i] != i);
      }
      std::cout << "Got " << numFail << " failues!" << std::endl;
    }
  }
  return 0;
}
$ acpp -O3 --acpp-targets='omp;cuda:sm_86;hip:gfx1034' test_scan.cpp && ACPP_VISIBILITY_MASK="cuda;hip" ./a.out 
[a bunch of "loop not vectorized" warnings]
4 warnings generated when compiling for host.
Running on NVIDIA GeForce RTX 3060
Running inplace...
Got 8191 failues!
Running with a temp buffer...
Got 0 failues!
Running on hipSYCL OpenMP host device
Running inplace...
Got 8191 failues!
Running with a temp buffer...
Got 0 failues!
Running on AMD Radeon RX 6400
Running inplace...
Got 8191 failues!
Running with a temp buffer...
Got 0 failues!

Expected behavior
A clear and concise description of what you expected to happen.

Describe your setup

$ acpp --acpp-version
acpp [AdaptiveCpp compilation driver], Copyright (C) 2018-2023 Aksel Alpay and the AdaptiveCpp project
  AdaptiveCpp version: 24.02.0+git.0359cac9.20240401.branch.develop.dirty
  Installation root: /home/aland/local
  Plugin LLVM version: 17, can accelerate CPU: True
  Available runtime backends:
     librt-backend-cuda.so
     librt-backend-omp.so
     librt-backend-hip.so
     librt-backend-ocl.so


Full configuration [can be overridden using environment variables or command line arguments]:
    version-major: 24
    version-minor: 02
    version-patch: 0
    version-suffix: +git.0359cac9.20240401.branch.develop.dirty
    plugin-llvm-version-major: 17
    plugin-with-cpu-acceleration: true
    default-clang: /usr/lib/llvm-17/bin/clang++
    default-targets: generic
    default-cpu-cxx: /usr/bin/clang++-17
    default-rocm-path: /opt/rocm
    default-use-bootstrap-mode: false
    default-is-dryrun: false
    default-use-accelerated-cpu: true
    default-clang-include-path: /usr/lib/llvm-17/lib/clang/17/include/..
    default-sequential-link-line: -L/usr/lib/x86_64-linux-gnu -lboost_context -lboost_fiber -Wl,-rpath=/usr/lib/x86_64-linux-gnu
    default-sequential-cxx-flags: -I/usr/include -D_ENABLE_EXTENDED_ALIGNED_STORAGE
    default-omp-link-line: -L/usr/lib/x86_64-linux-gnu -lboost_context -lboost_fiber -Wl,-rpath=/usr/lib/x86_64-linux-gnu -fopenmp
    default-omp-cxx-flags: -I/usr/include -fopenmp -D_ENABLE_EXTENDED_ALIGNED_STORAGE
    default-is-explicit-multipass: false
    default-save-temps: false
    default-rocm-link-line: -Wl,-rpath=$HIPSYCL_ROCM_PATH/lib -Wl,-rpath=$HIPSYCL_ROCM_PATH/hip/lib -L/opt/rocm/lib -L/opt/rocm/hip/lib -lamdhip64
    default-rocm-cxx-flags: -isystem $HIPSYCL_PATH/include/AdaptiveCpp/hipSYCL/std/hiplike -isystem /usr/lib/llvm-17/lib/clang/17/include/.. -U__FLOAT128__ -U__SIZEOF_FLOAT128__ -I$HIPSYCL_ROCM_PATH/include -I$HIPSYCL_ROCM_PATH/include --rocm-device-lib-path=$HIPSYCL_ROCM_PATH/amdgcn/bitcode --rocm-path=$HIPSYCL_ROCM_PATH -fhip-new-launch-api -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -D__HIP_ROCclr__
    default-nvcxx: (unconfigured)
    default-cuda-path: /usr/local/cuda
    default-cuda-link-line: -Wl,-rpath=$HIPSYCL_CUDA_LIB_PATH -L$HIPSYCL_CUDA_LIB_PATH -lcudart
    default-cuda-cxx-flags: -U__FLOAT128__ -U__SIZEOF_FLOAT128__ -isystem $HIPSYCL_PATH/include/AdaptiveCpp/hipSYCL/std/hiplike
  • ROCm 6.0.0, CUDA 12.3, Clang 17, Ubuntu 22.04
@al42and al42and added the bug Something isn't working label Apr 18, 2024
@al42and
Copy link
Contributor Author

al42and commented Apr 18, 2024

One reason is that we overwrite result[0] before doing the inclusive scan, but moving this conditional assignment after the __hipsycl_joint_inclusive_scan is not enough to solve the problem.

Upd 1:

And NVIDIA Compute Sanitizer reports a potential race on a scratch buffer in __hipsycl_inclusive_scan_over_group. Still not enough.

Both fixes above are in d41f958 for anyone interested.

Upd 2: And the host implementation seems to be pretty broken in the "in-place" case too, but apparently for independent reasons since the implementation is quite different.

Upd 3: Ah, yes, we pass input and output to __hipsycl_inclusive_scan_over_group shifted by one, so we have a nasty overlap there :( At least on top of my head, I don't see an easy way to easily fix the current code (i.e., while keeping exclusive scan implemented on top of internal scan). The easiest would perhaps be to template the functions on whether the scan is inclusive or exclusive. Or just duplicate the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant