Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add clz builtin #965

Merged
merged 26 commits into from
May 5, 2023
Merged

add clz builtin #965

merged 26 commits into from
May 5, 2023

Conversation

tdavidcl
Copy link
Contributor

@tdavidcl tdavidcl commented Mar 7, 2023

Hi,

I've tried to implement the clz builtin in OpenSYCL by mimicking the way it was done for mul24.

The part where I'm unsure whether the definition is correct are :

HIPSYCL_DEFINE_BUILTIN(clz, HIPSYCL_BUILTIN_OVERLOAD_SET_GENINTEGER,
                       HIPSYCL_BUILTIN_GENERATOR_UNARY_T)

and

template<class T>
HIPSYCL_BUILTIN T __hipsycl_clz(T x) noexcept {
  HIPSYCL_RETURN_DISPATCH_BUILTIN(__hipsycl_clz, x);
}

Also do you have a mechanism in place to check the correctness of builtin functions ?

@illuhad
Copy link
Collaborator

illuhad commented Mar 7, 2023

Nice, more awesome PRs! :D

HIPSYCL_DEFINE_BUILTIN(clz, HIPSYCL_BUILTIN_OVERLOAD_SET_GENINTEGER,
HIPSYCL_BUILTIN_GENERATOR_UNARY_T)

This looks plausible to me.

template
HIPSYCL_BUILTIN T __hipsycl_clz(T x) noexcept {
HIPSYCL_RETURN_DISPATCH_BUILTIN(__hipsycl_clz, x);
}

That too.

Also do you have a mechanism in place to check the correctness of builtin functions ?

You can add a test case. We do test builtin correctness for math functions here: https://github.com/OpenSYCL/OpenSYCL/blob/develop/tests/sycl/math.cpp
But we don't yet have a similar file for integer builtins. You can add it :)

There seems to be an issue with the current enable_if logic in the host path. Maybe it is easier just using overloads?

src/libkernel/sscp/ptx/integer.cpp Outdated Show resolved Hide resolved
include/hipSYCL/sycl/libkernel/sscp/builtins.hpp Outdated Show resolved Hide resolved
@tdavidcl
Copy link
Contributor Author

tdavidcl commented Mar 9, 2023

There seems to be an issue with the current enable_if logic in the host path. Maybe it is easier just using overloads?

I've changed the declaration now the checks passes on my computer at least

Currently on the host side I use a declaration like this

template <class T,
          std::enable_if_t<
              (std::is_same_v<T, unsigned int> || std::is_same_v<T, int> ||
               std::is_same_v<T, unsigned short> || std::is_same_v<T, short> ||
               std::is_same_v<T, unsigned char> ||
               std::is_same_v<T, signed char> || std::is_same_v<T, char>),
              int> = 0>
HIPSYCL_BUILTIN T __hipsycl_clz(T x) noexcept {
  return __builtin_clz(x);
}

template <class T, std::enable_if_t<(std::is_same_v<T, unsigned long> ||
                                     std::is_same_v<T, long>),
                                    int> = 0>
HIPSYCL_BUILTIN T __hipsycl_clz(T x) noexcept {
  return __builtin_clzl(x);
}

template <class T, std::enable_if_t<(std::is_same_v<T, unsigned long long> ||
                                     std::is_same_v<T, long long>),
                                    int> = 0>
HIPSYCL_BUILTIN T __hipsycl_clz(T x) noexcept {
  return __builtin_clzll(x);
}

which does work but is rather long. Would this be ok ?

@illuhad
Copy link
Collaborator

illuhad commented Mar 9, 2023

I feel like this could be simplified. Would maybe something like this work?

template<class T>
T __hipsycl_clz(T x) {
  if constexpr(std::is_same_v<T, long long> || std::is_same_v<T, unsigned long long>)
    return __builtin_clzll(x);
  else if constexpr(std::is_same_v<T, long> || std::is_same_v<T unsigned long>)
    return __builtin_clzl(x);
  else
   return __builtin_clz(x); // do we need static_cast<T>(...) for the return statements?
}

@illuhad
Copy link
Collaborator

illuhad commented Mar 9, 2023

CI says nvc++ is not happy:

"/home/runner/work/OpenSYCL/OpenSYCL/build/install/bin/../include/CL/../hipSYCL/sycl/libkernel/builtin_interface.hpp", line 534: error: no instance of overloaded function "hipsycl::sycl::detail::hiplike_builtins::__hipsycl_clz" matches the argument list
            argument types are: (char)
    HIPSYCL_RETURN_DISPATCH_BUILTIN(__hipsycl_clz, x);
    ^
          detected during instantiation of "T hipsycl::sycl::detail::__hipsycl_clz(T) noexcept [with T=char]" at line 788 of "/home/runner/work/OpenSYCL/OpenSYCL/build/install/bin/../include/CL/../hipSYCL/sycl/libkernel/builtins.hpp"

"/home/runner/work/OpenSYCL/OpenSYCL/build/install/bin/../include/CL/../hipSYCL/sycl/libkernel/builtin_interface.hpp", line 534: error: no instance of overloaded function "hipsycl::sycl::detail::hiplike_builtins::__hipsycl_clz" matches the argument list
            argument types are: (signed char)
    HIPSYCL_RETURN_DISPATCH_BUILTIN(__hipsycl_clz, x);
    ^
          detected during instantiation of "T hipsycl::sycl::detail::__hipsycl_clz(T) noexcept [with T=signed char]" at line 788 of "/home/runner/work/OpenSYCL/OpenSYCL/build/install/bin/../include/CL/../hipSYCL/sycl/libkernel/builtins.hpp"

"/home/runner/work/OpenSYCL/OpenSYCL/build/install/bin/../include/CL/../hipSYCL/sycl/libkernel/builtin_interface.hpp", line 534: error: no instance of overloaded function "hipsycl::sycl::detail::hiplike_builtins::__hipsycl_clz" matches the argument list
            argument types are: (unsigned char)
    HIPSYCL_RETURN_DISPATCH_BUILTIN(__hipsycl_clz, x);
    ^
          detected during instantiation of "T hipsycl::sycl::detail::__hipsycl_clz(T) noexcept [with T=unsigned char]" at line 788 of "/home/runner/work/OpenSYCL/OpenSYCL/build/install/bin/../include/CL/../hipSYCL/sycl/libkernel/builtins.hpp"

"/home/runner/work/OpenSYCL/OpenSYCL/build/install/bin/../include/CL/../hipSYCL/sycl/libkernel/builtin_interface.hpp", line 534: error: no instance of overloaded function "hipsycl::sycl::detail::hiplike_builtins::__hipsycl_clz" matches the argument list
            argument types are: (short)
    HIPSYCL_RETURN_DISPATCH_BUILTIN(__hipsycl_clz, x);
    ^
          detected during instantiation of "T hipsycl::sycl::detail::__hipsycl_clz(T) noexcept [with T=short]" at line 788 of "/home/runner/work/OpenSYCL/OpenSYCL/build/install/bin/../include/CL/../hipSYCL/sycl/libkernel/builtins.hpp"

"/home/runner/work/OpenSYCL/OpenSYCL/build/install/bin/../include/CL/../hipSYCL/sycl/libkernel/builtin_interface.hpp", line 534: error: no instance of overloaded function "hipsycl::sycl::detail::hiplike_builtins::__hipsycl_clz" matches the argument list
            argument types are: (unsigned short)
    HIPSYCL_RETURN_DISPATCH_BUILTIN(__hipsycl_clz, x);
    ^
          detected during instantiation of "T hipsycl::sycl::detail::__hipsycl_clz(T) noexcept [with T=unsigned short]" at line 788 of "/home/runner/work/OpenSYCL/OpenSYCL/build/install/bin/../include/CL/../hipSYCL/sycl/libkernel/builtins.hpp"

"/home/runner/work/OpenSYCL/OpenSYCL/build/install/bin/../include/CL/../hipSYCL/sycl/libkernel/builtin_interface.hpp", line 534: error: no instance of overloaded function "hipsycl::sycl::detail::hiplike_builtins::__hipsycl_clz" matches the argument list
            argument types are: (long)
    HIPSYCL_RETURN_DISPATCH_BUILTIN(__hipsycl_clz, x);
    ^
          detected during instantiation of "T hipsycl::sycl::detail::__hipsycl_clz(T) noexcept [with T=long]" at line 788 of "/home/runner/work/OpenSYCL/OpenSYCL/build/install/bin/../include/CL/../hipSYCL/sycl/libkernel/builtins.hpp"

"/home/runner/work/OpenSYCL/OpenSYCL/build/install/bin/../include/CL/../hipSYCL/sycl/libkernel/builtin_interface.hpp", line 534: error: no instance of overloaded function "hipsycl::sycl::detail::hiplike_builtins::__hipsycl_clz" matches the argument list
            argument types are: (unsigned long)
    HIPSYCL_RETURN_DISPATCH_BUILTIN(__hipsycl_clz, x);
    ^
          detected during instantiation of "T hipsycl::sycl::detail::__hipsycl_clz(T) noexcept [with T=unsigned long]" at line 788 of "/home/runner/work/OpenSYCL/OpenSYCL/build/install/bin/../include/CL/../hipSYCL/sycl/libkernel/builtins.hpp"

@tdavidcl
Copy link
Contributor Author

tdavidcl commented Mar 9, 2023

CI says nvc++ is not happy:

actually after checking the hip/cuda doc specify clz for 32 bit ints and clzll for 64 bit ints.

So I've changed it to :

template<class T>
HIPSYCL_HIPLIKE_BUILTIN T __hipsycl_clz(T x) noexcept {

  // use __clzll or __clz by checking the bit lenght because
  // the nvidia/hip documentation mention clz as 32 bits and clzll as 64
  
  if constexpr (sizeof(T)*CHAR_BIT == 64){
    return __clzll(static_cast<__hipsycl_int64>(x));
  }

  return __clz(static_cast<__hipsycl_int32>(x));
}

It should also solve the instantiation issues.

tdavidcl and others added 2 commits March 18, 2023 16:06
Co-authored-by: Ronan Keryell <ronan@keryell.fr>
@illuhad
Copy link
Collaborator

illuhad commented Apr 5, 2023

Also do you have a mechanism in place to check the correctness of builtin functions ?

Can you confirm whether you intend to also add a test case, or if this PR should be considered for final review/testing as is?

@tdavidcl
Copy link
Contributor Author

tdavidcl commented Apr 5, 2023

I think I won't be able to work on this one for a few weeks, so I won't be able to add testing to this as of now...

Can you confirm whether you intend to also add a test case, or if this PR should be considered for final review/testing as is?

So do you prefer to wait until i add some test case or do you want to review it now ?

@illuhad
Copy link
Collaborator

illuhad commented Apr 26, 2023

So do you prefer to wait until i add some test case or do you want to review it now ?

I'll start reviewing now and merge once tests are there too :-)

@tdavidcl
Copy link
Contributor Author

tdavidcl commented May 3, 2023

I've added the test, and cleaned the clz implementation.
The tests pass on omp mode, but i don't have hardware/config to test on amd & nvidia right now.

@illuhad
Copy link
Collaborator

illuhad commented May 3, 2023

Thank you, I have just merged our new self-hosted CI with NVIDIA and AMD GPUs - so if you rebase on current develop, the tests will be run on GPUs :)

@illuhad
Copy link
Collaborator

illuhad commented May 4, 2023

I fixed the typo for nvcc, but i'm suprised that nvcc doesn't have __builtin_clz, would __clz, __clzll work since it is compiling some CUDA ?

We don't support nvcc, only nvc++ :-) (those are two very different compilers).
I'm not sure if the CUDA intrinsic would work in the host path - most likely not, so we'd have to use the fallback clz.

@tdavidcl
Copy link
Contributor Author

tdavidcl commented May 4, 2023

I fixed the typo for nvcc, but i'm suprised that nvcc doesn't have __builtin_clz, would __clz, __clzll work since it is compiling some CUDA ?

We don't support nvcc, only nvc++ :-) (those are two very different compilers). I'm not sure if the CUDA intrinsic would work in the host path - most likely not, so we'd have to use the fallback clz.

yeah my bad i meant nvc++. The test in math.cpp with nvc++ passes, with the fix, tho i'm not sure if the __clz or fallback_clz version was used

edit : after checking in the doc it says __device__​ int __clz ( int x ), so yes not on the host

@tdavidcl
Copy link
Contributor Author

tdavidcl commented May 4, 2023

So i reverted to the previous commit. With the test passing on all config (+ self hosted), is everything ok for final review ?

Copy link
Collaborator

@illuhad illuhad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@illuhad
Copy link
Collaborator

illuhad commented May 4, 2023

Hm.. it seems that there might still be an issue on CPU: https://github.com/OpenSYCL/OpenSYCL/actions/runs/4883593430/jobs/8716435004?pr=965

@tdavidcl
Copy link
Contributor Author

tdavidcl commented May 4, 2023

Yes, but i can not reproduce the issue ...
The previous version with the issue in the null case was working, so i've rolled back to it with an additional check on x==0

https://godbolt.org/z/ToKsdoTKM

@illuhad
Copy link
Collaborator

illuhad commented May 4, 2023

The only potential issue I see is that it would probably evalute !bset[sizeof(T)*CHAR_BIT - idx -1] before it would evaluate the idx<sizeof(T)*CHAR_BIT. So in the step when the second condition evaluates to false, it could happen that it accessed the invalid memory one time (thus triggering UB) for the first condition before finding out that it should not have done that.

Let's see if it works now :)

@tdavidcl
Copy link
Contributor Author

tdavidcl commented May 4, 2023

it's odd that the macos test doesn't work, it was ok here : https://github.com/OpenSYCL/OpenSYCL/actions/runs/4883180762/jobs/8714601525

and the only difference is this line :

if(x==0){return sizeof(T)*CHAR_BIT;}

and the test pass on my MAC (M1) with appleClang :'(

@illuhad
Copy link
Collaborator

illuhad commented May 4, 2023

Mac also failed previously, before the fix. So I wonder why changing to if(x==0) still has not resolved the issue entirely... For reference, this is how Mac CI works: https://github.com/OpenSYCL/OpenSYCL/blob/develop/.github/workflows/macos.yml
If you have a Mac, can you try following these commands to see if you can reproduce? Or maybe try address sanitizer to see if it's still doing some invalid access.

@tdavidcl
Copy link
Contributor Author

tdavidcl commented May 4, 2023

it is still correct on my Mac, I've added bunch of print statements, can you try running it only the Mac OS workflow?

@illuhad
Copy link
Collaborator

illuhad commented May 4, 2023

I've invited you into the OpenSYCL organization; the workflows will then run automatically for you.

@tdavidcl
Copy link
Contributor Author

tdavidcl commented May 4, 2023

I've invited you into the OpenSYCL organization; the workflows will then run automatically for you.

Oh thanks a lot, it will speed up the process for sure ^^

@tdavidcl
Copy link
Contributor Author

tdavidcl commented May 4, 2023

ok I found the issue : in the failing tests you have this ...

castvar = 0
__builtin_clz(castvar) = 73911 (unsigned int)
__builtin_clz(0) = 73896 (unsigned int)

the clz builtin looks broken in the Mac OS workflow, also it does not look consistent and hard to check ...

also I found this : https://stackoverflow.com/questions/19527897/how-undefined-are-builtin-ctz0-or-builtin-clz0
love it ...

@illuhad
Copy link
Collaborator

illuhad commented May 4, 2023

Good find!

So basically we should always

if(x == 0)
  return sizeof(T)*CHAR_BIT
else
  return __builtin_clz(x)

whenever using the builtin?

I do wonder though when this case is so undefined, whether the SYCL specification even mandates a specific result here? I can kind of see that the concept of leading or trailing zeros falls apart somewhat in the case of a value of 0.

Good thing we noticed this edge case :)

@tdavidcl
Copy link
Contributor Author

tdavidcl commented May 4, 2023

The SYCL standard actually says :

geninteger clz(geninteger x)

Returns the number of leading 0-bits in x, starting at the most significant bit position. 
If x is 0, returns the size in bits of the type of x or component type of x, if x is a vector type.

So I've added the check for __builtin_clz

In cuda :

__device__ ​ int __clz ( int  x )
Return the number of consecutive high-order zero bits in a 32-bit integer.

so this one is fine

I'm not 100% sure for the others

@illuhad
Copy link
Collaborator

illuhad commented May 4, 2023

My guess is that SPIR-V will probably be aligned with OpenCL, which is probably aligned with SYCL. No idea for AMD. So maybe also add the check for those two backends to be sure?

@illuhad
Copy link
Collaborator

illuhad commented May 4, 2023

... Or maybe just add the check to the high-level builtin interface, so that the backends don't need to be concerned about this case anymore?

@tdavidcl
Copy link
Contributor Author

tdavidcl commented May 4, 2023

My guess is that SPIR-V will probably be aligned with OpenCL, which is probably aligned with SYCL. No idea for AMD. So maybe also add the check for those two backends to be sure?

As you guessed
From https://registry.khronos.org/SPIR-V/specs/unified1/OpenCL.ExtendedInstructionSet.100.html

clz

Returns the number of leading 0 bits in x, starting at the most significant bit position. 
If x is 0, returns the size in bits of the type of x or component type of x, if x is a vector. 

So spir-v should be fine

... Or maybe just add the check to the high-level builtin interface, so that the backends don't need to be concerned about this case anymore?

Actually we can just add the check in tests, also I don't know also why in the first place the value was 0 in the inputs, it shouldn't when i'm looking at the test.

@tdavidcl
Copy link
Contributor Author

tdavidcl commented May 4, 2023

From RadeonCompute gits:

https://github.com/RadeonOpenCompute/clang/blob/a09d37e345861d68f9768939e485d265f4fcb0ce/include/clang/Basic/TargetInfo.h#L776

Looks like they are aware and check for it :), so looks like we only have to be careful with the host.

@illuhad illuhad merged commit a54d87b into AdaptiveCpp:develop May 5, 2023
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants