Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute special constants. #2830

Open
MkazemAkhgary opened this issue Apr 6, 2024 · 1 comment
Open

compute special constants. #2830

MkazemAkhgary opened this issue Apr 6, 2024 · 1 comment
Labels
Performance All issues related to performance/code generation

Comments

@MkazemAkhgary
Copy link

MkazemAkhgary commented Apr 6, 2024

constants that are filled with 1s from one side and 0s from another side, such as 0xFFFFFFF8 or 0x000000FF, can be computed directly rather than being broadcasted from memory which should be faster. these numbers are common such as 1, -8, 255, ...

if -1 is already present in a register, then vpcmpeqd is not needed and this will be just one instruction.

vpcmpeqd    ymm0, ymm0, ymm0 # 0xFFFFFFFF
vpslld    ymm1, ymm0, 3 # -8 = 0xFFFFFFF8
vpcmpeqd    ymm0, ymm0, ymm0 # 0xFFFFFFFF
vpsrld    ymm1, ymm0, 24 # 255 = 0x000000FF

constants with 1s in the middle can be computed in similar way, perhaps faster than broadcast, should be faster if -1 is present. also common (2, 4, -2.0, 0.5, ...)

vpcmpeqd    ymm0, ymm0, ymm0 # 0xFFFFFFFF
vpslld    ymm1, ymm0, 24 # 0xFF000000
vpsrld    ymm1, ymm1, 2 # 1.5f = 0x3FC00000

similar trick can be used to compute constants with 0s in the middle using a shift and a rotate. (AVX512 only)

vpcmpeqd    zmm0, zmm0, zmm0 # 0xFFFFFFFF
vpsrld    zmm1, zmm0, 1 # 0x7FFFFFFF
vprold    zmm1, zmm1, 2 # -3 = 0xFFFFFFFD

if a negative number is present, vpabsd can be used to get the positive value.
duplicate of a number can be computed via vpaddd.
if -1 is present, complement of a constant can be computed using vpxor.
adjacent numbers can be computed by adding or subtracting -1.


for some numbers, vpsubd or vpaddd can be used instead of double shifts to reduce port contention. for example, to get the number 2, compute (~0 >> 30) + ~0 instead of (~0 >> 31 << 1). (provided that -1 is present)


caveats:

  • these methods rely on having a register set to -1, which may induce register spilling in certain cases. However, in smaller code sections or where -1 is already available, these methods may be beneficial.
  • using too many shifts can lead to contention on execution ports. so depending on what instructions are scheduled, it might be better to use broadcast to better utilize ports.
@dbabokin
Copy link
Collaborator

dbabokin commented Apr 8, 2024

This tricks are implemented by LLVM backend (codegen), ISPC can handle it, but preferably it should be done in LLVM. I suggest verifying that LLVM doesn't do that for C/C++ code (using vector extension) and file this in LLVM project - and linking this issue, so we make sure that it happens in ISPC once it's implemented.

It's important to note in the LLVM issue, that it's for vector constants - as they would expect that it's for scalar by default.

@pbrubaker pbrubaker added the Performance All issues related to performance/code generation label Apr 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance All issues related to performance/code generation
Projects
None yet
Development

No branches or pull requests

3 participants