Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i1 kernel arguments loaded as 4 bytes #2

Closed
arsenm opened this issue Nov 18, 2014 · 2 comments
Closed

i1 kernel arguments loaded as 4 bytes #2

arsenm opened this issue Nov 18, 2014 · 2 comments

Comments

@arsenm
Copy link
Contributor

arsenm commented Nov 18, 2014

These should be treated as an 8-bit byte load. If the ABI requires these are promoted to 32-bit and occupy that much space, they should be marked with zeroext parameter attributes

@rampitec
Copy link

rampitec commented Dec 8, 2014

OpenCL ABI does not mandate this size since bool cannot be passed into a kernel. An ABI for the functions we are defining ourselves. From the performance perspective 32 bit works better, so preferable. Otherwise HSAIL spec does not dictate any ABI for the kernels.

@arsenm
Copy link
Contributor Author

arsenm commented Apr 22, 2015

Fixed by ff1c637

@arsenm arsenm closed this as completed Apr 22, 2015
grollinger pushed a commit to grollinger/HLC-HSAIL-Development-LLVM that referenced this issue May 27, 2015
…oring the subregister.

For 0-lane stores, we used to generate code similar to:

  fmov w8, s0
  str w8, [x0, x1, lsl HSAFoundation#2]

instead of:

  str s0, [x0, x1, lsl HSAFoundation#2]

To correct that: for store lane 0 patterns, directly match to STR <subreg>0.

Byte-sized instructions don't have the special case for a 0 index,
because FPR8s are defined to have untyped content.

rdar://16372710
Differential Revision: http://reviews.llvm.org/D6772


git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225181 91177308-0d34-0410-b5e6-96231b3b80d8
grollinger pushed a commit to grollinger/HLC-HSAIL-Development-LLVM that referenced this issue May 27, 2015
…ew virtual function to end of subclass. NFC

The previous attempt at fixing this only moved the problem to the subclass
vtable. We can safely move the function into the subclass so attempt to fix it
that way.



git-svn-id: https://llvm.org/svn/llvm-project/llvm/branches/release_36@236112 91177308-0d34-0410-b5e6-96231b3b80d8
arsenm pushed a commit that referenced this issue Jul 1, 2015
in-register LUT technique.

Summary:
A description of this technique can be found here:
http://wm.ite.pl/articles/sse-popcount.html

The core of the idea is to use an in-register lookup table and the
PSHUFB instruction to compute the population count for the low and high
nibbles of each byte, and then to use horizontal sums to aggregate these
into vector population counts with wider element types.

On x86 there is an instruction that will directly compute the horizontal
sum for the low 8 and high 8 bytes, giving vNi64 popcount very easily.
Various tricks are used to get vNi32 and vNi16 from the vNi8 that the
LUT computes.

The base implemantion of this, and most of the work, was done by Bruno
in a follow up to D6531. See Bruno's detailed post there for lots of
timing information about these changes.

I have extended Bruno's patch in the following ways:

0) I committed the new tests with baseline sequences so this shows
   a diff, and regenerated the tests using the update scripts.

1) Bruno had noticed and mentioned in IRC a redundant mask that
   I removed.

2) I introduced a particular optimization for the i32 vector cases where
   we use PSHL + PSADBW to compute the the low i32 popcounts, and PSHUFD
   + PSADBW to compute doubled high i32 popcounts. This takes advantage
   of the fact that to line up the high i32 popcounts we have to shift
   them anyways, and we can shift them by one fewer bit to effectively
   divide the count by two. While the PSHUFD based horizontal add is no
   faster, it doesn't require registers or load traffic the way a mask
   would, and provides more ILP as it happens on different ports with
   high throughput.

3) I did some code cleanups throughout to simplify the implementation
   logic.

4) I refactored it to continue to use the parallel bitmath lowering when
   SSSE3 is not available to preserve the performance of that version on
   SSE2 targets where it is still much better than scalarizing as we'll
   still do a bitmath implementation of popcount even in scalar code
   there.

With #1 and #2 above, I analyzed the result in IACA for sandybridge,
ivybridge, and haswell. In every case I measured, the throughput is the
same or better using the LUT lowering, even v2i64 and v4i64, and even
compared with using the native popcnt instruction! The latency of the
LUT lowering is often higher than the latency of the scalarized popcnt
instruction sequence, but I think those latency measurements are deeply
misleading. Keeping the operation fully in the vector unit and having
many chances for increased throughput seems much more likely to win.

With this, we can lower every integer vector popcount implementation
using the LUT strategy if we have SSSE3 or better (and thus have
PSHUFB). I've updated the operation lowering to reflect this. This also
fixes an issue where we were scalarizing horribly some AVX lowerings.

Finally, there are some remaining cleanups. There is duplication between
the two techniques in how they perform the horizontal sum once the byte
population count is computed. I'm going to factor and merge those two in
a separate follow-up commit.

Differential Revision: http://reviews.llvm.org/D10084

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238636 91177308-0d34-0410-b5e6-96231b3b80d8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants