i1 kernel arguments loaded as 4 bytes #2

arsenm · 2014-11-18T20:20:54Z

These should be treated as an 8-bit byte load. If the ABI requires these are promoted to 32-bit and occupy that much space, they should be marked with zeroext parameter attributes

rampitec · 2014-12-08T23:51:35Z

OpenCL ABI does not mandate this size since bool cannot be passed into a kernel. An ABI for the functions we are defining ourselves. From the performance perspective 32 bit works better, so preferable. Otherwise HSAIL spec does not dictate any ABI for the kernels.

arsenm · 2015-04-22T22:23:05Z

Fixed by ff1c637

…oring the subregister. For 0-lane stores, we used to generate code similar to: fmov w8, s0 str w8, [x0, x1, lsl HSAFoundation#2] instead of: str s0, [x0, x1, lsl HSAFoundation#2] To correct that: for store lane 0 patterns, directly match to STR <subreg>0. Byte-sized instructions don't have the special case for a 0 index, because FPR8s are defined to have untyped content. rdar://16372710 Differential Revision: http://reviews.llvm.org/D6772 git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225181 91177308-0d34-0410-b5e6-96231b3b80d8

…ew virtual function to end of subclass. NFC The previous attempt at fixing this only moved the problem to the subclass vtable. We can safely move the function into the subclass so attempt to fix it that way. git-svn-id: https://llvm.org/svn/llvm-project/llvm/branches/release_36@236112 91177308-0d34-0410-b5e6-96231b3b80d8

in-register LUT technique. Summary: A description of this technique can be found here: http://wm.ite.pl/articles/sse-popcount.html The core of the idea is to use an in-register lookup table and the PSHUFB instruction to compute the population count for the low and high nibbles of each byte, and then to use horizontal sums to aggregate these into vector population counts with wider element types. On x86 there is an instruction that will directly compute the horizontal sum for the low 8 and high 8 bytes, giving vNi64 popcount very easily. Various tricks are used to get vNi32 and vNi16 from the vNi8 that the LUT computes. The base implemantion of this, and most of the work, was done by Bruno in a follow up to D6531. See Bruno's detailed post there for lots of timing information about these changes. I have extended Bruno's patch in the following ways: 0) I committed the new tests with baseline sequences so this shows a diff, and regenerated the tests using the update scripts. 1) Bruno had noticed and mentioned in IRC a redundant mask that I removed. 2) I introduced a particular optimization for the i32 vector cases where we use PSHL + PSADBW to compute the the low i32 popcounts, and PSHUFD + PSADBW to compute doubled high i32 popcounts. This takes advantage of the fact that to line up the high i32 popcounts we have to shift them anyways, and we can shift them by one fewer bit to effectively divide the count by two. While the PSHUFD based horizontal add is no faster, it doesn't require registers or load traffic the way a mask would, and provides more ILP as it happens on different ports with high throughput. 3) I did some code cleanups throughout to simplify the implementation logic. 4) I refactored it to continue to use the parallel bitmath lowering when SSSE3 is not available to preserve the performance of that version on SSE2 targets where it is still much better than scalarizing as we'll still do a bitmath implementation of popcount even in scalar code there. With #1 and #2 above, I analyzed the result in IACA for sandybridge, ivybridge, and haswell. In every case I measured, the throughput is the same or better using the LUT lowering, even v2i64 and v4i64, and even compared with using the native popcnt instruction! The latency of the LUT lowering is often higher than the latency of the scalarized popcnt instruction sequence, but I think those latency measurements are deeply misleading. Keeping the operation fully in the vector unit and having many chances for increased throughput seems much more likely to win. With this, we can lower every integer vector popcount implementation using the LUT strategy if we have SSSE3 or better (and thus have PSHUFB). I've updated the operation lowering to reflect this. This also fixes an issue where we were scalarizing horribly some AVX lowerings. Finally, there are some remaining cleanups. There is duplication between the two techniques in how they perform the horizontal sum once the byte population count is computed. I'm going to factor and merge those two in a separate follow-up commit. Differential Revision: http://reviews.llvm.org/D10084 git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238636 91177308-0d34-0410-b5e6-96231b3b80d8

arsenm closed this as completed Apr 22, 2015

arsenm mentioned this issue Jul 15, 2015

Crash in machine verifier #14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

i1 kernel arguments loaded as 4 bytes #2

i1 kernel arguments loaded as 4 bytes #2

arsenm commented Nov 18, 2014

rampitec commented Dec 8, 2014

arsenm commented Apr 22, 2015

i1 kernel arguments loaded as 4 bytes #2

i1 kernel arguments loaded as 4 bytes #2

Comments

arsenm commented Nov 18, 2014

rampitec commented Dec 8, 2014

arsenm commented Apr 22, 2015