Fix invalid data layout for BRIGAsmPrinter on 64-bit HSAIL. #1

r-potter · 2014-10-16T15:07:35Z

Without this patch HSAILDevice::getDataLayout() returns the same data layout string for both 32-bit and 64-bit HSAIL.

BRIGAsmPrinter uses this layout to calculate alignment requirements, and so assumes that pointers should be 4 byte aligned even on 64-bit platforms. This manifests as attempting to align kernel args to 4 byte addresses.

prog kernel &_ZN8raytrace6camera22generate_rays_parallelE__ns1RKU3AS1NS_6matrixIfLm4ELm4EEES4_PU3AS1N2rt6sampleERU3AS1NS5_10ray_bufferE(
    align(4) kernarg_u64 %camera_to_world,
    align(4) kernarg_u64 %screen_to_camera,
    align(4) kernarg_u64 %samples,
    align(4) kernarg_u64 %rays)

arsenm · 2014-10-16T17:13:59Z

Can you add a test for this in test/CodeGen/HSAIL?

r-potter · 2014-10-17T09:48:54Z

Test case added and braces/string initialization adjusted.

Fix invalid data layout for BRIGAsmPrinter on 64-bit HSAIL.

in-register LUT technique. Summary: A description of this technique can be found here: http://wm.ite.pl/articles/sse-popcount.html The core of the idea is to use an in-register lookup table and the PSHUFB instruction to compute the population count for the low and high nibbles of each byte, and then to use horizontal sums to aggregate these into vector population counts with wider element types. On x86 there is an instruction that will directly compute the horizontal sum for the low 8 and high 8 bytes, giving vNi64 popcount very easily. Various tricks are used to get vNi32 and vNi16 from the vNi8 that the LUT computes. The base implemantion of this, and most of the work, was done by Bruno in a follow up to D6531. See Bruno's detailed post there for lots of timing information about these changes. I have extended Bruno's patch in the following ways: 0) I committed the new tests with baseline sequences so this shows a diff, and regenerated the tests using the update scripts. 1) Bruno had noticed and mentioned in IRC a redundant mask that I removed. 2) I introduced a particular optimization for the i32 vector cases where we use PSHL + PSADBW to compute the the low i32 popcounts, and PSHUFD + PSADBW to compute doubled high i32 popcounts. This takes advantage of the fact that to line up the high i32 popcounts we have to shift them anyways, and we can shift them by one fewer bit to effectively divide the count by two. While the PSHUFD based horizontal add is no faster, it doesn't require registers or load traffic the way a mask would, and provides more ILP as it happens on different ports with high throughput. 3) I did some code cleanups throughout to simplify the implementation logic. 4) I refactored it to continue to use the parallel bitmath lowering when SSSE3 is not available to preserve the performance of that version on SSE2 targets where it is still much better than scalarizing as we'll still do a bitmath implementation of popcount even in scalar code there. With #1 and #2 above, I analyzed the result in IACA for sandybridge, ivybridge, and haswell. In every case I measured, the throughput is the same or better using the LUT lowering, even v2i64 and v4i64, and even compared with using the native popcnt instruction! The latency of the LUT lowering is often higher than the latency of the scalarized popcnt instruction sequence, but I think those latency measurements are deeply misleading. Keeping the operation fully in the vector unit and having many chances for increased throughput seems much more likely to win. With this, we can lower every integer vector popcount implementation using the LUT strategy if we have SSSE3 or better (and thus have PSHUFB). I've updated the operation lowering to reflect this. This also fixes an issue where we were scalarizing horribly some AVX lowerings. Finally, there are some remaining cleanups. There is duplication between the two techniques in how they perform the horizontal sum once the byte population count is computed. I'm going to factor and merge those two in a separate follow-up commit. Differential Revision: http://reviews.llvm.org/D10084 git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238636 91177308-0d34-0410-b5e6-96231b3b80d8

- Factor out code to query and modify the sign bit of a floatingpoint value as an integer. This also works if none of the targets integer types is big enough to hold all bits of the floatingpoint value. - Legalize FABS(x) as FCOPYSIGN(x, 0.0) if FCOPYSIGN is available, otherwise perform bit manipulation on the sign bit. The previous code used "x >u 0 ? x : -x" which is incorrect for x being -0.0! It also takes 34 instructions on ARM Cortex-M4. With this patch we only require 5: vldr d0, LCPI0_0 vmov r2, r3, d0 lsrs r2, r3, #31 bfi r1, r2, #31, #1 bx lr (This could be further improved if the compiler would recognize that r2, r3 is zero). - Only lower FCOPYSIGN(x, y) = sign(x) ? -FABS(x) : FABS(x) if FABS is available otherwise perform bit manipulation on the sign bit. - Perform the sign(x) test by masking out the sign bit and comparing with 0 rather than shifting the sign bit to the highest position and testing for "<s 0". For x86 copysignl (on 80bit values) this gets us: testl $32768, %eax rather than: shlq $48, %rax sets %al testb %al, %al git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@242107 91177308-0d34-0410-b5e6-96231b3b80d8

Fix misaligned kernargs on 64-bit HSAIL.

e5282a1

Corrected for LLVM style and added test case.

708df9a

arsenm added a commit that referenced this pull request Oct 17, 2014

Merge pull request #1 from r-potter/hsail-3.6

553a70f

Fix invalid data layout for BRIGAsmPrinter on 64-bit HSAIL.

arsenm merged commit 553a70f into HSAFoundation:hsail-3.6 Oct 17, 2014

arsenm mentioned this pull request Jul 15, 2015

Crash in machine verifier #14

Closed

arsenm mentioned this pull request Aug 11, 2015

opt crashes in SROA #17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix invalid data layout for BRIGAsmPrinter on 64-bit HSAIL. #1

Fix invalid data layout for BRIGAsmPrinter on 64-bit HSAIL. #1

r-potter commented Oct 16, 2014

arsenm commented Oct 16, 2014

r-potter commented Oct 17, 2014

Fix invalid data layout for BRIGAsmPrinter on 64-bit HSAIL. #1

Fix invalid data layout for BRIGAsmPrinter on 64-bit HSAIL. #1

Conversation

r-potter commented Oct 16, 2014

arsenm commented Oct 16, 2014

r-potter commented Oct 17, 2014