Altivec bitshuffle #98

FrancescAlted · 2019-12-24T10:18:53Z

This PR is for:

Extend work to types larger than 8-bit
Handle the leftovers (buffer size is not a multiple of the block size)

It is still work in progress.

…er speed

FrancescAlted · 2020-01-02T17:36:09Z

This is complete and ready for merging (bar some suggestions or comments). @kif can you review?

FrancescAlted · 2020-01-10T10:11:11Z

I have removed a duplicated bitshuffle1_altivec function, but bitunshuffle1_altivec is different because the function that replicates its functionality (bshuf_trans_byte_bitrow_altivec) is implemented quite differently. @kif could you please do some performance measurements and tell me whether bitunshuffle1_altivec has an advantage for type_size == 1?

kif · 2020-01-13T12:46:54Z

I am running the benchmark ...

kif · 2020-01-14T08:42:28Z

Here are some graphs I obtained versus the master (using the SSE2 auto-translated code) on various block size:

kif · 2020-01-14T08:46:43Z

I also performed a benchmark of the bitshuffle1_altivec (labelled JK) versus bshuf_trans_byte_bitrow_altivec (labelled FA):

FrancescAlted · 2020-01-14T09:26:49Z

Yes, I did notice the drop in performance of the VSX version for bitunshuffle for typesize > 1. However, as Blosc2 typically shuffles/unshuffles blocks of 1 MB as maximum, I don't think the drop in performance in this region is too bad.

But if for some reason, one absolutely wants better performance for blocks > 1 MB, another possibility is to find a direct replacement for __mm_store_pi and come with a similar algorithm like in master but using VSX. Finally, if getting rid of SSE2 is not deemed absolutely necessary, one may want to go back to the original bshuf_trans_byte_bitrow_altivec.

FrancescAlted · 2020-01-14T09:28:02Z

@kif Something went wrong in pasting the plot for bitshuffle1_altivec versus bshuf_trans_byte_bitrow_altivec.

kif · 2020-01-14T09:46:31Z

I noticed, but my notebook kernel crashed and I had to relaunch it to get the last image.

FrancescAlted · 2020-01-14T10:01:58Z

I also performed a benchmark of the bitshuffle1_altivec (labelled JK) versus bshuf_trans_byte_bitrow_altivec (labelled FA):

That's pretty cool. I'd say that we want to use the bitunshuffle1_altivec instead of bshuf_trans_byte_bitrow_altivec for typesize == 1. A switch case would be enough for this.

OTOH, I am not sure why bitshuffle1_altivec is faster than bshuf_trans_byte_bitrow_altivec as at a glance they look pretty much the same. @kif are you aware of a significant difference?

kif · 2020-01-14T20:06:27Z

If I remember well the code is not working in all conditions (there are additional constrains on the bloc size which has to be a multiple of 16(vector size)*8(bits/item)) which makes it not that easy to use.

FrancescAlted · 2020-01-15T09:22:47Z

I see. Well, a possibility is to use the accelerated path just when the condition blocksize % (16 * 8) == 0 holds (which should be a fairly common case, as Blosc typically splits blocks in figures that are a power of 2). I think this approach will allow to quickly get low-hanging performance fruits.

FrancescAlted · 2020-04-22T06:52:59Z

Merging.

FrancescAlted and others added 22 commits December 24, 2019 10:49

Add missing transpose-altivec.h

9007724

Add the type of loop index

006c3c4

Express the shuffling of 16-bit in terms of Altivec primitives

3af91ea

Use the byte-shuffle technique in transpose-altivec.h for a much bett…

f83743f

…er speed

Use byte transpose routines explicitly

2c10bda

Use transpose4x16 routine explicitly

efd1ee7

Use transpose8x16 routine explicitly

c5ecd6f

Support for accelerated 128-bit types via transpose16x16

071d36d

Remove several dependencies of SSE2 functions

abe0ccc

Use VSX functions through all bshuf_trans_byte_bitrow_altivec

216527b

Use scalar version for odd values of elem_size

c349121

Use the scalar version for all odd values except 1

328e40b

Re-express permutations using vec_perm()

74d7187

Remove unnecessary emmittrin.h

cfafd82

Free memory after blosc_get_complib_info()

f846959

Fix some pending test_getitem() tests

febde12

Fix warnings for altivec

27a68da

Fix some data types in loops

b30d0b7

Sync with Jerome's transpose-altivec.h

6bde477

Fix warnings for suffle-altivec

2ddfa43

Minor cleanup

5f63fa0

More minor cleanups

4ec34f2

Remove duplicated function

dfabc27

FrancescAlted merged commit 6d23ce8 into master Apr 22, 2020

FrancescAlted mentioned this pull request Apr 22, 2020

Use an accelerated path for ALTIVEC-aware bitshuffle #106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Altivec bitshuffle #98

Altivec bitshuffle #98

FrancescAlted commented Dec 24, 2019 •

edited

FrancescAlted commented Jan 2, 2020

FrancescAlted commented Jan 10, 2020

kif commented Jan 13, 2020

kif commented Jan 14, 2020

kif commented Jan 14, 2020 •

edited

FrancescAlted commented Jan 14, 2020

FrancescAlted commented Jan 14, 2020

kif commented Jan 14, 2020

FrancescAlted commented Jan 14, 2020 •

edited

kif commented Jan 14, 2020 via email

FrancescAlted commented Jan 15, 2020

FrancescAlted commented Apr 22, 2020

Altivec bitshuffle #98

Altivec bitshuffle #98

Conversation

FrancescAlted commented Dec 24, 2019 • edited

FrancescAlted commented Jan 2, 2020

FrancescAlted commented Jan 10, 2020

kif commented Jan 13, 2020

kif commented Jan 14, 2020

kif commented Jan 14, 2020 • edited

FrancescAlted commented Jan 14, 2020

FrancescAlted commented Jan 14, 2020

kif commented Jan 14, 2020

FrancescAlted commented Jan 14, 2020 • edited

kif commented Jan 14, 2020 via email

FrancescAlted commented Jan 15, 2020

FrancescAlted commented Apr 22, 2020

FrancescAlted commented Dec 24, 2019 •

edited

kif commented Jan 14, 2020 •

edited

FrancescAlted commented Jan 14, 2020 •

edited