Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Altivec bitshuffle #98

Merged
merged 23 commits into from Apr 22, 2020
Merged

Altivec bitshuffle #98

merged 23 commits into from Apr 22, 2020

Conversation

FrancescAlted
Copy link
Member

@FrancescAlted FrancescAlted commented Dec 24, 2019

This PR is for:

  • Extend work to types larger than 8-bit
  • Handle the leftovers (buffer size is not a multiple of the block size)

It is still work in progress.

@FrancescAlted
Copy link
Member Author

This is complete and ready for merging (bar some suggestions or comments). @kif can you review?

@FrancescAlted
Copy link
Member Author

I have removed a duplicated bitshuffle1_altivec function, but bitunshuffle1_altivec is different because the function that replicates its functionality (bshuf_trans_byte_bitrow_altivec) is implemented quite differently. @kif could you please do some performance measurements and tell me whether bitunshuffle1_altivec has an advantage for type_size == 1?

@kif
Copy link
Contributor

kif commented Jan 13, 2020

I am running the benchmark ...

@kif
Copy link
Contributor

kif commented Jan 14, 2020

Here are some graphs I obtained versus the master (using the SSE2 auto-translated code) on various block size:

bitshuffle_altivec_16bits
bitshuffle_altivec_32bits
bitshuffle_altivec_8bits
bitshuffle_altivec_64bits

@kif
Copy link
Contributor

kif commented Jan 14, 2020

I also performed a benchmark of the bitshuffle1_altivec (labelled JK) versus bshuf_trans_byte_bitrow_altivec (labelled FA):
bitshuffle_implementations_8bits

@FrancescAlted
Copy link
Member Author

Yes, I did notice the drop in performance of the VSX version for bitunshuffle for typesize > 1. However, as Blosc2 typically shuffles/unshuffles blocks of 1 MB as maximum, I don't think the drop in performance in this region is too bad.

But if for some reason, one absolutely wants better performance for blocks > 1 MB, another possibility is to find a direct replacement for __mm_store_pi and come with a similar algorithm like in master but using VSX. Finally, if getting rid of SSE2 is not deemed absolutely necessary, one may want to go back to the original bshuf_trans_byte_bitrow_altivec.

@FrancescAlted
Copy link
Member Author

@kif Something went wrong in pasting the plot for bitshuffle1_altivec versus bshuf_trans_byte_bitrow_altivec.

@kif
Copy link
Contributor

kif commented Jan 14, 2020

I noticed, but my notebook kernel crashed and I had to relaunch it to get the last image.

@FrancescAlted
Copy link
Member Author

FrancescAlted commented Jan 14, 2020

I also performed a benchmark of the bitshuffle1_altivec (labelled JK) versus bshuf_trans_byte_bitrow_altivec (labelled FA):
bitshuffle_implementations_8bits

That's pretty cool. I'd say that we want to use the bitunshuffle1_altivec instead of bshuf_trans_byte_bitrow_altivec for typesize == 1. A switch case would be enough for this.

OTOH, I am not sure why bitshuffle1_altivec is faster than bshuf_trans_byte_bitrow_altivec as at a glance they look pretty much the same. @kif are you aware of a significant difference?

@kif
Copy link
Contributor

kif commented Jan 14, 2020 via email

@FrancescAlted
Copy link
Member Author

I see. Well, a possibility is to use the accelerated path just when the condition blocksize % (16 * 8) == 0 holds (which should be a fairly common case, as Blosc typically splits blocks in figures that are a power of 2). I think this approach will allow to quickly get low-hanging performance fruits.

@FrancescAlted
Copy link
Member Author

Merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants