SIMD: faster vint4 load/store with unsigned char conversion #4071

aras-p · 2023-12-08T14:47:41Z

Description

vint4::load from unsigned char pointer got pre-SSE4 code path. Testing on Ryzen 5950X / VS2022 (with only SSE2 enabled in the build):

vint4 load from unsigned char[]: 946.1 -> 4232.8 Mvals/sec

vint4::store to unsigned char pointer got simpler/faster SSE code path, and a NEON code path. Additionally, it got test correctness coverage, including what happens to values outside of unsigned char range (current behavior just masks lowest byte, i.e. does not clamp the integer lanes).

vint4 store to unsigned char[]: 3489.8 -> 3979.3 Mvals/sec
vint8 store to unsigned char[]: 5516.9 -> 7325.3 Mvals/sec

NEON code path as tested on Mac M1 Max (clang 15):

vint4 store to unsigned char[]: 4137.2 -> 6074.8 Mvals/sec

Tests

vint4 store to unsigned char pointer got actual correctness checking test, which seemingly was lacking before (only had a benchmark test).

Checklist:

I have read the contribution guidelines.
I have updated the documentation, if applicable.
I have ensured that the change is tested somewhere in the testsuite
(adding new test cases if necessary).
If I added or modified a C++ API call, I have also amended the
corresponding Python bindings (and if altering ImageBufAlgo functions, also
exposed the new functionality as oiiotool options).
My code follows the prevailing code style of this project. If I haven't
already run clang-format before submitting, I definitely will look at the CI
test that runs clang-format and fix anything that it highlights as being
nonconforming.

linux-foundation-easycla · 2023-12-08T14:47:46Z

The committers listed above are authorized under a signed CLA.

✅ login: aras-p / name: Aras Pranckevičius (09de3c4)

vint4::load from unsigned char pointer got pre-SSE4 code path. Testing on Ryzen 5950X / VS2022 (with only SSE2 enabled in the build): - vint4 load from unsigned char[]: 946.1 -> 4232.8 Mvals/sec vint4::store to unsigned char pointer got simpler/faster SSE code path, and a NEON code path. Additionally, it got test correctness coverage, including what happens to values outside of unsigned char range (current behavior just masks lowest byte, i.e. does not clamp the integer lanes). - vint4 store to unsigned char[]: 3489.8 -> 3979.3 Mvals/sec - vint8 store to unsigned char[]: 5516.9 -> 7325.3 Mvals/sec NEON code path as tested on Mac M1 Max (clang 15): - vint4 store to unsigned char[]: 4137.2 -> 6074.8 Mvals/sec Signed-off-by: Aras Pranckevicius <aras@nesnausk.org>

lgritz

Outstanding, thank you, Aras!

lgritz · 2023-12-08T23:31:16Z

N.B. The failed "bleeding edge" CI test is because something has changed in the trunk of OpenJPEG and is unrelated to these changes, so I will merge.

…ademySoftwareFoundation#4071) ## simd.h improvements: vint4::load from unsigned char pointer got pre-SSE4 code path. Testing on Ryzen 5950X / VS2022 (with only SSE2 enabled in the build): - vint4 load from unsigned char[]: 946.1 -> 4232.8 Mvals/sec vint4::store to unsigned char pointer got simpler/faster SSE code path, and a NEON code path. Additionally, it got test correctness coverage, including what happens to values outside of unsigned char range (current behavior just masks lowest byte, i.e. does not clamp the integer lanes). - vint4 store to unsigned char[]: 3489.8 -> 3979.3 Mvals/sec - vint8 store to unsigned char[]: 5516.9 -> 7325.3 Mvals/sec NEON code path as tested on Mac M1 Max (clang 15): - vint4 store to unsigned char[]: 4137.2 -> 6074.8 Mvals/sec ## Tests vint4 store to unsigned char pointer got actual correctness checking test, which seemingly was lacking before (only had a benchmark test).

Primarily, recent changes (PR #4071) to vint4::store for the NEON case appear to have some type mismatches, which apple clang on ARM-based Mac (including our CI) seems ok with, but which is generating type errors on other ARM Linux platforms. I think the types were weird here, so I tightened it up to get the types right for temporary variables in that function. That's the primary fix here. Secondarily, I modified simd.h and the CMake setup so that build option USE_SIMD=0 will disable NEON in the same way that it disables SSE. (I realized that USE_SIMD=0 was not disabling NEON, so there was no way for a NEON platform to completely disable SIMD if they needed to.) Fixes #4111 Signed-off-by: Larry Gritz <lg@larrygritz.com>

Primarily, recent changes (PR AcademySoftwareFoundation#4071) to vint4::store for the NEON case appear to have some type mismatches, which apple clang on ARM-based Mac (including our CI) seems ok with, but which is generating type errors on other ARM Linux platforms. I think the types were weird here, so I tightened it up to get the types right for temporary variables in that function. That's the primary fix here. Secondarily, I modified simd.h and the CMake setup so that build option USE_SIMD=0 will disable NEON in the same way that it disables SSE. (I realized that USE_SIMD=0 was not disabling NEON, so there was no way for a NEON platform to completely disable SIMD if they needed to.) Fixes AcademySoftwareFoundation#4111 Signed-off-by: Larry Gritz <lg@larrygritz.com>

…ademySoftwareFoundation#4071) ## simd.h improvements: vint4::load from unsigned char pointer got pre-SSE4 code path. Testing on Ryzen 5950X / VS2022 (with only SSE2 enabled in the build): - vint4 load from unsigned char[]: 946.1 -> 4232.8 Mvals/sec vint4::store to unsigned char pointer got simpler/faster SSE code path, and a NEON code path. Additionally, it got test correctness coverage, including what happens to values outside of unsigned char range (current behavior just masks lowest byte, i.e. does not clamp the integer lanes). - vint4 store to unsigned char[]: 3489.8 -> 3979.3 Mvals/sec - vint8 store to unsigned char[]: 5516.9 -> 7325.3 Mvals/sec NEON code path as tested on Mac M1 Max (clang 15): - vint4 store to unsigned char[]: 4137.2 -> 6074.8 Mvals/sec ## Tests vint4 store to unsigned char pointer got actual correctness checking test, which seemingly was lacking before (only had a benchmark test). Signed-off-by: Peter Kovář <peter.kovar@reflexion.tv>

Primarily, recent changes (PR AcademySoftwareFoundation#4071) to vint4::store for the NEON case appear to have some type mismatches, which apple clang on ARM-based Mac (including our CI) seems ok with, but which is generating type errors on other ARM Linux platforms. I think the types were weird here, so I tightened it up to get the types right for temporary variables in that function. That's the primary fix here. Secondarily, I modified simd.h and the CMake setup so that build option USE_SIMD=0 will disable NEON in the same way that it disables SSE. (I realized that USE_SIMD=0 was not disabling NEON, so there was no way for a NEON platform to completely disable SIMD if they needed to.) Fixes AcademySoftwareFoundation#4111 Signed-off-by: Larry Gritz <lg@larrygritz.com> Signed-off-by: Peter Kovář <peter.kovar@reflexion.tv>

aras-p force-pushed the simd-loadstore-opts branch from 0f4d893 to 6a0be5e Compare December 8, 2023 14:51

aras-p force-pushed the simd-loadstore-opts branch from 6a0be5e to 09de3c4 Compare December 8, 2023 15:10

lgritz approved these changes Dec 8, 2023

View reviewed changes

lgritz merged commit d58e4aa into AcademySoftwareFoundation:master Dec 8, 2023
24 of 25 checks passed

lgritz mentioned this pull request Feb 7, 2024

fix(simd.h): Address NEON issues #4143

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD: faster vint4 load/store with unsigned char conversion #4071

SIMD: faster vint4 load/store with unsigned char conversion #4071

aras-p commented Dec 8, 2023

linux-foundation-easycla bot commented Dec 8, 2023 •

edited

lgritz left a comment

lgritz commented Dec 8, 2023

SIMD: faster vint4 load/store with unsigned char conversion #4071

SIMD: faster vint4 load/store with unsigned char conversion #4071

Conversation

aras-p commented Dec 8, 2023

Description

Tests

Checklist:

linux-foundation-easycla bot commented Dec 8, 2023 • edited

lgritz left a comment

Choose a reason for hiding this comment

lgritz commented Dec 8, 2023

linux-foundation-easycla bot commented Dec 8, 2023 •

edited