Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD: faster vint4 load/store with unsigned char conversion #4071

Merged

Conversation

aras-p
Copy link
Contributor

@aras-p aras-p commented Dec 8, 2023

Description

vint4::load from unsigned char pointer got pre-SSE4 code path. Testing on Ryzen 5950X / VS2022 (with only SSE2 enabled in the build):

  • vint4 load from unsigned char[]: 946.1 -> 4232.8 Mvals/sec

vint4::store to unsigned char pointer got simpler/faster SSE code path, and a NEON code path. Additionally, it got test correctness coverage, including what happens to values outside of unsigned char range (current behavior just masks lowest byte, i.e. does not clamp the integer lanes).

  • vint4 store to unsigned char[]: 3489.8 -> 3979.3 Mvals/sec
  • vint8 store to unsigned char[]: 5516.9 -> 7325.3 Mvals/sec

NEON code path as tested on Mac M1 Max (clang 15):

  • vint4 store to unsigned char[]: 4137.2 -> 6074.8 Mvals/sec

Tests

vint4 store to unsigned char pointer got actual correctness checking test, which seemingly was lacking before (only had a benchmark test).

Checklist:

  • I have read the contribution guidelines.
  • I have updated the documentation, if applicable.
  • I have ensured that the change is tested somewhere in the testsuite
    (adding new test cases if necessary).
  • If I added or modified a C++ API call, I have also amended the
    corresponding Python bindings (and if altering ImageBufAlgo functions, also
    exposed the new functionality as oiiotool options).
  • My code follows the prevailing code style of this project. If I haven't
    already run clang-format before submitting, I definitely will look at the CI
    test that runs clang-format and fix anything that it highlights as being
    nonconforming.

Copy link

linux-foundation-easycla bot commented Dec 8, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: aras-p / name: Aras Pranckevičius (09de3c4)

vint4::load from unsigned char pointer got pre-SSE4 code path. Testing
on Ryzen 5950X / VS2022 (with only SSE2 enabled in the build):
- vint4 load from unsigned char[]: 946.1 -> 4232.8 Mvals/sec

vint4::store to unsigned char pointer got simpler/faster SSE code path,
and a NEON code path. Additionally, it got test correctness coverage,
including what happens to values outside of unsigned char range
(current behavior just masks lowest byte, i.e. does not clamp the
integer lanes).

- vint4 store to unsigned char[]: 3489.8 -> 3979.3 Mvals/sec
- vint8 store to unsigned char[]: 5516.9 -> 7325.3 Mvals/sec

NEON code path as tested on Mac M1 Max (clang 15):
- vint4 store to unsigned char[]: 4137.2 -> 6074.8 Mvals/sec

Signed-off-by: Aras Pranckevicius <aras@nesnausk.org>
Copy link
Collaborator

@lgritz lgritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outstanding, thank you, Aras!

@lgritz
Copy link
Collaborator

lgritz commented Dec 8, 2023

N.B. The failed "bleeding edge" CI test is because something has changed in the trunk of OpenJPEG and is unrelated to these changes, so I will merge.

@lgritz lgritz merged commit d58e4aa into AcademySoftwareFoundation:master Dec 8, 2023
24 of 25 checks passed
lgritz pushed a commit to lgritz/OpenImageIO that referenced this pull request Dec 10, 2023
…ademySoftwareFoundation#4071)

## simd.h improvements:

vint4::load from unsigned char pointer got pre-SSE4 code path. Testing
on Ryzen 5950X / VS2022 (with only SSE2 enabled in the build):
- vint4 load from unsigned char[]: 946.1 -> 4232.8 Mvals/sec

vint4::store to unsigned char pointer got simpler/faster SSE code path,
and a NEON code path. Additionally, it got test correctness coverage,
including what happens to values outside of unsigned char range (current
behavior just masks lowest byte, i.e. does not clamp the integer lanes).

- vint4 store to unsigned char[]: 3489.8 -> 3979.3 Mvals/sec
- vint8 store to unsigned char[]: 5516.9 -> 7325.3 Mvals/sec

NEON code path as tested on Mac M1 Max (clang 15):
- vint4 store to unsigned char[]: 4137.2 -> 6074.8 Mvals/sec

## Tests

vint4 store to unsigned char pointer got actual correctness checking
test, which seemingly was lacking before (only had a benchmark test).
lgritz added a commit that referenced this pull request Feb 8, 2024
Primarily, recent changes (PR #4071) to vint4::store for the NEON case
appear to have some type mismatches, which apple clang on ARM-based Mac
(including our CI) seems ok with, but which is generating type errors on
other ARM Linux platforms.

I think the types were weird here, so I tightened it up to get the types
right for temporary variables in that function. That's the primary fix
here.

Secondarily, I modified simd.h and the CMake setup so that build option
USE_SIMD=0 will disable NEON in the same way that it disables SSE. (I
realized that USE_SIMD=0 was not disabling NEON, so there was no way for
a NEON platform to completely disable SIMD if they needed to.)

Fixes #4111

Signed-off-by: Larry Gritz <lg@larrygritz.com>
lgritz added a commit to lgritz/OpenImageIO that referenced this pull request Feb 8, 2024
Primarily, recent changes (PR AcademySoftwareFoundation#4071) to vint4::store for the NEON case
appear to have some type mismatches, which apple clang on ARM-based Mac
(including our CI) seems ok with, but which is generating type errors on
other ARM Linux platforms.

I think the types were weird here, so I tightened it up to get the types
right for temporary variables in that function. That's the primary fix
here.

Secondarily, I modified simd.h and the CMake setup so that build option
USE_SIMD=0 will disable NEON in the same way that it disables SSE. (I
realized that USE_SIMD=0 was not disabling NEON, so there was no way for
a NEON platform to completely disable SIMD if they needed to.)

Fixes AcademySoftwareFoundation#4111

Signed-off-by: Larry Gritz <lg@larrygritz.com>
1div0 pushed a commit to 1div0/OpenImageIO that referenced this pull request Feb 24, 2024
…ademySoftwareFoundation#4071)

## simd.h improvements:

vint4::load from unsigned char pointer got pre-SSE4 code path. Testing
on Ryzen 5950X / VS2022 (with only SSE2 enabled in the build):
- vint4 load from unsigned char[]: 946.1 -> 4232.8 Mvals/sec

vint4::store to unsigned char pointer got simpler/faster SSE code path,
and a NEON code path. Additionally, it got test correctness coverage,
including what happens to values outside of unsigned char range (current
behavior just masks lowest byte, i.e. does not clamp the integer lanes).

- vint4 store to unsigned char[]: 3489.8 -> 3979.3 Mvals/sec
- vint8 store to unsigned char[]: 5516.9 -> 7325.3 Mvals/sec

NEON code path as tested on Mac M1 Max (clang 15):
- vint4 store to unsigned char[]: 4137.2 -> 6074.8 Mvals/sec

## Tests

vint4 store to unsigned char pointer got actual correctness checking
test, which seemingly was lacking before (only had a benchmark test).

Signed-off-by: Peter Kovář <peter.kovar@reflexion.tv>
1div0 pushed a commit to 1div0/OpenImageIO that referenced this pull request Feb 24, 2024
Primarily, recent changes (PR AcademySoftwareFoundation#4071) to vint4::store for the NEON case
appear to have some type mismatches, which apple clang on ARM-based Mac
(including our CI) seems ok with, but which is generating type errors on
other ARM Linux platforms.

I think the types were weird here, so I tightened it up to get the types
right for temporary variables in that function. That's the primary fix
here.

Secondarily, I modified simd.h and the CMake setup so that build option
USE_SIMD=0 will disable NEON in the same way that it disables SSE. (I
realized that USE_SIMD=0 was not disabling NEON, so there was no way for
a NEON platform to completely disable SIMD if they needed to.)

Fixes AcademySoftwareFoundation#4111

Signed-off-by: Larry Gritz <lg@larrygritz.com>
Signed-off-by: Peter Kovář <peter.kovar@reflexion.tv>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants