-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD: faster vint4 load/store with unsigned char conversion #4071
Merged
lgritz
merged 1 commit into
AcademySoftwareFoundation:master
from
aras-p:simd-loadstore-opts
Dec 8, 2023
Merged
SIMD: faster vint4 load/store with unsigned char conversion #4071
lgritz
merged 1 commit into
AcademySoftwareFoundation:master
from
aras-p:simd-loadstore-opts
Dec 8, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|
aras-p
force-pushed
the
simd-loadstore-opts
branch
from
December 8, 2023 14:51
0f4d893
to
6a0be5e
Compare
vint4::load from unsigned char pointer got pre-SSE4 code path. Testing on Ryzen 5950X / VS2022 (with only SSE2 enabled in the build): - vint4 load from unsigned char[]: 946.1 -> 4232.8 Mvals/sec vint4::store to unsigned char pointer got simpler/faster SSE code path, and a NEON code path. Additionally, it got test correctness coverage, including what happens to values outside of unsigned char range (current behavior just masks lowest byte, i.e. does not clamp the integer lanes). - vint4 store to unsigned char[]: 3489.8 -> 3979.3 Mvals/sec - vint8 store to unsigned char[]: 5516.9 -> 7325.3 Mvals/sec NEON code path as tested on Mac M1 Max (clang 15): - vint4 store to unsigned char[]: 4137.2 -> 6074.8 Mvals/sec Signed-off-by: Aras Pranckevicius <aras@nesnausk.org>
aras-p
force-pushed
the
simd-loadstore-opts
branch
from
December 8, 2023 15:10
6a0be5e
to
09de3c4
Compare
lgritz
approved these changes
Dec 8, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outstanding, thank you, Aras!
N.B. The failed "bleeding edge" CI test is because something has changed in the trunk of OpenJPEG and is unrelated to these changes, so I will merge. |
lgritz
merged commit Dec 8, 2023
d58e4aa
into
AcademySoftwareFoundation:master
24 of 25 checks passed
lgritz
pushed a commit
to lgritz/OpenImageIO
that referenced
this pull request
Dec 10, 2023
…ademySoftwareFoundation#4071) ## simd.h improvements: vint4::load from unsigned char pointer got pre-SSE4 code path. Testing on Ryzen 5950X / VS2022 (with only SSE2 enabled in the build): - vint4 load from unsigned char[]: 946.1 -> 4232.8 Mvals/sec vint4::store to unsigned char pointer got simpler/faster SSE code path, and a NEON code path. Additionally, it got test correctness coverage, including what happens to values outside of unsigned char range (current behavior just masks lowest byte, i.e. does not clamp the integer lanes). - vint4 store to unsigned char[]: 3489.8 -> 3979.3 Mvals/sec - vint8 store to unsigned char[]: 5516.9 -> 7325.3 Mvals/sec NEON code path as tested on Mac M1 Max (clang 15): - vint4 store to unsigned char[]: 4137.2 -> 6074.8 Mvals/sec ## Tests vint4 store to unsigned char pointer got actual correctness checking test, which seemingly was lacking before (only had a benchmark test).
lgritz
added a commit
that referenced
this pull request
Feb 8, 2024
Primarily, recent changes (PR #4071) to vint4::store for the NEON case appear to have some type mismatches, which apple clang on ARM-based Mac (including our CI) seems ok with, but which is generating type errors on other ARM Linux platforms. I think the types were weird here, so I tightened it up to get the types right for temporary variables in that function. That's the primary fix here. Secondarily, I modified simd.h and the CMake setup so that build option USE_SIMD=0 will disable NEON in the same way that it disables SSE. (I realized that USE_SIMD=0 was not disabling NEON, so there was no way for a NEON platform to completely disable SIMD if they needed to.) Fixes #4111 Signed-off-by: Larry Gritz <lg@larrygritz.com>
lgritz
added a commit
to lgritz/OpenImageIO
that referenced
this pull request
Feb 8, 2024
Primarily, recent changes (PR AcademySoftwareFoundation#4071) to vint4::store for the NEON case appear to have some type mismatches, which apple clang on ARM-based Mac (including our CI) seems ok with, but which is generating type errors on other ARM Linux platforms. I think the types were weird here, so I tightened it up to get the types right for temporary variables in that function. That's the primary fix here. Secondarily, I modified simd.h and the CMake setup so that build option USE_SIMD=0 will disable NEON in the same way that it disables SSE. (I realized that USE_SIMD=0 was not disabling NEON, so there was no way for a NEON platform to completely disable SIMD if they needed to.) Fixes AcademySoftwareFoundation#4111 Signed-off-by: Larry Gritz <lg@larrygritz.com>
1div0
pushed a commit
to 1div0/OpenImageIO
that referenced
this pull request
Feb 24, 2024
…ademySoftwareFoundation#4071) ## simd.h improvements: vint4::load from unsigned char pointer got pre-SSE4 code path. Testing on Ryzen 5950X / VS2022 (with only SSE2 enabled in the build): - vint4 load from unsigned char[]: 946.1 -> 4232.8 Mvals/sec vint4::store to unsigned char pointer got simpler/faster SSE code path, and a NEON code path. Additionally, it got test correctness coverage, including what happens to values outside of unsigned char range (current behavior just masks lowest byte, i.e. does not clamp the integer lanes). - vint4 store to unsigned char[]: 3489.8 -> 3979.3 Mvals/sec - vint8 store to unsigned char[]: 5516.9 -> 7325.3 Mvals/sec NEON code path as tested on Mac M1 Max (clang 15): - vint4 store to unsigned char[]: 4137.2 -> 6074.8 Mvals/sec ## Tests vint4 store to unsigned char pointer got actual correctness checking test, which seemingly was lacking before (only had a benchmark test). Signed-off-by: Peter Kovář <peter.kovar@reflexion.tv>
1div0
pushed a commit
to 1div0/OpenImageIO
that referenced
this pull request
Feb 24, 2024
Primarily, recent changes (PR AcademySoftwareFoundation#4071) to vint4::store for the NEON case appear to have some type mismatches, which apple clang on ARM-based Mac (including our CI) seems ok with, but which is generating type errors on other ARM Linux platforms. I think the types were weird here, so I tightened it up to get the types right for temporary variables in that function. That's the primary fix here. Secondarily, I modified simd.h and the CMake setup so that build option USE_SIMD=0 will disable NEON in the same way that it disables SSE. (I realized that USE_SIMD=0 was not disabling NEON, so there was no way for a NEON platform to completely disable SIMD if they needed to.) Fixes AcademySoftwareFoundation#4111 Signed-off-by: Larry Gritz <lg@larrygritz.com> Signed-off-by: Peter Kovář <peter.kovar@reflexion.tv>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
vint4::load from unsigned char pointer got pre-SSE4 code path. Testing on Ryzen 5950X / VS2022 (with only SSE2 enabled in the build):
vint4::store to unsigned char pointer got simpler/faster SSE code path, and a NEON code path. Additionally, it got test correctness coverage, including what happens to values outside of unsigned char range (current behavior just masks lowest byte, i.e. does not clamp the integer lanes).
NEON code path as tested on Mac M1 Max (clang 15):
Tests
vint4 store to unsigned char pointer got actual correctness checking test, which seemingly was lacking before (only had a benchmark test).
Checklist:
(adding new test cases if necessary).
corresponding Python bindings (and if altering ImageBufAlgo functions, also
exposed the new functionality as oiiotool options).
already run clang-format before submitting, I definitely will look at the CI
test that runs clang-format and fix anything that it highlights as being
nonconforming.