-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
_mm_dp_ps does not always match x86_64 #595
Comments
Hi @ThiagoIze , |
Cuda-Chen
added a commit
to Cuda-Chen/sse2neon
that referenced
this issue
May 18, 2023
This commit improves _mm_dp_ps and its test cases with the following aspects: 1. remove Kahan algorithm precision enhancement > Dot Product operations are not specified in IEEE-754. When neither FTZ nor DAZ are enabled, the dot product instructions resemble sequences of IEEE-754 multiplies and adds (with rounding at each stage) As stated in Intel® SSE4 Programming Reference. the dot product does not apply Kahan algorithm. To align with IEEE-754 result, the Kahan algorithm used for precision enhancement should be removed. 2. apply shortcut when imm is 0xXF and environment is ARMv8-A Apply shortcut when imm is 0xXF and environment is ARMv8-A. Also, add corredponding tests when imm is 0xXF. 3. add more tests More tests are added for testing the possible combinations of imm. Close DLTcollab#595.
Cuda-Chen
added a commit
to Cuda-Chen/sse2neon
that referenced
this issue
May 18, 2023
This commit improves _mm_dp_ps and its test cases with the following aspects: 1. remove Kahan algorithm precision enhancement > Dot Product operations are not specified in IEEE-754. When neither FTZ nor DAZ are enabled, the dot product instructions resemble sequences of IEEE-754 multiplies and adds (with rounding at each stage). As stated in Intel® SSE4 Programming Reference, the dot product does not apply Kahan algorithm. To align with IEEE-754 result, the Kahan algorithm used for precision enhancement should be removed. 2. apply shortcut when imm is 0xXF and environment is ARMv8-A Apply shortcut when imm is 0xXF and environment is ARMv8-A. Also, add corredponding tests when imm is 0xXF. 3. add more tests More tests are added for testing the possible combinations of imm. Close DLTcollab#595.
Cuda-Chen
added a commit
to Cuda-Chen/sse2neon
that referenced
this issue
May 18, 2023
Remove Kahan algorithm in _mm_dp_ps to align conversion result with SSE. Also, apply shortcut when immediate is 0xXF and target is ARMv8-A. Last, add more tests for testing possible combinations of immediate including 0xXF. Close DLTcollab#595.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The
_mm_dp_ps
implementation will give a perfect match for 0xFF and 0x7F modes, but the other modes will use the Kahan algorithm to get a little bit more precision. This means that changing which lanes you write the result to can produce slightly different results, which is likely unwanted and surprising. It also won't match what x86_64 produces.Note that the Intel® SSE4 Programming Reference states that this instruction should produce the same result that you get from a standard non-Kahan algorithm implementation.
Either this enhanced precision code path should be removed (my preference and it also simplifies the code) or made optional using an
SSE2NEON_PRECISE_*
define.The text was updated successfully, but these errors were encountered: