_mm_dp_ps does not always match x86_64 #595

ThiagoIze · 2023-05-10T00:27:20Z

The _mm_dp_ps implementation will give a perfect match for 0xFF and 0x7F modes, but the other modes will use the Kahan algorithm to get a little bit more precision. This means that changing which lanes you write the result to can produce slightly different results, which is likely unwanted and surprising. It also won't match what x86_64 produces.

Note that the Intel® SSE4 Programming Reference states that this instruction should produce the same result that you get from a standard non-Kahan algorithm implementation.

Dot Product operations are not specified in IEEE-754. When neither FTZ nor DAZ are enabled, the dot product instructions resemble sequences of IEEE-754 multiplies and adds (with rounding at each stage)

Either this enhanced precision code path should be removed (my preference and it also simplifies the code) or made optional using an SSE2NEON_PRECISE_* define.

The text was updated successfully, but these errors were encountered:

Cuda-Chen · 2023-05-11T12:21:49Z

Hi @ThiagoIze ,
I will vote on "enhanced precision code path should be removed" to not to let user being confused.
Let me make some changes then create a PR to solve this.

This commit improves _mm_dp_ps and its test cases with the following aspects: 1. remove Kahan algorithm precision enhancement > Dot Product operations are not specified in IEEE-754. When neither FTZ nor DAZ are enabled, the dot product instructions resemble sequences of IEEE-754 multiplies and adds (with rounding at each stage) As stated in Intel® SSE4 Programming Reference. the dot product does not apply Kahan algorithm. To align with IEEE-754 result, the Kahan algorithm used for precision enhancement should be removed. 2. apply shortcut when imm is 0xXF and environment is ARMv8-A Apply shortcut when imm is 0xXF and environment is ARMv8-A. Also, add corredponding tests when imm is 0xXF. 3. add more tests More tests are added for testing the possible combinations of imm. Close DLTcollab#595.

This commit improves _mm_dp_ps and its test cases with the following aspects: 1. remove Kahan algorithm precision enhancement > Dot Product operations are not specified in IEEE-754. When neither FTZ nor DAZ are enabled, the dot product instructions resemble sequences of IEEE-754 multiplies and adds (with rounding at each stage). As stated in Intel® SSE4 Programming Reference, the dot product does not apply Kahan algorithm. To align with IEEE-754 result, the Kahan algorithm used for precision enhancement should be removed. 2. apply shortcut when imm is 0xXF and environment is ARMv8-A Apply shortcut when imm is 0xXF and environment is ARMv8-A. Also, add corredponding tests when imm is 0xXF. 3. add more tests More tests are added for testing the possible combinations of imm. Close DLTcollab#595.

Remove Kahan algorithm in _mm_dp_ps to align conversion result with SSE. Also, apply shortcut when immediate is 0xXF and target is ARMv8-A. Last, add more tests for testing possible combinations of immediate including 0xXF. Close DLTcollab#595.

jserv assigned Cuda-Chen May 10, 2023

Cuda-Chen mentioned this issue May 18, 2023

Remove Kahan algorithm in _mm_dp_ps #597

Merged

jserv closed this as completed in #597 May 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_mm_dp_ps does not always match x86_64 #595

_mm_dp_ps does not always match x86_64 #595

ThiagoIze commented May 10, 2023

Cuda-Chen commented May 11, 2023

_mm_dp_ps does not always match x86_64 #595

_mm_dp_ps does not always match x86_64 #595

Comments

ThiagoIze commented May 10, 2023

Cuda-Chen commented May 11, 2023