Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_mm_dp_ps does not always match x86_64 #595

Closed
ThiagoIze opened this issue May 10, 2023 · 1 comment · Fixed by #597
Closed

_mm_dp_ps does not always match x86_64 #595

ThiagoIze opened this issue May 10, 2023 · 1 comment · Fixed by #597
Assignees

Comments

@ThiagoIze
Copy link

The _mm_dp_ps implementation will give a perfect match for 0xFF and 0x7F modes, but the other modes will use the Kahan algorithm to get a little bit more precision. This means that changing which lanes you write the result to can produce slightly different results, which is likely unwanted and surprising. It also won't match what x86_64 produces.

Note that the Intel® SSE4 Programming Reference states that this instruction should produce the same result that you get from a standard non-Kahan algorithm implementation.

Dot Product operations are not specified in IEEE-754. When neither FTZ nor DAZ are enabled, the dot product instructions resemble sequences of IEEE-754 multiplies and adds (with rounding at each stage)

Either this enhanced precision code path should be removed (my preference and it also simplifies the code) or made optional using an SSE2NEON_PRECISE_* define.

@Cuda-Chen
Copy link
Collaborator

Hi @ThiagoIze ,
I will vote on "enhanced precision code path should be removed" to not to let user being confused.
Let me make some changes then create a PR to solve this.

Cuda-Chen added a commit to Cuda-Chen/sse2neon that referenced this issue May 18, 2023
This commit improves _mm_dp_ps and its test cases with the following
aspects:

1. remove Kahan algorithm precision enhancement

> Dot Product operations are not specified in IEEE-754. When neither FTZ nor DAZ are enabled, the dot product instructions resemble sequences of IEEE-754 multiplies and adds (with rounding at each stage)

As stated in Intel® SSE4 Programming Reference. the dot product does not
apply Kahan algorithm. To align with IEEE-754 result, the Kahan algorithm used for precision enhancement should be removed.

2. apply shortcut when imm is 0xXF and environment is ARMv8-A

Apply shortcut when imm is 0xXF and environment is ARMv8-A.
Also, add corredponding tests when imm is 0xXF.

3. add more tests

More tests are added for testing the possible combinations of imm.

Close DLTcollab#595.
Cuda-Chen added a commit to Cuda-Chen/sse2neon that referenced this issue May 18, 2023
This commit improves _mm_dp_ps and its test cases with the following
aspects:

1. remove Kahan algorithm precision enhancement

> Dot Product operations are not specified in IEEE-754. When neither FTZ nor DAZ are enabled, the dot product instructions resemble sequences of IEEE-754 multiplies and adds (with rounding at each stage).

As stated in Intel® SSE4 Programming Reference, the dot product does not
apply Kahan algorithm. To align with IEEE-754 result, the Kahan algorithm used for precision enhancement should be removed.

2. apply shortcut when imm is 0xXF and environment is ARMv8-A

Apply shortcut when imm is 0xXF and environment is ARMv8-A.
Also, add corredponding tests when imm is 0xXF.

3. add more tests

More tests are added for testing the possible combinations of imm.

Close DLTcollab#595.
Cuda-Chen added a commit to Cuda-Chen/sse2neon that referenced this issue May 18, 2023
Remove Kahan algorithm in _mm_dp_ps to align conversion result with SSE.

Also, apply shortcut when immediate is 0xXF and target is ARMv8-A.

Last, add more tests for testing possible combinations of immediate
including 0xXF.

Close DLTcollab#595.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants