-
Notifications
You must be signed in to change notification settings - Fork 251
SIMD.info LP completed #1364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD.info LP completed #1364
Conversation
… equivalent for NEON.
ARM Learning path for simd.info initial commit
TamarChristinaArm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Overall looks good! Added some comments.
| float32x4_t a = {1.0f, 4.0f, 9.0f, 16.0f}; | ||
| float32x4_t b = {1.0f, 2.0f, 3.0f, 4.0f}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is undefined behavior for ACLE and would put the values in the wrong order for Big-Endian.
Initialization should be done through a load
float32_t a_array[4] = {1.0f, 4.0f, 9.0f, 16.0f};
float32_t b_array[4] = {1.0f, 2.0f, 3.0f, 4.0f};
float32x4_t a = vld1q_f32 (a_array);
float32x4_t b = vld1q_f32 (b_array);which allows the compiler to do the lane correction on Big-Endian. Same for the other examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback! We originally focused on little-endian, but we've now updated the code to ensure compatibility with big-endian systems.
| Compile the above code as follows on an Arm system: | ||
|
|
||
| ```bash | ||
| gcc -O3 calculation_neon.c -o calculation_neon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the constants are local, the compiler constant evaluates this example entirely at compile time https://godbolt.org/z/eeY754815 so there won't be any Adv. SIMD instructions generated here (aside from fsqrt). It may be confusing for someone verifying the source.
You could consider lifting the constants to global scope to prevent this: https://godbolt.org/z/4qjxPd399
Same for the other examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment, we changed the examples to use global scope.
|
|
||
| You will note that the result of the first element is a negative number, even though we added 2 positive results (`130*140` and `150*160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative. | ||
|
|
||
| The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. You could get the correct order in multiple ways, using the widening intrinsics **`vmovl`** to zero-extend or using the **`zip`** ones to merge with zero elements. The fastest way is the **`vmovl`** intrinsics, as you can see in the next example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's the opposite actually, see the Software Optimization Guides, e.g. https://developer.arm.com/documentation/109898/latest/ for Neoverse V2, as you can see zero extends UXTL have the same latency as ZIP but much lower throughput. That's why GCC emits ZIP for zero extends https://godbolt.org/z/nddeM5Wra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the correction and additional information! We’ve adjusted the text based on your feedback.
|
|
||
| You may already know the equivalent operations for this particular intrinsic, but let's assume you don't. In this usecase, reading the **`_mm_madd_epi16`** on the **SIMD.info** might indicate that a key characteristic of the instruction involved is the *widening* of the result elements, from 16-bit to 32-bit signed integers. Unfortunately, that is not the case, as this particular instruction does not actually increase the size of the element holding the result values. You will see how that effects the result in the example. | ||
|
|
||
| Consider the following code for **SSE2**. Create a new file for the code named `_mm_madd_epi16_test.c` with the contents shown below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting here, or in the general section that especially whenever there isn't a 1-1 mapping for intrinsics that you can often get better performance on Arm platforms by reorganizing the data layout to match a native Arm instruction. As an example you often have code doing unsigned widening multiplies using {xxx,0001,yyy,0001} i.e. only multiplying the even bits and just zero extending the odd elements. In this case you should permute the even/odd elements out and do the zero extend and multiple separately. To avoid the expensive multiply. Just a quick example but you get the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comment once again. Our primary focus in this work was to optimize the existing algorithm using SIMD intrinsics directly, without altering the algorithm or changing data patterns. While reordering data to align with native ARM instructions can indeed improve performance in some cases, our scope here was limited to optimizing within the constraints of the current data layout and algorithm. We mentioned this idea in the conclusion, also pointing to another LP about vectorization-friendly data layout.
… & some minor fixes in other places
Alternations based on the comments
|
Thanks for the updates! The changes look good to me. |
|
Thank you @TamarChristinaArm for the technical review and @gMerm @markos for another great learning path. @gMerm can you please resolve the conflicts on this before I merge for the next review. |
Thanks @pareenaverma , the conflicts have been resolved. |
Before submitting a pull request for a new Learning Path, please review Create a Learning Path
Please do not include any confidential information in your contribution. This includes confidential microarchitecture details and unannounced product information. No AI tool can be used to generate either content or code when creating a learning path or install guide.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the Creative Commons Attribution 4.0 International License.