Skip to content

Conversation

@markos
Copy link
Contributor

@markos markos commented Oct 31, 2024

Before submitting a pull request for a new Learning Path, please review Create a Learning Path

  • I have reviewed Create a Learning Path

Please do not include any confidential information in your contribution. This includes confidential microarchitecture details and unannounced product information. No AI tool can be used to generate either content or code when creating a learning path or install guide.

  • I have checked my contribution for confidential information

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the Creative Commons Attribution 4.0 International License.

Copy link

@TamarChristinaArm TamarChristinaArm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Overall looks good! Added some comments.

Comment on lines 24 to 25
float32x4_t a = {1.0f, 4.0f, 9.0f, 16.0f};
float32x4_t b = {1.0f, 2.0f, 3.0f, 4.0f};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is undefined behavior for ACLE and would put the values in the wrong order for Big-Endian.
Initialization should be done through a load

    float32_t a_array[4] = {1.0f, 4.0f, 9.0f, 16.0f};
    float32_t b_array[4] = {1.0f, 2.0f, 3.0f, 4.0f};
    float32x4_t a = vld1q_f32 (a_array);
    float32x4_t b = vld1q_f32 (b_array);

which allows the compiler to do the lane correction on Big-Endian. Same for the other examples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback! We originally focused on little-endian, but we've now updated the code to ensure compatibility with big-endian systems.

Compile the above code as follows on an Arm system:

```bash
gcc -O3 calculation_neon.c -o calculation_neon

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the constants are local, the compiler constant evaluates this example entirely at compile time https://godbolt.org/z/eeY754815 so there won't be any Adv. SIMD instructions generated here (aside from fsqrt). It may be confusing for someone verifying the source.

You could consider lifting the constants to global scope to prevent this: https://godbolt.org/z/4qjxPd399

Same for the other examples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment, we changed the examples to use global scope.


You will note that the result of the first element is a negative number, even though we added 2 positive results (`130*140` and `150*160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.

The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. You could get the correct order in multiple ways, using the widening intrinsics **`vmovl`** to zero-extend or using the **`zip`** ones to merge with zero elements. The fastest way is the **`vmovl`** intrinsics, as you can see in the next example:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the opposite actually, see the Software Optimization Guides, e.g. https://developer.arm.com/documentation/109898/latest/ for Neoverse V2, as you can see zero extends UXTL have the same latency as ZIP but much lower throughput. That's why GCC emits ZIP for zero extends https://godbolt.org/z/nddeM5Wra

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the correction and additional information! We’ve adjusted the text based on your feedback.


You may already know the equivalent operations for this particular intrinsic, but let's assume you don't. In this usecase, reading the **`_mm_madd_epi16`** on the **SIMD.info** might indicate that a key characteristic of the instruction involved is the *widening* of the result elements, from 16-bit to 32-bit signed integers. Unfortunately, that is not the case, as this particular instruction does not actually increase the size of the element holding the result values. You will see how that effects the result in the example.

Consider the following code for **SSE2**. Create a new file for the code named `_mm_madd_epi16_test.c` with the contents shown below:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting here, or in the general section that especially whenever there isn't a 1-1 mapping for intrinsics that you can often get better performance on Arm platforms by reorganizing the data layout to match a native Arm instruction. As an example you often have code doing unsigned widening multiplies using {xxx,0001,yyy,0001} i.e. only multiplying the even bits and just zero extending the odd elements. In this case you should permute the even/odd elements out and do the zero extend and multiple separately. To avoid the expensive multiply. Just a quick example but you get the idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment once again. Our primary focus in this work was to optimize the existing algorithm using SIMD intrinsics directly, without altering the algorithm or changing data patterns. While reordering data to align with native ARM instructions can indeed improve performance in some cases, our scope here was limited to optimizing within the constraints of the current data layout and algorithm. We mentioned this idea in the conclusion, also pointing to another LP about vectorization-friendly data layout.

@TamarChristinaArm
Copy link

Thanks for the updates! The changes look good to me.

@pareenaverma
Copy link
Contributor

Thank you @TamarChristinaArm for the technical review and @gMerm @markos for another great learning path. @gMerm can you please resolve the conflicts on this before I merge for the next review.

@gMerm
Copy link
Contributor

gMerm commented Nov 11, 2024

Thank you @TamarChristinaArm for the technical review and @gMerm @markos for another great learning path. @gMerm can you please resolve the conflicts on this before I merge for the next review.

Thanks @pareenaverma , the conflicts have been resolved.

@pareenaverma pareenaverma merged commit cab68ba into ArmDeveloperEcosystem:main Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants