SIMD.info LP completed #1364

markos · 2024-10-31T11:30:37Z

Before submitting a pull request for a new Learning Path, please review Create a Learning Path

I have reviewed Create a Learning Path

Please do not include any confidential information in your contribution. This includes confidential microarchitecture details and unannounced product information. No AI tool can be used to generate either content or code when creating a learning path or install guide.

I have checked my contribution for confidential information

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the Creative Commons Attribution 4.0 International License.

… equivalent for NEON.

ARM Learning path for simd.info initial commit

TamarChristinaArm

Thanks! Overall looks good! Added some comments.

TamarChristinaArm · 2024-11-06T09:11:14Z

content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1-porting.md

+    float32x4_t a = {1.0f, 4.0f, 9.0f, 16.0f};
+    float32x4_t b = {1.0f, 2.0f, 3.0f, 4.0f};


This is undefined behavior for ACLE and would put the values in the wrong order for Big-Endian.
Initialization should be done through a load

float32_t a_array[4] = {1.0f, 4.0f, 9.0f, 16.0f}; float32_t b_array[4] = {1.0f, 2.0f, 3.0f, 4.0f}; float32x4_t a = vld1q_f32 (a_array); float32x4_t b = vld1q_f32 (b_array);

which allows the compiler to do the lane correction on Big-Endian. Same for the other examples.

Thank you for the feedback! We originally focused on little-endian, but we've now updated the code to ensure compatibility with big-endian systems.

TamarChristinaArm · 2024-11-06T09:20:11Z

content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1-porting.md

+Compile the above code as follows on an Arm system:
+
+```bash
+gcc -O3 calculation_neon.c -o calculation_neon


Because the constants are local, the compiler constant evaluates this example entirely at compile time https://godbolt.org/z/eeY754815 so there won't be any Adv. SIMD instructions generated here (aside from fsqrt). It may be confusing for someone verifying the source.

You could consider lifting the constants to global scope to prevent this: https://godbolt.org/z/4qjxPd399

Same for the other examples

Thanks for the comment, we changed the examples to use global scope.

TamarChristinaArm · 2024-11-06T09:29:44Z

content/learning-paths/cross-platform/simd-info-demo/simdinfo-example2.md

+
+You will note that the result of the first element is a negative number, even though we added 2 positive results (`130*140` and `150*160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.
+
+The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. You could get the correct order in multiple ways, using the widening intrinsics **`vmovl`** to zero-extend or using the **`zip`** ones to merge with zero elements. The fastest way is the **`vmovl`** intrinsics, as you can see in the next example:


it's the opposite actually, see the Software Optimization Guides, e.g. https://developer.arm.com/documentation/109898/latest/ for Neoverse V2, as you can see zero extends UXTL have the same latency as ZIP but much lower throughput. That's why GCC emits ZIP for zero extends https://godbolt.org/z/nddeM5Wra

Thank you for the correction and additional information! We’ve adjusted the text based on your feedback.

TamarChristinaArm · 2024-11-06T09:54:23Z

content/learning-paths/cross-platform/simd-info-demo/simdinfo-example2.md

+
+You may already know the equivalent operations for this particular intrinsic, but let's assume you don't. In this usecase, reading the **`_mm_madd_epi16`** on the **SIMD.info** might indicate that a key characteristic of the instruction involved is the *widening* of the result elements, from 16-bit to 32-bit signed integers. Unfortunately, that is not the case, as this particular instruction does not actually increase the size of the element holding the result values. You will see how that effects the result in the example.
+
+Consider the following code for **SSE2**. Create a new file for the code named `_mm_madd_epi16_test.c` with the contents shown below:


It's worth noting here, or in the general section that especially whenever there isn't a 1-1 mapping for intrinsics that you can often get better performance on Arm platforms by reorganizing the data layout to match a native Arm instruction. As an example you often have code doing unsigned widening multiplies using {xxx,0001,yyy,0001} i.e. only multiplying the even bits and just zero extending the odd elements. In this case you should permute the even/odd elements out and do the zero extend and multiple separately. To avoid the expensive multiply. Just a quick example but you get the idea.

Thank you for the comment once again. Our primary focus in this work was to optimize the existing algorithm using SIMD intrinsics directly, without altering the algorithm or changing data patterns. While reordering data to align with native ARM instructions can indeed improve performance in some cases, our scope here was limited to optimizing within the constraints of the current data layout and algorithm. We mentioned this idea in the conclusion, also pointing to another LP about vectorization-friendly data layout.

… & some minor fixes in other places

Alternations based on the comments

TamarChristinaArm · 2024-11-08T12:01:32Z

Thanks for the updates! The changes look good to me.

pareenaverma · 2024-11-08T14:09:39Z

Thank you @TamarChristinaArm for the technical review and @gMerm @markos for another great learning path. @gMerm can you please resolve the conflicts on this before I merge for the next review.

correct contributors

gMerm · 2024-11-11T13:46:59Z

Thank you @TamarChristinaArm for the technical review and @gMerm @markos for another great learning path. @gMerm can you please resolve the conflicts on this before I merge for the next review.

Thanks @pareenaverma , the conflicts have been resolved.

gMerm and others added 14 commits September 9, 2024 16:30

ARM Learning path for simd.info initial commit

e566a2b

Completed pages 2-5

22ffb25

deleted images because they are not being used

336ba6d

Some detailed changes after reviewing & contributor addition to the csv

20bcb1e

Paths lead to simd.info and not to staging as they were before.

936b6ad

Added 1 more page for a more complex example that doesn't have direct…

0761196

… equivalent for NEON.

Minor size changes

f618d47

resovled (probably) contributores.csv conflicts

6da9746

conflicts

ee31b18

maybe a path change will fix the conflicts

f801adb

Merge branch 'main' into simd-info-demo

92ebee0

Merge pull request #1 from gMerm/simd-info-demo

88e5f67

ARM Learning path for simd.info initial commit

SIMD.info LP final changes made

9625cb7

Merge branch 'main' into main

720bcbe

TamarChristinaArm reviewed Nov 6, 2024

View reviewed changes

gMerm and others added 4 commits November 7, 2024 14:21

Based on the comments, we altered the data-loading, used global scope…

b429e68

… & some minor fixes in other places

INTEL remains unchanged

936a73c

some text-related changes + conclusion addition about data layout

7380861

Merge pull request #2 from gMerm/LP_Corrections

9208c3a

Alternations based on the comments

gMerm and others added 3 commits November 11, 2024 12:34

correct contributors

a442565

Merge pull request #3 from gMerm/contributors_check

41fc4b6

correct contributors

Merge branch 'main' into main

2042e66

pareenaverma merged commit cab68ba into ArmDeveloperEcosystem:main Nov 11, 2024

		float32x4_t a = {1.0f, 4.0f, 9.0f, 16.0f};
		float32x4_t b = {1.0f, 2.0f, 3.0f, 4.0f};


		You will note that the result of the first element is a negative number, even though we added 2 positive results (`130140` and `150160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.

		The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. You could get the correct order in multiple ways, using the widening intrinsics `vmovl` to zero-extend or using the `zip` ones to merge with zero elements. The fastest way is the `vmovl` intrinsics, as you can see in the next example:


		You may already know the equivalent operations for this particular intrinsic, but let's assume you don't. In this usecase, reading the `_mm_madd_epi16` on the SIMD.info might indicate that a key characteristic of the instruction involved is the widening of the result elements, from 16-bit to 32-bit signed integers. Unfortunately, that is not the case, as this particular instruction does not actually increase the size of the element holding the result values. You will see how that effects the result in the example.

		Consider the following code for SSE2. Create a new file for the code named `_mm_madd_epi16_test.c` with the contents shown below:

SIMD.info LP completed #1364

SIMD.info LP completed #1364

Uh oh!

Conversation

markos commented Oct 31, 2024

Uh oh!

TamarChristinaArm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TamarChristinaArm commented Nov 8, 2024

Uh oh!

pareenaverma commented Nov 8, 2024

Uh oh!

gMerm commented Nov 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants