Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: Introduction to SIMD.info

minutes_to_complete: 30

who_is_this_for: This is for software developers interested in porting SIMD code across platforms.
who_is_this_for: This is for advanced topic for software developers interested in porting SIMD code across Arm platforms.

learning_objectives:
- Learn how to use SIMD.info’s tools and features, such as navigation, search, and comparison, to simplify the process of finding equivalent SIMD intrinsics between architectures and improving code portability.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ layout: learningpathall

### Conclusion and Additional Resources

Porting SIMD code between architecture can be a daunting process, in many cases requiring many hours of studying multiple ISAs in online resources or ISA manuals of thousands pages. Our primary focus in this work was to optimize the existing algorithm directly with SIMD intrinsics, without altering the algorithm or data layout. While reordering data to align with native ARM instructions could offer performance benefits, our scope remained within the constraints of the current data layout and algorithm. For those interested in data layout strategies to further enhance performance on ARM, the [vectorization-friendly data layout learning path](https://learn.arm.com/learning-paths/cross-platform/vectorization-friendly-data-layout/) offers valuable insights.
Porting SIMD code between architecture can be a daunting process, in many cases requiring many hours of studying multiple ISAs in online resources or ISA manuals of thousands pages. Our primary focus in this work was to optimize the existing algorithm directly with SIMD intrinsics, without altering the algorithm or data layout. While reordering data to align with native Arm instructions could offer performance benefits, our scope remained within the constraints of the current data layout and algorithm. For those interested in data layout strategies to further enhance performance on Arm, the [vectorization-friendly data layout learning path](https://learn.arm.com/learning-paths/cross-platform/vectorization-friendly-data-layout/) offers valuable insights.

Using **[SIMD.info](https://simd.info)** can be be instrumental in reducing the amount of time spent in this process, providing a centralized and user-friendly resource for finding **NEON** equivalents to intrinsics of other architectures. It saves considerable time and effort by offering detailed descriptions, prototypes, and comparisons directly, eliminating the need for extensive web searches and manual lookups.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Overview & Context
title: Overview
weight: 2

### FIXED, DO NOT MODIFY
Expand All @@ -9,9 +9,8 @@ layout: learningpathall
### The Challenge of SIMD Code Portability
One of the biggest challenges developers face when working with SIMD code is making it portable across different platforms. SIMD instructions are designed to increase performance by executing the same operation on multiple data elements in parallel. However, each architecture has its own set of SIMD instructions, making it difficult to write code that works on all of them without major changes to the code and/or algorithm.

Consider you have the task of porting a software written using Intel intrinsics, like SSE/AVX/AVX512, to Arm Neon.
The differences in instruction sets and data handling require careful attention.
To port software written using Intel intrinsics, like SSE/AVX/AVX512, to Arm Neon, you have pay attention to data handling with the different instruction sets.

This lack of portability increases development time and introduces the risk of errors during the porting process. Currently, developers rely on ISA documentation and manually search across various vendor platforms like [ARM Developer](https://developer.arm.com/architectures/instruction-sets/intrinsics/) and [Intel Intrinsics Guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html) to find equivalent instructions.
Having to port the code between architectures can increase development time and introduce the risk of errors during the porting process. Currently, developers rely on ISA documentation and manually search across various vendor platforms like [Arm Developer](https://developer.arm.com/architectures/instruction-sets/intrinsics/) and [Intel Intrinsics Guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html) to find equivalent instructions.

[SIMD.info](https://simd.info) aims to solve this by helping you find equivalent instructions and providing a more streamlined way to adapt your code for different architectures.
Original file line number Diff line number Diff line change
Expand Up @@ -46,22 +46,8 @@ An example of how the tree structure looks like:

- **Advanced search functionality:** With its robust search engine, **SIMD.info** allows you to either search for a specific intrinsic (e.g. `vaddq_f64`) or enter more general terms (e.g. *How to add 2 vectors*), and it will return a list of the corresponding intrinsics. You can also filter results based on the specific engine you're working with, such as **NEON**, **SSE4.2**, **AVX**, **AVX512**, **VSX**. This functionality streamlines the process of finding the right commands tailored to your needs.

- **Comparison tools:** This feature lets you directly compare SIMD instructions from different (or the same) platforms side by side, offering a clear view of the similarities and differences. It’s an invaluable tool for porting code across architectures, as it ensures accuracy and efficiency.
- **Comparison tools:** This feature lets you directly compare SIMD instructions from different (or the same) platforms side by side, offering a clear view of the similarities and differences. It’s a very helpful tool for porting code across architectures, as it ensures accuracy and efficiency.

- **Discussion forum (like StackOverflow):** The integrated discussion forum, powered by **[discuss](https://disqus.com/)** allows users to ask questions, share insights, and troubleshoot problems together. This community-driven space ensures that you’re never stuck on a complex issue without support, fostering collaboration and knowledge-sharing among SIMD developers. Imagine something like **StackOverflow** but specific to SIMD intrinsics.

### Work in Progress & Future Development
- **Pseudo-code:** Currently under development, this feature will enable users to generate high-level pseudo-code based on specific SIMD instructions. This tool aims to enable better understanding of the SIMD instructions, in a *common language*. This will also be used in the next feature, **Intrinsics Diagrams**.

- **Intrinsics Diagrams:** A feature under progress, creating detailed diagrams for each intrinsic to visualize how it operates on a low level using registers. These diagrams will help you grasp the mechanics of SIMD instructions more clearly, aiding in optimization and debugging.

- **[SIMD.ai](https://simd.ai/):** SIMD.ai is an upcoming feature that promises to bring AI-assisted insights and recommendations to the SIMD development process, making it faster and more efficient to port SIMD code between architectures.

### How These Features Aid in SIMD Development
**[SIMD.info](https://simd.info/)** offers a range of features that streamline the process of porting SIMD code across different architectures. The hierarchical structure of tree-based navigation allows you to easily locate instructions within a clear framework. This organization into broad categories and specific subcategories, such as **Arithmetic** and **Boolean Logic**, makes it straightforward to identify the relevant SIMD instructions.

When you need to port code from one architecture to another, the advanced search functionality proves invaluable. You can either search for specific intrinsics or use broader terms to find equivalent instructions across platforms. This capability ensures that you quickly find the right intrinsics for Arm, Intel or Power architectures.

Furthermore, **SIMD.info**’s comparison tools enhance this process by enabling side-by-side comparisons of instructions from various platforms. This feature highlights the similarities and differences between instructions, which is crucial for accurately adapting your code. By understanding how similar operations are implemented across architectures, you can ensure that your ported code performs optimally.

Let's look at an actual example.
You can now learn how to use these features in the context of an actual example.
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ layout: learningpathall

After identifying the **NEON** intrinsics you will need in the ported program, it's time to actually write the code.

Create a new file for the ported NEON code named `calculation_neon.c` with the contents shown below:
This time on your Arm Linux machine, create a new file for the ported NEON code named `calculation_neon.c` with the contents shown below:

```C
#include <arm_neon.h>
Expand Down Expand Up @@ -68,7 +68,7 @@ int main() {

It's time to verify that the functionality remains the same, which means you get the same results and similar performance.

Compile the above code as follows on an Arm system:
Compile the above code as follows on your Arm Linux machine:

```bash
gcc -O3 calculation_neon.c -o calculation_neon
Expand All @@ -95,5 +95,5 @@ Square Root Result: 1.41 3.46 6.00 8.94
You can see that the results are the same as in the **SSE4.2** example.

{{% notice Note %}}
We initialized the vectors in reverse order compared to the **SSE4.2** version because the array initialization and vld1q_f32 function load vectors from LSB to MSB, whereas **`_mm_set_ps`** loads elements MSB to LSB.
{{% /notice %}}
You initialized the vectors in reverse order compared to the **SSE4.2** version because the array initialization and vld1q_f32 function load vectors from LSB to MSB, whereas **`_mm_set_ps`** loads elements MSB to LSB.
{{% /notice %}}
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ layout: learningpathall

Consider the following C example that uses Intel SSE4.2 intrinsics.

Create a file named `calculation_sse.c` with the contents shown below.
On an x86_64 Linux development machine, create a file named `calculation_sse.c` with the contents shown below:

```C
#include <xmmintrin.h>
Expand Down Expand Up @@ -54,12 +54,14 @@ int main() {

The program first compares whether elements in one vector are greater than those in another vector, prints the result, and then proceeds to compute the addition of two vectors, multiplies the result with one of the vectors, and finally takes the square root of the multiplication result:

Compile the code as follows on an Intel system that supports **SSE4.2**:
Compile the code on your Linux x86_64 system that supports **SSE4.2**:

```bash
gcc -O3 calculation_sse.c -o calculation_sse -msse4.2
```

Now run the program:

```bash
./calculation_sse
```
Expand All @@ -76,4 +78,4 @@ Multiplication Result: 2.00 12.00 36.00 80.00
Square Root Result: 1.41 3.46 6.00 8.94
```

It is imperative that you run the code first on the reference platform (here Intel), to make sure you understand how it works and what kind of results are being expected.
It is imperative that you run the code first on an Intel x86_64 reference platform, to make sure you understand how it works and what kind of results are being expected.
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ During the porting process, you will observe that certain instructions translate

You may already know the equivalent operations for this particular intrinsic, but let's assume you don't. In this usecase, reading the **`_mm_madd_epi16`** on the **SIMD.info** might indicate that a key characteristic of the instruction involved is the *widening* of the result elements, from 16-bit to 32-bit signed integers. Unfortunately, that is not the case, as this particular instruction does not actually increase the size of the element holding the result values. You will see how that effects the result in the example.

Consider the following code for **SSE2**. Create a new file for the code named `_mm_madd_epi16_test.c` with the contents shown below:
Consider the following code for **SSE2**. Create a new file on your x86_64 Linux machine named `_mm_madd_epi16_test.c` with the contents shown below:

```C
#include <stdint.h>
Expand Down Expand Up @@ -44,7 +44,7 @@ int main() {
}
```

Compile the code as follows on an x86 system (no extra flags required as **SSE2** is assumed by default on all 64-bit x86 systems):
Compile the code as follows on the x86_64 system (no extra flags required as **SSE2** is assumed by default on all 64-bit x86 systems):
```bash
gcc -O3 _mm_madd_epi16_test.c -o _mm_madd_epi16_test
```
Expand All @@ -63,8 +63,9 @@ _mm_madd_epi16(a, b) : a4d8 0 56b8 0 2198 0 578 0

You will note that the result of the first element is a negative number, even though we added 2 positive results (`130*140` and `150*160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.

The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. In this example, we chose to use **`vmovl`** to zero-extend values, which achieves the correct order with zero elements in place. While both **`vmovl`** and **`zip`** could be used for this purpose, we opted for **`vmovl`** in this implementation. For more details, see the ARM Software Optimization Guides, such as the [Neoverse V2 guide](https://developer.arm.com/documentation/109898/latest/).
The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. In this example, you used **`vmovl`** to zero-extend values, which achieves the correct order with zero elements in place. While both **`vmovl`** and **`zip`** could be used for this purpose, **`vmovl`** was chosen in this implementation. For more details, see the Arm Software Optimization Guides, such as the [Neoverse V2 guide](https://developer.arm.com/documentation/109898/latest/).

Now switch your Linux Arm machine and create a file called `_mm_madd_epi16_neon.c` with the contents below:
```C
#include <arm_neon.h>
#include <stdint.h>
Expand Down Expand Up @@ -107,7 +108,7 @@ int main() {
}
```

Write the above program to a file called `_mm_madd_epi16_neon.c` and compile it:
Compile the code on your Arm Linux machine:

```bash
gcc -O3 _mm_madd_epi16_neon.c -o _mm_madd_epi16_neon
Expand All @@ -127,5 +128,5 @@ vpaddq_s16(a, b) : a4d8 56b8 2198 578 0 0 0 0
final : a4d8 0 56b8 0 2198 0 578 0
```

As you can see the results of both match, **SIMD.info** was especially helpful in this process, providing detailed descriptions and examples that guided the translation of complex intrinsics between different SIMD architectures.
As you can see the results of both executions on different architectures match. You were able to use **SIMD.info** to help with the translation of complex intrinsics between different SIMD architectures.