This repository shows basic usage of Burst C# Neon intrinsics available in Unity. The code is equivalent to that written in C in this repo: https://github.com/Over17/NeonIntrinsics-Android/
Please refer to https://github.com/Over17/NeonIntrinsics-Android/ for explanation of what is being calculated and how. The input data is absolutely the same as in the C version. Here is the list of what was implemented:
- pure managed C# scalar version calculation
- absolutely same function as above, but bursted by just adding the [BurstCompile] attribute - this is a new option comparing to the C version. Unity profiler markers were added to this variant, showing up in the Unity profiler and systrace
- using Neon intrinsics, same variations as in the C version: plain Neon; manual loop unrolling 2x, 3x, 4x, 6x; SMLAL+SMLAL2; SMLAL+SMLAL2, manually unrolled 2x
- there is no need for special code to run threads on big cores because Unity already does it by default for you
- generally, there's no need for warm-up code because loading a Unity app will clock up the CPU enough. However, some interesting finds are explained below
- the outer loop (on 1000000
trials
) was moved into the calculation functions themselves, to avoid the cost of 1000000x P/Invoke and potentially allow for better optimizations. Overall, this is the expected way to do such calculations.
If you look at the code itself, it looks very similar to the C intrinsics version. This makes porting C Neon intrinsics code or writing C# Neon code for people familiar with C Neon code a relatively easy task.
To compare the performance of Burst C# intrinsics to C intrinsics, the APK was run on the same device (Pixel 3XL). Here is the result screenshot:
When comparing to the results of a C intrinsics run, you can find that most cases are demonstrating absolutely equivalent performance, with the following exceptions:
- The "No Neon" (pure managed C# scalar version) run is much slower in C# than in C. This managed function doesn't get any benefit of auto-vectorization by neither Burst nor clang/LLVM (in case of C). Interesting fact: the performance of this pure scalar implementation is 5 times slower than of a simple Neon implementation, and 8 times slower than the most performant Neon intrinsic version.
- The performance of "No Neon, bursted" version is equivalent to "No Neon" (LLVM auto-vectorized) version in C. Basically, it shows that by default Burst delivers the same performance as code written in C and compiled by a default Android compiler. NOTE: first run of a bursted function was some 20ms slower; profiler showed that the thread was waiting for some I/O to finish - likely loading - to be checked additionally - in code this place is commented as "warm-up".
- The performance of "Neon, 2x unrolling" version is some 15% faster than that of a C version (!!). Checking further, I found out that the difference is caused by the "1000000 trials" loop being moved into the bursted function, while it's in the outer function in C. When I move the
trials
loop into the calculation function in the C version, the performance becomes equivalent. I don't have a clear explanation on why this is happening, likely some additional optimizations are being performed when the outer loop is in the calculation function. Overall, there is parity of C# and C performance in this case too.
Looking closely at the explanation above, I think it's valid to say that the performance of these functions written in Burst C# Neon intrinsics is equivalent to ones written in C intrinsics.
A nice addition to the user experience is the Burst inspector, which allows to check the IL, IR and assembly output of bursted functions without having to build and use any external tools:
It's easy to verify that the output code has the expected instructions using the Burst inspector.
The project was created and built using Unity 2020.3.2f1 and Burst 1.5.0. I don't see any reason why it wouldn't work with a newer Unity or Burst version.
Licensed under Apache 2.0 license.