-
Notifications
You must be signed in to change notification settings - Fork 5k
[SME] Design proposal #115037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[SME] Design proposal #115037
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
@dotnet/arm64-contrib @dotnet/jit-contrib @jkotas @davidwrighton @tannergooding |
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Change-Id: I4c1a09bc5830895a8c6b503bfa0bd8dda6b06d27
ping @tannergooding ...anyone else has any other feedback? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! I work with @a74nh at Arm, and he asked me to have a look at this.
I've left several comments and questions (which should be self-explanatory). The proposal spends a lot of time talking about the architecture (which is good for me reviewing because I know that better than .NET).
Overall, I think some of the trickier bits, like how it gets represented by Vector<T>
, or how it interacts with a GC, are worth exploring a bit more; ACLE (for C/C++) gets to declare cross-streaming-state uses of VL-dependent objects as "undefined behaviour" but I doubt that's acceptable for .NET.
Thanks!
|
||
SME introduces two key concepts essential for AI workloads: | ||
|
||
1. **Streaming Mode**: Allows a program to operate with a scalable vector length (SVL) distinct from the non-streaming vector length (NSVL). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the VL difference is more of a side effect of having a ZA matrix (which must be VL*VL for the matrix operations to work), rather than a primary goal. Perhaps this point is better written as: "Permits access to SME instructions, including matrix operations that operate on ZA."
In SVE1 and SVE2, the vector length (VL) is scalable and can vary between 16B and 256B, depending on the hardware. This is referred to as **NSVL**. Examples include: | ||
- Microsoft Azure's Cobalt: VL = 16B | ||
- Amazon AWS's Graviton3: VL = 32B | ||
- Fujitsu A64FX Fugaku: VL = 64B | ||
|
||
**Streaming Mode**, introduced by SME, operates with a distinct VL known as **SVL**. Some key differences include: | ||
- The SME core is a separate hardware unit, supporting different VLs for NSVL and SVL. | ||
- Transitions to and from Streaming Mode are controlled using the `SMSTART` and `SMSTOP` instructions. | ||
- Certain SVE1, SVE2, and NEON instructions are invalid in Streaming Mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't really explain what VL is. Maybe that's Ok, since there's Arm documentation that does that. This document seems to be written such that it could be read by people not already familiar with SME itself, so perhaps it's useful to link to some extra documentation for basic SVE concepts too. Arm's "Introduction to SVE" might be a good choice.
Personally, I find it useful to explain that SME introduces a "streaming mode" where VL differs, and then NSVL (non-streaming vector length) and SVL (streaming vector length) map intuitively. A plain "VL" is simply the current vector length, according to the current mode.
- `SMSTOP`: Deactivates Streaming Mode and restores non-streaming operations. | ||
|
||
**Important Considerations**: | ||
- Transitioning between modes incurs a performance penalty, as register states must be saved and restored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't always true, but I'll leave a detailed comment on your later example.
- Transitioning between modes incurs a performance penalty, as register states must be saved and restored. | ||
- Ensure non-streaming instructions are not executed while Streaming Mode is active. | ||
|
||
The `FEAT_SME_FA64` feature, if present, integrates the SME unit directly onto each CPU core (referred to as the "Mesh" configuration on the right). In this configuration, no instructions become invalid. However, this setup may not be implemented for server SKUs due to the significant silicon area required for an SME unit on every core. This implementation detail still requires confirmation from Arm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FEAT_SME_FA64
indicates that you can use the full A64 instruction set whilst in Streaming SVE mode. It doesn't strictly dictate the SoC design. From a programmer's perspective, it is probably the availability of instructions that is important for correctness. SoC design features become a platform-specific tuning trade-off.
|
||
### ZA Storage | ||
|
||
The second key concept introduced by SME is **ZA Storage**, which is a 2D matrix of dimensions `N x N`, where `N` represents the Scalable Vector Length (SVL) (note: this is SVL and not NSVL). This matrix is exclusively accessible in Streaming Mode, and any instructions to read, write, or manipulate the ZA Storage must be executed within Streaming Mode. Similar to NEON registers and Scalable Vector Registers, the contents of the ZA Storage matrix can be interpreted in various data types, including 1-byte (`B`), 2-byte half (`H`), 4-byte single (`S`), 8-byte double (`D`), or 16-byte quad (`Q`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just write SVL x SVL
, rather than N x N
and then a definition of N
?
ZA is actually writeable (in a limited way) outside of Streaming SVE mode. For example, smstart za
can be used to enable the SME ZA storage without entering streaming mode. A limited set of instructions — some loads, stores and a "zero" instruction — can then access it.
This may not be particularly useful under the proposed [SME_Streaming]
abstraction, but this section seems to be describing the underlying architecture.
```asm | ||
fcmgt p0.s, p0/z, z8.s, z0.s # Activate p0 lanes for which z8 > z0 | ||
ptrue p1.s | ||
cntp x0, p1, p0.s # Count number of active lanes | ||
cmp x0, #8 # If all 8 lanes are active, all satisfy z8 > z0 | ||
cset x0, eq # Set result based on comparison | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be the best example. fcmgt
doesn't set flags, but it does set a predicate, which can be tested to see if it matched the original governing predicate:
fcmgt p1.s, p0/z, z8.s, z0.s # p1.s[n] = p0.s[n] ? z8.s[n] > z0.s[n] : false
bics p1.s, p0/z, p0.s, p1.s # p1.s[n] = p0.s[n] & !p1.s[n], Z = <no lanes active>
cset x0, none # x0 = (Z ? 1 : 0) # "none" is an alias for "eq"
That's agnostic and probably faster than the fixed-VL listing. (I haven't tested or measured it!)
- When suspended execution threads are resumed by the GC, some threads might be in streaming mode while others might not. Investigate whether any handling is required to ensure that their modes are restored correctly, avoiding accidental execution in the wrong mode. | ||
|
||
#### Assembly Routines and Stubs | ||
- Hand-written assembly routines will avoid using streaming mode for simplicity. However, if streaming is necessary, ensure that all required streaming and `ZA` states are saved and restored properly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if that's a correct assumption, but it's true that hand-written assembly will have to manage streaming states (etc) properly.
#### .NET <--> PInvokes / System Calls | ||
- For all calls (e.g., PInvokes, JIT helpers, stubs), ensure that scalable `Z` and predicate `P` registers are saved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with how PInvokes work in .NET, but isn't this just about making sure that __arm_streaming(_compatible)
is respected on external libraries? Saving Z and P registers is just an AAPCS64 concern (and applies irrespective of SME).
For example, the corresponding .NET API for the above instruction could look like this: | ||
```csharp | ||
void SvmlaLaneZA32S8_VG4x2(uint slice, Tuple<Vector<byte>, Vector<byte>> zn, Vector<byte> zm, ulong imm_idx); | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Doesn't this lack
[SME_Streaming]
? - What (if anything) indicates that this function acccesses
za
? Will .NET have an equivalent to__arm_inout("za")
? The nomenclature section below looks like it will explain this, but it doesn't, so perhaps there's a missing bullet point there.
- ZA lazy scheme: https://arm-software.github.io/acle/main/acle.html#sme-instruction-intrinsics | ||
|
||
## Open Questions | ||
- What happens when `Vector<T>` is created in streaming mode? Can we pass it around to non-streaming mode and vice-versa? 18.1.7 restricts VL-dependent arguments to be passed that way, but how to restrict them in C#? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 18.1.7 is a reference to ACLE. I can't answer the question for C#, but that's one of the hard questions here, I think.
- Agnostic Vector Length (VL) code generation | ||
- Exception Handling | ||
- Threads and SME state | ||
- NativeAOT and crossgen2 handling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra challenges on top of the codegen/unwinding/code GC info reporting that are going to be shared between ReadyToRun and native AOT, for native AOT we have additional problems with data structures generated by the AOT compiler.
The native AOT compiler doesn't just generate code, it also generates data and metadata. An example of the data generated by the compiler are MethodTable
s. MethodTable
is the thing that describes types to the runtime and captures things like the size or the GC info. If the size and GC info of the type is not known at runtime because there was an instance field of type Vector<T>
somewhere, we may not be able to emit a MethodTable
into the executable. All accesses to it from code (e.g. for a new
or cast) will have to go through an indirection (probably helper call). We'll also need to figure out how accesses to these will be from other data structures (e.g. what if a type with Vector<T>
instance fields is a base of another type - what does the BaseType field of MethodTable
point to?).
Other considerations:
- What if
Vector<T>
is a static field? (e.g. does C++ allow these variable-sized types as globals?). We place statics into.data
section of the executable. - How does reflection access work (accessing fields in a type that has
Vector<T>
fields before the accessed field; or accessing theVector<T>
fields itself) - Can we represent this for the debuggers? (Looks like native-first languages severely constrained how these types can be used so there likely isn't a native (DWARF/CodeView) debug info representation for "field on a struct with variable-size")
Here is the design proposal for supporting SME in .NET ecosystem.