-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: generate unrolled implementations in internal/asm
#1926
Comments
I realized that it should be possible to create some similar idea for SSE, AVX, NEON things as well, e.g.
Also, it feels like someone has already done this, somewhere. |
Sorry for the slow response. Local benchmarks (the first is a relatively old i7 and the second is an RPi 3B+). I think the first take home is the degree of behaviour variation between architectures, even within the same broad families. On the i7, the assembly sits in the middle and on the RPi it is no better than the mode. In general though I think this is a good idea; if we can get similar or better performance without assembly, that is a win since at the moment we have limited active contributor expertise with assembly, so new features are unlikely to be added and bugs (should they be found) would be difficult to address.
|
So, I think this needs an additional comparison against assembly that ensures loop alignment. To me the results look kind of weird, and it might be due to loop alignment. However, I'm not sure how we could force loop alignment from Go code. |
Background
While going over some of the
internal/asm
code, I realized that most of the assembly things do not actually have any SSE, AVX, NEON optimizations; and most of the benefit comes from avoiding the bounds checks and unrolling the loops. So theoretically, if the Go code can avoid bounds-checks then those assembly implementations can be avoided.The main issue is that Go doesn't do bounds checks elimination well enough to write straight-forward Go code.
This doesn't exclude having SSE, AVX, NEON (and other) optimizations in the future.
Proposal
Most of the operations (but not all) in
internal/asm
can be stated as:The first question is, what bounds-checkless code would optimal for Go? I did a bunch of benchmarking experiments in writing
axpy
https://github.com/egonelbre/exp/blob/vec/vector/compare/axpy.go. On amd64 the best performing wasAxpyPointerR4
and on mac M1AxpyUnsafeInlineR4
. Feel free to re-run the benchmarks on your own machines to verify on more machines.Notice, the Go versions with bounds checks removed ended up faster than the current gonum axpy assembly implementation. Even the unrolled version with bounds checks present ended up faster than the current gonum assembly version.
Of course, writing such unrolled and optimized versions would be rather error-prone and annoying. I realized that it should be easy to generate the code as long as we constrain to a subset of basic operations it needs to accomplish.
Initially I was thinking of using regular Go code as the base implementation and then running an "unroller and optimizer" on it, however, that seemed difficult to work with.
Finally figured out a way https://github.com/egonelbre/exp/blob/vec/vector/generate/example.go#L37 to write rather concise definition of such functions:
The generator is mostly hacked together https://github.com/egonelbre/exp/blob/vec/vector/generate/naive.go, but nothing extreme. It currently implements four configurations:
The slice accesses can be replaced with the appropriate
at
implementation. It would be even possible to generate the same code for all the implementation, but haveat
operation be implemented differently for different targets (arm
,amd64
,safe
).I'm certain there are a bunch of bugs present in the proof-of-concept, but I decided to show it as is, because it should be sufficient to get the ideas across.
Potential impact of proposal
internal/asm
without using assembly.for
loops across multiple slices.The text was updated successfully, but these errors were encountered: