proposal: generate unrolled implementations in `internal/asm` #1926

egonelbre · 2023-11-22T07:12:14Z

Background

While going over some of the internal/asm code, I realized that most of the assembly things do not actually have any SSE, AVX, NEON optimizations; and most of the benefit comes from avoiding the bounds checks and unrolling the loops. So theoretically, if the Go code can avoid bounds-checks then those assembly implementations can be avoided.

The main issue is that Go doesn't do bounds checks elimination well enough to write straight-forward Go code.

This doesn't exclude having SSE, AVX, NEON (and other) optimizations in the future.

Proposal

Most of the operations (but not all) in internal/asm can be stated as:

iterate over these different vectors, with potentially different increments
with this computation

The first question is, what bounds-checkless code would optimal for Go? I did a bunch of benchmarking experiments in writing axpy https://github.com/egonelbre/exp/blob/vec/vector/compare/axpy.go. On amd64 the best performing was AxpyPointerR4 and on mac M1 AxpyUnsafeInlineR4. Feel free to re-run the benchmarks on your own machines to verify on more machines.

goos: windows
goarch: amd64
pkg: github.com/egonelbre/exp/vector/compare
cpu: AMD Ryzen Threadripper 2950X 16-Core Processor
AxpyAssembly-32                  117.6µ ± 1%
AxpyBasic-32                     151.0µ ± 1%
AxpyUnsafe-32                    93.06µ ± 0%
AxpyUnsafeInline-32              111.3µ ± 2%
AxpyPointer-32                   149.0µ ± 0%
AxpyPointerLoop-32               156.2µ ± 0%
AxpyBasicR4-32                   109.7µ ± 1%
AxpyUnsafeR4-32                  101.5µ ± 1%
AxpyUnsafeInlineR4-32            102.3µ ± 2%
AxpyUnsafeInlineR8-32            103.5µ ± 1%
AxpyPointerR4-32                 83.41µ ± 1%
AxpyPointerLoopR4-32             81.90µ ± 1%
AxpyPointerLoopInterleaveR4-32   84.45µ ± 1%
AxpyPointerR4Alt-32              84.55µ ± 1%

goos: darwin
goarch: arm64
pkg: github.com/egonelbre/exp/vector/compare
AxpyBasic-10                     105.4µ ± 0%
AxpyUnsafe-10                    84.80µ ± 1%
AxpyUnsafeInline-10              87.73µ ± 1%
AxpyPointer-10                   128.2µ ± 0%
AxpyPointerLoop-10               124.7µ ± 0%
AxpyBasicR4-10                   94.14µ ± 0%
AxpyUnsafeR4-10                  84.83µ ± 0%
AxpyUnsafeInlineR4-10            73.84µ ± 0%
AxpyUnsafeInlineR8-10            76.90µ ± 0%
AxpyPointerR4-10                 83.03µ ± 0%
AxpyPointerLoopR4-10             83.10µ ± 0%
AxpyPointerLoopInterleaveR4-10   83.09µ ± 0%
AxpyPointerR4Alt-10              83.27µ ± 0%

Notice, the Go versions with bounds checks removed ended up faster than the current gonum axpy assembly implementation. Even the unrolled version with bounds checks present ended up faster than the current gonum assembly version.

Of course, writing such unrolled and optimized versions would be rather error-prone and annoying. I realized that it should be easy to generate the code as long as we constrain to a subset of basic operations it needs to accomplish.

Initially I was thinking of using regular Go code as the base implementation and then running an "unroller and optimizer" on it, however, that seemed difficult to work with.

Finally figured out a way https://github.com/egonelbre/exp/blob/vec/vector/generate/example.go#L37 to write rather concise definition of such functions:

ctx.Func("ScalIncUnitaryTo(dst []$Type, incdst uintptr, alpha $Type, xs []$Type, n, incx uintptr)",
	Iterate{
		Ranges: []It{
			Range("i", 0, "n"),
			Slice("dst", 0, "incdst"),
			Slice("xs", 0, "incx"),
		},
		For: "$dst = alpha * $xs",
	})

The generator is mostly hacked together https://github.com/egonelbre/exp/blob/vec/vector/generate/naive.go, but nothing extreme. It currently implements four configurations:

// Unroll: 0, Counter: false

func ScalIncUnitaryTo(dst []float32, incdst uintptr, alpha float32, xs []float32, n, incx uintptr) {
        if len(dst) < int(n*incdst) {
                panic("dst is too small")
        }
        if len(xs) < int(n*incx) {
                panic("xs is too small")
        }
        for i := 0; i < n; i++ {
                dst[i*incdst] = alpha * xs[i*incx]
        }
}

// Unroll: 0, Counter: true

func ScalIncUnitaryTo(dst []float32, incdst uintptr, alpha float32, xs []float32, n, incx uintptr) {
        if len(dst) < int(n*incdst) {
                panic("dst is too small")
        }
        if len(xs) < int(n*incx) {
                panic("xs is too small")
        }
        idst := 0
        ixs := 0
        for i := 0; i < n; i++ {
                dst[idst] = alpha * xs[ixs]
                idst += incdst
                ixs += incx
        }
}

// Unroll: 4, Counter: false

func ScalIncUnitaryTo(dst []float32, incdst uintptr, alpha float32, xs []float32, n, incx uintptr) {
        if len(dst) < int(n*incdst) {
                panic("dst is too small")
        }
        if len(xs) < int(n*incx) {
                panic("xs is too small")
        }
        i := 0
        n_unroll := n - n%4
        for ; i < n_unroll; i += 4 {
                dst[i*incdst] = alpha * xs[i*incx]
                dst[i+1*incdst] = alpha * xs[i+1*incx]
                dst[i+2*incdst] = alpha * xs[i+2*incx]
                dst[i+3*incdst] = alpha * xs[i+3*incx]
        }
        for ; i < n; i++ {
                dst[i*incdst] = alpha * xs[i*incx]
        }
}

// Unroll: 4, Counter: true

func ScalIncUnitaryTo(dst []float32, incdst uintptr, alpha float32, xs []float32, n, incx uintptr) {
        if len(dst) < int(n*incdst) {
                panic("dst is too small")
        }
        if len(xs) < int(n*incx) {
                panic("xs is too small")
        }
        i := 0
        n_unroll := n - n%4
        idst := 0
        ixs := 0
        for ; i < n_unroll; i += 4 {
                dst[idst] = alpha * xs[ixs]
                dst[idst+incdst] = alpha * xs[ixs+incx]
                dst[idst+2*incdst] = alpha * xs[ixs+2*incx]
                dst[idst+3*incdst] = alpha * xs[ixs+3*incx]
                idst += incdst * 4
                ixs += incx * 4
        }
        for ; i < n; i++ {
                dst[idst] = alpha * xs[ixs]
                idst += incdst
                ixs += incx
        }
}

The slice accesses can be replaced with the appropriate at implementation. It would be even possible to generate the same code for all the implementation, but have at operation be implemented differently for different targets (arm, amd64, safe).

I'm certain there are a bunch of bugs present in the proof-of-concept, but I decided to show it as is, because it should be sufficient to get the ideas across.

Potential impact of proposal

Better or same performance for everything that uses internal/asm without using assembly.
More easily inspectable and maintainable code.
Less work to implement new things that require optimized for loops across multiple slices.

The text was updated successfully, but these errors were encountered:

egonelbre · 2023-11-22T10:15:18Z

I realized that it should be possible to create some similar idea for SSE, AVX, NEON things as well, e.g.


ctx.Func("ScalIncUnitaryTo(dst []$Type, incdst uintptr, alpha $Type, xs []$Type, n, incx uintptr)",
	Iterate{
		Ranges: []It{
			Range("i", 0, "n"),
			Slice("dst", 0, "incdst"),
			Slice("xs", 0, "incx"),
		},
		Go: "$dst = alpha * $xs",
		Neon: ctx.WhenType( // ignore the correctness
			"float32", "VMUL.F32 $dst, @alpha, $xs",
			"float64", "VMUL.F64 $dst, @alpha, $xs",
		)
	})

Also, it feels like someone has already done this, somewhere.

kortschak · 2023-12-22T20:24:15Z

Sorry for the slow response. Local benchmarks (the first is a relatively old i7 and the second is an RPi 3B+). I think the first take home is the degree of behaviour variation between architectures, even within the same broad families. On the i7, the assembly sits in the middle and on the RPi it is no better than the mode.

In general though I think this is a good idea; if we can get similar or better performance without assembly, that is a win since at the moment we have limited active contributor expertise with assembly, so new features are unlikely to be added and bugs (should they be found) would be difficult to address.

goos: linux
goarch: amd64
pkg: github.com/egonelbre/exp/vector/compare
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
                              │   cmp.bench   │
                              │    sec/op     │
AxpyAssembly-8                  205.5µ ±  17%
AxpyBasic-8                     251.6µ ±  15%
AxpyUnsafe-8                    168.9µ ±   3%
AxpyUnsafeInline-8              195.2µ ±   9%
AxpyPointer-8                   185.9µ ±   3%
AxpyPointerLoop-8               186.9µ ±  23%
AxpyBasicR4-8                   230.6µ ±  10%
AxpyUnsafeR4-8                  192.5µ ±   4%
AxpyUnsafeInlineR4-8            186.1µ ± 117%
AxpyUnsafeInlineR8-8            205.2µ ±  22%
AxpyPointerR4-8                 217.0µ ±   7%
AxpyPointerLoopR4-8             214.4µ ±   2%
AxpyPointerLoopInterleaveR4-8   201.9µ ±  21%
AxpyPointerR4Alt-8              198.4µ ±   2%
geomean                         201.9µ

goos: linux
goarch: arm64
pkg: github.com/egonelbre/exp/vector/compare
                              │  cmp.bench  │
                              │   sec/op    │
AxpyBasic-4                     5.144m ± 1%
AxpyUnsafe-4                    5.168m ± 1%
AxpyUnsafeInline-4              5.605m ± 3%
AxpyPointer-4                   4.335m ± 2%
AxpyPointerLoop-4               4.334m ± 2%
AxpyBasicR4-4                   5.121m ± 1%
AxpyUnsafeR4-4                  5.145m ± 1%
AxpyUnsafeInlineR4-4            5.166m ± 0%
AxpyUnsafeInlineR8-4            5.185m ± 1%
AxpyPointerR4-4                 5.135m ± 1%
AxpyPointerLoopR4-4             5.115m ± 1%
AxpyPointerLoopInterleaveR4-4   5.233m ± 1%
AxpyPointerR4Alt-4              5.126m ± 0%
geomean                         5.051m

egonelbre · 2023-12-22T21:04:46Z

So, I think this needs an additional comparison against assembly that ensures loop alignment. To me the results look kind of weird, and it might be due to loop alignment. However, I'm not sure how we could force loop alignment from Go code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: generate unrolled implementations in `internal/asm` #1926

proposal: generate unrolled implementations in `internal/asm` #1926

egonelbre commented Nov 22, 2023 •

edited

egonelbre commented Nov 22, 2023

kortschak commented Dec 22, 2023

egonelbre commented Dec 22, 2023

proposal: generate unrolled implementations in internal/asm #1926

proposal: generate unrolled implementations in internal/asm #1926

Comments

egonelbre commented Nov 22, 2023 • edited

Background

Proposal

Potential impact of proposal

egonelbre commented Nov 22, 2023

kortschak commented Dec 22, 2023

egonelbre commented Dec 22, 2023

proposal: generate unrolled implementations in `internal/asm` #1926

proposal: generate unrolled implementations in `internal/asm` #1926

egonelbre commented Nov 22, 2023 •

edited