Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: generate unrolled implementations in internal/asm #1926

Open
egonelbre opened this issue Nov 22, 2023 · 3 comments
Open

proposal: generate unrolled implementations in internal/asm #1926

egonelbre opened this issue Nov 22, 2023 · 3 comments

Comments

@egonelbre
Copy link
Contributor

egonelbre commented Nov 22, 2023

Background

While going over some of the internal/asm code, I realized that most of the assembly things do not actually have any SSE, AVX, NEON optimizations; and most of the benefit comes from avoiding the bounds checks and unrolling the loops. So theoretically, if the Go code can avoid bounds-checks then those assembly implementations can be avoided.

The main issue is that Go doesn't do bounds checks elimination well enough to write straight-forward Go code.

This doesn't exclude having SSE, AVX, NEON (and other) optimizations in the future.

Proposal

Most of the operations (but not all) in internal/asm can be stated as:

  • iterate over these different vectors, with potentially different increments
  • with this computation

The first question is, what bounds-checkless code would optimal for Go? I did a bunch of benchmarking experiments in writing axpy https://github.com/egonelbre/exp/blob/vec/vector/compare/axpy.go. On amd64 the best performing was AxpyPointerR4 and on mac M1 AxpyUnsafeInlineR4. Feel free to re-run the benchmarks on your own machines to verify on more machines.

goos: windows
goarch: amd64
pkg: github.com/egonelbre/exp/vector/compare
cpu: AMD Ryzen Threadripper 2950X 16-Core Processor
AxpyAssembly-32                  117.6µ ± 1%
AxpyBasic-32                     151.0µ ± 1%
AxpyUnsafe-32                    93.06µ ± 0%
AxpyUnsafeInline-32              111.3µ ± 2%
AxpyPointer-32                   149.0µ ± 0%
AxpyPointerLoop-32               156.2µ ± 0%
AxpyBasicR4-32                   109.7µ ± 1%
AxpyUnsafeR4-32                  101.5µ ± 1%
AxpyUnsafeInlineR4-32            102.3µ ± 2%
AxpyUnsafeInlineR8-32            103.5µ ± 1%
AxpyPointerR4-32                 83.41µ ± 1%
AxpyPointerLoopR4-32             81.90µ ± 1%
AxpyPointerLoopInterleaveR4-32   84.45µ ± 1%
AxpyPointerR4Alt-32              84.55µ ± 1%

goos: darwin
goarch: arm64
pkg: github.com/egonelbre/exp/vector/compare
AxpyBasic-10                     105.4µ ± 0%
AxpyUnsafe-10                    84.80µ ± 1%
AxpyUnsafeInline-10              87.73µ ± 1%
AxpyPointer-10                   128.2µ ± 0%
AxpyPointerLoop-10               124.7µ ± 0%
AxpyBasicR4-10                   94.14µ ± 0%
AxpyUnsafeR4-10                  84.83µ ± 0%
AxpyUnsafeInlineR4-10            73.84µ ± 0%
AxpyUnsafeInlineR8-10            76.90µ ± 0%
AxpyPointerR4-10                 83.03µ ± 0%
AxpyPointerLoopR4-10             83.10µ ± 0%
AxpyPointerLoopInterleaveR4-10   83.09µ ± 0%
AxpyPointerR4Alt-10              83.27µ ± 0%

Notice, the Go versions with bounds checks removed ended up faster than the current gonum axpy assembly implementation. Even the unrolled version with bounds checks present ended up faster than the current gonum assembly version.

Of course, writing such unrolled and optimized versions would be rather error-prone and annoying. I realized that it should be easy to generate the code as long as we constrain to a subset of basic operations it needs to accomplish.

Initially I was thinking of using regular Go code as the base implementation and then running an "unroller and optimizer" on it, however, that seemed difficult to work with.

Finally figured out a way https://github.com/egonelbre/exp/blob/vec/vector/generate/example.go#L37 to write rather concise definition of such functions:

ctx.Func("ScalIncUnitaryTo(dst []$Type, incdst uintptr, alpha $Type, xs []$Type, n, incx uintptr)",
	Iterate{
		Ranges: []It{
			Range("i", 0, "n"),
			Slice("dst", 0, "incdst"),
			Slice("xs", 0, "incx"),
		},
		For: "$dst = alpha * $xs",
	})

The generator is mostly hacked together https://github.com/egonelbre/exp/blob/vec/vector/generate/naive.go, but nothing extreme. It currently implements four configurations:

// Unroll: 0, Counter: false

func ScalIncUnitaryTo(dst []float32, incdst uintptr, alpha float32, xs []float32, n, incx uintptr) {
        if len(dst) < int(n*incdst) {
                panic("dst is too small")
        }
        if len(xs) < int(n*incx) {
                panic("xs is too small")
        }
        for i := 0; i < n; i++ {
                dst[i*incdst] = alpha * xs[i*incx]
        }
}

// Unroll: 0, Counter: true

func ScalIncUnitaryTo(dst []float32, incdst uintptr, alpha float32, xs []float32, n, incx uintptr) {
        if len(dst) < int(n*incdst) {
                panic("dst is too small")
        }
        if len(xs) < int(n*incx) {
                panic("xs is too small")
        }
        idst := 0
        ixs := 0
        for i := 0; i < n; i++ {
                dst[idst] = alpha * xs[ixs]
                idst += incdst
                ixs += incx
        }
}

// Unroll: 4, Counter: false

func ScalIncUnitaryTo(dst []float32, incdst uintptr, alpha float32, xs []float32, n, incx uintptr) {
        if len(dst) < int(n*incdst) {
                panic("dst is too small")
        }
        if len(xs) < int(n*incx) {
                panic("xs is too small")
        }
        i := 0
        n_unroll := n - n%4
        for ; i < n_unroll; i += 4 {
                dst[i*incdst] = alpha * xs[i*incx]
                dst[i+1*incdst] = alpha * xs[i+1*incx]
                dst[i+2*incdst] = alpha * xs[i+2*incx]
                dst[i+3*incdst] = alpha * xs[i+3*incx]
        }
        for ; i < n; i++ {
                dst[i*incdst] = alpha * xs[i*incx]
        }
}

// Unroll: 4, Counter: true

func ScalIncUnitaryTo(dst []float32, incdst uintptr, alpha float32, xs []float32, n, incx uintptr) {
        if len(dst) < int(n*incdst) {
                panic("dst is too small")
        }
        if len(xs) < int(n*incx) {
                panic("xs is too small")
        }
        i := 0
        n_unroll := n - n%4
        idst := 0
        ixs := 0
        for ; i < n_unroll; i += 4 {
                dst[idst] = alpha * xs[ixs]
                dst[idst+incdst] = alpha * xs[ixs+incx]
                dst[idst+2*incdst] = alpha * xs[ixs+2*incx]
                dst[idst+3*incdst] = alpha * xs[ixs+3*incx]
                idst += incdst * 4
                ixs += incx * 4
        }
        for ; i < n; i++ {
                dst[idst] = alpha * xs[ixs]
                idst += incdst
                ixs += incx
        }
}

The slice accesses can be replaced with the appropriate at implementation. It would be even possible to generate the same code for all the implementation, but have at operation be implemented differently for different targets (arm, amd64, safe).

I'm certain there are a bunch of bugs present in the proof-of-concept, but I decided to show it as is, because it should be sufficient to get the ideas across.

Potential impact of proposal

  • Better or same performance for everything that uses internal/asm without using assembly.
  • More easily inspectable and maintainable code.
  • Less work to implement new things that require optimized for loops across multiple slices.
@egonelbre
Copy link
Contributor Author

I realized that it should be possible to create some similar idea for SSE, AVX, NEON things as well, e.g.


ctx.Func("ScalIncUnitaryTo(dst []$Type, incdst uintptr, alpha $Type, xs []$Type, n, incx uintptr)",
	Iterate{
		Ranges: []It{
			Range("i", 0, "n"),
			Slice("dst", 0, "incdst"),
			Slice("xs", 0, "incx"),
		},
		Go: "$dst = alpha * $xs",
		Neon: ctx.WhenType( // ignore the correctness
			"float32", "VMUL.F32 $dst, @alpha, $xs",
			"float64", "VMUL.F64 $dst, @alpha, $xs",
		)
	})

Also, it feels like someone has already done this, somewhere.

@kortschak
Copy link
Member

Sorry for the slow response. Local benchmarks (the first is a relatively old i7 and the second is an RPi 3B+). I think the first take home is the degree of behaviour variation between architectures, even within the same broad families. On the i7, the assembly sits in the middle and on the RPi it is no better than the mode.

In general though I think this is a good idea; if we can get similar or better performance without assembly, that is a win since at the moment we have limited active contributor expertise with assembly, so new features are unlikely to be added and bugs (should they be found) would be difficult to address.

goos: linux
goarch: amd64
pkg: github.com/egonelbre/exp/vector/compare
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
                              │   cmp.bench   │
                              │    sec/op     │
AxpyAssembly-8                  205.5µ ±  17%
AxpyBasic-8                     251.6µ ±  15%
AxpyUnsafe-8                    168.9µ ±   3%
AxpyUnsafeInline-8              195.2µ ±   9%
AxpyPointer-8                   185.9µ ±   3%
AxpyPointerLoop-8               186.9µ ±  23%
AxpyBasicR4-8                   230.6µ ±  10%
AxpyUnsafeR4-8                  192.5µ ±   4%
AxpyUnsafeInlineR4-8            186.1µ ± 117%
AxpyUnsafeInlineR8-8            205.2µ ±  22%
AxpyPointerR4-8                 217.0µ ±   7%
AxpyPointerLoopR4-8             214.4µ ±   2%
AxpyPointerLoopInterleaveR4-8   201.9µ ±  21%
AxpyPointerR4Alt-8              198.4µ ±   2%
geomean                         201.9µ

goos: linux
goarch: arm64
pkg: github.com/egonelbre/exp/vector/compare
                              │  cmp.bench  │
                              │   sec/op    │
AxpyBasic-4                     5.144m ± 1%
AxpyUnsafe-4                    5.168m ± 1%
AxpyUnsafeInline-4              5.605m ± 3%
AxpyPointer-4                   4.335m ± 2%
AxpyPointerLoop-4               4.334m ± 2%
AxpyBasicR4-4                   5.121m ± 1%
AxpyUnsafeR4-4                  5.145m ± 1%
AxpyUnsafeInlineR4-4            5.166m ± 0%
AxpyUnsafeInlineR8-4            5.185m ± 1%
AxpyPointerR4-4                 5.135m ± 1%
AxpyPointerLoopR4-4             5.115m ± 1%
AxpyPointerLoopInterleaveR4-4   5.233m ± 1%
AxpyPointerR4Alt-4              5.126m ± 0%
geomean                         5.051m

@egonelbre
Copy link
Contributor Author

So, I think this needs an additional comparison against assembly that ensures loop alignment. To me the results look kind of weird, and it might be due to loop alignment. However, I'm not sure how we could force loop alignment from Go code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants