arm64,purego: field arithmetic mul for arm64 and cleanup build tags #257

gbotrel · 2022-11-04T17:19:40Z

This PR introduce the following changes:

arm64 performance boost for field arithmetic Multiplication. No assembly, generating pure Go code that match closely what we would hand-write.
get rid of amd64_adx build tag and duplicate .s assembly files generated for amd64 target;
adds purego build tag which follows other projects convention to disable assembly if provided
simplifies variation of CIOS multiplication algorithm generated to minimize codebase size. See full strategy below

Benchmarks (`arm64`)

These are run on a M1 chip, but give ~similar performance improvement on AWS Graviton3 (not on AWS Graviton2 !). Some mobile devices may also benefit from these improvments, and particularly the Square algorithm (not generated by default to simplify codebase).

bn254(4 word modulus)       Mul  18.3ns ± 0%  17.3ns ± 0%   -5.50%   ( // - 21% on AWS Graviton 3)
bls12-381(6 word modulus)   Mul  35.3ns ± 0%  29.3ns ± 0%   -17.00% ( // - 22% on AWS Graviton 3)
bw6-761(12 word modulus)    Mul  199ns ± 0%   114ns ± 0%    -42.74% ( // - 43% on AWS Graviton 3)

Mul generation details

// There are couple of variations to the multiplication (and squaring) algorithms.
//
// All versions are derived from the Montgomery CIOS algorithm: see
// section 2.3.2 of Tolga Acar's thesis
// https://www.microsoft.com/en-us/research/wp-content/uploads/1998/06/97Acar.pdf
//
// For 1-word modulus, the generator will call mul_cios_one_limb (standard REDC)
//
// For 13-word+ modulus, the generator will output a unoptimized textbook CIOS code, in plain Go.
//
// For all other modulus, we look at the available bits in the last limb.
// If they are none (like secp256k1) we generate a unoptimized textbook CIOS code, in plain Go, for all architectures.
// If there is at least one we can ommit a carry propagation in the CIOS algorithm.
// If there is at least two we can use the same technique for the CIOS Squaring.
// See appendix in https://eprint.iacr.org/2022/1400.pdf for the exact condition.
//
// In practice, we have 3 differents targets in mind: x86(amd64), arm64 and wasm.
//
// For amd64, we can leverage (when available) the BMI2 and ADX instructions to have 2-carry-chains in parallel.
// This make the use of assembly worth it as it results in a significant perf improvment; most CPUs since 2016 support these
// instructions, and we assume it to be the "default path"; in case the CPU has no support, we fall back to a slow, unoptimized version.
//
// On amd64, the Squaring algorithm always call the Multiplication (assembly) implementation.
//
// For arm64, we unroll the loops in the CIOS (+nocarry optimization) algorithm, such that the instructions generated
// by the Go compiler closely match what we would hand-write. Hence, there is no assembly needed for arm64 target.
//
// Additionally, if 2-bits+ are available on the last limb, we have a template to generate a dedicated Squaring algorithm
// This is not activated by default, to minimize the codebase size.
// On M1, AWS Graviton3 it results in a 5-10% speedup. On some mobile devices, speed up observed was more important (~20%).
//
// The same (arm64) unrolled Go code produce satisfying perfomrance for WASM (compiled using TinyGo).

internal/field/internal/templates/element/mul_cios.go

internal/field/internal/templates/element/mul_nocarry.go

Tabaie · 2022-11-14T22:08:37Z

ecc/bls12-377/fp/element_mul_amd64.s

@@ -1,4 +1,4 @@
-// +build !amd64_adx
+// +build !purego


I'm confused by amd64_adx being replaced with purego. I thought that when ADX was available we definitely used assembly and hence it wouldn't be pure Go.

Is what's happening the following:

Previously we had two sets of x64 assembly, one for when ADX is available and one when not. Turns out assembly is not quite worth it unless ADX is available hence the purego flag is set unless ADX is available.

purego is a wider Golang community conventioon, to avoid the use of assembly at all.

The previous amd64_adx was removing the instructions to actually check the presence of the ADX instructions at run time, which can result in a minor 2-4% speed up.

ecc/bls12-377/fp/element_ops_amd64.go

gbotrel added 9 commits November 4, 2022 10:24

feat: remove special amd64_adx build tag path

8bbb405

feat: get rid of adx build tag for e2

b269bd5

docs: add generator mul decision strategy docs

13abd32

feat: generate unoptimized CIOS as default fallback

cfc8a7c

feat: one more step towards purego vs amd64

ff56f6d

feat: added purego build tag for field arithmetic

0e3e0ee

feat: replaced x86 optimized purego code with arm64 optimized

dc3a82e

docs: generate mul doc for field arithmetic

e118269

docs: move generator mul doc to template

b3b45d9

gbotrel added cleanup Suggestion to clean up the code perf labels Nov 4, 2022

gbotrel added this to the v0.9.0 milestone Nov 4, 2022

gbotrel requested review from yelhousni and Tabaie November 4, 2022 17:19

build: ensure we run test with purego build tag

16b368c

Tabaie approved these changes Nov 14, 2022

View reviewed changes

style: update typo in comments

4a0d51d

gbotrel merged commit 8a889f4 into develop Nov 17, 2022

gbotrel deleted the perf/clean/mul branch November 17, 2022 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arm64,purego: field arithmetic mul for arm64 and cleanup build tags #257

arm64,purego: field arithmetic mul for arm64 and cleanup build tags #257

gbotrel commented Nov 4, 2022

Tabaie Nov 14, 2022

Tabaie Nov 14, 2022

gbotrel Nov 17, 2022

arm64,purego: field arithmetic mul for arm64 and cleanup build tags #257

arm64,purego: field arithmetic mul for arm64 and cleanup build tags #257

Conversation

gbotrel commented Nov 4, 2022

Benchmarks (arm64)

Mul generation details

Tabaie Nov 14, 2022

Choose a reason for hiding this comment

Tabaie Nov 14, 2022

Choose a reason for hiding this comment

gbotrel Nov 17, 2022

Choose a reason for hiding this comment

Benchmarks (`arm64`)