Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arm64,purego: field arithmetic mul for arm64 and cleanup build tags #257

Merged
merged 11 commits into from
Nov 17, 2022

Conversation

gbotrel
Copy link
Collaborator

@gbotrel gbotrel commented Nov 4, 2022

This PR introduce the following changes:

  • arm64 performance boost for field arithmetic Multiplication. No assembly, generating pure Go code that match closely what we would hand-write.
  • get rid of amd64_adx build tag and duplicate .s assembly files generated for amd64 target;
  • adds purego build tag which follows other projects convention to disable assembly if provided
  • simplifies variation of CIOS multiplication algorithm generated to minimize codebase size. See full strategy below

Benchmarks (arm64)

These are run on a M1 chip, but give ~similar performance improvement on AWS Graviton3 (not on AWS Graviton2 !). Some mobile devices may also benefit from these improvments, and particularly the Square algorithm (not generated by default to simplify codebase).

bn254(4 word modulus)       Mul  18.3ns ± 0%  17.3ns ± 0%   -5.50%   ( // - 21% on AWS Graviton 3)
bls12-381(6 word modulus)   Mul  35.3ns ± 0%  29.3ns ± 0%   -17.00% ( // - 22% on AWS Graviton 3)
bw6-761(12 word modulus)    Mul  199ns ± 0%   114ns ± 0%    -42.74% ( // - 43% on AWS Graviton 3)

Mul generation details

// There are couple of variations to the multiplication (and squaring) algorithms.
//
// All versions are derived from the Montgomery CIOS algorithm: see
// section 2.3.2 of Tolga Acar's thesis
// https://www.microsoft.com/en-us/research/wp-content/uploads/1998/06/97Acar.pdf
//
// For 1-word modulus, the generator will call mul_cios_one_limb (standard REDC)
//
// For 13-word+ modulus, the generator will output a unoptimized textbook CIOS code, in plain Go.
//
// For all other modulus, we look at the available bits in the last limb.
// If they are none (like secp256k1) we generate a unoptimized textbook CIOS code, in plain Go, for all architectures.
// If there is at least one we can ommit a carry propagation in the CIOS algorithm.
// If there is at least two we can use the same technique for the CIOS Squaring.
// See appendix in https://eprint.iacr.org/2022/1400.pdf for the exact condition.
//
// In practice, we have 3 differents targets in mind: x86(amd64), arm64 and wasm.
//
// For amd64, we can leverage (when available) the BMI2 and ADX instructions to have 2-carry-chains in parallel.
// This make the use of assembly worth it as it results in a significant perf improvment; most CPUs since 2016 support these
// instructions, and we assume it to be the "default path"; in case the CPU has no support, we fall back to a slow, unoptimized version.
//
// On amd64, the Squaring algorithm always call the Multiplication (assembly) implementation.
//
// For arm64, we unroll the loops in the CIOS (+nocarry optimization) algorithm, such that the instructions generated
// by the Go compiler closely match what we would hand-write. Hence, there is no assembly needed for arm64 target.
//
// Additionally, if 2-bits+ are available on the last limb, we have a template to generate a dedicated Squaring algorithm
// This is not activated by default, to minimize the codebase size.
// On M1, AWS Graviton3 it results in a 5-10% speedup. On some mobile devices, speed up observed was more important (~20%).
//
// The same (arm64) unrolled Go code produce satisfying perfomrance for WASM (compiled using TinyGo).

@gbotrel gbotrel added cleanup Suggestion to clean up the code perf labels Nov 4, 2022
@gbotrel gbotrel added this to the v0.9.0 milestone Nov 4, 2022
internal/field/internal/templates/element/mul_cios.go Outdated Show resolved Hide resolved
@@ -1,4 +1,4 @@
// +build !amd64_adx
// +build !purego
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by amd64_adx being replaced with purego. I thought that when ADX was available we definitely used assembly and hence it wouldn't be pure Go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is what's happening the following:

Previously we had two sets of x64 assembly, one for when ADX is available and one when not. Turns out assembly is not quite worth it unless ADX is available hence the purego flag is set unless ADX is available.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

purego is a wider Golang community conventioon, to avoid the use of assembly at all.

The previous amd64_adx was removing the instructions to actually check the presence of the ADX instructions at run time, which can result in a minor 2-4% speed up.

ecc/bls12-377/fp/element_ops_amd64.go Outdated Show resolved Hide resolved
@gbotrel gbotrel merged commit 8a889f4 into develop Nov 17, 2022
@gbotrel gbotrel deleted the perf/clean/mul branch November 17, 2022 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup Suggestion to clean up the code perf
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants