By allowing all limbs to be up to 52 bits between operations, which was
already allowed by all out code, we can make the carry propagation more
parallelizable. Seems to help the compiler more than the handwritten asm.
name old time/op new time/op delta
Add-8 7.77ns ±19% 6.43ns ± 1% -17.16% (p=0.000 n=10+8)
Mul-8 26.3ns ± 0% 24.6ns ± 1% -6.32% (p=0.000 n=9+10)
Mul32-8 5.86ns ± 1% 5.87ns ± 1% ~ (p=0.171 n=10+10)
WideMultCall-8 2.54ns ± 0% 2.54ns ± 0% ~ (p=0.965 n=9+8)
BasepointMul-8 18.6µs ± 1% 18.7µs ± 1% ~ (p=0.095 n=9+10)
ScalarMul-8 65.6µs ± 3% 63.9µs ± 1% -2.63% (p=0.000 n=10+9)
VartimeDoubleBaseMul-8 61.1µs ± 1% 60.7µs ± 2% -0.73% (p=0.017 n=10+9)
MultiscalarMulSize8-8 224µs ± 1% 224µs ± 1% ~ (p=0.182 n=10+9)