Benchmark array-ish structures #2926

dapplion · 2021-08-08T21:42:34Z

Motivation

To properly optimize our beacon state transition performance and memory usage we need to understand the tradeoffs or our different approaches.

Description

Add informational tests (not run in CI) with hardcoded results.

Notable results:

Iterating an array is x10 faster than iterating a MutableVector
Iterating a MutableVector is x100 times faster than iterating a Tree
Regular JS arrays of numbers take 8 bytes per element
MutableVector of numbers take 15 bytes per element
Cloning a MutableVector has a fixed cost of ~1000 bytes. Cloning a MutableVector to mutate a few elements is very memory efficient even with the initial 1000 bytes cost

codeclimate · 2021-08-08T21:43:09Z

Code Climate has analyzed commit 6aacd2f and detected 0 issues on this pull request.

View more on Code Climate.

github-actions · 2021-08-08T21:53:42Z

Performance Report

✔️ no performance regression detected

Full benchmark results

Benchmark suite	Current: `13efffb`	Previous: `bea1304`	Ratio
getCommitteeAssignments - req 1000 vs - 250000 vc	9.1524 ms/op	7.7644 ms/op	1.18
epoch altair - 250000 vs - 7PWei - processInactivityUpdates	2.4537 s/op	2.7679 s/op	0.89
epoch altair - 250000 vs - 7PWei - processRewardsAndPenalties	909.44 ms/op	847.30 ms/op	1.07
epoch altair - 250000 vs - 7PWei - processParticipationFlagUpdates	391.59 ms/op	340.11 ms/op	1.15
Process block - 250000 vs - 7PWei - with 0 validator exit	443.40 us/op	479.49 us/op	0.92
Process block - 250000 vs - 7PWei - with 1 validator exit	30.160 ms/op	36.390 ms/op	0.83
Process block - 250000 vs - 7PWei - with 16 validator exits	27.802 ms/op	26.682 ms/op	1.04
epoch phase0 - 250000 vs - 7PWei - prepareEpochProcessState	702.22 ms/op	848.78 ms/op	0.83
epoch phase0 - 250000 vs - 7PWei - processRewardsAndPenalties	437.91 ms/op	572.64 ms/op	0.76
epoch phase0 - 250000 vs - 7PWei - processEffectiveBalanceUpdates	109.40 ms/op	132.96 ms/op	0.82
getAttestationDeltas - 250000 vs - 7PWei	107.34 ms/op	114.20 ms/op	0.94
processSlots - 250000 vs - 7PWei - 32 empty slots	5.1582 s/op	5.3798 s/op	0.96
shuffle list - 16384 els	2.9532 ms/op	1.8159 ms/op	1.63
shuffle list - 250000 els	41.698 ms/op	24.715 ms/op	1.69
getPubkeys - persistent - req 1000 vs - 250000 vc	21.177 us/op	18.185 us/op	1.16
BLS verify - blst-native	2.0754 ms/op	2.0260 ms/op	1.02
BLS verifyMultipleSignatures 3 - blst-native	4.3593 ms/op	4.1901 ms/op	1.04
BLS verifyMultipleSignatures 8 - blst-native	9.4146 ms/op	8.8411 ms/op	1.06
BLS verifyMultipleSignatures 32 - blst-native	33.540 ms/op	36.086 ms/op	0.93
BLS aggregatePubkeys 32 - blst-native	47.899 us/op	45.802 us/op	1.05
BLS aggregatePubkeys 128 - blst-native	175.75 us/op	174.99 us/op	1.00
getAttestationsForBlock	140.13 ms/op	87.886 ms/op	1.59
validate gossip signedAggregateAndProof - struct	5.0982 ms/op	7.4572 ms/op	0.68
validate gossip signedAggregateAndProof - treeBacked	4.9912 ms/op	5.3407 ms/op	0.93
validate gossip attestation - struct	2.3002 ms/op	2.3805 ms/op	0.97
validate gossip attestation - treeBacked	2.4132 ms/op	2.5641 ms/op	0.94

by benchmarkbot/action

twoeths

Thanks for having this statistic 👍 , as we'll have more and more validators especially after the Merge, I suppose the loop speed is more and more important and we want to take a scalable approach.

To deduplicate validator data, I suggest keeping only validator roots in the tree (i.e. validatorRoots: new ListType({elementType: Root, limit: VALIDATOR_REGISTRY_LIMIT}) and still keep CachedValidatorList to get the best of both world: the hash, the loop and access validator properties. I'm not sure how serialize() works for CachedBeaconState through.

What do you think @wemeetagain @dapplion ?

wemeetagain · 2021-08-09T16:06:37Z

To deduplicate validator data, I suggest keeping only validator roots in the tree

Definitely agree. I think the only question is how we should go about that.

You mentioned the tradeoff of storing the deserialized validators separately. Done naively, it breaks ssz serialization/deserialization (and proof generation).

Another approach may be to work within the ssz library to support hybrid tree-backed / struct-backed values. This could make it easier to maintain compatibility with the full range of ssz operations. The tradeoff being that it may be harder to customize / get the exact performance characteristics we're wanting in lodestar.

dapplion added 5 commits August 8, 2021 21:29

Benchmark array performance

a92a004

Add Tree tests

96187e7

Add x1000 tests

55b6d55

Add memory tests

d999690

Update results

3ad22c9

dapplion requested review from mpetrunic, twoeths and wemeetagain as code owners August 8, 2021 21:42

github-actions bot added the StateTransition label Aug 8, 2021

Add Set and Map

6aacd2f

twoeths approved these changes Aug 9, 2021

View reviewed changes

dapplion merged commit 2a478d5 into master Aug 9, 2021

dapplion deleted the dapplion/benchmark-arrays branch August 9, 2021 20:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark array-ish structures #2926

Benchmark array-ish structures #2926

dapplion commented Aug 8, 2021

codeclimate bot commented Aug 8, 2021 •

edited

github-actions bot commented Aug 8, 2021 •

edited

twoeths left a comment

wemeetagain commented Aug 9, 2021

Benchmark array-ish structures #2926

Benchmark array-ish structures #2926

Conversation

dapplion commented Aug 8, 2021

codeclimate bot commented Aug 8, 2021 • edited

github-actions bot commented Aug 8, 2021 • edited

Performance Report

twoeths left a comment

Choose a reason for hiding this comment

wemeetagain commented Aug 9, 2021

codeclimate bot commented Aug 8, 2021 •

edited

github-actions bot commented Aug 8, 2021 •

edited