Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIx archive finalized states db entry key typo #4508

Merged
merged 5 commits into from Sep 9, 2022

Conversation

dapplion
Copy link
Contributor

@dapplion dapplion commented Sep 3, 2022

Motivation

Archive finalized states logic has a typo where it indexes states based on epoch instead of slot. That causes the archiver to not prune states as expected thus Lodestar DB grows faster than it should.

Description

  • Use Slot to index archiveStates DB repository
  • Add a new step on process start to prune existing states that do not align with configured buckets

Opening as draft to deploy and test before merging

Closes #4378
Closes #4417

@github-actions
Copy link
Contributor

github-actions bot commented Sep 3, 2022

Performance Report

✔️ no performance regression detected

Full benchmark results
Benchmark suite Current: 3b44217 Previous: ac755d9 Ratio
getPubkeys - index2pubkey - req 1000 vs - 250000 vc 1.9907 ms/op 1.8254 ms/op 1.09
getPubkeys - validatorsArr - req 1000 vs - 250000 vc 78.379 us/op 64.583 us/op 1.21
BLS verify - blst-native 2.4027 ms/op 2.1636 ms/op 1.11
BLS verifyMultipleSignatures 3 - blst-native 4.9661 ms/op 4.4772 ms/op 1.11
BLS verifyMultipleSignatures 8 - blst-native 10.598 ms/op 9.6675 ms/op 1.10
BLS verifyMultipleSignatures 32 - blst-native 42.471 ms/op 35.144 ms/op 1.21
BLS aggregatePubkeys 32 - blst-native 56.253 us/op 46.462 us/op 1.21
BLS aggregatePubkeys 128 - blst-native 208.96 us/op 182.29 us/op 1.15
getAttestationsForBlock 175.97 ms/op 151.08 ms/op 1.16
isKnown best case - 1 super set check 469.00 ns/op 480.00 ns/op 0.98
isKnown normal case - 2 super set checks 477.00 ns/op 470.00 ns/op 1.01
isKnown worse case - 16 super set checks 477.00 ns/op 467.00 ns/op 1.02
CheckpointStateCache - add get delete 10.188 us/op 8.9150 us/op 1.14
validate gossip signedAggregateAndProof - struct 5.5448 ms/op 5.0262 ms/op 1.10
validate gossip attestation - struct 2.6363 ms/op 2.3852 ms/op 1.11
pickEth1Vote - no votes 2.3198 ms/op 2.1516 ms/op 1.08
pickEth1Vote - max votes 19.284 ms/op 18.824 ms/op 1.02
pickEth1Vote - Eth1Data hashTreeRoot value x2048 13.048 ms/op 13.265 ms/op 0.98
pickEth1Vote - Eth1Data hashTreeRoot tree x2048 21.323 ms/op 21.634 ms/op 0.99
pickEth1Vote - Eth1Data fastSerialize value x2048 1.7198 ms/op 1.5836 ms/op 1.09
pickEth1Vote - Eth1Data fastSerialize tree x2048 13.126 ms/op 13.093 ms/op 1.00
bytes32 toHexString 1.1450 us/op 976.00 ns/op 1.17
bytes32 Buffer.toString(hex) 755.00 ns/op 790.00 ns/op 0.96
bytes32 Buffer.toString(hex) from Uint8Array 1.0100 us/op 1.0460 us/op 0.97
bytes32 Buffer.toString(hex) + 0x 725.00 ns/op 808.00 ns/op 0.90
Object access 1 prop 0.37200 ns/op 0.39200 ns/op 0.95
Map access 1 prop 0.42700 ns/op 0.31500 ns/op 1.36
Object get x1000 14.225 ns/op 10.900 ns/op 1.31
Map get x1000 0.85500 ns/op 1.0990 ns/op 0.78
Object set x1000 110.66 ns/op 72.958 ns/op 1.52
Map set x1000 75.987 ns/op 48.857 ns/op 1.56
Return object 10000 times 0.38260 ns/op 0.44070 ns/op 0.87
Throw Error 10000 times 7.2026 us/op 5.9723 us/op 1.21
enrSubnets - fastDeserialize 64 bits 2.8630 us/op 2.4940 us/op 1.15
enrSubnets - ssz BitVector 64 bits 815.00 ns/op 716.00 ns/op 1.14
enrSubnets - fastDeserialize 4 bits 406.00 ns/op 360.00 ns/op 1.13
enrSubnets - ssz BitVector 4 bits 788.00 ns/op 738.00 ns/op 1.07
prioritizePeers score -10:0 att 32-0.1 sync 2-0 98.463 us/op 82.147 us/op 1.20
prioritizePeers score 0:0 att 32-0.25 sync 2-0.25 143.25 us/op 115.11 us/op 1.24
prioritizePeers score 0:0 att 32-0.5 sync 2-0.5 252.52 us/op 195.56 us/op 1.29
prioritizePeers score 0:0 att 64-0.75 sync 4-0.75 511.12 us/op 382.31 us/op 1.34
prioritizePeers score 0:0 att 64-1 sync 4-1 586.70 us/op 407.10 us/op 1.44
RateTracker 1000000 limit, 1 obj count per request 186.03 ns/op 182.78 ns/op 1.02
RateTracker 1000000 limit, 2 obj count per request 144.77 ns/op 133.01 ns/op 1.09
RateTracker 1000000 limit, 4 obj count per request 119.22 ns/op 108.43 ns/op 1.10
RateTracker 1000000 limit, 8 obj count per request 105.91 ns/op 98.759 ns/op 1.07
RateTracker with prune 5.1450 us/op 4.1090 us/op 1.25
array of 16000 items push then shift 4.7578 us/op 51.613 us/op 0.09
LinkedList of 16000 items push then shift 18.465 ns/op 12.569 ns/op 1.47
array of 16000 items push then pop 256.89 ns/op 213.45 ns/op 1.20
LinkedList of 16000 items push then pop 17.590 ns/op 12.123 ns/op 1.45
array of 24000 items push then shift 7.3085 us/op 77.349 us/op 0.09
LinkedList of 24000 items push then shift 18.851 ns/op 12.956 ns/op 1.46
array of 24000 items push then pop 221.06 ns/op 196.41 ns/op 1.13
LinkedList of 24000 items push then pop 16.848 ns/op 12.247 ns/op 1.38
intersect bitArray bitLen 8 11.153 ns/op 10.846 ns/op 1.03
intersect array and set length 8 182.48 ns/op 133.58 ns/op 1.37
intersect bitArray bitLen 128 63.556 ns/op 55.666 ns/op 1.14
intersect array and set length 128 2.3307 us/op 1.7903 us/op 1.30
Buffer.concat 32 items 2.1540 ns/op 2.0250 ns/op 1.06
pass gossip attestations to forkchoice per slot 3.2253 ms/op 3.0652 ms/op 1.05
computeDeltas 3.1867 ms/op 3.6153 ms/op 0.88
computeProposerBoostScoreFromBalances 795.45 us/op 803.84 us/op 0.99
altair processAttestation - 250000 vs - 7PWei normalcase 4.2552 ms/op 3.3991 ms/op 1.25
altair processAttestation - 250000 vs - 7PWei worstcase 6.4349 ms/op 5.1740 ms/op 1.24
altair processAttestation - setStatus - 1/6 committees join 222.68 us/op 180.58 us/op 1.23
altair processAttestation - setStatus - 1/3 committees join 430.74 us/op 354.19 us/op 1.22
altair processAttestation - setStatus - 1/2 committees join 626.54 us/op 500.06 us/op 1.25
altair processAttestation - setStatus - 2/3 committees join 834.64 us/op 657.56 us/op 1.27
altair processAttestation - setStatus - 4/5 committees join 1.1260 ms/op 921.26 us/op 1.22
altair processAttestation - setStatus - 100% committees join 1.3619 ms/op 1.1140 ms/op 1.22
altair processBlock - 250000 vs - 7PWei normalcase 29.586 ms/op 24.124 ms/op 1.23
altair processBlock - 250000 vs - 7PWei normalcase hashState 50.469 ms/op 38.249 ms/op 1.32
altair processBlock - 250000 vs - 7PWei worstcase 94.551 ms/op 82.245 ms/op 1.15
altair processBlock - 250000 vs - 7PWei worstcase hashState 117.74 ms/op 111.64 ms/op 1.05
phase0 processBlock - 250000 vs - 7PWei normalcase 4.0586 ms/op 3.2540 ms/op 1.25
phase0 processBlock - 250000 vs - 7PWei worstcase 57.044 ms/op 50.665 ms/op 1.13
altair processEth1Data - 250000 vs - 7PWei normalcase 1.0714 ms/op 690.64 us/op 1.55
Tree 40 250000 create 928.74 ms/op 707.85 ms/op 1.31
Tree 40 250000 get(125000) 290.96 ns/op 264.59 ns/op 1.10
Tree 40 250000 set(125000) 2.9948 us/op 2.2053 us/op 1.36
Tree 40 250000 toArray() 34.858 ms/op 27.183 ms/op 1.28
Tree 40 250000 iterate all - toArray() + loop 33.697 ms/op 27.685 ms/op 1.22
Tree 40 250000 iterate all - get(i) 119.90 ms/op 110.62 ms/op 1.08
MutableVector 250000 create 16.987 ms/op 12.510 ms/op 1.36
MutableVector 250000 get(125000) 13.143 ns/op 10.888 ns/op 1.21
MutableVector 250000 set(125000) 913.67 ns/op 532.57 ns/op 1.72
MutableVector 250000 toArray() 7.4804 ms/op 6.0997 ms/op 1.23
MutableVector 250000 iterate all - toArray() + loop 7.4740 ms/op 6.0809 ms/op 1.23
MutableVector 250000 iterate all - get(i) 3.1774 ms/op 2.7632 ms/op 1.15
Array 250000 create 7.0241 ms/op 5.8296 ms/op 1.20
Array 250000 clone - spread 4.6629 ms/op 3.1366 ms/op 1.49
Array 250000 get(125000) 2.0070 ns/op 1.4370 ns/op 1.40
Array 250000 set(125000) 1.8950 ns/op 1.4720 ns/op 1.29
Array 250000 iterate all - loop 137.00 us/op 151.00 us/op 0.91
effectiveBalanceIncrements clone Uint8Array 300000 374.89 us/op 137.46 us/op 2.73
effectiveBalanceIncrements clone MutableVector 300000 840.00 ns/op 640.00 ns/op 1.31
effectiveBalanceIncrements rw all Uint8Array 300000 266.96 us/op 247.07 us/op 1.08
effectiveBalanceIncrements rw all MutableVector 300000 197.90 ms/op 131.75 ms/op 1.50
phase0 afterProcessEpoch - 250000 vs - 7PWei 197.58 ms/op 189.15 ms/op 1.04
phase0 beforeProcessEpoch - 250000 vs - 7PWei 73.676 ms/op 57.362 ms/op 1.28
altair processEpoch - mainnet_e81889 621.06 ms/op 574.62 ms/op 1.08
mainnet_e81889 - altair beforeProcessEpoch 172.71 ms/op 81.213 ms/op 2.13
mainnet_e81889 - altair processJustificationAndFinalization 73.746 us/op 21.123 us/op 3.49
mainnet_e81889 - altair processInactivityUpdates 10.867 ms/op 9.4378 ms/op 1.15
mainnet_e81889 - altair processRewardsAndPenalties 101.02 ms/op 126.80 ms/op 0.80
mainnet_e81889 - altair processRegistryUpdates 25.551 us/op 4.2820 us/op 5.97
mainnet_e81889 - altair processSlashings 7.7510 us/op 1.2270 us/op 6.32
mainnet_e81889 - altair processEth1DataReset 5.6580 us/op 1.1550 us/op 4.90
mainnet_e81889 - altair processEffectiveBalanceUpdates 2.5863 ms/op 2.4479 ms/op 1.06
mainnet_e81889 - altair processSlashingsReset 32.646 us/op 6.6480 us/op 4.91
mainnet_e81889 - altair processRandaoMixesReset 41.914 us/op 6.3480 us/op 6.60
mainnet_e81889 - altair processHistoricalRootsUpdate 7.5680 us/op 994.00 ns/op 7.61
mainnet_e81889 - altair processParticipationFlagUpdates 25.348 us/op 2.6100 us/op 9.71
mainnet_e81889 - altair processSyncCommitteeUpdates 5.1330 us/op 833.00 ns/op 6.16
mainnet_e81889 - altair afterProcessEpoch 196.72 ms/op 200.94 ms/op 0.98
phase0 processEpoch - mainnet_e58758 594.37 ms/op 494.43 ms/op 1.20
mainnet_e58758 - phase0 beforeProcessEpoch 277.02 ms/op 188.67 ms/op 1.47
mainnet_e58758 - phase0 processJustificationAndFinalization 82.566 us/op 21.146 us/op 3.90
mainnet_e58758 - phase0 processRewardsAndPenalties 138.15 ms/op 71.354 ms/op 1.94
mainnet_e58758 - phase0 processRegistryUpdates 54.608 us/op 10.478 us/op 5.21
mainnet_e58758 - phase0 processSlashings 5.6370 us/op 1.3330 us/op 4.23
mainnet_e58758 - phase0 processEth1DataReset 5.8730 us/op 800.00 ns/op 7.34
mainnet_e58758 - phase0 processEffectiveBalanceUpdates 2.2288 ms/op 1.9315 ms/op 1.15
mainnet_e58758 - phase0 processSlashingsReset 34.535 us/op 4.8330 us/op 7.15
mainnet_e58758 - phase0 processRandaoMixesReset 27.644 us/op 5.9760 us/op 4.63
mainnet_e58758 - phase0 processHistoricalRootsUpdate 8.1970 us/op 1.0230 us/op 8.01
mainnet_e58758 - phase0 processParticipationRecordUpdates 36.333 us/op 5.3320 us/op 6.81
mainnet_e58758 - phase0 afterProcessEpoch 155.39 ms/op 165.05 ms/op 0.94
phase0 processEffectiveBalanceUpdates - 250000 normalcase 2.3661 ms/op 2.0652 ms/op 1.15
phase0 processEffectiveBalanceUpdates - 250000 worstcase 0.5 2.5351 ms/op 2.5194 ms/op 1.01
altair processInactivityUpdates - 250000 normalcase 46.288 ms/op 33.386 ms/op 1.39
altair processInactivityUpdates - 250000 worstcase 49.823 ms/op 40.137 ms/op 1.24
phase0 processRegistryUpdates - 250000 normalcase 25.840 us/op 9.2880 us/op 2.78
phase0 processRegistryUpdates - 250000 badcase_full_deposits 495.96 us/op 375.10 us/op 1.32
phase0 processRegistryUpdates - 250000 worstcase 0.5 238.85 ms/op 182.03 ms/op 1.31
altair processRewardsAndPenalties - 250000 normalcase 95.754 ms/op 104.28 ms/op 0.92
altair processRewardsAndPenalties - 250000 worstcase 126.14 ms/op 108.90 ms/op 1.16
phase0 getAttestationDeltas - 250000 normalcase 12.681 ms/op 11.958 ms/op 1.06
phase0 getAttestationDeltas - 250000 worstcase 13.699 ms/op 12.124 ms/op 1.13
phase0 processSlashings - 250000 worstcase 6.0137 ms/op 5.2024 ms/op 1.16
altair processSyncCommitteeUpdates - 250000 306.08 ms/op 301.94 ms/op 1.01
BeaconState.hashTreeRoot - No change 543.00 ns/op 563.00 ns/op 0.96
BeaconState.hashTreeRoot - 1 full validator 74.603 us/op 71.903 us/op 1.04
BeaconState.hashTreeRoot - 32 full validator 705.86 us/op 650.60 us/op 1.08
BeaconState.hashTreeRoot - 512 full validator 7.8516 ms/op 7.4920 ms/op 1.05
BeaconState.hashTreeRoot - 1 validator.effectiveBalance 95.924 us/op 90.327 us/op 1.06
BeaconState.hashTreeRoot - 32 validator.effectiveBalance 1.4468 ms/op 1.3303 ms/op 1.09
BeaconState.hashTreeRoot - 512 validator.effectiveBalance 17.378 ms/op 17.578 ms/op 0.99
BeaconState.hashTreeRoot - 1 balances 68.796 us/op 69.135 us/op 1.00
BeaconState.hashTreeRoot - 32 balances 700.52 us/op 660.03 us/op 1.06
BeaconState.hashTreeRoot - 512 balances 6.7703 ms/op 6.4038 ms/op 1.06
BeaconState.hashTreeRoot - 250000 balances 113.54 ms/op 100.25 ms/op 1.13
aggregationBits - 2048 els - zipIndexesInBitList 33.221 us/op 25.137 us/op 1.32
regular array get 100000 times 51.957 us/op 64.141 us/op 0.81
wrappedArray get 100000 times 51.590 us/op 60.862 us/op 0.85
arrayWithProxy get 100000 times 31.116 ms/op 29.011 ms/op 1.07
ssz.Root.equals 515.00 ns/op 485.00 ns/op 1.06
byteArrayEquals 503.00 ns/op 477.00 ns/op 1.05
shuffle list - 16384 els 10.878 ms/op 11.336 ms/op 0.96
shuffle list - 250000 els 158.57 ms/op 169.62 ms/op 0.93
processSlot - 1 slots 15.808 us/op 13.421 us/op 1.18
processSlot - 32 slots 2.1519 ms/op 2.1377 ms/op 1.01
getEffectiveBalanceIncrementsZeroInactive - 250000 vs - 7PWei 455.92 us/op 484.68 us/op 0.94
getCommitteeAssignments - req 1 vs - 250000 vc 5.0651 ms/op 5.4002 ms/op 0.94
getCommitteeAssignments - req 100 vs - 250000 vc 7.4336 ms/op 7.9028 ms/op 0.94
getCommitteeAssignments - req 1000 vs - 250000 vc 7.5758 ms/op 8.4658 ms/op 0.89
RootCache.getBlockRootAtSlot - 250000 vs - 7PWei 9.6500 ns/op 8.5900 ns/op 1.12
state getBlockRootAtSlot - 250000 vs - 7PWei 1.1451 us/op 1.1159 us/op 1.03
computeProposers - vc 250000 18.208 ms/op 18.261 ms/op 1.00
computeEpochShuffling - vc 250000 164.58 ms/op 173.05 ms/op 0.95
getNextSyncCommittee - vc 250000 315.64 ms/op 297.85 ms/op 1.06

by benchmarkbot/action

@dapplion dapplion added this to the v1.1.0 milestone Sep 6, 2022
@dapplion
Copy link
Contributor Author

dapplion commented Sep 6, 2022

Deployed to feat2 and checked lg1k logs, which confirmed that states are pruned on start

Sep-06 15:55:08.988[CHAIN]            info: Pruned 197 finalized states slots

However, metrics do not confirm that disk space is reduced, not sure why

@dapplion dapplion mentioned this pull request Sep 7, 2022
@dapplion
Copy link
Contributor Author

dapplion commented Sep 7, 2022

However, metrics do not confirm that disk space is reduced, not sure why

Due to level-db internals, after a significant amount of data put or del it's necessary to compact that range to realize the size reduction. After an offline discussion with other teams:

  • Should compact all archived states range after the prune of this PR
  • Should compact after completing sync
  • Should consider compacting the DB at most once per 7 days

@dapplion dapplion marked this pull request as ready for review September 7, 2022 11:15
@dapplion dapplion requested a review from a team as a code owner September 7, 2022 11:15
@dapplion
Copy link
Contributor Author

dapplion commented Sep 7, 2022

@dapplion dapplion marked this pull request as draft September 7, 2022 12:00
@dapplion
Copy link
Contributor Author

dapplion commented Sep 7, 2022

Deployed this branch to all feat2, in feat2-md16-ctvpsm there's a lot of

Sep-07 14:02:22.367[eth1]            error: Error updating eth1 chain cache  Database is not open
Error: Database is not open
    at ClassicLevel.values (/usr/src/lodestar/node_modules/abstract-level/abstract-level.js:726:13)
    at LevelDbController.values (file:///usr/src/lodestar/packages/db/src/controller/level.ts:134:36)
    at DepositEventRepository.values (file:///usr/src/lodestar/packages/db/src/abstractRepository.ts:161:32)
    at DepositEventRepository.lastValue (file:///usr/src/lodestar/packages/db/src/abstractRepository.ts:226:31)
    at Eth1DepositsCache.add (file:///usr/src/lodestar/packages/beacon-node/src/eth1/eth1DepositsCache.ts:44:48)
    at Eth1DepositDataTracker.updateDepositCache (file:///usr/src/lodestar/packages/beacon-node/src/eth1/eth1DepositDataTracker.ts:211:30)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at Eth1DepositDataTracker.update (file:///usr/src/lodestar/packages/beacon-node/src/eth1/eth1DepositDataTracker.ts:191:33)
    at Eth1DepositDataTracker.runAutoUpdate (file:///usr/src/lodestar/packages/beacon-node/src/eth1/eth1DepositDataTracker.ts:154:29)

After some of those the node suddenly crashes with no apparent reason

@wemeetagain
Copy link
Member

Can we split this into two PRs:

  • minimal fix for the bug, will allow new bns to not store too many states
    • hopefully this can make it into 1.1.0
  • fix for GCing old states, compacting, etc
    • draft PR, ready when its ready

@dapplion
Copy link
Contributor Author

dapplion commented Sep 9, 2022

Can we split this into two PRs:

* minimal fix for the bug, will allow new bns to not store too many states
  
  * hopefully this can make it into 1.1.0

* fix for GCing old states, compacting, etc
  
  * draft PR, ready when its ready

Yes makes sense 👍

@dapplion dapplion marked this pull request as ready for review September 9, 2022 07:56
@dapplion dapplion mentioned this pull request Sep 9, 2022
@wemeetagain wemeetagain merged commit 65b38ee into unstable Sep 9, 2022
@wemeetagain wemeetagain deleted the dapplion/fix-state-pruning branch September 9, 2022 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StateArchiver use epoch as key instead of slot Archive states uses epoch as a key instead of slot
2 participants