Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validator db read/write ops too high on first startup with empty db #5356

Closed
nflaig opened this issue Apr 11, 2023 · 8 comments · Fixed by #5454
Closed

Validator db read/write ops too high on first startup with empty db #5356

nflaig opened this issue Apr 11, 2023 · 8 comments · Fixed by #5454
Labels
scope-performance Performance issue and ideas to improve performance.

Comments

@nflaig
Copy link
Member

nflaig commented Apr 11, 2023

Describe the bug

Validator db read/write ops too high on first startup with empty db which causes I/O lag and instability in VC. After one or two hours the process stabilizes and read/write ops per second drastically decrease.

Expected behavior

VC should not be unstable for one or two hours if started with an empty db.

Steps to Reproduce

Run VC with a lot of keys (500+) and make sure validator-db is empty. If VC was active before and db was removed, be careful to not get slashed as this also removes the slashing protection db.

Metrics

image

image

image

@dapplion
Copy link
Contributor

Can you paste the promquery code for the first metrics screen capture?

@nflaig
Copy link
Member Author

nflaig commented Apr 12, 2023

I just changed the prom query to use the vc metrics, the rate interval is 1m, would probably have to add bucket labels to further investigate

  • rate(validator_db_read_items_total[$rate_interval]) (top left)
  • rate(validator_db_write_items_total[$rate_interval]) (top right)
  • rate( validator_db_read_req_total [$rate_interval]) (bottom left)
  • rate( validator_db_write_req_total [$rate_interval]) (bottol right)

@philknows philknows added the scope-performance Performance issue and ideas to improve performance. label Apr 23, 2023
@nflaig
Copy link
Member Author

nflaig commented Apr 26, 2023

@dapplion I found the problem, the huge amount of read/writes it due to minMaxSurround.ts. this.maxEpochLookback is set to Infinity meaning unitlEpoch will always be 0. This results in a db read and a bulk write for each validator pubkey from current epoch until genesis.

image

I have to dig into how min-max surround works, likely need to use another value for maxEpochLookback.

@nflaig
Copy link
Member Author

nflaig commented Apr 28, 2023

Based on Detecting slashing conditions, it seems like maxEpochLookback only needs to be the number of epochs in weak subjectivity period. More details in Scanning for surrounding votes section.

@nflaig
Copy link
Member Author

nflaig commented Apr 29, 2023

Slashers default to a period of 4096 epochs, which means could be fine to only set maxEpochLookback to that period. They might still collect slashable attestations over a longer period, meaning limiting the max lookback might be riksy when using min-max approach.

Maybe could consider setting maxEpochLookback to a practically safe enough value to fix the I/O issue for now.

Alternative would be to implement a different slashing protection strategy like "minimal"/high-watermark strategy as suggested in Eth R&D discord.

Edit: It looks like this strategy is also already implemented on top of min-max surround

// Refuse to sign any attestation with:
// - source.epoch < min(att.source_epoch for att in data.signed_attestations if att.pubkey == attester_pubkey), OR
// - target_epoch <= min(att.target_epoch for att in data.signed_attestations if att.pubkey == attester_pubkey)
// (spec v4, Slashing Protection Database Interchange Format)
const attestationLowerBound = await this.attestationLowerBound.get(pubKey);

@dapplion
Copy link
Contributor

Setting maxEpochLookback to 4096 sounds sensible. We can do that plus using the minimal strategy. That has the benefit of producing way smaller exports right?

@nflaig
Copy link
Member Author

nflaig commented Apr 30, 2023

That has the benefit of producing way smaller exports right?

No, the slashing export size just depends on data in AttestationByTargetRepository, min-max spans are not exported. Setting maxEpochLookback to 4096 would fix the I/O issue and also massively reduce the validator db size. Right now we are creating ~200k (mainnet epochs) min span records per pubkey.

Setting maxEpochLookback to 4096 sounds sensible.

I think a better value would be weak subjectivity period epochs but we don't have access to that on the VC side. All of these values are quite theoretical though, so might also be fine with 4096 and possibly allow to override via CLI.

@nflaig
Copy link
Member Author

nflaig commented Apr 30, 2023

That has the benefit of producing way smaller exports right?

Actually, right now it even affects the size of the slashing protection but this is a bug which will be fixed with #5437, it loads keys from other buckets which are not valid pubkeys, for those there will not be any signed_blocks or signed_attestations but still adds records to slashing protection.

The empty records with invalid pubkeys look like this

{
  "pubkey": "0x47454e455349535f54494d45",
  "signed_blocks": [],
  "signed_attestations": []
},
{
  "pubkey": "0x47454e455349535f56414c494441544f52535f524f4f54",
  "signed_blocks": [],
  "signed_attestations": []
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope-performance Performance issue and ideas to improve performance.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants