Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage of non-validator nodes grows indefinitely, leading to OOM and unclean shutdown #1660

Open
hhsel opened this issue Jul 21, 2023 · 1 comment

Comments

@hhsel
Copy link
Contributor

hhsel commented Jul 21, 2023

Expected behaviour

When running a QBFT cluster, memory usage should stay within a moderate value range as long as the cluster is not busy.

Actual behaviour

Memory usage of QBFT non-validator nodes grows over time at a rate of approx. 50MB/day, if the cluster keeps producing empty blocks every 1 second, for example.
Non-validator nodes will be killed by OOM as a result. I have experienced this with 2GB and 4GB nodes, and it took about 1 and 2 months for the nodes to be killed by OOM.

OOM causes an unclean shutdown, which means the node loses its intermediate states that are not persisted to its disk.
The memory usage grows indefinitely, even the cluster is producing just empty blocks and does almost nothing on the chain.
In my case 8 out of 8 non-validator nodes in the cluster have the same results.

Validator nodes, on the other hand, have similar tendencies but several sudden memory usage drops have been observed (its frequency is not regular nor expectable but about once in 1-2 weeks).

As a result, for non-validator nodes, I must watch its memory usage closely and take nodes out of a load balancer and restart them when memory usage gets high, to avoid OOM.

Steps to reproduce the behaviour

Start a QBFT cluster with an arbitrary number of non-validator nodes, and let the cluster produce empty blocks.
Memory usage of non-validator nodes grows indefinitely, causing OOM after some months.

@hhsel
Copy link
Contributor Author

hhsel commented Jul 23, 2023

Executing debug.writeMemProfile() every 15 minutes for 2 months showed that, among go tool pprof entries, the following entries are growing constantly, finally causing OOM:

  • github.com/ethereum/go-ethereum/consensus/istanbul/validator.newDefaultSet
  • github.com/ethereum/go-ethereum/consensus/istanbul/validator.New (inline)

validator.newDefaultSet calls policy.RegisterValidatorSet() and add a validator set to validator set list called registry inside ProposerPolicy. It seems that this function is called at least once in the block production process.
It seems that the only way to clear this array is calling ClearRegistry(), but this is only called in Commit(), so non-validator nodes does not call this function, accumulating validator sets indefinitely.

I don't know why validator nodes also accumulating these validator sets in the same manner and sudden memory usage drops occur, because validator nodes call Commit() every block so registry is expected to keep relarively small.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant