-
-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JS heap blows up when running beacon node #2005
Comments
here is my crash report (tried pastebin, but they didn't like the JSON I guess): |
I have the same issue without running a validator (beacon node only)
|
I found some promises retained in the memory along with its data (CachedValidatorsBeaconState + EpochContext) and it keeps accumulated. I tried to add the strange thing is it only happens to some blocks randomly and they stayed in the memory even we processed them 10mins - 20mins ago, i.e. archived blocks passed those slots. @ChainSafe/eth2-0 do any of us have any ideas? |
I've never seen anything like this, to narrow down the possibilities I would suggest
|
I wonder when the last time we ran the beacon node without getting an OOM was? I hadn't ran the beacon node into regular sync in quite some time before I ran it and found this issue. So, maybe one thing that could help is if we retrace our steps and look at the history of what was modified in the code section that Tuyen linked above (and/or the runStateTransition function itself) between the time were last able to run beacon node without OOM and now. |
I've tried version 0.13.0 in 1h and still saw the memory increased significantly, we should sync our node more frequently from now on since we have contavo servers now. Initially we couldn't test Pyrmont sync in our old server due to 2GB RAM limitation, the other thing is blockchain in Pyrmont is not as long as the current. In my local environment, I used to increase heap size option for Pyrmont in order to sync and restarted 1-2 times. |
thanks for the suggestion. I tested the vanilla To make a sync |
Is the rate of memory increase much lower than on the current master? |
yeah just to give you an idea after I synced Pyrmont in 2h:
unfortunately I couldn't test on contavo-01 due to #1990 and lost all peers, it stopped at epoch 12468 so I have to switch back to that issue atm. |
the other time I run a sync |
@tuyennhv , currently running the beacon node on the For example here's some of the logs I see:
and yet, if you look at the beaconcha.in link I provided above, the validator apparently missed that epoch (17249). I tried to replicate this behavior on master and was not able to. The validator is able to consistently attest and have said attestations show up in beaconcha.in on master. So, something in that branch is making the validator (or perhaps the validator's interactions with the beacon node) act funny, it seems. |
to follow up on the above comment, sometimes the failed attestation publish will show up in validator logs, but not always. like this:
but still, even if the error doesn't show up in validator logs, it never shows up in beaconcha.in (for example, in the last hour, every epoch (epochs 17245 - 17256) are shown as missed in beaconcha.in |
I got an out of memory crash today running on Mainnet on Digital ocean nodes
UPDATE: After some hours from the Out Of Memory crash, memory is climbing again. The worrying part is that the added memory directly affects performance, reducing the sync speed by half. Why do you think the machine slows so much when memory is high? It can be due to frequent swaps or garbage collection runs? |
One somewhat easy fix would be to reduce the state caches sizes while we're initial-syncing. Right now we're storing a BUNCH of states that we really don't need to keep around, which accounts for most of that memory. |
@wemeetagain Can you spec a proposal in a new issue to implement this? Sounds very important and urgent. |
here's a proposal I have #2051 |
To debug the stairstepping memory, lets take a heap snapshot when the stairstep is low, then another when the stairstep is higher. And compare the snapshots. |
I managed to capture a heapdump before the stair step and after the stair step (see the last two blue lines). They are 1 hour apart but relatively close to the step, given that it takes 3-5min to capture a heapdump. Note: All heapsnapshots are available in our cloud node Contabo01 in
I've loaded add compared the heapsnapshots and the only big diff is |
Some nodes I've been running for +7 days show no memory leak Node A Note the 7 day scale in this chart Node B Note the 30 day scale in this chart Some or many of the improvements that we introduced in the last few months fixed all major leaks. I'll close the issue for now. Let's open a new one if new leaks appear when stressing the node in new situations |
This happens again to my node after upgrading from 0.18.0 to 0.27.0
Shall I open a new ticket or shall we stick to this one? |
Thanks, that works well! (for now) |
Describe the bug
When running a synced lodestar beacon node (cmd:
/lodestar beacon --network pyrmont --rootDir .pyrmont/ --eth1.providerUrl https://goerli.infura.io/v3/3ee38f4a9c4447dfbd050e5afbb4b9c6
) and the validator (cmd:./lodestar validator --network pyrmont --rootDir .pyrmont/ --graffiti "lodestar wuz here"
) with an activated pyrmont validator, on pyrmont, everything runs smoothly for about 16-18 hours (beacon node is stable, validator is attesting, etc.), and then I got an OOM issue. I'm not sure if it's connected to the fact that both the validator and the beacon node were ran together, but the fact that other people have ran the beacon node without an OOM issue for quite some time (and that no one has really been running an active testnet validator alongside the beacon nod in lodestar, to my knowledge) gives me a hunch that it's connected to the validator.EDIT: this issue happens when running just the beacon node by itself and does not require the validator to be ran to replicate the issue.
This is what my heap space looked like in grafana (the big empty space is between when the crash happened and when I noticed the crash)
![image](https://user-images.githubusercontent.com/32028745/106190791-c1770900-616f-11eb-8d06-3fb91c296766.png)
here is my crash report:
https://pastebin.com/WJCEe6R6
The text was updated successfully, but these errors were encountered: