Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JS heap blows up when running beacon node #2005

Closed
3xtr4t3rr3str14l opened this issue Jan 28, 2021 · 25 comments
Closed

JS heap blows up when running beacon node #2005

3xtr4t3rr3str14l opened this issue Jan 28, 2021 · 25 comments
Assignees

Comments

@3xtr4t3rr3str14l
Copy link
Contributor

3xtr4t3rr3str14l commented Jan 28, 2021

Describe the bug

When running a synced lodestar beacon node (cmd: /lodestar beacon --network pyrmont --rootDir .pyrmont/ --eth1.providerUrl https://goerli.infura.io/v3/3ee38f4a9c4447dfbd050e5afbb4b9c6) and the validator (cmd: ./lodestar validator --network pyrmont --rootDir .pyrmont/ --graffiti "lodestar wuz here") with an activated pyrmont validator, on pyrmont, everything runs smoothly for about 16-18 hours (beacon node is stable, validator is attesting, etc.), and then I got an OOM issue. I'm not sure if it's connected to the fact that both the validator and the beacon node were ran together, but the fact that other people have ran the beacon node without an OOM issue for quite some time (and that no one has really been running an active testnet validator alongside the beacon nod in lodestar, to my knowledge) gives me a hunch that it's connected to the validator.

EDIT: this issue happens when running just the beacon node by itself and does not require the validator to be ran to replicate the issue.

This is what my heap space looked like in grafana (the big empty space is between when the crash happened and when I noticed the crash)
image

here is my crash report:
https://pastebin.com/WJCEe6R6

@3xtr4t3rr3str14l
Copy link
Contributor Author

External memory was slightly increasing over time as well:

image

@3xtr4t3rr3str14l
Copy link
Contributor Author

3xtr4t3rr3str14l commented Jan 28, 2021

here is my crash report (tried pastebin, but they didn't like the JSON I guess):

report.20210128.061107.19505.0.001.json.txt

@twoeths twoeths self-assigned this Jan 28, 2021
@q9f
Copy link
Contributor

q9f commented Jan 29, 2021

I have the same issue without running a validator (beacon node only)

2021-01-28 23:31:00 [SYNC]             info: Imported blocks for slots blocks=357503,357504,357506,357507,357508,357509,357510,357511,357512,357513,357514,357516,357517,357518,357519,357520,357521,357522,357523,357524,357525,357526,357527,357528,357529,357530,357531,357532,357533,357535,357536,357537,357538,357539,357540,357542,357543,357544,357545,357546,357547,357548,357549,357550,357551,357552,357553,357555,357556,357557,357558,357559,357560,357561,357562,357563,357565,357566
2021-01-28 23:31:00 [CHAIN]            info: Block processed slot=357503, root=0x2b084f567f44a0c158fda41f6e32a3b694fc5f9eadfdb1ba6fefb1a34a7435b4

<--- Last few GCs --->

[81004:0x30f7230] 24156899 ms: Mark-sweep 2023.8 (2079.1) -> 2021.7 (2052.6) MB, 1199.8 / 0.9 ms  (+ 0.0 ms in 29 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 1210 ms) (average mu = 0.109, current mu = 0.009) finalize[81004:0x30f7230] 24158437 ms: Mark-sweep 2023.1 (2052.6) -> 2022.0 (2058.1) MB, 1529.9 / 0.8 ms  (average mu = 0.052, current mu = 0.005) allocation failure scavenge might not succeed


<--- JS stacktrace --->

==== JS stack trace =========================================

    0: ExitFrame [pc: 0x140a8f9]
Security context: 0x2cc1aa9008d1 <JSObject>
    1: /* anonymous */(aka /* anonymous */) [0x31043e746e89] [/home/user/lodestar/packages/lodestar-beacon-state-transition/lib/fast/util/epochProcess.js:~60] [pc=0x2d03f8603781](this=0x1510846804b1 <undefined>,0x2d9c88a08c89 <Object map = 0x14d1e318e809>,7303)
    2: iterNode(aka iterNode) [0x31043e746f11] [/home/user/lodestar/packages/persistent-ts/lib/Vector.j...

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
 1: 0xa18150 node::Abort() [node]
 2: 0xa1855c node::OnFatalError(char const*, char const*) [node]
 3: 0xb9715e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xb974d9 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xd54755  [node]
 6: 0xd54de6 v8::internal::Heap::RecomputeLimits(v8::internal::GarbageCollector) [node]
 7: 0xd616a5 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [node]
 8: 0xd62555 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
 9: 0xd6500c v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
10: 0xd2ba2b v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [node]
11: 0x106e15c v8::internal::Runtime_AllocateInOldGeneration(int, unsigned long*, v8::internal::Isolate*) [node]
12: 0x140a8f9  [node]

@twoeths
Copy link
Contributor

twoeths commented Feb 1, 2021

I found some promises retained in the memory along with its data (CachedValidatorsBeaconState + EpochContext) and it keeps accumulated.
Screen Shot 2021-02-01 at 10 19 43

I tried to add myId to all suspected promises during epoch transition to find out which promises were retained and it's https://github.com/ChainSafe/lodestar/blob/master/packages/lodestar/src/chain/blocks/process.ts#L148

the strange thing is it only happens to some blocks randomly and they stayed in the memory even we processed them 10mins - 20mins ago, i.e. archived blocks passed those slots. @ChainSafe/eth2-0 do any of us have any ideas?

@dapplion
Copy link
Contributor

dapplion commented Feb 1, 2021

I've never seen anything like this, to narrow down the possibilities I would suggest

  • Making runStateTransition function sync and confirming if the leak persists
  • Using a vanilla version of sleep(0) and confirming if the leak persists
  • Reaching out to Chromium or NodeJS with this heap and see if they have some advice on what to do next

@3xtr4t3rr3str14l 3xtr4t3rr3str14l changed the title JS heap blows up when running both beacon node and validator JS heap blows up when running both beacon node Feb 1, 2021
@3xtr4t3rr3str14l 3xtr4t3rr3str14l changed the title JS heap blows up when running both beacon node JS heap blows up when running beacon node Feb 1, 2021
@3xtr4t3rr3str14l
Copy link
Contributor Author

3xtr4t3rr3str14l commented Feb 1, 2021

I wonder when the last time we ran the beacon node without getting an OOM was? I hadn't ran the beacon node into regular sync in quite some time before I ran it and found this issue. So, maybe one thing that could help is if we retrace our steps and look at the history of what was modified in the code section that Tuyen linked above (and/or the runStateTransition function itself) between the time were last able to run beacon node without OOM and now.

@twoeths
Copy link
Contributor

twoeths commented Feb 1, 2021

I wonder when the last time we ran the beacon node without getting an OOM was? I hadn't ran the beacon node into regular sync in quite some time before I ran it and found this issue. So, maybe one thing that could help is if we retrace our steps and look at the history of what was modified in the code section that Tuyen linked above (and/or the runStateTransition function itself) between the time were last able to run beacon node without OOM and now.

I've tried version 0.13.0 in 1h and still saw the memory increased significantly, we should sync our node more frequently from now on since we have contavo servers now. Initially we couldn't test Pyrmont sync in our old server due to 2GB RAM limitation, the other thing is blockchain in Pyrmont is not as long as the current.

In my local environment, I used to increase heap size option for Pyrmont in order to sync and restarted 1-2 times.

@twoeths
Copy link
Contributor

twoeths commented Feb 1, 2021

I've never seen anything like this, to narrow down the possibilities I would suggest

  • Making runStateTransition function sync and confirming if the leak persists
  • Using a vanilla version of sleep(0) and confirming if the leak persists
  • Reaching out to Chromium or NodeJS with this heap and see if they have some advice on what to do next

thanks for the suggestion. I tested the vanilla sleep(0) and still saw the issue. I'm testing a sync runStateTransition now and let's see how it works on server 01.

To make a sync runStateTransition(), I have to move sleep(0) to the outer for loop so it's kind of similar to my previous try. A debug in my local environment, after 2h test, is good - there is no CachedValidatorsBeaconState instances leaked and of course no Promise instances keeping them. I still see the memory slowly increased, not sure if it's because of my --inspect flag.

@dapplion
Copy link
Contributor

dapplion commented Feb 2, 2021

I still see the memory slowly increased, not sure if it's because of my --inspect flag.

Is the rate of memory increase much lower than on the current master?

@twoeths
Copy link
Contributor

twoeths commented Feb 2, 2021

I still see the memory slowly increased, not sure if it's because of my --inspect flag.

Is the rate of memory increase much lower than on the current master?

yeah just to give you an idea after I synced Pyrmont in 2h:

  • master should be >3GB or even 4GB at taht time, normally I stopped after 1h
  • this branch consume only 2GB at that time and it increased at a much slower speed, the increasement maybe due to my debug (--inspect) but at least I don't see CachedValidatorsBeaconState being leaked like in master

unfortunately I couldn't test on contavo-01 due to #1990 and lost all peers, it stopped at epoch 12468 so I have to switch back to that issue atm.

@twoeths
Copy link
Contributor

twoeths commented Feb 2, 2021

the other time I run a sync stateTransition in 2.5h, the heap ranges from 1.1GB to 1.7GB stably

@wemeetagain
Copy link
Member

One thing I noticed is the heap issue only seems to happen during regular sync.
The initial sync has a low and steady heap usage.

pyrmont

@3xtr4t3rr3str14l
Copy link
Contributor Author

3xtr4t3rr3str14l commented Feb 3, 2021

@tuyennhv , currently running the beacon node on the tuyen/experiment-sync-runStateTransition branch. i'm still waiting to see what happens with regard to the heap space usage, but i'm noticing that when i try to run the validator on the same branch alongside the beacon node (once the beacon node is in regular sync), then the validator consistently does not publish it's attestations (according to beaconcha.in: https://pyrmont.beaconcha.in/validator/adb60c447b8b59bfa0df75b88bd9ee4bd393258f917ab54eee6520787f3346b6b36fab203cd45baf7d308576a067cb3e#attestations), even though I see in the attestations being made in the validator logs. This has happened consistently for multiple epochs in a row, and actually, I haven't seen any published attestations on beaconcha.in since running on this branch.

For example here's some of the logs I see:

2021-02-02 21:53:43 []                 info: on new block epoch epoch=17249, validator=0xadb60c447b8b59bfa0df75b88bd9ee4bd393258f917ab54eee6520787f3346b6b36fab203cd45baf7d308576a067cb3e
2021-02-02 21:57:07 []                 info: Handling attestation duty slot=551985, committee=27, validator=0xadb60c447b8b59bfa0df75b88bd9ee4bd393258f917ab54eee6520787f3346b6b36fab203cd45baf7d308576a067cb3e
2021-02-02 21:57:11 []                 info: Signed new attestation block=0x18c153f031843f3e364de5a000927c93fe44ccc72b692ebe0ec5dfc4fd5a28d5, committeeIndex=27, slot=551985
2021-02-02 21:57:11 []                 info: Published new attestation slot=551985, committee=27, attestation=0x26d09431e78cf7e0fedd9fa0adec32d430987f333bb94e7c323a28e651a8e90b, block=0x18c153f031843f3e364de5a000927c93fe44ccc72b692ebe0ec5dfc4fd5a28d5, validator=0xadb60c447b8b59bfa0df75b88bd9ee4bd393258f917ab54eee6520787f3346b6b36fab203cd45baf7d308576a067cb3e
2021-02-02 21:57:15 []                 info: Aggregating attestations committeeIndex=27, slot=551985
2021-02-02 21:57:16 []                 info: Published signed aggregate and proof committeeIndex=27, slot=551985

and yet, if you look at the beaconcha.in link I provided above, the validator apparently missed that epoch (17249).

I tried to replicate this behavior on master and was not able to. The validator is able to consistently attest and have said attestations show up in beaconcha.in on master. So, something in that branch is making the validator (or perhaps the validator's interactions with the beacon node) act funny, it seems.

@3xtr4t3rr3str14l
Copy link
Contributor Author

3xtr4t3rr3str14l commented Feb 3, 2021

to follow up on the above comment, sometimes the failed attestation publish will show up in validator logs, but not always. like this:

2021-02-02 22:07:47 []                error: Failed to publish attestation message=ATTESTATION_ERROR_ATTESTATION_ALREADY_KNOWN
2021-02-02 22:07:51 []                 info: Aggregating attestations committeeIndex=4, slot=552038
2021-02-02 22:07:51 []                error: Failed to produce aggregate and proof message=No matching attestations found for attestationData

but still, even if the error doesn't show up in validator logs, it never shows up in beaconcha.in (for example, in the last hour, every epoch (epochs 17245 - 17256) are shown as missed in beaconcha.in

@3xtr4t3rr3str14l
Copy link
Contributor Author

3xtr4t3rr3str14l commented Feb 3, 2021

so, the heap blew earlier for me this time (on tuyen/experiment-sync-runStateTransition) than on master. looks like it may have plugged one memory issue but then introduced another (notice the pink line going up)

Screenshot from 2021-02-03 00-58-41

@dapplion
Copy link
Contributor

I got an out of memory crash today running on Mainnet on Digital ocean nodes

Screenshot from 2021-02-14 08-42-34

==== JS stack trace =========================================

Security context: 0x179d4cdc08a1 <JSObject>
    0: builtin exit frame: BigInt(aka BigInt)(this=0x1b5ec21004a9 <undefined>,40,0x1b5ec21004a9 <undefined>)

    1: readOnlyMap [0x3dd077e32789] [/usr/app/node_modules/@chainsafe/ssz/lib/backings/tree/array.js:~211] [pc=0x1c31e130bcdf](this=0x16c3a3cd2b01 <TreeHandler map = 0xc0f0df1f81>,0x174a8ddbf0c1 <Tree map = 0xd54264f1491>,0x174a8ddbf089 <JSFunction (sfi = 0x2ecf077fbfd1)>)
    2: process...

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memo

UPDATE: After some hours from the Out Of Memory crash, memory is climbing again. The worrying part is that the added memory directly affects performance, reducing the sync speed by half. Why do you think the machine slows so much when memory is high? It can be due to frequent swaps or garbage collection runs?

Screenshot from 2021-02-14 11-31-41

@wemeetagain
Copy link
Member

One somewhat easy fix would be to reduce the state caches sizes while we're initial-syncing. Right now we're storing a BUNCH of states that we really don't need to keep around, which accounts for most of that memory.

@dapplion
Copy link
Contributor

dapplion commented Feb 16, 2021

One somewhat easy fix would be to reduce the state caches sizes while we're initial-syncing. Right now we're storing a BUNCH of states that we really don't need to keep around, which accounts for most of that memory.

@wemeetagain Can you spec a proposal in a new issue to implement this? Sounds very important and urgent.

@twoeths
Copy link
Contributor

twoeths commented Feb 17, 2021

here's a proposal I have #2051

@wemeetagain
Copy link
Member

To debug the stairstepping memory, lets take a heap snapshot when the stairstep is low, then another when the stairstep is higher. And compare the snapshots.

@dapplion
Copy link
Contributor

dapplion commented Mar 2, 2021

I managed to capture a heapdump before the stair step and after the stair step (see the last two blue lines). They are 1 hour apart but relatively close to the step, given that it takes 3-5min to capture a heapdump.

Screenshot from 2021-03-02 15-12-12

Note: All heapsnapshots are available in our cloud node Contabo01 in /root/. Please feel free to scp the files to your local environment and investigate what's going on.

-rw-r--r--  1 root root 1367036296 Mar  2 12:52 lodestar-master-06dab5-2021-03-02T12:52:00.000Z.heapsnapshot
-rw-r--r--  1 root root 1343952846 Mar  2 13:43 lodestar-master-06dab5-2021-03-02T13:43:00.000Z.heapsnapshot
-rw-r--r--  1 root root 1334864890 Mar  2 14:55 lodestar-master-06dab5-2021-03-02T14:55:00.000Z.heapsnapshot

I've loaded add compared the heapsnapshots and the only big diff is (system). Any clue what's this?

Screenshot from 2021-03-02 15-39-06

@dapplion
Copy link
Contributor

Some nodes I've been running for +7 days show no memory leak

Node A

Note the 7 day scale in this chart

Screenshot from 2021-04-22 09-23-01
Screenshot from 2021-04-22 09-20-19

Node B Note the 30 day scale in this chart

Screenshot from 2021-04-22 09-23-25
Screenshot from 2021-04-22 09-22-27


Some or many of the improvements that we introduced in the last few months fixed all major leaks. I'll close the issue for now. Let's open a new one if new leaks appear when stressing the node in new situations

@q9f
Copy link
Contributor

q9f commented Aug 11, 2021

This happens again to my node after upgrading from 0.18.0 to 0.27.0

Aug-11 14:44:15.595 []                 info: Lodestar version=0.27.0 master 2c010b25, network=mainnet
Aug-11 14:44:15.680 [DB]               info: Connected to LevelDB database name=/srv/chain/lodestar/chain-db
Aug-11 14:44:20.333 []                 info: Initializing beacon state from anchor state slot=1217312, epoch=38041, stateRoot=0x677f9dcaf9e0cfa49586a020259c06480082988f5409472dd6c68127d8f43f80
Aug-11 14:44:32.248 [METRICS]          info: Starting metrics HTTP server port=8008
(node:1717966) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 uncaughtException listeners added to [process]. Use emitter.setMaxListeners() to increase limit
Aug-11 14:44:32.335 [API]              info: Started REST api server address=http://127.0.0.1:9596, namespaces=["beacon","config","events","node","validator"]
Aug-11 14:44:43.342 [NETWORK]          info: PeerId 16Uiu2HAkxpvEzCjLDCEKK4sxGH8TzjAr3j3H135brtQghsqo6Re9, Multiaddrs /ip4/127.0.0.1/tcp/9000,/ip4/135.181.182.54/tcp/9000
Aug-11 14:44:43.343 []                 warn: Low peer count peers=1
Aug-11 14:44:43.344 []                 info: Searching for peers - peers: 1 - finalized: 38039 0xbf67…e44c - head: 1217312 0xf5c7…19cd - clockSlot: 1821821
Aug-11 14:44:54.492 []                 info: Syncing - 1.2 days left - 5.74 slots/s - finalized: 38041 0xf5c7…19cd - head: 1217376 0x7be4…ee3b - clockSlot: 1821822 - peers: 2
Aug-11 14:45:15.248 []                 info: Syncing - 1.8 days left - 3.89 slots/s - finalized: 38043 0x7be4…ee3b - head: 1217440 0xfb97…2967 - clockSlot: 1821824 - peers: 4
Aug-11 14:45:17.465 []                 info: Syncing - 1.9 days left - 3.68 slots/s - finalized: 38043 0x7be4…ee3b - head: 1217441 0x8d03…ce63 - clockSlot: 1821824 - peers: 4
Aug-11 14:45:29.142 []                 info: Syncing - 1.8 days left - 3.88 slots/s - finalized: 38044 0x13cc…e6de - head: 1217503 0x010a…83d0 - clockSlot: 1821825 - peers: 5
Aug-11 14:45:45.344 []                 info: Syncing - 1.9 days left - 3.60 slots/s - finalized: 38046 0xd1ba…1503 - head: 1217537 0x8cf6…f5cf - clockSlot: 1821826 - peers: 5
Aug-11 14:45:53.848 []                 info: Syncing - 2 days left - 3.53 slots/s - finalized: 38047 0x584c…b199 - head: 1217569 0x3778…1bc7 - clockSlot: 1821827 - peers: 5
Aug-11 14:46:05.112 []                 info: Syncing - 2 days left - 3.56 slots/s - finalized: 38048 0x56f1…c00e - head: 1217618 0xf015…7905 - clockSlot: 1821828 - peers: 8
Aug-11 14:46:17.310 []                 info: Syncing - 2 days left - 3.43 slots/s - finalized: 38049 0x6fd4…fecb - head: 1217636 0x94a0…138f - clockSlot: 1821829 - peers: 10

<--- Last few GCs --->

[1717966:0x362f180]   122575 ms: Scavenge 2028.8 (2040.6) -> 2027.9 (2040.6) MB, 10.2 / 3.8 ms  (average mu = 0.107, current mu = 0.016) allocation failure 
[1717966:0x362f180]   122656 ms: Scavenge 2028.8 (2040.6) -> 2028.0 (2040.6) MB, 10.1 / 3.7 ms  (average mu = 0.107, current mu = 0.016) allocation failure 
[1717966:0x362f180]   122729 ms: Scavenge 2028.9 (2040.6) -> 2028.0 (2040.6) MB, 10.0 / 3.7 ms  (average mu = 0.107, current mu = 0.016) allocation failure 


<--- JS stacktrace --->

==== JS stack trace =========================================

    0: ExitFrame [pc: 0x140a7f9]
Security context: 0x1829873c08d1 <JSObject>
    1: new constructor(aka Tree) [0x1879803bbb89] [/home/user/lodestar/node_modules/@chainsafe/persistent-merkle-tree/lib/tree.js:~13] [pc=0x3ab22a5e1dba](this=0x0dfa9e31b5a9 <Tree map = 0x2372219894e9>,0x0ad8d4252111 <Node map = 0x237221989499>,0x0dfa9e31b571 <JSFunction (sfi = 0x1ad02c97d7b9)>)
    2: ConstructFrame [pc: 0x138b8c0]
    3: getProperty [0x70c51b8...

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
 1: 0xa18150 node::Abort() [node]
 2: 0xa1855c node::OnFatalError(char const*, char const*) [node]
 3: 0xb9705e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xb973d9 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xd54655  [node]
 6: 0xd54ce6 v8::internal::Heap::RecomputeLimits(v8::internal::GarbageCollector) [node]
 7: 0xd615a5 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [node]
 8: 0xd62455 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
 9: 0xd63aff v8::internal::Heap::HandleGCRequest() [node]
10: 0xd12565 v8::internal::StackGuard::HandleInterrupts() [node]
11: 0x106dba6 v8::internal::Runtime_StackGuard(int, unsigned long*, v8::internal::Isolate*) [node]
12: 0x140a7f9  [node]

Shall I open a new ticket or shall we stick to this one?

@twoeths
Copy link
Contributor

twoeths commented Aug 12, 2021

@q9f recently we run the nodes with 6GB heap memory

NODE_OPTIONS: --max_old_space_size=6144

there's an ongoing effort to reduce memory now at #2885

@q9f
Copy link
Contributor

q9f commented Aug 12, 2021

Thanks, that works well! (for now)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants