-
Notifications
You must be signed in to change notification settings - Fork 3.8k
1.5.0-rc1 with state_history_plugin gets stuck on kylin network #6377
Comments
I reproduced this issue 3 times. The 2nd and 3rd time I synced to the network just in case replay was the issue. All 3 times, the issue starts with the same transaction: 6139bd32da0c523116efcbb2e8536787f68e2b9da364e40792598ea7e64f5815 Nov 24 05:14:20 testserver2 nodeos[1058]: info 2018-11-24T05:14:20.975 thread-0 producer_plugin.cpp:341 on_incoming_block ] Received block f33b817e6ba2063e... #19408000 @ 2018-11-07T06:06:04.500 signed by eosargentina [trxs: |
I got it to this point, but am not hitting this issue.
|
|
I have attached my config.ini and Dockerfiles. I am building from the branch. |
Which branch? release/1.5.x? |
yes, release/1.5.x |
Still unable to reproduce with Ubuntu Server 18.04. Here's the build steps I used from a fresh VM image (not Docker):
|
I can't reproduce the environment exactly, since I don't have access to the base image (image.eosn.io/eosnation/builder:1.3). The builder image it references is our standard builder: https://github.com/EOSIO/eos/blob/master/Docker/builder/Dockerfile. Using that image with release/1.5.x, I wasn't able to reproduce this. |
I will do some more testing to figure out exactly what is the issue here... I will report back when I have something to share. |
I'm testing rc2 and still seeing the same behaviour. next I will test with no docker... (I've tested replaying blocks.log and sync from network, no difference) |
I have sometimes seen this right before the error messages start: |
There’s been a couple reports of this happening on mainnet. |
Running this:
generates the following error on ubuntu 18.10:
|
That’s #6337 . You need to cd into the /tmp/... folder where it’s building llvm, apply the fix, make -j, then make -j install. After that rerun eosio_build.sh |
Is this related to #6495 |
Different initial cause. Once the plugin detects a problem it won't continue; this is to prevent it from recording an incorrect history. |
From Telegram: EOS Mainnet BPs, jem writes: I'm doing a state_history replay, and start seeing this about 31.7M block: warn 2019-01-04T11:34:28.546 thread-0 controller.cpp:244 emit ] 3110000 plugin_exception: Plugin exception Any clue what happened, or how to mitigate? I am replaying from a block.log file saved from a peer node, all irreversible blocks in the chain. There are no active peers during replay. |
The "missed a block" check stops it from creating holes in the data after an initial failure. There should be an earlier error in the log. |
That sounds like #6495 |
jem added: I was using mainnet-1.5.1, and had the problem from the block referenced in the issue. Now running v1.6.0-rc2, replay started from a snapshot at block #31685000, and it pushed through that block without complaint. |
Did 1.6.0-rc2 work on both mainnet and Kylin? |
I believe jem got 1.6.0-rc2 to work on mainnet. (Well he started with something older and then upgraded when it got stuck). I have no luck on kylin still... (I didn't try mainnet and not planning to until I can get all the test networks to work reliably) of 29430486#015 19387700 of 29430486#015 19387800 of 29430486#015 19387900 of 29430486#015 19388000 of 29430486#015 19388100 of 29430486#015 19388200 of 29430486#015 19388300 of 29430486#015 1 |
Same issue here: 1.6.0-rc2/EOS Mainnet/ubuntu 18.04: <4>warn 2019-01-17T21:46:57.013 thread-0 controller.cpp:235 emit ] 3110000 plugin_exception: Plugin exception |
@ramtej The "missed a block" check stops it from creating holes in the data after an initial failure. There should be an earlier error in the log. |
so do we have to replay from the start if we get this error? I got this on 1.5.x on Ubuntu 18.04 around block 30M somewhere. I'm upgrading to 1.6.x, can I re-use the existing data on that sync? |
I'm always happy to see there are more victims of this issue, so it is not only me running into the problem. |
I'm starting to think you are the one causing them... Every time I bang my head on something, you already have a github issue open on it... 👍 |
@eosusa what are the specs on your machine or VM? |
not impressive for our jungle node :) 4 cores, 8gb ram, 500gb (8 sata raid6) I can throw some additional specs at it if you think it's resource related |
I never test below 32gb ram, so that opens a new possibility. Try bumping the ram way up. If that's it then it's possible something somewhere may be swallowing a bad_alloc. |
I tend to test on 'poor' hardware as well. It's expensive to run a lot of beefy cloud servers just to do testing. |
@k26dr thanks for your reply. but I used v1.6 and found the same problem. I analyzed the code and found that the write failed because of several abnormal transactions. I modified the code of the eos, and now everything works fine, and the abnormal transaction is filtered out.
|
Bumped it to 12 cores & 22GB RAM... anything more than that will have to wait on some Hyperion indexing to finish to steal some from it sometime tomorrow. Dumped logs/state/statehist and kicked off -delete-all-blocks --genesis-json --disable-replay-opts and will see how it goes! |
No dice... was running along smoothly and then seems to have hit the exact same range of blocks and tanked. Getting missing trace transactions again: info 2019-03-21T11:21:51.205 thread-0 producer_plugin.cpp:345 on_incoming_block ] Received block ef93a26ecf8ec81a... #8856000 @ 2019-01-15T10:05:46.500 signed by eosphereiobp [trxs: 0, lib: 8855668, conf: 0, latency: 5620564705 ms] I can give it more resources if needed, but seems odd to be the same place in the chain with 3x memory :) |
I created a branch off of 1.7.0 to test the bad_alloc idea. Could you try running with it? Look for this message:
|
I forgot to give the branch name: |
Still no dice: info 2019-03-21T16:56:18.557 thread-0 main.cpp:99 main ] nodeos version v1.7.0-1-gef64fe27a info 2019-03-22T01:18:37.483 thread-0 producer_plugin.cpp:345 on_incoming_block ] Received block a844f129c27f525d... #8855000 @ 2019-01-15T09:57:00.000 signed by eosdacserval [trxs: 0, lib: 8854676, conf: 227, latency: 5671297483 ms] Exact same place as last time |
So anything else you want me to try or should I roll back to 1.6.3 and see if it's any happier? |
@arhag has an idea for test code for you to try. We'll try to get it ready for you to test Mon. |
@eosusa I just pushed more logging to the |
Same crash, much more info in the logs: info 2019-03-28T02:54:17.889 thread-0 producer_plugin.cpp:345 on_incoming_block ] Received block ef93a26ecf8ec81a... #8856000 @ 2019-01-15T10:05:46.500 signed by eosphereiobp [trxs: 0, lib: 8855668, conf: 0, latency: 6194911389 ms] warn 2019-03-28T02:54:18.625 thread-0 controller.cpp:234 emit ] fc::exception: 3110000 plugin_exception: Plugin exception |
Bingo! That's the cause:
|
Reason I could never reproduce: I don't run history_plugin while testing state_history_plugin. |
So I believe the conclusion is that the following 2 options are not compatible with each other:
nodeos should fail to start if both plugins are enabled. |
@hamza-kakar there should be a first error which triggered that |
Today I ran across someone else running history and state_history on the same nodeos instance. I referred them to this issue and advised against it. It would be great to make nodeos fatally stop when both plugins are loaded. |
Created a new issue that outlines the fix proposed by @matthewdarwin |
So I replay the kylin blocks from scratch using config:
/opt/eosio/bin/nodeos --hard-replay --replay-blockchain --disable-replay-opts
plugin = eosio::state_history_plugin
state-history-dir = "state-history"
trace-history = true
chain-state-history = true
state-history-endpoint = 0.0.0.0:9090
[and a few other options that have been there for a long time]
sometime after block 19406900 (not sure exact number), the software goes an infinite loop:
Nov 22 15:37:56 peer1 nodeos[28294]: warn 2018-11-22T15:37:56.654 thread-0 controller.cpp:244 emit ] 3110000 plugin_exception: Plugin exception
Nov 22 15:37:56 peer1 nodeos[28294]: missing trace for transaction 6139bd32da0c523116efcbb2e8536787f68e2b9da364e40792598ea7e64f5815
Nov 22 15:37:56 peer1 nodeos[28294]: {"id":"6139bd32da0c523116efcbb2e8536787f68e2b9da364e40792598ea7e64f5815"}
Nov 22 15:37:56 peer1 nodeos[28294]: thread-0 state_history_plugin.cpp:385 store_traces
Nov 22 15:37:56 peer1 nodeos[28294]: warn 2018-11-22T15:37:56.659 thread-0 controller.cpp:244 emit ] 3110000 plugin_exception: Plugin exception
Nov 22 15:37:56 peer1 nodeos[28294]: missed a block in trace_history.log
Nov 22 15:37:56 peer1 nodeos[28294]: {"name":"trace_history"}
Nov 22 15:37:56 peer1 nodeos[28294]: thread-0 state_history_log.hpp:84 write_entry
Nov 22 15:37:56 peer1 nodeos[28294]: warn 2018-11-22T15:37:56.661 thread-0 controller.cpp:244 emit ] 3110000 plugin_exception: Plugin exception
Nov 22 15:37:56 peer1 nodeos[28294]: missed a block in trace_history.log
Nov 22 15:37:56 peer1 nodeos[28294]: {"name":"trace_history"}
Nov 22 15:37:56 peer1 nodeos[28294]: thread-0 state_history_log.hpp:84 write_entry
Nov 22 15:37:56 peer1 nodeos[28294]: warn 2018-11-22T15:37:56.662 thread-0 controller.cpp:244 emit ] 3110000 plugin_exception: Plugin exception
Nov 22 15:37:56 peer1 nodeos[28294]: missed a block in trace_history.log
Nov 22 15:37:56 peer1 nodeos[28294]: {"name":"trace_history"}
Nov 22 15:37:56 peer1 nodeos[28294]: thread-0 state_history_log.hpp:84 write_entry
Nov 22 15:37:56 peer1 nodeos[28294]: warn 2018-11-22T15:37:56.669 thread-0 controller.cpp:244 emit ] 3110000 plugin_exception: Plugin exception
Nov 22 15:37:56 peer1 nodeos[28294]: missed a block in trace_history.log
Nov 22 15:37:56 peer1 nodeos[28294]: {"name":"trace_history"}
Nov 22 15:37:56 peer1 nodeos[28294]: thread-0 state_history_log.hpp:84 write_entry
Nov 22 15:37:56 peer1 nodeos[28294]: warn 2018-11-22T15:37:56.670 thread-0 controller.cpp:244 emit ] 3110000 plugin_exception: Plugin exception
Nov 22 15:37:56 peer1 nodeos[28294]: missed a block in trace_history.log
Nov 22 15:37:56 peer1 nodeos[28294]: {"name":"trace_history"}
Nov 22 15:37:56 peer1 nodeos[28294]: thread-0 state_history_log.hpp:84 write_entry
Nov 22 15:37:56 peer1 nodeos[28294]: warn 2018-11-22T15:37:56.678 thread-0 controller.cpp:244 emit ] 3110000 plugin_exception: Plugin exception
Nov 22 15:37:56 peer1 nodeos[28294]: missed a block in trace_history.log
Nov 22 15:37:56 peer1 nodeos[28294]: {"name":"trace_history"}
Nov 22 15:37:56 peer1 nodeos[28294]: thread-0 state_history_log.hpp:84 write_entry
Nov 22 15:37:56 peer1 nodeos[28294]: warn 2018-11-22T15:37:56.681 thread-0 controller.cpp:244 emit ] 3110000 plugin_exception: Plugin exception
Nov 22 15:37:56 peer1 nodeos[28294]: missed a block in trace_history.log
etc...
The text was updated successfully, but these errors were encountered: