Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processes are killed with segfault on high load #38

Open
cebe opened this issue Jan 5, 2018 · 20 comments
Open

Processes are killed with segfault on high load #38

cebe opened this issue Jan 5, 2018 · 20 comments

Comments

@cebe
Copy link

cebe commented Jan 5, 2018

When a lot of requests are made against the RPC API, multichain processes regularily fail with segfault.

Here is what I see in the systemd journal:

Jan 06 00:06:15  systemd[1]: multichain@9201.service: Main process exited, code=killed, status=11/SEGV
Jan 06 00:06:15  systemd[1]: multichain@9201.service: Unit entered failed state.
Jan 06 00:06:15  systemd[1]: multichain@9201.service: Failed with result 'signal'.

I am seeing this error regularily coming up in times where a lot of requests are made against the RPC API, when the amount of requests was low it did not happen before.

Multichain Version: 1.0.2, I am using the binaries from the website.

To provide more information about the problem I'd compile multichain myself and run it with gdb, I wonder however, which branch I should use, as this repo does not seem to have any tags?

@cebe
Copy link
Author

cebe commented Jan 6, 2018

Another case:

Jan 06 01:11:55  multichaind[13532]: *** Error in `/opt/multichain/bin/multichaind': double free or corruption (out): 0x00007fa09c002060 ***
Jan 06 01:11:55  multichaind[13532]: ======= Backtrace: =========
Jan 06 01:11:55  multichaind[13532]: /lib/x86_64-linux-gnu/libc.so.6(+0x70bcb)[0x7fa0e5885bcb]
Jan 06 01:11:55  multichaind[13532]: /lib/x86_64-linux-gnu/libc.so.6(+0x76f96)[0x7fa0e588bf96]
Jan 06 01:11:55  multichaind[13532]: /lib/x86_64-linux-gnu/libc.so.6(+0x777de)[0x7fa0e588c7de]
Jan 06 01:11:55  multichaind[13532]: /opt/multichain/bin/multichaind[0x4d0120]
Jan 06 01:11:55  multichaind[13532]: /opt/multichain/bin/multichaind[0x4adb9b]
Jan 06 01:11:55  multichaind[13532]: /opt/multichain/bin/multichaind[0x53717c]
Jan 06 01:11:55  multichaind[13532]: /opt/multichain/bin/multichaind[0x51b075]
Jan 06 01:11:55  multichaind[13532]: /opt/multichain/bin/multichaind[0x5354d5]
Jan 06 01:11:55  multichaind[13532]: /opt/multichain/bin/multichaind[0x852474]
Jan 06 01:11:55  multichaind[13532]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7494)[0x7fa0e5bbb494]
Jan 06 01:11:55  multichaind[13532]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fa0e58fdaff]
Jan 06 01:11:55  multichaind[13532]: ======= Memory map: ========
Jan 06 01:11:55  multichaind[13532]: 00400000-00cc6000 r-xp 00000000 09:02 168298965                          /opt/multichain/bin/multichaind
Jan 06 01:11:55  multichaind[13532]: 00ec6000-00ef4000 r--p 008c6000 09:02 168298965                          /opt/multichain/bin/multichaind
Jan 06 01:11:55  multichaind[13532]: 00ef4000-00f07000 rw-p 008f4000 09:02 168298965                          /opt/multichain/bin/multichaind
Jan 06 01:11:55  multichaind[13532]: 00f07000-00f12000 rw-p 00000000 00:00 0
Jan 06 01:11:55  multichaind[13532]: 01fb9000-02e35000 rw-p 00000000 00:00 0                                  [heap]
Jan 06 01:11:55  multichaind[13532]: 02e35000-02e36000 rw-p 00000000 00:00 0                                  [heap]
Jan 06 01:11:55  multichaind[13532]: 02e36000-02e37000 rw-p 00000000 00:00 0                                  [heap]
Jan 06 01:11:55  multichaind[13532]: 02e37000-02e38000 rw-p 00000000 00:00 0                                  [heap]
Jan 06 01:11:55  multichaind[13532]: 02e38000-03765000 rw-p 00000000 00:00 0                                  [heap]
Jan 06 01:11:55  multichaind[13532]: 7fa064000000-7fa064793000 rw-p 00000000 00:00 0
Jan 06 01:11:55  multichaind[13532]: 7fa064793000-7fa068000000 ---p 00000000 00:00 0
Jan 06 01:11:55  multichaind[13532]: 7fa068000000-7fa068794000 rw-p 00000000 00:00 0
Jan 06 01:11:55  multichaind[13532]: 7fa068794000-7fa06c000000 ---p 00000000 00:00 0
Jan 06 01:11:55  multichaind[13532]: 7fa06c000000-7fa06c7a9000 rw-p 00000000 00:00 0
Jan 06 01:11:55  multichaind[13532]: 7fa06c7a9000-7fa070000000 ---p 00000000 00:00 0
Jan 06 01:11:55  multichaind[13532]: 7fa070000000-7fa070797000 rw-p 00000000 00:00 0
Jan 06 01:11:55  multichaind[13532]: 7fa070797000-7fa074000000 ---p 00000000 00:00 0
Jan 06 01:11:55  multichaind[13532]: 7fa074000000-7fa074796000 rw-p 00000000 00:00 0
Jan 06 01:11:55  systemd[1]: multichain@9201.service: Main process exited, code=killed, status=6/ABRT
Jan 06 01:11:55  systemd[1]: multichain@9201.service: Unit entered failed state.
Jan 06 01:11:55  systemd[1]: multichain@9201.service: Failed with result 'signal'.

@gidgreen
Copy link
Contributor

gidgreen commented Jan 6, 2018

Thanks for reporting this. The branch you want is 1.0.x-release. Can you also let us know please which APIs you were calling since we (and many of our partners) have done extensive load testing of MultiChain and not seen this phenomenon. It may be a bug caused by the particular APIs you are calling.

@cebe
Copy link
Author

cebe commented Jan 7, 2018

The APIs frequently called in this scenario are:

  • getassettransaction (node is subscribed to the asset)
  • getaddressbalances
  • listaddresstransactions

@cebe
Copy link
Author

cebe commented Jan 7, 2018

Seems its the getassettransaction call which causes the segfault. For certain transaction ids it fails on every request.

@cebe
Copy link
Author

cebe commented Jan 7, 2018

It also seems to work fine when calling the API via multichain-cli but it fails with another HTTP client. The request that causes the node to segfault:

POST / HTTP/1.0
Host: 127.0.0.1:10201
User-Agent: React/alpha
Content-Type: application/json
Content-Length: 171
Authorization: Basic ...

{"id":"bo5a51722d06a14","method":"getassettransaction","params":["testcoin","7c2d46bb91221374b2ae09f565f984b952a2451b1494ec30f86660804bb08632",true],"chain_name":"testcoin"}

(I have replaced the HTTP basic auth header with ...)

edit: does not seem to be related, depends on the txid.

@gidgreen
Copy link
Contributor

gidgreen commented Jan 7, 2018

Thanks, that's very helpful. We'll take a look at getassettransaction. If we can't find the problem from the source code we may ask for a copy of your blockchain files. So if you're willing, please be sure to back up the directory for this chain for us while the problem persists.

@gidgreen
Copy link
Contributor

Hi @cebe we believe we've fixed this problem (specifically with getassettransaction) and have created a build with the fix. If you are willing could you please download and test it:

https://www.multichain.com/download/multichain-1.0.2.1.tar.gz

Thank you in advance.

@cebe
Copy link
Author

cebe commented Jan 16, 2018

Hi @gidgreen, thanks for the update. I have installed your updated version but it still crashes on getassettransaction calls. I'd now try to compile multichain myself and see if I find the cause with gdb. Could you please push the latest changes?

@mike31
Copy link
Contributor

mike31 commented Jan 17, 2018

Hi @cebe. The updated branch is 1.0.x-dev (it is not official release).

If this happens always on some specific transaction, is there any chance you can share your chain directory with us?

@cebe
Copy link
Author

cebe commented Jan 17, 2018

The 1.0.x branches are updated only 5 month ago: https://github.com/MultiChain/multichain/branches ?

The problem happens on the production blockchain, I will try to reproduce this on a dev chain.

@mike31
Copy link
Contributor

mike31 commented Jan 17, 2018

1.0.x-dev is the branch. Simply, last commit is 5-month old (cherry-picked from another branch)

Can you confirm that it always happens to specific txid? If yes, can you please check what happens if you call

  1. getassettransaction your-asset txid
  2. getassettransaction some-other-asset-maybe-unsubscribed txid
  3. getwallettransaction txid
  4. getrawtransaction txid

@cebe
Copy link
Author

cebe commented Jan 22, 2018

  1. getassettransaction testcoin 7c2d46bb91221374b2ae09f565f984b952a2451b1494ec30f86660804bb08632

    {"method":"getassettransaction","params":["testcoin","7c2d46bb91221374b2ae09f565f984b952a2451b1494ec30f86660804bb08632"],"id":1,"chain_name":"testcoin"}
    
    error: no response from server
    

    systemd journal:

    Jan 22 14:36:59 htzn022 systemd[1]: multichain-testcoin@9201.service: Main process exited, code=killed, status=11/SEGV
    Jan 22 14:36:59 htzn022 systemd[1]: multichain-testcoin@9201.service: Unit entered failed state.
    Jan 22 14:36:59 htzn022 systemd[1]: multichain-testcoin@9201.service: Failed with result 'signal'.
    Jan 22 14:37:02 htzn022 systemd[1]: multichain-testcoin@9201.service: Service hold-off time over, scheduling restart.
    Jan 22 14:37:02 htzn022 systemd[1]: Stopped Multichain Blockchain for testcoin, node 9201.
    Jan 22 14:37:02 htzn022 systemd[1]: Started Multichain Blockchain for testcoin, node 9201.
    Jan 22 14:37:02 htzn022 multichaind[12950]: Chain testcoin already exists, adding 127.0.0.1:9200 to list of peers
    Jan 22 14:37:02 htzn022 multichaind[12950]: Other nodes can connect to this node using:
    Jan 22 14:37:02 htzn022 multichaind[12950]: multichaind testcoin@...:9201
    Jan 22 14:37:02 htzn022 multichaind[12950]: Node ready.
    
  2. we only have one asset in the chain. When I try with an unexisting asset, it returns

    error code: -708
    error message:
    Asset with this name not found: alpcointest
    
  3. getwallettransaction 7c2d46bb91221374b2ae09f565f984b952a2451b1494ec30f86660804bb08632

    error code: -710
    error message:
    Wallet addresses with specified criteria are not involved in transaction
    
  4. getrawtransaction 7c2d46bb91221374b2ae09f565f984b952a2451b1494ec30f86660804bb08632

    0100000001948dd31a4abf659ced974ae3da7f9aff49f65b875add2510ec3712e5ceaa8af2020000006b4830450221009ee1358ea8829ff983ee012f6fe9a5b45ed45fd57ffa9d145b5c41f4234d3d3402200395386e2c6b009939546e5b7954003d57655ea0adc5c2d2ca74087e7d4e8484012103d169eaeba95c8ed00ede3c445a969922d86c8b6c7bdddccd4825cd67a1a043a7ffffffff0300000000000000003776a9147266fbcb517a69e7442d90b17b0bf8160ae92bbd88ac1c73706b6f7bf0f2fcf51d6fab6791082bf454763ddd3d1f0a0c000000750000000000000000911473706b657bf0f2fcf51d6fab6791082bf454763d754c7773706b75016465736372697074696f6e001d65786368616e67652066726f6d20416c70636f696e20436c6173736963636c61737369635f74786964002a54583635303532313734376336383836356337323338303834326233333239633739346237656230323365786368616e67655f72617465000135756a00000000000000001976a914647515d4b3b315fbc2deb369a0b5c6148232efaa88ac00000000
    

@cebe
Copy link
Author

cebe commented Jan 22, 2018

While trying to debug the problem, I wanted to compile my own multichain binary, however I am unable to compile multichain on debian stable, which system are you using for compiling the binaries?

@cebe
Copy link
Author

cebe commented Jan 22, 2018

If I create a new node and let it subscribe to the asset, I can get the transaction information (same transaction as in #38 (comment) ) without crashing.

@cebe
Copy link
Author

cebe commented Jan 22, 2018

If you could provide a binary with debugging symbols I could run it in gdb and provide a backtrace of the error.

@gidgreen
Copy link
Contributor

Here's a version with debug symbols:

https://www.multichain.com/download/multichain-1.0.2.1-wdebug.tar.gz

@cebe
Copy link
Author

cebe commented Jan 23, 2018

Thanks for providing the debug version. Using this binary the nodes get segfaults directly after start:

Starting program: /opt/multichain/alpcointest/bin/multichaind-wdebug alpcointest@127.0.0.1:20200 -datadir=/var/lib/multichain/alpcointest/node-20401
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

MultiChain 1.0.2.1 Daemon (latest protocol 10009)

[New Thread 0x7ffff6654700 (LWP 12459)]
[New Thread 0x7ffff5e53700 (LWP 12460)]
[New Thread 0x7ffff5652700 (LWP 12461)]
[New Thread 0x7ffff4e51700 (LWP 12462)]
[New Thread 0x7fffe7fff700 (LWP 12463)]
[New Thread 0x7fffdffff700 (LWP 12464)]
[New Thread 0x7fffe77fe700 (LWP 12465)]
[New Thread 0x7fffe6ffd700 (LWP 12466)]
[New Thread 0x7fffe67fc700 (LWP 12467)]
[New Thread 0x7fffe5ffb700 (LWP 12468)]
[New Thread 0x7fffe57fa700 (LWP 12469)]
[New Thread 0x7fffe4ff9700 (LWP 12470)]
Chain alpcointest already exists, adding 127.0.0.1:20200 to list of peers

[New Thread 0x7fffdf27d700 (LWP 12471)]
[New Thread 0x7fffdea7c700 (LWP 12472)]
[New Thread 0x7fffde27b700 (LWP 12473)]
[New Thread 0x7fffdda7a700 (LWP 12474)]
[New Thread 0x7fffdd279700 (LWP 12475)]
[New Thread 0x7fffdca78700 (LWP 12476)]
[New Thread 0x7fffbffff700 (LWP 12477)]
[New Thread 0x7fffbf7fe700 (LWP 12478)]
Other nodes can connect to this node using:
multichaind alpcointest@...:20401

[New Thread 0x7fffbeffd700 (LWP 12479)]
[Thread 0x7fffbeffd700 (LWP 12479) exited]
[New Thread 0x7fffbe7fc700 (LWP 12480)]
[New Thread 0x7fffbdffb700 (LWP 12481)]
[Thread 0x7fffbe7fc700 (LWP 12480) exited]
[New Thread 0x7fffbd7fa700 (LWP 12482)]
[New Thread 0x7fffbcff9700 (LWP 12483)]
[New Thread 0x7fffaffff700 (LWP 12484)]
[New Thread 0x7fffa7fff700 (LWP 12485)]
Node ready.


Thread 27 "bitcoin-msghand" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffaffff700 (LWP 12484)]
CPubKey::GetID (this=0x388) at ./keys/pubkey.h:147
147	./keys/pubkey.h: No such file or directory.
(gdb) backtrace
#0  CPubKey::GetID (this=0x388) at ./keys/pubkey.h:147
#1  0x0000000000742c59 in CWallet::GetKeyFromAddressBook (this=<optimized out>, result=..., 
    type=<optimized out>, addresses=<optimized out>, mapSpecialEntity=<optimized out>)
    at wallet/wallet.cpp:3462
#2  0x00000000004fc9ff in PushMultiChainVerack (pfrom=0x7fffb4000c90, fIsVerackack=<optimized out>)
    at protocol/handshake.cpp:486
#3  0x00000000004a432e in ProcessMessage (pfrom=<optimized out>, strCommand=..., vRecv=..., 
    nTimeReceived=<optimized out>) at core/main.cpp:5594
#4  0x00000000004ae082 in ProcessMessages (pfrom=0x7fffb4000c90) at core/main.cpp:6788
#5  0x0000000000537b4c in boost::signals2::detail::signal1_impl<bool, CNode*, boost::signals2::optional_last_value<bool>, int, std::less<int>, boost::function<bool (CNode*)>, boost::function<bool (boost::signals2::connection const&, CNode*)>, boost::signals2::mutex>::operator()(CNode*) ()
#6  0x000000000051ba45 in boost::signals2::signal1<bool, CNode*, boost::signals2::optional_last_value<bool>, int, std::less<int>, boost::function<bool (CNode*)>, boost::function<bool (boost::signals2::connection const&, CNode*)>, boost::signals2::mutex>::operator()(CNode*) (arg1=<optimized out>, 
    this=<optimized out>) at /usr/local/include/boost/signals2/detail/signal_template.hpp:718
#7  ThreadMessageHandler () at net/net.cpp:1643
#8  0x0000000000535ea5 in void TraceThread<void (*)()>(char const*, void (*)()) ()
#9  0x0000000000852354 in thread_proxy ()
#10 0x00007ffff6d16494 in start_thread (arg=0x7fffaffff700) at pthread_create.c:333
#11 0x00007ffff6a58aff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

This seems unrelated to the original problem.

I have attached the blockchain where this happens: problem-chain.zip

Started the nodes like this:

/opt/multichain/multichaind alpcointest -datadir="/var/lib/multichain/alpcointest/master"
gdb --args /opt/multichain/multichaind-wdebug alpcointest@127.0.0.1:20200 -datadir="/var/lib/multichain/alpcointest/node-20401"
(gdb) run

@mike31
Copy link
Contributor

mike31 commented Jan 23, 2018

Hi @cebe.

Yes, it is completely unrelated. Probably it is related to debug symbols or gdb. We have no problem with these binaries on Ubuntu.

Let's try previous approach (calling APIs). The outputs of the APIs you provided earlier significantly narrowed the list of suspects, thanks. The current situation seems very unlikely - it looks like the database on your node is corrupted (may be because of the issue we already fixed). If so, we should decide how to repair it.

Can you please send us the output of:

getrawtransaction 7c2d46bb91221374b2ae09f565f984b952a2451b1494ec30f86660804bb08632 1

getrawtransaction f28aaacee51237ec1025dd5a875bf649ff9a7fdae34a97ed9c65bf4a1ad38d94 1

getwallettransaction f28aaacee51237ec1025dd5a875bf649ff9a7fdae34a97ed9c65bf4a1ad38d94 true true

getassettransaction testcoin f28aaacee51237ec1025dd5a875bf649ff9a7fdae34a97ed9c65bf4a1ad38d94

getblockcount

and, if possible

listassets * 1

listaddresses 1GToD6orU6gPgPRv9RDuHrQtkPucBzBkPHHVh6,1EaV1JMskeUNv2TPPq7KGxwCeTGrJx1GbDvEeh true

Please confirm also that you are able to call getassettransaction for other txids and this asset.

@cebe
Copy link
Author

cebe commented Jan 24, 2018

The current situation seems very unlikely - it looks like the database on your node is corrupted (may be because of the issue we already fixed).

I can confirm that I have no issues with fresh databases created with version 1.0.2.1. So it is very likely you already fixed the problem. Do you want to reproduce the problem on 1.0.2 to figure out how a broken database could be fixed? For our case it would be enough to recreate database of affected nodes, so there is no need for further debugging. If you want more information I can still provide them, let me know. Thank you for your help!

Block count is 19805 btw.

@mike31
Copy link
Contributor

mike31 commented Jan 24, 2018

Thank you for your message.

If you can provide the outputs of the APis I asked (before recovery), it would be great. May be we'll be able to guess better what exactly is broken, how and why.

As for repairing you node's database. Here is the procedure:

  1. Backup entire chain directory (/var/lib/multichain/alpcointest/node-20401/alpcointest ?). This step is not necessary, but if we decide to make further debugging it is important.
  2. Restart multichaind with -rescan runtime parameter. It will take some time to start up. I suppose it will be enough to resolve the issue.
  3. If your multichaind still crashes - restart multichaind with -reindex=1.
  4. If you still see the problem - we'll probably have to make further investigation (and we'll need backed up directory from 1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants