Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.0.0.62 network unreliability, node goes unresponsive/offline following socket timeouts #139

Closed
walkjivefly opened this issue Oct 22, 2022 · 10 comments
Labels
bug Something isn't working Review/Close Needs review

Comments

@walkjivefly
Copy link
Member

Nodes sometimes go unavailable following a bunch of socket timeouts.

Expected behavior
Nodes should stay up

Actual behavior
Nodes sometimes go unavailable following a bunch of socket timeouts.

To reproduce
Start a node and wait and see.

System information
v0.0.0.62

debug.log ends with

2022-10-22T13:22:10Z connect() to 194.163.189.111:18333 failed after select(): Connection refused (111)
2022-10-22T13:22:10Z connect() to 194.163.189.109:18333 failed after select(): Connection refused (111)
2022-10-22T13:22:11Z socket sending timeout: 1201s
2022-10-22T13:22:13Z socket sending timeout: 1201s
2022-10-22T13:22:13Z socket sending timeout: 1201s
2022-10-22T13:22:13Z socket sending timeout: 1201s

the node is still running, just not talking to anything, including RPC requests from crown-cli

gdb shows

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
0x00007f70ccfdc23f in __GI___clock_nanosleep (clock_id=clock_id@entry=0, 
    flags=flags@entry=0, req=0x7ffc0a67cb70, rem=0x7ffc0a67cb70)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
78	../sysdeps/unix/sysv/linux/clock_nanosleep.c: No such file or directory.
(gdb) info thread
  Id   Target Id                                           Frame 
* 1    Thread 0x7f70ccefa780 (LWP 250346) "crownd"         0x00007f70ccfdc23f in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, 
    req=0x7ffc0a67cb70, rem=0x7ffc0a67cb70)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
  2    Thread 0x7f70caad0700 (LWP 250348) "b-scriptch.0"   futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55c6154bb1d0 <scriptcheckqueue+80>)
    at ../sysdeps/nptl/futex-internal.h:183
  3    Thread 0x7f70ca2cf700 (LWP 250349) "b-crown-minter" __lll_lock_wait (
    futex=futex@entry=0x55c6154bc510 <budget+48>, private=0)
    at lowlevellock.c:52
  4    Thread 0x7f70c9ace700 (LWP 250350) "b-http"         0x00007f70cd01e46e in epoll_wait (epfd=5, events=0x55c615dd2160, maxevents=32, timeout=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  5    Thread 0x7f70c92cd700 (LWP 250351) "b-httpworker.0" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c615dc0a64)
    at ../sysdeps/nptl/futex-internal.h:183
  6    Thread 0x7f70c8acc700 (LWP 250352) "b-httpworker.1" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c615dc0a60)
    at ../sysdeps/nptl/futex-internal.h:183
  7    Thread 0x7f70bbfff700 (LWP 250353) "b-httpworker.2" __lll_lock_wait (
    futex=futex@entry=0x55c6154b0300 <cs_main>, private=0) at lowlevellock.c:52
  8    Thread 0x7f70b37fe700 (LWP 250354) "b-httpworker.3" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55c615dc0a60)
    at ../sysdeps/nptl/futex-internal.h:183
  9    Thread 0x7f70bb7fe700 (LWP 250355) "crownd"         futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55c6154c2138 <leveldb::Env::Default()::env_container+88>)
    at ../sysdeps/nptl/futex-internal.h:183
  10   Thread 0x7f70b9631700 (LWP 250358) "b-torcontrol"   0x00007f70cd01e46e in epoll_wait (epfd=36, events=0x55c617f2e520, maxevents=32, timeout=2100000)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  11   Thread 0x7f70b8e30700 (LWP 250359) "b-net"          0x00007f70cd01199f in __GI___poll (fds=0x7f709404ee10, nfds=3, timeout=50)
    at ../sysdeps/unix/sysv/linux/poll.c:29
  12   Thread 0x7f70b2ffd700 (LWP 250361) "b-addcon"       __lll_lock_wait (
    futex=futex@entry=0x55c6154b0300 <cs_main>, private=0) at lowlevellock.c:52
  13   Thread 0x7f70b27fc700 (LWP 250362) "b-mncon"        futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7f70b27fb970, 
    clockid=<optimized out>, expected=0, futex_word=0x55c615ea8410)
    at ../sysdeps/nptl/futex-internal.h:320
  14   Thread 0x7f70b1ffb700 (LWP 250363) "b-opencon"      futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7f70b1ffa270, 
    clockid=<optimized out>, expected=0, futex_word=0x55c615ea8410)
    at ../sysdeps/nptl/futex-internal.h:320
  15   Thread 0x7f70b17fa700 (LWP 250364) "b-msghand"      __lll_lock_wait (
    futex=futex@entry=0x55c6154b0300 <cs_main>, private=0) at lowlevellock.c:52
@walkjivefly
Copy link
Member Author

Happened again to 2 nodes overnight. 2 other nodes hosted on the same infrastructure stayed running OK.

@walkjivefly
Copy link
Member Author

Happened to 4 nodes again since yesterday.
One debug.log ends with

crown@crown-testnet-01:~/.crown/testnet3$ tail debug.log
2022-10-24T12:04:46Z connect() to 194.163.189.109:18333 failed after select(): Connection refused (111)
2022-10-24T12:05:47Z connect() to 194.163.189.106:18333 failed after select(): Connection refused (111)
2022-10-24T12:05:47Z connect() to 194.163.189.111:18333 failed after select(): Connection refused (111)
2022-10-24T12:05:48Z connect() to 194.163.189.109:18333 failed after select(): Connection refused (111)
2022-10-24T12:06:04Z socket receive timeout: 1201s
2022-10-24T12:06:07Z socket send error Connection reset by peer (104)
2022-10-24T12:06:07Z socket sending timeout: 1201s
2022-10-24T12:14:41Z socket sending timeout: 1201s
2022-10-24T12:14:41Z socket sending timeout: 1201s
2022-10-24T12:14:41Z socket sending timeout: 1201s

crown@crown-testnet-01:~/.crown/testnet3$ crown-cli getmasternodestatus
{
  "txid": "91553b579693099b6e552510b0276048c042dbf90bddcceafbfaf068ebb2d6d2",
  "outputidx": 1,
  "netaddr": "92.60.45.236:18333",
  "addr": "n2xTMYnpcvohenNq1vrsS5qYQXQYi69fMA",
  "status": "Masternode successfully started"
}

This node managed to answer 1 RPC but locked up on the next one.
gdb shows

(gdb) info thread
  Id   Target Id                                           Frame 
* 1    Thread 0x7f184116c780 (LWP 357607) "crownd"         0x00007f184124e23f in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, 
    req=0x7ffd04a57020, rem=0x7ffd04a57020)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
  2    Thread 0x7f183ed42700 (LWP 357608) "b-scriptch.0"   futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55d11506e1d0 <scriptcheckqueue+80>)
    at ../sysdeps/nptl/futex-internal.h:183
  3    Thread 0x7f183e541700 (LWP 357609) "b-crown-minter" __lll_lock_wait (
    futex=futex@entry=0x55d11506f510 <budget+48>, private=0)
    at lowlevellock.c:52
  4    Thread 0x7f183dd40700 (LWP 357610) "b-http"         0x00007f184129046e in epoll_wait (epfd=5, events=0x55d1156cb900, maxevents=32, timeout=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  5    Thread 0x7f183d53f700 (LWP 357611) "b-httpworker.0" __lll_lock_wait (
    futex=futex@entry=0x55d115063300 <cs_main>, private=0) at lowlevellock.c:52
  6    Thread 0x7f183cd3e700 (LWP 357612) "b-httpworker.1" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55d11552f854)
    at ../sysdeps/nptl/futex-internal.h:183
  7    Thread 0x7f182ffff700 (LWP 357613) "b-httpworker.2" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55d11552f854)
    at ../sysdeps/nptl/futex-internal.h:183
  8    Thread 0x7f182f7fe700 (LWP 357614) "b-httpworker.3" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55d11552f854)
    at ../sysdeps/nptl/futex-internal.h:183
  9    Thread 0x7f182dffb700 (LWP 357617) "b-torcontrol"   0x00007f184129046e in epoll_wait (epfd=34, events=0x55d117267e80, maxevents=32, timeout=2100000)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  10   Thread 0x7f182d7fa700 (LWP 357618) "b-net"          0x00007f184128399f in __GI___poll (fds=0x7f181001e650, nfds=3, timeout=50)
    at ../sysdeps/unix/sysv/linux/poll.c:29
  11   Thread 0x7f181bfff700 (LWP 357620) "b-addcon"       __lll_lock_wait (
    futex=futex@entry=0x55d115063300 <cs_main>, private=0) at lowlevellock.c:52
  12   Thread 0x7f181b7fe700 (LWP 357621) "b-mncon"        futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7f181b7fd970, 
    clockid=<optimized out>, expected=0, futex_word=0x55d115737fd0)
    at ../sysdeps/nptl/futex-internal.h:320
  13   Thread 0x7f181affd700 (LWP 357622) "b-opencon"      futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7f181affc270, 
    clockid=<optimized out>, expected=0, futex_word=0x55d115737fd0)
    at ../sysdeps/nptl/futex-internal.h:320
  14   Thread 0x7f181a7fc700 (LWP 357623) "b-msghand"      __pthread_mutex_unlock_usercnt (decr=1, mutex=0x55d115066d20 <mnodeman>)
--Type <RET> for more, q to quit, c to continue without paging--
    at pthread_mutex_unlock.c:77

Second debug.log ends with

crown@crown-testnet-02:~/.crown/testnet3$ tail debug.log
2022-10-24T14:51:39Z CBudgetManager::Sync - sent 0 items
2022-10-24T14:51:39Z CBudgetManager::Sync - sent 4 items
2022-10-24T14:51:39Z CBudgetManager::Sync - sent 0 items
2022-10-24T14:51:39Z CBudgetManager::CheckAndRemove
2022-10-24T14:51:39Z CBudgetManager::CheckAndRemove - mapBudgetDrafts cleanup - size: 9
2022-10-24T15:11:40Z socket sending timeout: 1201s
2022-10-24T15:11:40Z socket sending timeout: 1201s
2022-10-24T15:11:40Z socket sending timeout: 1201s
2022-10-24T15:11:40Z socket sending timeout: 1201s
2022-10-24T15:11:40Z socket sending timeout: 1201s

Third ends with

2022-10-24T20:01:25Z NodeMinter: Attempting to stake..
2022-10-24T20:01:55Z CreateCoinStake: Found valid kernel for mn/sn collateral COutPoint(4f5f0861da4aac43768059d9201bc001af26d4f76662374e6abddce2a53969e3, 1)
2022-10-24T20:01:55Z CreateCoinStake: OutPoint: 9eded0b2d9cc9a38c302f9d7c3359db4730109011be3b9abd5a30b3113a4263b:2 Modifier=9e1533ef4c832e2e09577519fe48d93ff8a9348a240f578bf3ce6cd7b2c32fa1 timeblockfrom=1666479520 time=1666641699 amount=50000000000
2022-10-24T20:01:55Z CreateNewBlock: vtx size 2
2022-10-24T20:01:55Z CreateNewBlock(): block weight: 4000 txs: 0 fees: 0 sigops 400
2022-10-24T20:01:56Z CheckProofOfStake : OutPoint: f9f91876e014b9614ab4ae425a1891b1fc361a0c56c440ad335e5eb319d03deb:2 Modifier=680daa7c8c795c4acd66c2bc0cb4be86ffce18007dc0086a129fef89e8b72cee timeblockfrom=1666606223 time=1666641695 amount=50000000000
2022-10-24T20:01:56Z CheckProofOfStake : OutPoint: 9eded0b2d9cc9a38c302f9d7c3359db4730109011be3b9abd5a30b3113a4263b:2 Modifier=9e1533ef4c832e2e09577519fe48d93ff8a9348a240f578bf3ce6cd7b2c32fa1 timeblockfrom=1666479520 time=1666641699 amount=50000000000
2022-10-24T20:21:21Z socket sending timeout: 1201s
2022-10-24T20:21:56Z socket sending timeout: 1201s
2022-10-24T20:21:56Z socket sending timeout: 1201s

and the fourth with

2022-10-24T21:43:23Z CreateNewBlock: vtx size 2
2022-10-24T21:43:23Z CreateNewBlock(): block weight: 4000 txs: 0 fees: 0 sigops 400
2022-10-24T21:43:23Z ProcessGetDataMasternodeTypes sn winner 1b93071eda6a28b6758a07697a147cc46466548558e11a24815d44170ee0d259
2022-10-24T21:43:23Z CheckProofOfStake : OutPoint: 76a8cb451edda1f457ca5d70d593dd612d8b716707bcc4d1b59f8bd776c796de:1 Modifier=137fc0e972963fc3e9749fa0dd604a1767e3c12c1e50183dedef8c03e423c6c0 timeblockfrom=1666455988 time=1666647778 amount=1000000000000
2022-10-24T21:43:23Z CheckProofOfStake : OutPoint: 782727c1a8611d85f122916746745f2d0a9494a41df1d2699248b8d33ac533f1:2 Modifier=bce955aea2bdb53c1133206afba18545b420a0f04509fa3dba33709b0c50e076 timeblockfrom=1666408653 time=1666647796 amount=50000000000
2022-10-24T22:03:24Z socket sending timeout: 1201s
2022-10-24T22:03:24Z socket sending timeout: 1201s
2022-10-24T22:03:24Z socket sending timeout: 1201s
2022-10-25T08:05:44Z [testnet] keypool reserve 4
2022-10-25T08:05:44Z [testnet] keypool keep 4

This one accepted a createcontract RPC but didn't timeout or complete the request.
gdb shows it stuck with

(gdb) info thread
  Id   Target Id                                           Frame 
* 1    Thread 0x7f08ce7c4780 (LWP 710519) "crownd"         0x00007f08ce8a623f in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, 
    req=0x7ffde9cb79b0, rem=0x7ffde9cb79b0)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
  2    Thread 0x7f08cc39a700 (LWP 710520) "b-scriptch.0"   futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55f1f411e1d4 <scriptcheckqueue+84>)
    at ../sysdeps/nptl/futex-internal.h:183
  3    Thread 0x7f08cbb99700 (LWP 710521) "b-crown-minter" __lll_lock_wait (
    futex=futex@entry=0x55f1f411cd80 <snodeman>, private=0)
    at lowlevellock.c:52
  4    Thread 0x7f08cb398700 (LWP 710522) "b-http"         0x00007f08ce8e846e in epoll_wait (epfd=5, events=0x55f1f47e7d20, maxevents=32, timeout=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  5    Thread 0x7f08cab97700 (LWP 710523) "b-httpworker.0" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55f1f47673f4)
    at ../sysdeps/nptl/futex-internal.h:183
  6    Thread 0x7f08ca396700 (LWP 710524) "b-httpworker.1" __lll_lock_wait (
    futex=futex@entry=0x55f1f4113300 <cs_main>, private=0) at lowlevellock.c:52
  7    Thread 0x7f08c9b95700 (LWP 710525) "b-httpworker.2" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55f1f47673f0)
    at ../sysdeps/nptl/futex-internal.h:183
  8    Thread 0x7f08c9394700 (LWP 710526) "b-httpworker.3" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55f1f47673f0)
    at ../sysdeps/nptl/futex-internal.h:183
  9    Thread 0x7f08aaffd700 (LWP 710529) "b-torcontrol"   0x00007f08ce8e846e in epoll_wait (epfd=35, events=0x55f1f6916970, maxevents=32, timeout=2100000)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  10   Thread 0x7f08aa7fc700 (LWP 710530) "b-net"          0x00007f08ce8db99f in __GI___poll (fds=0x7f089d2c61a0, nfds=3, timeout=50)
    at ../sysdeps/unix/sysv/linux/poll.c:29
  11   Thread 0x7f08a97fa700 (LWP 710532) "b-addcon"       __lll_lock_wait (
    futex=futex@entry=0x55f1f4113300 <cs_main>, private=0) at lowlevellock.c:52
  12   Thread 0x7f08a8ff9700 (LWP 710533) "b-mncon"        futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7f08a8ff8970, 
    clockid=<optimized out>, expected=0, futex_word=0x55f1f4949ce0)
    at ../sysdeps/nptl/futex-internal.h:320
  13   Thread 0x7f08a3fff700 (LWP 710534) "b-opencon"      futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7f08a3ffe270, 
    clockid=<optimized out>, expected=0, futex_word=0x55f1f4949ce0)
    at ../sysdeps/nptl/futex-internal.h:320
  14   Thread 0x7f08a37fe700 (LWP 710535) "b-msghand"      __lll_lock_wait (
    futex=futex@entry=0x55f1f4113300 <cs_main>, private=0) at lowlevellock.c:52
  15   Thread 0x7f08a2ffd700 (LWP 713363) "b-msghand"      futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x55f1f4125138 <leveldb::Env::Default()::env_container+88>)
    at ../sysdeps/nptl/futex-internal.h:183

@walkjivefly
Copy link
Member Author

Another one bites the dust

2022-11-02T13:12:15Z CheckProofOfStake : OutPoint: 5d202b41e5e46d814628630f1c3ed368e2a94c8609779b29a829597744ec1d85:1 Modifier=ed9e5b0fdbb1e07c031a037411c1d0550ba58a593ded6cea82b10f8ad6809d50 timeblockfrom=1667385472 time=1667394735 amount=1000000000000
2022-11-02T13:32:16Z socket sending timeout: 1201s
2022-11-02T13:32:16Z socket sending timeout: 1201s
2022-11-02T13:32:16Z socket sending timeout: 1201s
2022-11-02T13:32:16Z socket sending timeout: 1201s
2022-11-02T13:32:16Z socket sending timeout: 1201s
2022-11-02T13:32:16Z socket sending timeout: 1201s
2022-11-02T13:32:16Z socket sending timeout: 1201s
2022-11-02T13:32:16Z socket sending timeout: 1201s
2022-11-02T13:32:16Z socket sending timeout: 1201s

Refuses to respond to CLI RPCs, same gdb as above.

@walkjivefly walkjivefly changed the title v0.0.0.62 network unreliability, node goes offline following socket timeouts v0.0.0.62 network unreliability, node goes unresponsive/offline following socket timeouts Nov 2, 2022
@walkjivefly
Copy link
Member Author

v0.0.0.69 collateral wallet experienced the 1201 timeouts after restarting following a crown-qt crash #168
issue168or139.log

Probably before the timeouts, the peers were

mark@xub22:~/.crown/testnet3$ crown-cli getpeerinfo|grep -e '"addr": "' -e startingheight -e synced_
    "addr": "92.60.45.236:18333",
    "startingheight": 26344,
    "synced_headers": 26351,
    "synced_blocks": 26351,
    "addr": "92.60.45.237:18333",
    "startingheight": 26344,
    "synced_headers": 26352,
    "synced_blocks": 26352,
    "addr": "92.60.45.238:18333",
    "startingheight": 26344,
    "synced_headers": 26352,
    "synced_blocks": 26352,
    "addr": "92.60.45.239:18333",
    "startingheight": 26344,
    "synced_headers": 26352,
    "synced_blocks": 26352,
    "addr": "92.60.46.27",
    "startingheight": 26344,
    "synced_headers": 26351,
    "synced_blocks": 26351,
    "addr": "92.60.46.28",
    "startingheight": 26344,
    "synced_headers": 26351,
    "synced_blocks": 26351,
    "addr": "92.60.46.29",
    "startingheight": 26344,
    "synced_headers": 26351,
    "synced_blocks": 26351,
    "addr": "92.60.46.30",
    "startingheight": 26344,
    "synced_headers": 26352,
    "synced_blocks": 26352,

and chaintips included

mark@xub22:~/.crown/testnet3$ crown-cli getchaintips
[
  {
    "height": 26352,
    "hash": "fcf6cf6e78db74780221b3d0d6c103d639621972839810116d129182ecaf4c33",
    "branchlen": 0,
    "status": "active"
  },
  {
    "height": 26352,
    "hash": "1212f64485b12dc9ad9baffc6557d9cecb9f497a4342c59ee208e277cfca9b2d",
    "branchlen": 3,
    "status": "valid-headers"
  },
  {
    "height": 26350,
    "hash": "64e4f7dc75d3d06d22917af75cc25632dcaf47c89b83e736a15a239381ca88d5",
    "branchlen": 1,
    "status": "valid-headers"
  },
  {
    "height": 26350,
    "hash": "15997792553ba27e84330d07520d568d46b870cf1dd64ba4b793b1c8ee50d3b3",
    "branchlen": 1,
    "status": "valid-headers"
  },
...

CPU use of crownd pegged at 100% of 1 vCPU despite not actually doing anything useful. Asking it to reconsider the last good block didn't help.
gdb shows the process state as

(gdb) info threads
  Id   Target Id                                          Frame 
* 1    Thread 0x7fca80e1d780 (LWP 8173) "crownd"          0x00007fca80f07868 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, 
    req=0x7ffc5761d1e0, rem=0x7ffc5761d1e0)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
  2    Thread 0x7fca7dd3d640 (LWP 8174) "b-scriptch.0"    __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x55fc6b1b11b4 <scriptcheckqueue+84>)
    at ./nptl/futex-internal.c:57
  3    Thread 0x7fca7d53c640 (LWP 8175) "b-crown-nodesyn" __futex_abstimed_wait_common64 (private=517861120, cancel=true, abstime=0x7fca7d53bcc0, op=393, 
    expected=0, futex_word=0x55fc6bf522d8) at ./nptl/futex-internal.c:57
  4    Thread 0x7fca7cd3b640 (LWP 8176) "b-http"          0x00007fca80f47fde in epoll_wait (epfd=5, events=0x55fc6bfc6260, maxevents=32, timeout=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  5    Thread 0x7fca77fff640 (LWP 8177) "b-httpworker.0"  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x55fc6bfa6ae4) at ./nptl/futex-internal.c:57
  6    Thread 0x7fca777fe640 (LWP 8178) "b-httpworker.1"  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x55fc6bfa6ae0) at ./nptl/futex-internal.c:57
  7    Thread 0x7fca76ffd640 (LWP 8179) "b-httpworker.2"  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x55fc6bfa6ae0) at ./nptl/futex-internal.c:57
  8    Thread 0x7fca767fc640 (LWP 8180) "b-httpworker.3"  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x55fc6bfa6ae0) at ./nptl/futex-internal.c:57
  9    Thread 0x7fca62de9640 (LWP 8185) "b-torcontrol"    0x00007fca80f47fde in epoll_wait (epfd=35, events=0x55fc73ef5e50, maxevents=32, timeout=2100000)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  10   Thread 0x7fca625e8640 (LWP 8186) "b-net"           0x00007fca80f3ad7f in __GI___poll (fds=0x7fca308c6790, nfds=11, timeout=50)
    at ../sysdeps/unix/sysv/linux/poll.c:29
  11   Thread 0x7fca615e6640 (LWP 8188) "b-addcon"        __futex_abstimed_wait_common64 (private=855544824, cancel=true, abstime=0x7fca615e5bb0, op=137, 
    expected=0, futex_word=0x55fc6c02acf0) at ./nptl/futex-internal.c:57
  12   Thread 0x7fca60de5640 (LWP 8189) "b-mncon"         __futex_abstimed_wait_common64 (private=1796962040, cancel=true, abstime=0x7fca60de4b80, op=137, 
    expected=0, futex_word=0x55fc6c02acf0) at ./nptl/futex-internal.c:57
  13   Thread 0x7fca4bfff640 (LWP 8190) "b-opencon"       __futex_abstimed_wait_common64 (private=1797285236, cancel=true, abstime=0x7fca4bffe570, op=137, 
    expected=0, futex_word=0x55fc6c02acf0) at ./nptl/futex-internal.c:57
  14   Thread 0x7fca4b7fe640 (LWP 8191) "b-msghand"       0x000055fc6a95c14d in CBudgetManager::NewBlock(CConnman&) ()
  15   Thread 0x7fca4affd640 (LWP 8192) "b-loadblk"       __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x55fc6b1b8238 <leveldb::Env::Default()::env_container+88>)
    at ./nptl/futex-internal.c:57

similar, but not identical to the earlier example.

@walkjivefly
Copy link
Member Author

3 out of 4 v0.0.0.69 nodes hit the 1201 timeout overnight:
Screenshot_2022-11-08_10-30-46

@walkjivefly
Copy link
Member Author

walkjivefly commented Nov 10, 2022

Had 2 more instances of the 1201 timeouts since yesterday afternoon. This was after they were moved to different servers in the Crown infra. They had to be kill -9'd but restarted OK and are ENABLED again now.
Here's the debug.log from one of them
1201.zip

@walkjivefly
Copy link
Member Author

Had another instance overnight, here's the debug.log and a coredump of the hung node
issue139-20221110.zip

@BlockMechanic BlockMechanic added the Review/Close Needs review label Nov 19, 2022
@walkjivefly
Copy link
Member Author

walkjivefly commented Nov 21, 2022

Had at least 2 more instances overnight. This is not resolved, cannot be closed.

@walkjivefly walkjivefly removed the Review/Close Needs review label Nov 21, 2022
@BlockMechanic BlockMechanic added the Review/Close Needs review label Nov 29, 2022
@BlockMechanic
Copy link
Collaborator

And now ?

@walkjivefly
Copy link
Member Author

Let's work on the basis that (most of) the 1201s timeouts are a consequence of getting stuck in NodeMinter, so keep #195 open, and close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Review/Close Needs review
Projects
None yet
Development

No branches or pull requests

2 participants