Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pruning regression test failures #476

Open
ftrader opened this issue Apr 26, 2017 · 15 comments
Open

Pruning regression test failures #476

ftrader opened this issue Apr 26, 2017 · 15 comments

Comments

@ftrader
Copy link
Collaborator

ftrader commented Apr 26, 2017

I've always had trouble executing the pruning.py test successfully. I'm opening this issue to sort them out, and see if others also experience troubles with that test.

To test, one needs to re-enable the test (it is currently Disabled in qa/pull-tester/rpc-tests.py).
Change the line which says:

Disabled('pruning', "too much disk"),

to simply read:

'pruning',

and then execute qa/pull-tester/rpc-tests.py pruning

The test takes a long time, which is why is was disabled even though it is already in the 'extended' test suite which doesn't get run automatically in Travis CI.

The failures I got in the past looked like timeout / connection related issues. Yesterday, I tried again based on latest dev branch HEAD, on a Debian 7 x86_64 system.
I've attached the failure output from the test as a file - the failure is slightly different than I've had before - this time an actual assertion in the client was triggered.
I will run the same test on release and post results, as well as execute it on an Ubuntu 16.10 VM for both release and dev.

I hope others can also try executing this test on a diverse range of systems, and report on whether it succeeds for them. Best to do it overnight, and remember to either set terminal scrollback to a sufficient length, or run it using script(1) to capture the output or otherwise redirect the output to a file for capturing.

20170426_pruning_fail_on_bu_dev_debian7.txt

@sickpig
Copy link
Collaborator

sickpig commented Apr 26, 2017

I'm going to test it both on release and dev

@ftrader
Copy link
Collaborator Author

ftrader commented Apr 26, 2017

Testing 'release' branch on Debian7, it went into a loop after the test output
Mine 24 (stale) blocks on Node 1, followed by 25 (main chain) block reorg from Node 0, for 12 rounds
It seemed to be looping at these heights:
[1095, 1094, 1095]
with an output of such a line roughly every second. I might retry this later, but will first try Ubuntu 16.10.

Output log:
20170426_pruning_loop_on_bu_release_debian7.txt

@ftrader
Copy link
Collaborator Author

ftrader commented Apr 26, 2017

Test of 'release' (e78ce5e) looped after first step on Ubuntu 16.10 x86_64 (yakkety yak)

20170426_pruning_loop_on_bu_release_ubuntu16_10.txt

Test of 'dev' (b22d70e) failed in similar manner on Ubuntu 16.10 x68_64 as on Debian 7 x86_64:

20170426_pruning_fail_on_bu_dev_ubuntu16_10.txt

@sickpig
Copy link
Collaborator

sickpig commented Apr 26, 2017

release branch (current HEAD) test looped

[594, 594, 593]
[594, 594, 593]
[594, 594, 593]

Attached the debug log of the nodes used in the test

nodes_debug.tar.gz

@sickpig
Copy link
Collaborator

sickpig commented Apr 26, 2017

dev@HEAD

[1320, 1320, 1320]
Usage can be over target because of high stale rate: 638.6924505233765
Check that we can survive a 288 block reorg still
Current block height: 1320
Invalidating block at height: 1033 7599124c75931612b61d567c7e8d7be57756cb6b22e3e53b7de092e9295adb65
New best height 1032
Generating new longer chain of 300 more blocks
Reconnect nodes
[1320, 1332, 1320]
[1320, 1332, 1320]
[1320, 1332, 1033]
bitcoind: main.cpp:4894: void CheckBlockIndex(const Consensus::Params&): Assertion `foundInUnlinked' failed.
Unexpected exception caught during testing: ConnectionRefusedError(111, 'Connection refused')
  File "/home/sickpig/code/BU/bu-src/qa/rpc-tests/test_framework/test_framework.py", line 184, in main
    self.run_test()
  File "/home/sickpig/code/BU/bu-src/qa/rpc-tests/pruning.py", line 333, in run_test
    (self.forkheight,self.forkhash) = self.reorg_test() #(1033, )
  File "/home/sickpig/code/BU/bu-src/qa/rpc-tests/pruning.py", line 192, in reorg_test
    sync_blocks(self.nodes[0:3])
  File "/home/sickpig/code/BU/bu-src/qa/rpc-tests/test_framework/util.py", line 158, in sync_blocks
    counts = [ x.getblockcount() for x in rpc_connections ]
  File "/home/sickpig/code/BU/bu-src/qa/rpc-tests/test_framework/util.py", line 158, in <listcomp>
    counts = [ x.getblockcount() for x in rpc_connections ]
  File "/home/sickpig/code/BU/bu-src/qa/rpc-tests/test_framework/coverage.py", line 49, in __call__
    return_val = self.auth_service_proxy_instance.__call__(*args, **kwargs)

nodes_debug_log_dev.tar.gz

@ftrader
Copy link
Collaborator Author

ftrader commented Apr 26, 2017

A rollback to 6a06bda achieved a test PASS for sickpig (from Slack chat feedback).
I tried the same on my Ubuntu 16.10 VM, but wasn't successful on first run, neither on a repeat (test ongoing for Deb 7).
Here's the Ubuntu failure log of the first run (timeout):
20170426_pruning_fail_on_bu_6a06bda_ubuntu16_10.txt

The second failed run was the same failure, but at heights [1344, 1344, 1344].

@sickpig
Copy link
Collaborator

sickpig commented Apr 27, 2017

@ftrader tested again with repo rollback to 6a06bda and it passes

[1551, 1551, 1551]
[1552, 1551, 1551]
[1552, 1552, 1552]
Usage should be below target: 447.3477268218994
Test that we can rerequest a block we previously pruned if needed for a reorg
Will need to redownload block 1033
Rewind node 0 to prev main chain to mine longer chain to trigger redownload. Blocks needed: 233



Verify node 2 reorged back to the main chain, some blocks of which it had to redownload
Done
Stopping nodes
Cleaning up
Tests successful
Duration: 1895 s


Test                                                Status    Time (s)
----------------------------------------------------------------------
pruning.py                                          PASS         1895
----------------------------------------------------------------------
                                              Total time (s):    1895

1 test(s) passed / 0 test(s) failed / 1 test(s) executed
0 test(s) disabled / 0 test(s) skipped due to platform

@sickpig
Copy link
Collaborator

sickpig commented Apr 27, 2017

@ptschip @ftrader
just tested dev + #479 + #482 and got another failure:

[1320, 1295, 1306]
[1320, 1310, 1320]
[1320, 1320, 1320]
Usage can be over target because of high stale rate: 638.6925411224365
Check that we can survive a 288 block reorg still
Current block height: 1320
Invalidating block at height: 1033 492748de4c1b2b3a147a09bae348715dc34cb83284227290a18fcf9562101f44
New best height 1032
Generating new longer chain of 300 more blocks
Reconnect nodes
[1320, 1332, 1320]
[1320, 1332, 1320]
[1320, 1332, 1032]
bitcoind: main.cpp:4900: void CheckBlockIndex(const Consensus::Params&): Assertion `foundInUnlinked' failed.
Unexpected exception caught during testing: ConnectionRefusedError(111, 'Connection refused')
  File "/home/sickpig/Bitcoin/BU/bu-src/qa/rpc-tests/test_framework/test_framework.py", line 184, in main
    self.run_test()
  File "/home/sickpig/Bitcoin/BU/bu-src/qa/rpc-tests/pruning.py", line 333, in run_test
    (self.forkheight,self.forkhash) = self.reorg_test() #(1033, )
  File "/home/sickpig/Bitcoin/BU/bu-src/qa/rpc-tests/pruning.py", line 192, in reorg_test
    sync_blocks(self.nodes[0:3])
  File "/home/sickpig/Bitcoin/BU/bu-src/qa/rpc-tests/test_framework/util.py", line 158, in sync_blocks
    counts = [ x.getblockcount() for x in rpc_connections ]
  File "/home/sickpig/Bitcoin/BU/bu-src/qa/rpc-tests/test_framework/util.py", line 158, in <listcomp>
    counts = [ x.getblockcount() for x in rpc_connections ]
  File "/home/sickpig/Bitcoin/BU/bu-src/qa/rpc-tests/test_framework/coverage.py", line 49, in __call__
...
...
  File "/usr/lib/python3.5/socket.py", line 711, in create_connection
    raise err
  File "/usr/lib/python3.5/socket.py", line 702, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
Duration: 1729 s


Test                                                Status    Time (s)
----------------------------------------------------------------------
pruning.py                                          FAILED       1729
----------------------------------------------------------------------
                                              Total time (s):    1729

0 test(s) passed / 1 test(s) failed / 1 test(s) executed
0 test(s) disabled / 0 test(s) skipped due to platform

these are the debug.log
node_logs_dev.tar.gz

@ftrader
Copy link
Collaborator Author

ftrader commented Apr 27, 2017

Third run of 6a06bda on the Ubuntu VM failed as well, same timeout reason as before.
A run of 6a06bda on Debian 7 also failed with same timeout error.

Haven't tried with any PR patches yet.

@ptschip
Copy link
Collaborator

ptschip commented Apr 27, 2017 via email

@ftrader
Copy link
Collaborator Author

ftrader commented Apr 27, 2017

Would be nice if we could have more people testing this on VMs, specifically the commit 6a06bda which passed here for @sickpig .
It could of course be something to do with my specific virtualization environment, but I don't see how that could easily generalize to various other VM environments.

@ftrader
Copy link
Collaborator Author

ftrader commented Apr 28, 2017

VM positive result: Test passed for commit 6a06bda on a cloud server running Debian 8 on QEMU.
Storage was SSD (plenty of free disk space), machine had 2GB RAM, one VCPU (Xeon E5-2697 v3 @ 2.60GHz) rated at 5199.99 bogomips.

Total time (s): 5890

@ftrader
Copy link
Collaborator Author

ftrader commented Apr 28, 2017

Also passed on the afore-mentioned Debian 8 QEMU box with release branch 1.0.1.4 (9d36bec) with the #482 commit a145250 merged in.

Total time (s): 5868

@ftrader
Copy link
Collaborator Author

ftrader commented May 12, 2017

I produced a block sync timeout failure on Core 0.14 branch too.

Raised issue 10393 on Bitcoin Core.

@sickpig
Copy link
Collaborator

sickpig commented Aug 14, 2017

just an update on that one. I'm still able to reproduce the problem on current dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants