ci: Switch to Github Actions and remove Travis CI #4357

cdecker · 2021-01-26T13:38:03Z

With Travis CI refusing to build any of our configurations it's time to switch to something hopefully a bit more stable.

rustyrussell

Great improvements in here! Pretty epic, in fact.

I will test on my local machine to see if any of the test fixes broke things for me.

.github/workflows/ci.yaml

tests/test_pay.py

We weren't waiting for the transactions to enter the mempool which could cause all of our fine-tuned block counts to be off. Now just waiting for the expected number of txs.

We were getting a number of incompatibility warning due to the dependencies being expressed too rigidly. This losens the requirement definitions to being compatible with a known good version, and while we're at it we also bump all outdated requirements.

The timeout on the pay future was too short under valgrind.

We weren't waiting for the `dev_fail` transaction to hit the mempool, throwing the results off.

Synching with the blockchain was slower than our timeout...

It requires `--dev-force-features` which isn't available without `DEVELOPER=1`

We also make the logic a bit nicer to read. The failure was due to more than one status message being present if we look at the wrong time: ``` arr = ['CLOSINGD_SIGEXCHANGE:We agreed on a closing fee of 20334 satoshi for tx:17f1e9d377840edf79d8b6f1ed0faba59bb307463461...9b98', 'CLOSINGD_SIGEXCHANGE:Waiting for another closing fee offer: ours was 20334 satoshi, theirs was 20332 satoshi,'] │ def only_one(arr): """Many JSON RPC calls return an array; often we only expect a single entry """ > assert len(arr) == 1 E AssertionError ```

They were using TIMEOUT / 2 which may be way too long (hit against test timeout), so we use a still ludicrous 30 seconds instead.

We were getting a couple of starvations, so we need a fair filelock. I also wasn't too happy with the lock as is, so I hand-coded it quickly. Should be correct, but the overall timeout will tell us how well we are doing on CI.

We were sometimes waiting only 5 seconds, which is way too short on a heavily loaded machine such as CI. Making it 30 seconds and collecting it in a single place so we can adjust more easily.

We were using a lot of docker conventions, which are not necessary in the script itself.

We have a couple of very heavy tests bunched together, randomization could potentially lessen the peak load.

This should really be set by the environment by creating either a pytest.ini or setting PYTEST_OPTS envvar.

If we're quick (or the node is slow) we end up reconnecting before our counterparty has realized the state transition, resulting in an unexpected re-establish.

The test was not considering that concurrent sendrawtx of the same tx is not stable, and either endpoint will submit it first. Now just checking state transitions and the mempool.

Reconnections and unsynchronized states where causing us some issues.

openchannel internally generates blocks, which may cause nodes to be out of sync and ignore "future" channel announcements, resulting in bad gossip.

rerunfailures keeps not working.

Can be used in a second stage to generate stats and detect flaky tests.

We don't have a good way of referring to the configuration that failed, so let's give them a numberic ID. Particularly useful for the artifacts that'd be overwritten otherwise.

We are printing `repr(obj)` which is not pretty-printed, hard to read, and can't even be copied and inspected to JSON tools. We now print the JSONified and indented calls and responses for easier debugging based on solely the logs (useful for CI!). Changelog-Added: pyln-testing: The RPC client will now pretty-print requests and responses to facilitate log-based debugging.

rustyrussell · 2021-01-28T00:27:18Z

Tried rerunning, and it breaks in different ways every time. I can't see how to run a single job, it only lets me do them all :(

cdecker · 2021-01-28T10:56:53Z

Tried rerunning, and it breaks in different ways every time. I can't see how to run a single job, it only lets me do them all :(

Yep, that's a common feature request on Github's own tracker. Hopefully they'll implement it some time soon. I'll continue stabilizing tests after this is merged, making it more reliable and reducing the number of reruns we need. So far I have a failure rate of about 1/4, but each fixed test has a big effect on this.

Since we have the reports generated as artifacts we can easily integrate them into my flaky test tracking tool (http://46.101.246.115:5001/) and then we can hunt them down more easily.

cdecker · 2021-01-28T17:24:39Z

I should have mentioned that I have been rescheduling the runs a couple of times to see if new flakes show up, and to fix known ones, so not surprisingly some of the runs will stick to a failed state because I follow them up with a fix :-)

rustyrussell · 2021-01-28T23:59:02Z

Ack f2b8355

cdecker force-pushed the gci branch from bbcbe3a to 65b5ed3 Compare January 26, 2021 14:37

rustyrussell approved these changes Jan 27, 2021

View reviewed changes

.github/workflows/ci.yaml Outdated Show resolved Hide resolved

.github/workflows/ci.yaml Outdated Show resolved Hide resolved

.github/workflows/ci.yaml Show resolved Hide resolved

tests/test_pay.py Show resolved Hide resolved

cdecker force-pushed the gci branch 3 times, most recently from c675808 to 675c677 Compare January 27, 2021 11:48

cdecker added 25 commits January 27, 2021 13:41

gci: Add basic configuration for Github Actions as CI

55ec61e

gci: Add a tester Dockerfile

d165758

gci: Expand matrix to include all CI configurations

fcb8c6e

pytest: Skip hsm encryption test if we don't have a TTY

70837d3

pytest: Simplify and stabilize test_reconnect_no_update

5ef919e

pytest: Stabilize test_penalty_htlc_tx_timeout

f565b69

We weren't waiting for the transactions to enter the mempool which could cause all of our fine-tuned block counts to be off. Now just waiting for the expected number of txs.

pytest: Stabilize test_onchain_timeout

72aac6e

The timeout on the pay future was too short under valgrind.

pytest: Stabilize test_gossip_persistence

728702a

We weren't waiting for the `dev_fail` transaction to hit the mempool, throwing the results off.

pytest: Stabilize test_setchannelfee_state

e12b9ae

Synching with the blockchain was slower than our timeout...

pytest: Stabilize test_channel_state_changed_bilateral

0b4b41e

pytest: Disable test_v2_open if not developer

4fa31a4

It requires `--dev-force-features` which isn't available without `DEVELOPER=1`

pytest: Stabilize test_funding_external_wallet_corners

0171f0e

pytest: Stabilize test_channel_{spendable,receivable}

6557dfb

They were using TIMEOUT / 2 which may be way too long (hit against test timeout), so we use a still ludicrous 30 seconds instead.

pyln: Use a fair FS lock to throttle node startups

558efbd

We were getting a couple of starvations, so we need a fair filelock. I also wasn't too happy with the lock as is, so I hand-coded it quickly. Should be correct, but the overall timeout will tell us how well we are doing on CI.

pytest: Parameterize process waits for hsmtool calls

c84cc12

We were sometimes waiting only 5 seconds, which is way too short on a heavily loaded machine such as CI. Making it 30 seconds and collecting it in a single place so we can adjust more easily.

pyln: Adjust maximum load allowed by the throttler

3ac5468

gci: Format the build script

35deed2

We were using a lot of docker conventions, which are not necessary in the script itself.

gci: Add pytest.ini in order to randomize the groups

ab1bc6d

We have a couple of very heavy tests bunched together, randomization could potentially lessen the peak load.

make: Remove hardcoded timeout to pytest

af95700

This should really be set by the environment by creating either a pytest.ini or setting PYTEST_OPTS envvar.

pytest: Stabilize test_multiple_channels

c446f42

If we're quick (or the node is slow) we end up reconnecting before our counterparty has realized the state transition, resulting in an unexpected re-establish.

pytest: Stabilize test_bad_onion

7ac0aa1

pytest: Stabilize test_closing_negotiation_reconnect

69a18f3

The test was not considering that concurrent sendrawtx of the same tx is not stable, and either endpoint will submit it first. Now just checking state transitions and the mempool.

pytest: Stabilize test_funding_close_upfront

7038f5c

Reconnections and unsynchronized states where causing us some issues.

cdecker added 12 commits January 27, 2021 13:42

pytest: Give each run of the hsmtool its own pty

7a319d3

pytest: Stabilize test_routing_gossip

c000b35

openchannel internally generates blocks, which may cause nodes to be out of sync and ignore "future" channel announcements, resulting in bad gossip.

pytest: Stabilize test_forward_stats

0651b49

gci: Switch to the flaky plugin

6c4e1fe

rerunfailures keeps not working.

gci: Upload the junit.xml report

645539e

Can be used in a second stage to generate stats and detect flaky tests.

gci: Pin mypy to version 0.790 since 0.800 gives strange errors

0e416f7

gci: Give each configuration an CFG value to identify them later

276259f

We don't have a good way of referring to the configuration that failed, so let's give them a numberic ID. Particularly useful for the artifacts that'd be overwritten otherwise.

gci: Add the JSON report plugin to the ci configuration

7ce49c9

gci: Stabilize test_forward_event_notification

eaa27c0

travis: Goodbye Travis, hello github actions

2ca197f

pyln: Catch OSError when cleaning up test directories

f2b8355

cdecker force-pushed the gci branch from 675c677 to f2b8355 Compare January 27, 2021 12:44

cdecker marked this pull request as ready for review January 28, 2021 10:57

rustyrussell merged commit a5f16ab into ElementsProject:master Jan 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: Switch to Github Actions and remove Travis CI #4357

ci: Switch to Github Actions and remove Travis CI #4357

cdecker commented Jan 26, 2021

rustyrussell left a comment

rustyrussell commented Jan 28, 2021

cdecker commented Jan 28, 2021

cdecker commented Jan 28, 2021

rustyrussell commented Jan 28, 2021

ci: Switch to Github Actions and remove Travis CI #4357

ci: Switch to Github Actions and remove Travis CI #4357

Conversation

cdecker commented Jan 26, 2021

rustyrussell left a comment

Choose a reason for hiding this comment

rustyrussell commented Jan 28, 2021

cdecker commented Jan 28, 2021

cdecker commented Jan 28, 2021

rustyrussell commented Jan 28, 2021