Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve UX of experimental options/features and other small fixes #5047

Merged

Conversation

SimonVrouwe
Copy link
Collaborator

@SimonVrouwe SimonVrouwe commented Feb 21, 2022

So I was trying to connect the dots to understand dual-fund and leases and then give it a shot.
Did my first dual-fund on mainnet a couple of days ago and then a lease, it all worked out, very cool.

I needed to run custom pytest's, looked up Bolt pull-requests, to get some confidence in the meaning of various options and parameters. And came across a few issues during my journey.

For which I propose fixes in this PR:

  • a single location to see our features (bits) would be handy (proposed in issue #4310) edit: just found out plugin init configuration has these also and a plugin can be used to show them also
  • docs missing details and some confusing comments
  • other small fixes, typos comments etc.

@SimonVrouwe
Copy link
Collaborator Author

That's odd, test_plugin_shutdown failed while waiting for the lines:

        l1.daemon.wait_for_logs(['misc_notifications.py: via lightningd shutdown, datastore failed',
>                                'test_libplugin: failed to self-terminate in time, killing.'])

Which are both there:

lightningd-1: 2022-02-21T14:24:48.264Z INFO    plugin-misc_notifications.py: via lightningd shutdown, datastore failed
...
lightningd-1: 2022-02-21T14:25:18.189Z DEBUG   lightningd: test_libplugin: failed to self-terminate in time, killing.
lightningd-1: 2022-02-21T14:25:18.194Z DEBUG   plugin-test_libplugin: Killing plugin: exited during normal operation
lightningd-1: 2022-02-21T14:25:18.196Z DEBUG   lightningd: Command returned result after jcon close

The CI test failure:

                  ValueError: Process died while waiting for logs

Which is here in wait_for_logs

if not self.running:
raise ValueError('Process died while waiting for logs')

These lines are at the end of the log. Maybe process died quicker then its (log) tail ?

Copy link
Contributor

@rustyrussell rustyrussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great series! Only minor changes needed, tbh...

tests/test_plugin.py Show resolved Hide resolved
plugins/funder.c Show resolved Hide resolved
@SimonVrouwe
Copy link
Collaborator Author

Maybe I got lucky, but the wait_for_logs fix attempt seems-to-work ™️

Will push another fixup for another test round, if that also passes will squash.

Copy link
Collaborator

@niftynei niftynei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK, nice cleanup

@rustyrussell
Copy link
Contributor

Maybe I got lucky, but the wait_for_logs fix attempt seems-to-work tm

Will push another fixup for another test round, if that also passes will squash.

This fix doesn't do anything. If there is more in the logs, we don't check if it's dead; only once we've exhausted the logs do we see if it's dead.

The real problem is that wait_for_log only reads forward. Here's what happens when I run locally (with valgrind, it's too fast to trigger otherwise):

2022-02-22T23:33:24.306Z INFO    plugin-misc_notifications.py: via lightningd shutdown, datastore failed
2022-02-22T23:33:24.332Z DEBUG   plugin-keysend: Killing plugin: exited during normal operation
2022-02-22T23:33:24.333Z DEBUG   plugin-topology: Killing plugin: exited during normal operation
2022-02-22T23:33:24.335Z DEBUG   plugin-autoclean: Killing plugin: exited during normal operation
2022-02-22T23:33:24.336Z DEBUG   plugin-bcli: Killing plugin: exited during normal operation
2022-02-22T23:33:24.340Z DEBUG   plugin-test_libplugin: shutdown called
2022-02-22T23:33:24.343Z DEBUG   plugin-offers: Killing plugin: exited during normal operation
2022-02-22T23:33:24.360Z DEBUG   plugin-misc_notifications.py: Killing plugin: exited during normal operation
2022-02-22T23:33:24.362Z DEBUG   plugin-funder: Killing plugin: exited during normal operation
2022-02-22T23:33:24.366Z DEBUG   plugin-txprepare: Killing plugin: exited during normal operation
2022-02-22T23:33:24.368Z DEBUG   plugin-spenderp: Killing plugin: exited during normal operation
2022-02-22T23:33:24.384Z DEBUG   plugin-fetchinvoice: Killing plugin: exited during normal operation
2022-02-22T23:33:24.386Z DEBUG   plugin-pay: Killing plugin: exited during normal operation
2022-02-22T23:33:54.299Z DEBUG   lightningd: test_libplugin: failed to self-terminate in time, killing.

Now, we l1.daemon.wait_for_log(r'test_libplugin: shutdown called') and that skips past the plugin-misc_notifications.py: via lightningd shutdown, datastore failed, so waiting for that fails. Which is why we used to manually move the search needle back.

Switching to a single wait_for_logs() on all three fixes this:

diff --git a/tests/test_plugin.py b/tests/test_plugin.py
index 5963a437f..ad69ead61 100644
--- a/tests/test_plugin.py
+++ b/tests/test_plugin.py
@@ -2513,6 +2513,6 @@ def test_plugin_shutdown(node_factory):
     # Now, should also shutdown or timeout on finish, RPC calls then fail with error code -5
     l1.rpc.plugin_start(p, dont_shutdown=True)
     l1.rpc.stop()
-    l1.daemon.wait_for_log(r'test_libplugin: shutdown called')
-    l1.daemon.wait_for_logs(['misc_notifications.py: via lightningd shutdown, datastore failed',
+    l1.daemon.wait_for_logs(['test_libplugin: shutdown called',
+                             'misc_notifications.py: via lightningd shutdown, datastore failed',
                              'test_libplugin: failed to self-terminate in time, killing.'])

@SimonVrouwe
Copy link
Collaborator Author

-    l1.daemon.wait_for_log(r'test_libplugin: shutdown called')
-    l1.daemon.wait_for_logs(['misc_notifications.py: via lightningd shutdown, datastore failed',
+    l1.daemon.wait_for_logs(['test_libplugin: shutdown called',
+                             'misc_notifications.py: via lightningd shutdown, datastore failed',
                              'test_libplugin: failed to self-terminate in time, killing.'])
  • fixup with above recommendation (I keep tripping over that same stone)
  • fixup with added --lease-fee-base-msat as plugin_option_deprecated and another Changelog note
  • rebased on latest master and squashed fixups with relevant commits

@SimonVrouwe
Copy link
Collaborator Author

Trying to debug the failing test:

def test_ping_timeout(node_factory):
# Disconnects after this, but doesn't know it.
l1_disconnects = ['xWIRE_PING']
l1, l2 = node_factory.line_graph(2, opts=[{'dev-no-reconnect': None,
'disconnect': l1_disconnects},
{}])
# Takes 15-45 seconds, then another to try second ping
l1.daemon.wait_for_log('Last ping unreturned: hanging up',
timeout=45 + 45 + 5)

this is in the log where instead of l1, l2 logs the awaited-for line. But now I think it doesn't matter which one logs that line, see explanation below.

lightningd-1: 2022-02-25T18:10:59.080Z DEBUG   022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-connectd: dev_disconnect: xWIRE_PING (WIRE_PING)
lightningd-2: 2022-02-25T18:11:38.118Z DEBUG   0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: Last ping unreturned: hanging up

I can reproduce locally after few runs with valgrind=1
Added this status_peer_debug(&peer->id, "peer->expecting_pong %i", peer->expecting_pong); just above 1st if in

static void send_ping(struct peer *peer)
{
/* Already have a ping in flight? */
if (peer->expecting_pong != PONG_UNEXPECTED) {
status_peer_debug(&peer->id, "Last ping unreturned: hanging up");
if (peer->to_peer)
io_close(peer->to_peer);
return;
}
inject_peer_msg(peer, take(make_ping(NULL, 1, 0)));
peer->expecting_pong = PONG_EXPECTED_PROBING;
set_ping_timer(peer);

then run the test with log-level=io:connectd, below the case where the test fails.
I substituted l1 id with L1 etc. and the msg IO in the log a bit for clarity and the <== stuff are my comments.

lightningd-2: 2022-02-26T15:27:53.007Z DEBUG   L1-connectd: peer->expecting_pong 0
lightningd-2: 2022-02-26T15:27:53.008Z L1-connectd: [OUT] WIRE_PING
lightningd-1: 2022-02-26T15:27:53.229Z L2-connectd: [IN] WIRE_PING
lightningd-1: 2022-02-26T15:27:53.230Z L2-connectd: [OUT] WIRE_PONG
lightningd-2: 2022-02-26T15:27:53.438Z L1-connectd: [IN] WIRE_PONG
lightningd-2: 2022-02-26T15:27:53.440Z DEBUG   L1-connectd: handle_pong_in, case PONG_EXPECTED_PROBING <== this resets peer->expecting_pong

lightningd-1: 2022-02-26T15:28:07.992Z DEBUG   L2-connectd: peer->expecting_pong 0
lightningd-1: 2022-02-26T15:28:07.994Z L2-connectd: [OUT] WIRE_PING
lightningd-1: 2022-02-26T15:28:07.996Z DEBUG   L2-connectd: dev_disconnect: xWIRE_PING (WIRE_PING)
lightningd-2: 2022-02-26T15:28:08.221Z L1-connectd: [IN] WIRE_PING      <== leaves l2's peer->expecting_pong alone, see handle_ping_in
lightningd-2: 2022-02-26T15:28:08.223Z L1-connectd: [OUT] WIRE_PONG     <==

lightningd-2: 2022-02-26T15:28:16.009Z DEBUG   L1-connectd: peer->expecting_pong 0
lightningd-2: 2022-02-26T15:28:16.010Z L1-connectd: [OUT] WIRE_PING     <== Bingo! This sets peer->expecting_pong to 2
~ 32s since last l2's WIRE_PING <== is 15 + pseudorand(30)
lightningd-2: 2022-02-26T15:28:48.019Z DEBUG   L1-connectd: peer->expecting_pong 2
lightningd-2: 2022-02-26T15:28:48.021Z DEBUG   L1-connectd: Last ping unreturned: hanging up <== which also kills l1's ping timer

What is happening in above (fail) case:
l2's last [OUT] WIRE_PING is never PONGed because of l1's dev_disconnect.
l2's ping_timer went off (in this fail case before l1's ping_timer) and wants to send [OUT] WIRE_PING, but still expects pong and therefore closes connection.
Closing the connection makes l1 delete its ping_timer (I think) and thus never send its [OUT] WIRE_PING and never logs Last ping unreturned: hanging up

Solution is to check for desired log-line in both l1 and l2.

@SimonVrouwe
Copy link
Collaborator Author

SimonVrouwe commented Feb 27, 2022

Just a last point with respect to getinfo's "our_features" field.
Instead of a hex-string we can make it a comma-separated string of bit numbers (using fmt_featurebits).
The highest bit number that can show up in "our_features" currently is 105 (shutdown_wrong_funding), giving hex-strings like:

"our_features": {
      "init": "0200000000000000008828226aa2",
      "node": "0200000000000080008828226aa2",
      "channel": "",
      "invoice": "20024200"
   }

would become:

"our_features": {
      "init": "1,5,7,9,11,13,14,17,21,27,29,35,39,105",
      "node": "1,5,7,9,11,13,14,17,21,27,29,35,39,55,105",
      "channel": "",
      "invoice": "9,14,17,29"
   }

I think the latter is more readible and would be more compact when bits jump higher.
For example bit number 2023 (no idea what it means) exist in many node_announcement

   "nodes": [
      {
         "nodeid": "03b09440c75e335ae72000231ff0358c47e5943eef73be5746799f624cfc034cf0",
         "alias": "LabNode",
         "color": "3399ff",
         "last_timestamp": 1645837821,
         "features": "800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000808a52a1",
         "addresses": [
            {
               "type": "torv3",
               "address": "yerr22pgypq23lp4sppm7rkudtq24beaq4uyppjgwua5xbtchiukdqyd.onion",
               "port": 9735

as comma-separated string:

"features":  "0,5,7,9,12,14,17,19,23,31,45,2023"

However it would not be consistent with "features" field returned by connect, listchannels and listnodes, these all return hex-string.

@rustyrussell
Copy link
Contributor

Nice, I'm folding fixups now...

@rustyrussell
Copy link
Contributor

Just a last point with respect to getinfo's "our_features" field. Instead of a hex-string we can make it a comma-separated string of bit numbers (using fmt_featurebits). The highest bit number that can show up in "our_features" currently is 105 (shutdown_wrong_funding), giving hex-strings like:

"our_features": {
      "init": "0200000000000000008828226aa2",
      "node": "0200000000000080008828226aa2",
      "channel": "",
      "invoice": "20024200"
   }

would become:

"our_features": {
      "init": "1,5,7,9,11,13,14,17,21,27,29,35,39,105",
      "node": "1,5,7,9,11,13,14,17,21,27,29,35,39,55,105",
      "channel": "",
      "invoice": "9,14,17,29"
   }

I think the latter is more readible and would be more compact when bits jump higher. For example bit number 2023 (no idea what it means) exist in many node_announcement

   "nodes": [
      {
         "nodeid": "03b09440c75e335ae72000231ff0358c47e5943eef73be5746799f624cfc034cf0",
         "alias": "LabNode",
         "color": "3399ff",
         "last_timestamp": 1645837821,
         "features": "800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000808a52a1",
         "addresses": [
            {
               "type": "torv3",
               "address": "yerr22pgypq23lp4sppm7rkudtq24beaq4uyppjgwua5xbtchiukdqyd.onion",
               "port": 9735

as comma-separated string:

"features":  "0,5,7,9,12,14,17,19,23,31,45,2023"

However it would not be consistent with "features" field returned by connect, listchannels and listnodes, these all return hex-string.

Let's go better and add "featuremap"? {0: "option_data_loss_protect/even", 1023: "option_unknown_odd_1022/odd" etc...?)

Separate patch though IMHO

…ent with tlvs

Terminal prompt got messed up because missing newline in case of empty fields.

printwire_addresses expected it to be the last field, which is not
the case of a node_announcement with tlv
All build flags and (experimental) options make it hard to find
out what features are supported or enabled.

And the undocumented `--list-features-only`, does not account for all
our featurebits, for example bit 55 (keysend).

Changelog-Added: JSON-RPC: `getinfo` result now includes `our_features` (bits) for various Bolt ElementsProject#9 contexts
It did not pick-up the last (bolt11) bits set.
…` field returned by listnodes,

listchannels and connect.
…z-percent

Changelog-Experimental: option `--lease-fee-base-msat` renamed to `--lease-fee-base-sat`

Changelog-Experimental: option `--lease-fee-base-msat` deprecated and will be removed next release
…o works with dualfund

and some cleanup, fundchannel has been returning tx, txid and outnum for years
…re reliable

it now waits for 'alias' in node_announcement, not just block confirms.
more cleanup
@rustyrussell
Copy link
Contributor

Ack 3dd54c9

@rustyrussell rustyrussell merged commit 175e0b5 into ElementsProject:master Feb 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants