Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightningd: don't timeout plugins if init is slow! #5741

Merged
merged 1 commit into from Nov 27, 2022

Conversation

rustyrussell
Copy link
Contributor

This is a minimal fix: we wait until all plugins reply from init before continuing. Really large or busy nodes can have other things monopolize lightningd, then the timer goes off and we blame the plugin (which has responded, we just haven't read it yet!).

The real answer is to have some timeouts only advance when we're idle, or have them low-priority so we only activate them when we're idle (this doesn't apply to all timers: some are probably important!). But this is a minimal fix for -rc3.

Fixes: #5736

Copy link
Collaborator

@vincenzopalazzo vincenzopalazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HACK a418a46

@cdecker cdecker added this to the v22.11 milestone Nov 26, 2022
@cdecker
Copy link
Member

cdecker commented Nov 26, 2022

Thanks Rusty, shepherding through ci and adding it to rc3 👍

@cdecker
Copy link
Member

cdecker commented Nov 26, 2022

Added changelog entry from #5741

@cdecker cdecker changed the title lightningd: don't timeout plugins if init is slow! **NOTE NEW CHANGELOG ENTRY** lightningd: don't timeout plugins if init is slow! Nov 26, 2022
This is a minimal fix: we wait until all plugins reply from init before
continuing.  Really large or busy nodes can have other things monopolize
lightningd, then the timer goes off and we blame the plugin (which has
responded, we just haven't read it yet!).

The real answer is to have some timeouts only advance when we're idle,
or have them low-priority so we only activate them when we're idle (this
doesn't apply to all timers: some are probably important!).  But
this is a minimal fix for -rc3.

Fixes: ElementsProject#5736
Changelog-Fixed: plugins: on large/slow nodes we could blame plugins for failing to answer init in time, when we were just slow.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@cdecker
Copy link
Member

cdecker commented Nov 27, 2022

I'll apply this as is since including it seems safer than not including it, but we have a timeout on valgrind-4:

    def test_important_plugin(node_factory):
        # Cache it here.
        pluginsdir = os.path.join(os.path.dirname(__file__), "plugins")
    
        n = node_factory.get_node(options={"important-plugin": os.path.join(pluginsdir, "nonexistent")},
                                  may_fail=True, expect_fail=True,
                                  allow_broken_log=True, start=False)
    
        n.daemon.start(wait_for_initialized=False, stderr_redir=True)
        # Will exit with failure code.
        assert n.daemon.wait() == 1
        assert n.daemon.is_in_stderr(r"Failed to register .*nonexistent: No such file or directory")
    
        # Check we exit if the important plugin dies.
        n.daemon.opts['important-plugin'] = os.path.join(pluginsdir, "fail_by_itself.py")
    
        n.daemon.start(wait_for_initialized=False)
        # Will exit with failure code.
>       assert n.daemon.wait() == 1

tests/test_plugin.py:2128: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
contrib/pyln-testing/pyln/testing/utils.py:657: in wait
    self.proc.wait(timeout)
/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/subprocess.py:1019: in wait
    return self._wait(timeout=timeout)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <subprocess.Popen object at 0x7efd8b29f150>, timeout = 900

    def _wait(self, timeout):
        """Internal implementation of wait() on POSIX."""
        if self.returncode is not None:
            return self.returncode
    
        if timeout is not None:
            endtime = _time() + timeout
            # Enter a busy loop if we have a timeout.  This busy loop was
            # cribbed from Lib/threading.py in Thread.wait() at r71065.
            delay = 0.0005 # 500 us -> initial delay of 1 ms
            while True:
                if self._waitpid_lock.acquire(False):
                    try:
                        if self.returncode is not None:
                            break  # Another thread waited.
                        (pid, sts) = self._try_wait(os.WNOHANG)
                        assert pid == self.pid or pid == 0
                        if pid == self.pid:
                            self._handle_exitstatus(sts)
                            break
                    finally:
                        self._waitpid_lock.release()
                remaining = self._remaining_time(endtime)
                if remaining <= 0:
>                   raise TimeoutExpired(self.args, timeout)
E                   subprocess.TimeoutExpired: Command '['valgrind', '-q', '--trace-children=yes', '--trace-children-skip=*python*,*bitcoin-cli*,*elements-cli*,*cln-grpc', '--error-exitcode=7', '--log-file=/tmp/ltests-tj13wttf/test_important_plugin_1/lightning-1//valgrind-errors.%p', '--read-inline-info=no', '--suppressions=/home/runner/work/lightning/lightning/tests/valgrind-suppressions.txt', '--gen-suppressions=all', 'lightningd/lightningd', '--log-level=debug', '--cltv-delta=6', '--cltv-final=5', '--watchtime-blocks=5', '--rescan=1', '--disable-dns', '--lightning-dir=/tmp/ltests-tj13wttf/test_important_plugin_1/lightning-1/', '--addr=127.0.0.1:34521', '--allow-deprecated-apis=false', '--network=regtest', '--ignore-fee-limits=false', '--bitcoin-rpcuser=rpcuser', '--bitcoin-rpcpassword=rpcpass', '--bitcoin-datadir=/tmp/ltests-tj13wttf/test_important_plugin_1/lightning-1/', '--dev-fast-gossip', '--dev-bitcoind-poll=1', '--log-file=-', '--log-file=/tmp/ltests-tj13wttf/test_important_plugin_1/lightning-1/log', '--log-prefix=lightningd-1 ', '--dev-fail-on-subdaemon-fail', '--dev-no-version-checks', '--dev-no-plugin-checksum', '--dev-no-reconnect', '--important-plugin=/home/runner/work/lightning/lightning/tests/plugins/fail_by_itself.py', '--bitcoin-rpcport=60953']' timed out after 900 seconds

900 seconds feels way too long for it to be flaky, and the BROKEN error level is emitted early on, so we may end up waiting forever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UNUSUAL plugin-autoclean: Killing plugin: timed out before replying to init
3 participants