connectd hanging while being unable to connect to peers #7462

grubles · 2024-07-10T15:26:07Z

Running master at 029034a. CLN can't connect to any peers and lightning_connectd seems to hang at 100% CPU.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                     
 567272 test      20   0   60672  49152  45056 R 100.0   0.1   2:17.99 lightning_connectd

When trying to shut down CLN, lightningd and lightning_connectd both hang at 100% CPU each and the stop command also hangs indefinitely. I have to use kill to end those processes in able to restart CLN.

There doesn't seem to be anything useful in debug.log to share.

CLN config includes:

experimental-dual-fund
experimental-splicing
experimental-offers
experimental-peer-storage
experimental-quiesce

On a different machine, I am able to reproduce this without those experimental config options.

The text was updated successfully, but these errors were encountered:

kilrau · 2024-07-15T13:16:26Z

So looks like #7365 didn't fix anything for you...

grubles · 2024-07-15T14:24:28Z

This is more of a medium-sized node and wasn't running into the CPU usage that PR addresses, so I'm not sure. Also the other machine I tested on has a single signet channel and was experiencing the issue described above.

kilrau · 2024-07-15T14:26:43Z

OK sth different then, we'll go and test #7365

michael1011 · 2024-07-15T21:34:45Z

I can reproduce that problem on a fresh, new node:

Create a new node with latest master
Connect to some peers
Watch connectd spike to 100% CPU

For convenience to reproduce this I created a little script.
Run lightning-cli listnodes > nodes.json and then this python script to connect to some nodes and you'll see connectd go wild:

#!/usr/bin/env python3
import json
import subprocess

with open('nodes.json') as f:
    nodes = json.load(f)["nodes"]

print(f"Got {len(nodes)} nodes")

with_address = []

for node in nodes:
    if "addresses" not in node or len(node["addresses"]) == 0:
        continue

    with_address.append(node)

print(f"{len(with_address)} with address")

ipv4 = []

for node in with_address:
    for address in node["addresses"]:
        if address["type"] != "ipv4":
            continue

        ipv4.append(f"{node['nodeid']}@{address['address']}:{address['port']}")

print(f"{len(ipv4)} with IPV4 address")

for (i, address) in enumerate(ipv4):
    print(f"Connecting to {i+1}/{len(ipv4)}: {address}")
    res = subprocess.Popen(
        f"timeout 10 lightning-cli connect {address}",
        shell=True, 
        stdout=subprocess.PIPE,
    ).stdout.read()
    try:
        print(json.dumps(
            json.loads(res),
            indent=4,
        ))
    except:
        print("Connect timed out")

Edit:

This is definitely a regression since v24.05. I created a new node with v24.05 and ran the script; it was just fine. Updated to master, ran it again and connectd jumped to 100% CPU before it even connected to the first peer.

hMsats · 2024-07-24T09:38:04Z

Can confirm the original post.

Channel main node <-> test node

V24.05 <-> V24.05 no problems

V24.05 <-> Master same problems

When I return to v25.05 everything is fine again

If we need to iterate forward to find a timestamp (only happens if we have gossip older than 2 hours), we didn't exit the loop, as it didn't actually move the offset. Fixes: ElementsProject#7462 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

hMsats · 2024-07-24T21:36:55Z

Added the one line in the pull request into gossmap.c and it solved the issue for me!

endothermicdev added this to the v24.08 milestone Jul 24, 2024

rustyrussell mentioned this issue Jul 24, 2024

common: fix endless loop in gossmap iteration. #7492

Merged

ShahanaFarooqui closed this as completed in #7492 Jul 25, 2024

ShahanaFarooqui closed this as completed in 15fb37f Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connectd hanging while being unable to connect to peers #7462

connectd hanging while being unable to connect to peers #7462

grubles commented Jul 10, 2024 •

edited

Loading

kilrau commented Jul 15, 2024

grubles commented Jul 15, 2024

kilrau commented Jul 15, 2024

michael1011 commented Jul 15, 2024 •

edited

Loading

hMsats commented Jul 24, 2024

hMsats commented Jul 24, 2024

connectd hanging while being unable to connect to peers #7462

connectd hanging while being unable to connect to peers #7462

Comments

grubles commented Jul 10, 2024 • edited Loading

kilrau commented Jul 15, 2024

grubles commented Jul 15, 2024

kilrau commented Jul 15, 2024

michael1011 commented Jul 15, 2024 • edited Loading

hMsats commented Jul 24, 2024

hMsats commented Jul 24, 2024

grubles commented Jul 10, 2024 •

edited

Loading

michael1011 commented Jul 15, 2024 •

edited

Loading