Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connectd hanging while being unable to connect to peers #7462

Closed
grubles opened this issue Jul 10, 2024 · 6 comments · Fixed by #7492
Closed

connectd hanging while being unable to connect to peers #7462

grubles opened this issue Jul 10, 2024 · 6 comments · Fixed by #7492
Milestone

Comments

@grubles
Copy link
Contributor

grubles commented Jul 10, 2024

Running master at 029034a. CLN can't connect to any peers and lightning_connectd seems to hang at 100% CPU.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                     
 567272 test      20   0   60672  49152  45056 R 100.0   0.1   2:17.99 lightning_connectd

When trying to shut down CLN, lightningd and lightning_connectd both hang at 100% CPU each and the stop command also hangs indefinitely. I have to use kill to end those processes in able to restart CLN.

There doesn't seem to be anything useful in debug.log to share.

CLN config includes:

experimental-dual-fund
experimental-splicing
experimental-offers
experimental-peer-storage
experimental-quiesce

On a different machine, I am able to reproduce this without those experimental config options.

@kilrau
Copy link

kilrau commented Jul 15, 2024

So looks like #7365 didn't fix anything for you...

@grubles
Copy link
Contributor Author

grubles commented Jul 15, 2024

This is more of a medium-sized node and wasn't running into the CPU usage that PR addresses, so I'm not sure. Also the other machine I tested on has a single signet channel and was experiencing the issue described above.

@kilrau
Copy link

kilrau commented Jul 15, 2024

OK sth different then, we'll go and test #7365

@michael1011
Copy link

michael1011 commented Jul 15, 2024

I can reproduce that problem on a fresh, new node:

  1. Create a new node with latest master
  2. Connect to some peers
  3. Watch connectd spike to 100% CPU

For convenience to reproduce this I created a little script.
Run lightning-cli listnodes > nodes.json and then this python script to connect to some nodes and you'll see connectd go wild:

#!/usr/bin/env python3
import json
import subprocess

with open('nodes.json') as f:
    nodes = json.load(f)["nodes"]

print(f"Got {len(nodes)} nodes")

with_address = []

for node in nodes:
    if "addresses" not in node or len(node["addresses"]) == 0:
        continue

    with_address.append(node)

print(f"{len(with_address)} with address")

ipv4 = []

for node in with_address:
    for address in node["addresses"]:
        if address["type"] != "ipv4":
            continue

        ipv4.append(f"{node['nodeid']}@{address['address']}:{address['port']}")

print(f"{len(ipv4)} with IPV4 address")

for (i, address) in enumerate(ipv4):
    print(f"Connecting to {i+1}/{len(ipv4)}: {address}")
    res = subprocess.Popen(
        f"timeout 10 lightning-cli connect {address}",
        shell=True, 
        stdout=subprocess.PIPE,
    ).stdout.read()
    try:
        print(json.dumps(
            json.loads(res),
            indent=4,
        ))
    except:
        print("Connect timed out")

Edit:

This is definitely a regression since v24.05. I created a new node with v24.05 and ran the script; it was just fine. Updated to master, ran it again and connectd jumped to 100% CPU before it even connected to the first peer.

image

@hMsats
Copy link
Contributor

hMsats commented Jul 24, 2024

Can confirm the original post.

Channel main node <-> test node

V24.05 <-> V24.05 no problems

V24.05 <-> Master same problems

When I return to v25.05 everything is fine again

@endothermicdev endothermicdev added this to the v24.08 milestone Jul 24, 2024
rustyrussell added a commit to rustyrussell/lightning that referenced this issue Jul 24, 2024
If we need to iterate forward to find a timestamp (only happens if we have gossip older than
2 hours), we didn't exit the loop, as it didn't actually move the offset.

Fixes: ElementsProject#7462
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@hMsats
Copy link
Contributor

hMsats commented Jul 24, 2024

Added the one line in the pull request into gossmap.c and it solved the issue for me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants