-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jool causes boot to hang for five minutes #409
Comments
So the way I managed to trigger this is by starting the jool services via /etc/network/interfaces instead of directly via the system unit:
I do this to have a convenient way to start/stop jool and have a place to install routes to be redistributed into my IGP. In testing this I disabled the jool/jool_siit systemd units to make sure the DHCP script I have for dynamically updating pool4 doesn't break when jool is not loaded or no jool instance is active. So my workaround for now is to just re-enable the jool services. It seems if jool is loaded before any interfaces go up it works alright. One more, probably irrelevant, detail, I had already modified the jool units slightly: instead of After=network.target I have After=network-pre.target. Perhaps without this the bug might rear it's head again? --Daniel |
I've just observed the hang happening even when jool is modprob'ed before any interfaces come up:
--Daniel |
What's your distro?
I mean both warning and delay happen right between the
What does "it" and "boot is normal" refer to? The WARNING still happens but there is no 5 minute delay? Or the 5 minute delay still happens but the WARNING disappears?
What was the goal of this? I mean... if Jool starts before the interface starts... wouldn't Jool intercept interface setup traffic? And maybe this is what's causing the delay? Jool cannot ignore the addresses that belong to its own node if said addresses have not yet been assigned to the interface. That's how it tells them apart. Maybe it has something to do with that? Do you have a means to enable Jool's debug logging while the delay is happening?
What do you mean? There doesn't seem to be any 5 minute hang in this output. |
Hi Alberto,
Thanks for the quick response, see below.
On Tue, Jul 25, 2023 at 03:23:07PM -0700, Alberto Leiva Popper wrote:
What's your distro?
Debian. FYI: That's mentioned in the first dmesg line :)
> I don't think that's jool related.
I mean both warning and delay happen right between the `jool` modprobe
and the `jool_siit` modprobe. Jool looks pretty sus to me.
Sometimes it does, sometimes it happens way before jool loads. I think it's
a bug in the current Debian kernel revision. Happens on my workstation too
and that doesn't have jool. I just haven't gotten around to debugging that
yet :)
> If I disable the service(s) it's still happening and boot is normal.
What does "it" and "boot is normal" refer to?
The WARNING still happens but there is no 5 minute delay?
"it": the 5min delay, "normal" no delay to ability to login via console (or
otherwise). As I said, the WARNING is probably a red herring.
> One more, probably irrelevant, detail, I had already modified the jool
> units slightly: instead of After=network.target I have
> After=network-pre.target.
What was the goal of this?
If only I remembered that. I wrote the custom unit file quite a while ago
unfortunately. I've since reverted that bit see below.
I mean... if Jool starts before the interface starts... wouldn't Jool
intercept interface setup traffic? And maybe this is what's causing the
delay?
You mean stuff like DAD? Possible though I would expect to see systemd
output warning about the startup delay from the networking.service unit,
but instead the system just becomes unresponsive. Note: if I trigger a
reboot from my hypervisor during the 5min delay that does actually take
effect so the kernel isn't completely stuck so we can't rule out userspace
being at fault here.
Jool cannot ignore [the addresses that belong to its own
node](https://nicmx.github.io/Jool/en/usr-flags-denylist4.html#description)
if said addresses have not yet been assigned to the interface. That's how
it tells them apart. Maybe it has something to do with that?
Do you have a means to enable Jool's debug logging while the delay is
happening?
Sure, I already did that when debugging #396, but I'm afraid I've forgotten
how to do it. Do you have a docs link for me?
If it would help I'd be happy to give you ssh/libvirt access to a clone of
this VM or just a disk image if that's easier.
I can also just have another go at teasing out the exact conditions needed
for the hang to trigger. I have since reverted the After=network-pre.target
as well as the "starting jool from /etc/network/interfaces) thing and that
seems to have helped but only a couple reboots will really tell since
apparently it's not 100% deterministic.
> I've just observed the hang happening even when jool is modprob'ed
> before any interfaces come up:
What do you mean? There doesn't seem to be any 5 minute hang in this output.
I didn't bother waiting the full five minutes before posting, but rest
assured it was hanging there :)
…--Daniel
|
Oh, my bad I didn't read enough context. Fixed: "it": the printed WARNING. I.e. put another way if I disable jool and it's not loaded the WARNING still prints and the delay doesn't happen ergo it's (likely) unrelated to the delay. |
The normal way is
But that requires a running instance, so it's probably too late for your purposes. The other way is enabling all debug logging by default by compiling the kernel modules with
But I don't know if you can pass these kinds of flags to a Debian packaged module.
Okay. |
I compiled the modules with the I also checked One thing I note is -DDEBUG doesn't seem to enable messages like Any other ideas? --Daniel |
That's a pretty healthy indication that
If enabling the flag is too much trouble, try raising the severity of the relevant logs:
Probably mind the newline. |
Ok, I replicated the bug. Sorry for the wait. Some notes:
I think this happened because you have two versions of Jool installed: The debian package and the custom However, if you uninstall the Debian package, you lose the systemd unit files, so you have to reinstall them manually:
AND you have to rename the userspace clients in them, because the custom binaries are installed somewhere else:
Make sure the configuration files are in place...
Also force the old Jool out, and fulfill some systemd bureaucracy...
Anyway, that logs the 5-minute gap a couple seconds before the first Jool DEBUG message. I'll try researching more tomorrow. |
Found it: https://github.com/NICMx/Jool/blob/debian/debian/jool-tools.jool.service#L4 The systemd unit file states Jool's service needs to start after the network has been initialized. You're starting the Jool service during network initialization. Ergo, they deadlock until timeout. I see two solutions, and they both seem to work, but I don't know all the consequences:
Any opinions? Jool doesn't really have anything to do with the configuration of the interfaces, and can stand idle while there is no network. But NAT64 Jool requires defrag, and depending on configuration, either Jool might depend on iptables. They're both kernel modules that probably belong in the category "firewall services". |
Yeah I think network-pre.target is appropriate to use for jool. On my system it is odered like this:
So (eg) nftables will come up first then network-pre (and the jool if we use After=network-pre.target), then ifup though networking.service, all the connection managers and network.target. Looks good to me.
IIRC the kernel has a refcount for whether any module wants defrag and starts doing it if anyone requests it. iptables/nftables does this when any rules involving NAT are installed for example. If jool needs defrag (does it really?) it should likely do the same thing and not depend on module load ordering, no? FYI: Someone finally noticed the WARN flow_dissector thing: https://lore.kernel.org/lkml/e2b4ee49-0998-ff3b-f817-5076d0b43b77@gmail.com/T/ |
I'm on the same boat. I should have paid more attention to that comment of yours :/
Right.
NAT64 Jool relies on defrag because it needs to know the transport header ports of all packets. If defrag is disabled, NAT64 Jool can't access the ports of nonzero fragment-offset fragments (because the TCP/UDP header is missing).
I think you're right. My concern is that I'm not fluent with systemd's philosophy and ruleset, so I'm afraid of another deadlock. ie. Jool depends on defrag, but something on systemd forces defrag to start after wherever Jool plants itself.
Ok, then let's settle on |
Right, I was thinking in SIIT terms.
Context: I've fought my fair share of battles with systemd and am a frequent reader of its (decent) man pages. I'm not sure what you mean by "forces defrag to start". There's no service for that unless you mean firewall rules i.e nftables/iptabes.service? Very few units declare an ordering on network-pre.target so I woudn't worry too much either way. Jf it's better for jool to start after the firewall this is where we want to be anyway, since nftables.service declares Before=network-pre.target (so the order would be nftables, then jool) Honestly I'd also be fine with just leaving the jool units as they are, my config is pretty odd I'll admit : ) And this time I'll make a bloody note as to why I change my jool units to network-pre.target haha. Thanks, |
The old After=network.target causes deadlock when users attempt to start the service during network initialization. Fixes #409.
Thank you; I needed to hear this.
I think your configuration is pretty reasonable. Patch is up, I'll have it released in 4.1.11, once #410 is settled. |
Jool 4.1.11 has been released; closing. |
Hi,
I'm observing some rather weird behaviour. If I
modprobe jool
during boot the system seems to hang for about 5min and then resume booting. You'll notice the kernel throwsWARNING: CPU: 1 PID: 0 at net/core/flow_dissector.c:1016 __skb_flow_dissect+0xa91/0x1cd0
just before the hang. I don't think that's jool related. If I disable the service(s) it's still happening and boot is normal.Any idea what's going on?
Thanks,
--Daniel
The text was updated successfully, but these errors were encountered: