New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jool when translating always drops the packet due to an error #382
Comments
Well Error code 1 is "Operation not permitted." This looks like an environment problem. Have you tried flushing all iptables/nftables rules in all relevant network namespaces? Obviously not permanently, but rather to check whether those are blocking the packet. |
Might want to try disabling reverse path filtering too |
Neither iptables/nftables are installed (Just a plain debian 11 install with jool-tools and jool-dkms). Also have disabled rp filter, but it still happens. |
Which image did you use to install Debian? |
This one: |
Sorry for taking so long, and double sorry for not coming up with anything. I installed Debian 11 in a Virtualbox virtual machine, and I could not reproduce the problem. My host machine: sudo ip link set vboxnet0 up
sudo ip addr flush dev vboxnet0 scope global
sudo ip address add 2001:db8::8/96 dev vboxnet0
sudo ip link set vboxnet1 up
sudo ip addr flush dev vboxnet1 scope global
sudo ip address add 194.1.1.8/24 dev vboxnet1
sudo ip route add 64:ff9b::/96 via 2001:db8::1 My virtual machine: sudo apt install ./jool-dkms_4.1.8-1_all.deb ./jool-tools_4.1.8-1_amd64.deb
INTERFACE1=enp0s3
INTERFACE2=enp0s8
sudo ip link set $INTERFACE1 up
sudo ip address flush dev $INTERFACE1 scope global
sudo ip address add 2001:db8::1/96 dev $INTERFACE1
sudo ip link set $INTERFACE2 up
sudo ip address flush dev $INTERFACE2 scope global
sudo ip address add 194.1.1.1/24 dev $INTERFACE2
sudo sysctl -w net.ipv4.conf.all.forwarding=1
sudo sysctl -w net.ipv6.conf.all.forwarding=1
sudo sysctl -w net.ipv4.ip_local_port_range="10000 20000"
sudo modprobe jool
sudo jool file handle jool.conf In jool.conf, I changed
Want to discuss your network? You can find my email in my Github profile, if you prefer a less public medium. |
I have the "same" problem. Is it OK to report here or should I open a new issue?
Working: Not working: Everything else is identical between configs:
{
"comment": "Configuration for the systemd NAT64 Jool service.",
"instance": "nat64",
"framework": "netfilter",
"global": {
"comment": "Sample pool6 prefix",
"pool6": "64:ff9b::/96"
},
"comment": "Sample pool4 table",
"pool4": [
{
"protocol": "ICMP",
"prefix": "172.26.0.27",
"port range": "61001-65535"
}, {
"protocol": "TCP",
"prefix": "172.26.0.27",
"port range": "61001-65535"
}, {
"protocol": "UDP",
"prefix": "172.26.0.27",
"port range": "61001-65535"
}
]
} |
Please forgive my ignorance, but I'm having a lot of trouble replicating this environment. (I dropped support for CentOS < 8 fairly long ago, to reduce the release and maintenance overhead.) It seems the distro's official gcc (4.8.5) is too old to even understand the kernel headers, and at this point I'm just guessing what you did. What's your compiler, and its version? (If you can provide entire installation instructions, that'd be better) |
Since it looks like it's going to be tricky to reproduce.
BTW: Since I'm not really sure I'm going to be able to reproduce the problem, I just uploaded a commit that prints the packet when the Operation not permitted is found: 8980f79 If you can compile/install/run it and give me the output, I can analyze the packet and try to find issues with it. Feel free to censor fields, but let me know what you're censoring. Sample output:
|
I was originally trying out newer kernels on CentOS 7 because I needed a TCP BBR capable kernel, and that was easier than a full OS upgrade, but I recognize that was a nonstandard configuration and unsupportable by you and others. I have now upgraded the OS to RHEL 8 with standard RH kernel (4.18.0). Jool still doesn't work for me though. I then installed a clean RHEL 8 in a VM and tried to replicate the setup (network interfaces, sysctl knobs, iptables rules, routes), and thought I had it working in the VM with everything identical to the physical host (where it was still not working). Then tweaked some things and it stopped working in the VM too, tried to reverse what I'd tweaked and it continued to not work.
That's understandable.
This is no longer relevant, but I was using
I was thinking about "Operation not permitted" - wouldn't that normally be returned as I'm also not sure that dst_output always returns
I will certainly put it on both my physical host and VM and try to replicate. |
After installing 8980f79 from issue382 branch I managed to capture a log with
I turned off
|
Good point. There don't seem to be any IPv4 output functions that return positive EPERM. Although the POSTROUTING chain is a big black box.
If they reach the network interface, yes. There's possibly a lot of code before that, is the thing.
Ok, this output is haunting. This is definitely a bug now. |
I'm still trying to recreate in a VM to provide you with a more concrete and minimal reproducer, but I'm not getting very far - currently it is working in the VM. I have found one odd issue that has been distracting because externally it looks like I've reproduced the issue, but actually it's something else. As soon as I load the iptables ruleset (there are no jool specific rules, jool is in netfilter mode), the jool module unloads itself (or is unloaded):
Translation starts working again after reloading module
|
that was caused by iptables systemd unit conflicting with and stopping firewalld, which has this config option, and
|
Debug commit, second version: issue382 branch This time, it'll print the incoming packet, the translated packet, and two additional pieces information in the middle:
If you can provide the new output, the review should be significantly easier. |
|
Ok, the issue382 branch should work properly again. |
I'm sorry; I need to ask two very obnoxious questions. Are you completely sure that there are no stray routes in your routing table?
Is the packet really meant to depart through this interface? |
"WTF; I can't replicate anything. I literally wish my VM were more I feel your pain. Yesterday, after much back and forth trying different things and seemingly random things working and failing in VM I thought I had found a pattern (looks obvious when written like this, but it was not obvious when trying different things minutes apart, without realising it was not what I had done that changed behaviour just the passing of time/packets):
Of course today it's not behaving like that at all and I have a ping that has been running successfully via VM for nearly 2 hours. Nothing in your latest commits was expected to fix it, right? (haven't tried it on physical host yet)
I don't think they're obnoxious.
Fairly sure on the translator side, they're quite straightforward. On the client side, it is Mac OS, so ... who knows 😄 There's a lot of stuff in there though I don't thnk any of it should be affecting this (on client only the 64:ff9b::/96 route is relevant, right?):
Is it client route table you looking for or on the translator? if client I'll try to clean it up and send it, or find a different client to test from. Details for translator, both physical host and VM, are below.
Censored fields: IPV6PREFIX - local IPv6 network prefix Physical Host:
Physical Host:
Note: the 64:ff9b route here is pointing to the VM currently, but is not then when I am testing physical host. VM:
VM:
Yes. For the purposes of Jool/NAT64: |
I have reproduced on VM, here's an example from the logs:
All of the log enties have same I think I've narrowed down all the configuration required to reproduce in a VM from a clean RHEL8 install, although I haven't gone as far as to start from scratch just yet. Of course it all comes down to the one reason I wanted an updated kernel in the first place, BBR support:
Reboot after setting those - I don't think they take effect after link is up, I didn't figure out how to reproduce when setting these after boot (watch out for these being included in initrd too, otherwise they might still be applied).
Configure network:
(default IPv6 route discovered by RA) Start jool:
Then ping from an IPv6 client that has a route for 64:ff9b::/96 via this VM and wait ~5 minutes:
|
This reminds me of an old troubleshooting experience in which neighbors were disappearing from the neighbor table for some reason. It'd work for a while, then the neighbor entry would expire, and not get renewed automatically. Perhaps because Jool was eating ARP packets. So the ping would randomly stop working. I'm still trying to replicate the bug, but maybe try monitoring that table: $ ip neigh
192.168.1.174 dev wlp3s0 lladdr 84:16:f9:15:d5:4b STALE
192.168.1.1 dev wlp3s0 lladdr 64:09:ad:3a:fa:96 REACHABLE
fe80::e52e:76dc:64a7:cd95 dev wlp3s0 lladdr 84:16:f9:15:d5:4b STALE The neighbor tables of the VM, host and gateway are all suspects.
Nothing at all.
As long as it's the most specific one for the relevant packets, I think it's safe to assume so.
On the translator; I was thinking of another troubleshoot in which Jool's routing table had a bunch of bogus routes of unknown origin, so the packet TTL'd bouncing pointlessly on loopback. Although every other routing table is also relevant. (But the clients' are much harder to misconfigure, I imagine.)
Oh, my network was different. Good to know. Actually, I'm getting more confused now. Please bear my cluelessness, as I'm a developer, not a network admin/designer. It seems this is the path the packet has to travel:
What does this mean? Are you NATting VMPOOL4IP into HWPOOL4IP or something? And why? You seem to be using bridges.
What's the private network for? Something unrelated?
Back to the neighbor theory: Is enp2s0 answering ARP requests for the pool4 addresses? Some ARP tangle would explain the problem fairly well, because the routing table would yield a route for Jool to use, and then Linux would later fail to fetch the packet once it reaches layer 2. Once communication breaks, is it possible for the physical machine and/or VM to ping the gateway using VMPOOL4IP as a source? I think this would do it:
And if it doesn't work, does it also remain broken after you remove Jool?
Ok, so it looks like the suspiciously empty packet was actually a consequence of my recent brain fart. We're back to not knowing if this is a Jool bug or not; the packets look fine.
Suspicious. A TCP-specific technology, breaking ICMP traffic? Hmmmmmm |
Oh no! I've horribly confused you. It's late here so I'm not going to reply to your whole message but I have 2 parallel configurations I have been testing: 1 - a physical host (HW), 2 - a virtual machine (VM). They are installed in parallel on the same networks to try to replicate exactly the problem from HW on VM where there will be less network traffic to make diagnosis easier (well that was the plan). So it looks EITHER like this:
OR like this:
... depending on whether radvd is running on the physical host or the VM to advertise the route for 64:ff9b::/96 |
In the VM version, how are you connecting the interfaces to the physical network? Is it a bridge? Or the Mac is actually the host? |
Brief status update from my side: Spurred on by the comments here I have been rebuilding parts of my network to provide more robust isolation of network segments and eliminate possibility of ARP issues (although I don't think there were any, just to be sure...). Unfortunately it hasn't all gone to plan and I have encountered a few other unrelated issues that will slow down diagnosis of these Jool/NAT64 issues in the short term. |
I didn't expect to end up here, however seeing |
4.1.10 released; closing. |
Hi, I have been playing with jool a bit but when using my own address space for NAT64 always gets this issue:
Jool NAT64/ba23ed80/default: ===============================================
Jool NAT64/ba23ed80/default: Packet: 2001:xxx:518f->2001:xxx::1f2f:4d04
Jool NAT64/ba23ed80/default: TCP 41168->80
Jool NAT64/ba23ed80/default: Step 1: Determining the Incoming Tuple
Jool NAT64/ba23ed80/default: Tuple: 2001:xxx:518f#41168 -> 2001:xxx::1f2f:4d04#80 (TCP)
Jool NAT64/ba23ed80/default: Done step 1.
Jool NAT64/ba23ed80/default: Step 2: Filtering and Updating
Jool NAT64/ba23ed80/default: BIB entry: 2001:xxx:518f#41168 - 194.1.1.1#22154 (TCP)
Jool NAT64/ba23ed80/default: Session entry: 2001:xxx:518f#41168 - 2001:xxx::1f2f:4d04#80 | 194.1.1.1#22154 - 31.47.77.4#80 (TCP)
Jool NAT64/ba23ed80/default: Done: Step 2.
Jool NAT64/ba23ed80/default: Step 3: Computing the Outgoing Tuple
Jool NAT64/ba23ed80/default: Tuple: 194.1.1.1#22154 -> 31.47.77.4#80 (TCP)
Jool NAT64/ba23ed80/default: Done step 3.
Jool NAT64/ba23ed80/default: Step 4: Translating the Packet
Jool NAT64/ba23ed80/default: Translating packet addresses 2001:xxx:518f->2001:xxx::1f2f:4d04...
Jool NAT64/ba23ed80/default: Result: 194.1.1.1->31.47.77.4
Jool NAT64/ba23ed80/default: Packet routed via device 'eth0'.
Jool NAT64/ba23ed80/default: Done step 4.
Jool NAT64/ba23ed80/default: Sending packet.
Jool NAT64/ba23ed80/default: dst_output() returned errcode 1.
Jool: Dropping packet.
When jool is down, the server is able to reach anything on public ipv4 space.
i have totally run out of ideas.
my jool.conf:
{
"instance": "default",
"framework": "netfilter",
"global": {
"maximum-simultaneous-opens": 1000,
"drop-externally-initiated-tcp": true,
"pool6": "2001:xxx::/96"
},
"pool4": [
{
"comment": "mark, port range and max-iterations are optional.",
"protocol": "TCP",
"prefix": "194.1.1.1",
"port range": "21001-65535"
},
{
"protocol": "UDP",
"prefix": "194.1.1.1",
"port range": "21001-65535"
},
{
"protocol": "ICMP",
"prefix": "194.1.1.1",
"port range": "21001-65535"
}
]
}
sysctl values:
net.ipv4.ip_local_port_range = 10000 20000
net.ipv4.conf.all.forwarding=1
net.ipv6.conf.all.forwarding=1
Dont know if its a bug or what:
Tried this with:
Jool 4.1.5, 4.1.8, 4.2.0-rc2
Kernels 5.10, 5.15, 5.16
The text was updated successfully, but these errors were encountered: