nf_defrag_ipv4 and nf_defrag_ipv6

Alberto Leiva Popper edited this page Apr 1, 2015 · 1 revision

nf_defrag_ipv4 and nf_defrag_ipv6 are Netfilter's defragmentation modules. I generally perceive the couple of them as a single logical entity, which I call "defrag" (that thingo which "defragments" before Jool).

Because Stateful NAT64 needs to mangle layer 4 information, it needs some level of defragmentation. Therefore, modprobe automatically inserts defrag before inserting Jool. The user doesn't need to do anything for this to happen (conversely, modprobe -r automatically removes defrag after removing Jool - assuming Jool is the only module left which was using defrag).

SIIT doesn't want defragmentation:

  • For one thing, SIIT doesn't mangle layer-4 information (except for checksum, which doesn't matter), so defragmentation is redundant.
  • Defragmentation is stateful. This makes Stateless IP/ICMP Translation inherit some of Stateful NAT64's drawbacks:
    • Statefulness is a security risk because it requires memory. Because fragments are fairly big, this memory quickly adds up.
    • Defragmentation inhibits redundancy, because it forces all fragments of a common packet to traverse the same (defragmenting) gateway.

When running SIIT, it's wise to throw defrag out of the way.

A running kernel module cannot enforce dependency of a certain module in some cases and not in others; it has to be defined at compile-time. This is the one and only reason why SIIT Jool and NAT64 Jool are separate binaries; NAT64 Jool needs defrag, but SIIT Jool should work without it.

That said, while Stateful Jool enforces defrag, the kernel doesn't provide a way for SIIT Jool to enforce its "non-presence". Other modules, which the user might modprobe aside from NAT64 Jool, might activate defrag. If SIIT Jool is operating in such a situation, the drawbacks will kick in.

One can check the presence of defrag by querying lsmod:

$ lsmod | grep defrag
$ modprobe nf_defrag_ipv6
$ modprobe nf_defrag_ipv4
$ lsmod | grep defrag
nf_defrag_ipv6         34768  0 
nf_defrag_ipv4         12758  0 

The following output tells us NAT64 Jool is currently using defrag (the fourth column lists dependents):

$ lsmod | grep jool
jool                  152517  0 
nf_defrag_ipv6         34768  1 jool
nf_defrag_ipv4         12758  1 jool

This is undesirable; it means you're running SIIT and defragmentation at the same time:

$ lsmod | grep "defrag\|jool"
jool_siit              97050  0 
nf_defrag_ipv4         12758  0 
nf_defrag_ipv6         34768  0 

One way to accidentally insert defrag is by using iptable's state match:

$ lsmod | grep defrag
$ sudo iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
$ lsmod | grep defrag
nf_defrag_ipv4         12758  1 nf_conntrack_ipv4

I guess this information should actually belong in the manual an not here. Starting now, I will talk about how defrag affects the code.

Implementation gotchas

While Linux's representation of fragments is a fine mess, the main pitfall is that it works differently depending on which kernel version you're on.

You can read about the kernel's intended representation of fragments in chapter 21 of Understanding Linux Network Internals. You can also infer a good chunk of it from this document.

The following sections will talk about defrag's behaviour on different kernels.

Here's a little local glossary:

  • First fragment: Fragment whose fragment offset is zero.
  • Subsequent fragment: Fragment whose fragment offset is not zero.
  • sk_buff: The Linux kernel's structure representing packets. Instances of this structures are usually called "skb".

Common behaviour

Except for the differences noted below, the variants tend to behave as follows:

  • defrag sits in Netfilter's PREROUTING chain.
  • defrag steals and groups the fragments, "assembling" the original intended packet.
    • defrag does not actually create a new sk_buff where each fragment's bytes are copied. Instead, the fragments are listed. That's why I quoted "assembling".
    • Depending on kernel version and defrag's protocol, the list is either skb_shinfo(skb)->frag_list or skb_shinfo(skb)->frags (where skb is the first fragment's sk_buff). As far as I can read, the correct version is frag_list. frags is actually intended for paging and I guess Linux sometimes misuses it out of lazyness. (Or cleverness. But it is annoying.)
    • Packets in the list are sorted by fragment offset.
    • In short, the first fragment is skb. The next fragment is (usually) skb_shinfo(skb)->frag_list. The next fragment is skb_shinfo(skb)->frag_list->next. The next fragment is skb_shinfo(skb)->frag_list->next->next. And so on until ->next yields NULL. As far as I can tell, ->prev is never used.
  • Once the packet is complete, defrag fetches skb (ie. the first fragment) to the Netfilter chain.
    • This means that defrag kind of looks like a black hole step where several packets enter and only some survive. Of course, in reality, defrag is fetching all the fragments; it's just that most of them are invisible inside some other packet's list.
  • Many kernel functions painstakingly attempt to make skb look like a single/full packet.
    • If you access the list directly, you can still find each fragment's content and lengths.
    • However, layer-3-and-below headers of all the subsequent fragments are generally assumed to be lost.
      • In the case of frag_list, skb->data of all subsequent fragments point to the (network header's) payload. Technically, you can still use skb_network_header() to access the network header, but apparently only if the kernel was compiled with NET_SKBUFF_DATA_USES_OFFSET OFF. This is because the core skb operations (skb_pull(), skb_push(), etc) do not update offsets.
      • In the case of frags, only the (network header's payload) is paged. Apparently, the network header is completely ignored.
    • The net result is that sometimes any differences the original fragments' network headers had will be lost. If Linux needs to forward the fragments, each header might be rebuilt from the first one.
    • Also, and in other words, you should never attempt to extract layer-3 data from a subsequent fragment.
  • Somewhat as a consequence of the previous point, some header data is actually corrupted. This is one of the reasons why I've learned to hate atomic fragments so much.
    • In IPv4,
      • DF is turned off. Unfortunately, this collides with RFC 6145 because the atomic fragments logic infers information from this flag.
        • Fortunately, packets defrag doesn't need to mangle do not suffer from this quirk.
          • This means only fragmented packets with DF ON will get affected. It doesn't seem these packets are normal or useful Internet traffic so we should probably not care about this.
      • MF is turned off.
        • Again, this is because it's trying to fool you into thinking it's a full packet.
      • Total Length now includes the rest of the fragments' payload.
        • Idem.
      • I don't remember if the checksum is updated to reflect these changes. Probably so.
    • In IPv6,
      • The fragment header is literally eradicated. Its previous headers are moved right 8 bytes because logic.
        • This means we lose the fragment Identification.
          • It's OK, though. We can generate a new Identification because we do have all the fragments.
      • Payload Length now includes the rest of the fragments' payload.
    • It's not too bad. You don't have to worry about this when computing the outgoing Total Length from the incoming Payload length, for example. You're converting a length that misleadingly embraces all fragments into another length that also embraces all fragments in the same way. However, it is another layer of complexity added on top of the IP protocols, and it does mean you have to know what you're doing before directly manipulating these fields.
  • Of course, defrag also updates the first fragment's lengths (skb->len, skb->data_len and skb->truesize). See chapter 21 of the book.

Here's an example of some of defrag's work. If I send three fragments of a common packet like this:

Fig.1 - Input to defrag6

defrag6 will pervert them into this:

Fig.2 - defrag6's output

"nh" is skb_network_header() and "th" is skb_transport_header(). They assume NET_SKBUFF_DATA_USES_OFFSET is OFF. Even though I've pictured them for the sake of clarity, you should probably not access these values in subsequent fragments.

Notice the skb->data pointer almost goes bananas.

nf_defrag_ipv6 - kernels 3.13+

The fragments are stored in skb_shinfo(skb)->frag_list. frags is always empty (assuming no paging, I guess).

When forwarding:

  • If skb_linearize() is not called, Linux is capable of reverting the frag_list hack, and sniffing the network shows the original fragments. All that might change is the fragment order.
  • If skb_linearize() is called after defrag, the fragments are irreversibly fused; the original fragment lengths are lost. Assuming the packet fits, sniffing only sees one packet.

nf_defrag_ipv4 - kernels 3.12-

The fragments are stored in skb_shinfo(skb)->frag_list. frags is always empty (assuming no paging, I guess).

Linux joins the fragments before throwing them to the network. When this happens, DF is lost; it always becomes zero.

skb_linearize() does not seem to affect anything.

nf_defrag_ipv4 - kernels 3.13+

The fragments are stored in skb_shinfo(skb)->frags. frag_list seems to always be empty.

Linux joins the fragments before throwing them to the network. DF isn't lost.

skb_linearize() does not seem to affect anything.

nf_defrag_ipv6 - kernels 3.12-

This is the funny guy.

This is what defrag6 does when it has framents to handle:

  • It waits until all of them have arrived, storing them.
  • Once all the fragments are available, it fetches them separately (and sorted by fragment offset).
  • frag_list and frags are never affected.
  • No pointers or header fields are edited.

I have to say, even though is the only one that behaves differently, it's the least intrusive one because it doesn't corrupt anything (other than fragment order) and it's super easy to explain.

However, we don't want to make a special case out of it throughout the entire codebase, so Jool has a "fragment database", which all it does is wait until defrag6 has fetched all fragments, and apply the other defrags' hacks on them. This is done very early in Jool's translation pipeline.