Skip to content

L3 Tunneling

Ido Schimmel edited this page Dec 12, 2022 · 10 revisions
Table of Contents
  1. Introduction
    1. Topology
  2. Overlay Configuration
  3. Tunnel Configuration
    1. GRE in Main VRF
    2. General GRE Configuration
  4. Decap-only Tunnels
  5. Configuration Changes
  6. Features and Limitations
    1. Since Linux 6.2
    2. Since Linux 5.16
    3. Since Linux 5.5
    4. Since Linux 4.15
    5. Since Linux 4.14
  7. Further Resources

Introduction

Since L3 tunneling is fundamentally a routing technology, the switch where tunnels should to be configured needs to have routing enabled. See Static Routing for more details.

Topology

In abstract, the reason to create an IP-in-IP tunnel is to connect two IP networks separated by another IP network. In the example here, the two domains to be connected are represented by two hosts with arbitrarily-chosen addresses 192.168.1.33 resp. 192.168.2.33. The two hosts are each connected to a tunnel endpoint, addressed 1.2.3.4/31, which wraps up the host traffic and delivers it through a tunnel to the other endpoint. The encapsulated traffic travels over a transport network, here addressed 192.168.99.0/24.

In tunneling parlance, the traffic flowing between the two separated IP domains is called overlay traffic, and correspondingly the network where it flows overlay network. The encapsulated traffic on the other hand is called underlay traffic, and the network where it flows underlay network.

+--------------+         +--------------+
|              |         |              |
|    host1     |         |    host2     |
|              |         |              |
| 192.168.1.33 |         | 192.168.2.33 |
|      +       |         |      +       |
|      |       |         |      |       |
+--------------+         +--------------+
       |                        |
+--------------+         +--------------+
|      |       |         |      |       |
|      +       |         |      +       |   Overlay
| 192.168.1.1  |         | 192.168.2.1  | - - - - - -
|              |         |              |   Underlay
|   switch1    |         |   switch2    |
|              |         |              |
|   1.2.3.4    |         |   1.2.3.5    |
|      +       |         |      +       |
|      |       |         |      |       |
| 192.168.99.1 |         | 192.168.99.2 |
|      +       |         |      +       |
|     | |      |         |     | |      |
+--------------+         +--------------+
      | |______________________| |
      '--------------------------'

The switch, as a tunneling gateway, naturally handles both overlay and underlay traffic. Both can be in the same VRF (possibly the default one), or each can be in a different VRF. See below for details of each of these configurations.

Currently, mlxsw offloads GRE tunnels, but not all possible configurations are supported. Refer to Features and Limitations for the list of constraints that the tunnel needs to satisfy to be offloaded.

Besides setting up a tunnel device, one needs to also have a local route matching tunnel local address, which is offloaded to decapsulate packets; and possibly one or more routes that direct traffic to the tunnel, which are offloaded to encapsulate packets.

Features by Version

Kernel Version
4.15 Offload GRE tunnels.
5.1 Spectrum-2 support.
5.16 Offload GRE6 for Spectrum-2 and above.
6.2 Offload GRE6 for Spectrum-1.

Overlay Configuration

First, set up connection to local overlay network and route for tunneling of traffic destined for the remote overlay network (in this case, 192.168.2.0/24):

host1 $ ip link set dev eth0 up
host1 $ ip address add dev eth0 192.168.1.33/24
host1 $ ip route add 192.168.2.0/24 via 192.168.1.1
host2 $ ip link set dev eth0 up
host2 $ ip address add dev eth0 192.168.2.33/24
host2 $ ip route add 192.168.1.0/24 via 192.168.2.1

On the switch, set up the overlay interface accordingly:

sw1 $ ip link set dev sw1p49 up
sw1 $ ip address add dev sw1p49 192.168.1.1/24
sw2 $ ip link set dev sw1p49 up
sw2 $ ip address add dev sw1p49 192.168.2.1/24

Tunnel Configuration

You need a GRE module in order to set up GRE tunnels:

sw $ modprobe ip-gre

There are two main ways that GRE tunnel endpoint can be set up. If the tunnel is not bound to another device, the underlay is always in main VRF. If it is bound to a device, the underlay is where the device is.

The following sections elaborate how to setup first a simple case, where both overlay and underlay are in the main VRF, and then a general case, where they are possibly separate.

GRE in Main VRF

In this configuration, overlay and underlay traffic are both in the main VRF:

   +------------------( switch )-------------------+
   |                                               |
   |   overlay          GRE         transport      |
---|-+ 192.168.1.1      1.2.3.4 +-- 192.168.99.1 +=|===
   |                                               |
   +-----------------------------------------------+

First, set up the tunnel itself:

sw1 $ ip tunnel add name g mode gre local 1.2.3.4 remote 1.2.3.5 tos inherit
sw1 $ ip link set dev g up
sw1 $ ip address add dev g 1.2.3.4/32
sw2 $ ip tunnel add name g mode gre local 1.2.3.5 remote 1.2.3.4 tos inherit
sw2 $ ip link set dev g up
sw2 $ ip address add dev g 1.2.3.5/32

Or, if you want to use GRE keys:

sw1 $ ip tunnel add name g mode gre local 1.2.3.4 remote 1.2.3.5 tos inherit \
         key 123

Or:

sw1 $ ip tunnel add name g mode gre local 1.2.3.4 remote 1.2.3.5 tos inherit \
         ikey 456 okey 789

Note that the tunnel remote address must be reachable from this node. For example:

sw1 $ ip link set dev sw1p51 up
sw1 $ ip address add dev sw1p51 192.168.99.1/24
sw1 $ ip route add 1.2.3.5/32 via 192.168.99.2
sw2 $ ip link set dev sw1p51 up
sw2 $ ip address add dev sw1p51 192.168.99.2/24
sw2 $ ip route add 1.2.3.4/32 via 192.168.99.1

At this point, it is possible to direct traffic at the tunnel:

sw1 $ ip route add 192.168.2.0/24 dev g
sw1 $ ip route add 2001:db8:2::/56 dev g
sw2 $ ip route add 192.168.1.0/24 dev g
sw2 $ ip route add 2001:db8:1::/56 dev g

To verify that the individual routes have been offloaded:

sw $ ip route show table local dev g
local 1.2.3.4 dev g proto kernel scope host src 1.2.3.4 offload
sw $ ip route show dev g
192.168.2.0/24 scope link offload
sw $ ip -6 route show dev g
2001:db8:2::/56 metric 1024 offload pref medium

General GRE Configuration

A tunnel that is bound to another device has overlay in the VRF where the tunnel is, and underlay where the device that it is bound to is. Typically the underlay would be a different VRF than the one with the GRE netdevice itself, but it does not have to be.

Note: Bind devices are offloaded correctly only when their master is a VRF device. In that case, the bind device is only used to select the VRF to use for underlay traffic. When in main VRF, the bind device serves to actually select interface to egress encapsulated traffic through. That use is not recognized by mlxsw, a bind device is assumed to always just select an underlay VRF, even in cases when the bind device is in the main VRF. That is the reason we use a dummy device in this tutorial; it is the only device that makes sense as an anchor to select VRF.

This is what the set-up looks like:

   +------------------( switch )-------------------+
   |                                               |   <-- VRF ol
   |   overlay           GRE                       |
---|-+ 192.168.1.1        ^                        |
   |                      |                        |
   | - - - - - - - - - - -|- - - - - - - - - - - - |
   |                      v                        |   <-- VRF ul
   |                    dummy       transport      |
   |                    1.2.3.4 +-- 192.168.99.1 +=|===
   |                                               |
   +-----------------------------------------------+

First, create the VRFs themselves. For more details on that, see Virtual Routing and Forwarding (VRF):

sw $ ip link add name ol type vrf table 10
sw $ ip link set dev ol up
sw $ ip link add name ul type vrf table 20
sw $ ip link set dev ul up

Next create a dummy device to use to select the underlay VRF:

sw1 $ ip link add name d type dummy
sw1 $ ip link set dev d master ul
sw1 $ ip link set dev d up
sw1 $ ip address add dev d 1.2.3.4/32
sw2 $ ip link add name d type dummy
sw2 $ ip link set dev d master ul
sw2 $ ip link set dev d up
sw2 $ ip address add dev d 1.2.3.5/32

Now create a tunnel, binding it to the dummy:

sw1 $ ip tunnel add name g mode gre local 1.2.3.4 remote 1.2.3.5 dev d tos inherit
sw1 $ ip link set dev g master ol
sw1 $ ip link set dev g up
sw2 $ ip tunnel add name g mode gre local 1.2.3.5 remote 1.2.3.4 dev d tos inherit
sw2 $ ip link set dev g master ol
sw2 $ ip link set dev g up

You can of course set input and/or output GRE key like shown in the section on main VRF.

At this point, it is possible to direct traffic at the tunnel:

sw1 $ ip route add vrf ol 192.168.2.0/24 dev g
sw1 $ ip route add vrf ol 2001:db8:2::/56 dev g
sw2 $ ip route add vrf ol 192.168.1.0/24 dev g
sw2 $ ip route add vrf ol 2001:db8:1::/56 dev g

Also remember to put the ports which connect to the overlay and underlay networks to their right VRF. For example:

sw $ ip link set dev sw1p49 master ol
sw $ ip link set dev sw1p51 master ul

Decap-only Tunnels

Tunnel decap is offloaded as soon as there is a local route matching the local address of a tunnel. However in slow path, if the decapsulated packets are to be forwarded to hosts, one of the following conditions needs to hold:

  • There actually needs to be a corresponding route that would direct traffic from those hosts to the tunnel device (i.e. an encapsulating route)
  • Reverse path filtering needs to be disabled:
    sysctl -w net.ipv4.conf.all.rp_filter=0
    
  • The decapsulated traffic needs to be IPv6

mlxsw ignores the rp_filter setting and offloads as if it were disabled. This might create a discrepancy between how slow path and fast path packets are processed.

Another possibility to create a decap-only tunnel is to actually introduce the encapsulating routes, but set the bind device down. In that scenario, Linux (and mlxsw) does not forward encapsulated traffic, but the existence of the route makes the reverse path filtering work.

Features and Limitations

Only tunnels satisfying the following conditions are offloaded:

  • Only GRE tunnels
  • Both local and remote addresses shall be given (NBMA tunnels and LWT are currently not supported)
  • TTL and TOS shall both be inherit (note that in Linux the default TTL value for IPv6 tunnels is 64, unlike IPv4 tunnels where it is inherit by default. TOS inherit is not a default setting in Linux for either tunnel type)
  • No two tunnels that share underlay VRF shall share a local address (i.e. dispatch based on tunnel key is not supported)
  • Sequence numbers and checksumming shall not be used

The tunnel may have i-key and/or o-key set, and if it has both, the two may differ.

Since Linux 6.2

  • GRE tunnels with IPv6 underlay can be offloaded to Spectrum-1. Each router interface (RIF) representing an ip6gre tunnel consumes two RIF entries.

Since Linux 5.16

  • GRE tunnels with IPv6 underlay can be offloaded for Spectrum-2 and above. The type should be ip6gre and both TTL and TOS should be set to inherit. For example, to add a GRE tunnel with IPv6 underlay, run:
sw $ ip link add name g1 type ip6gre local 2001:db8:3::1 remote 2001:db8:3::2 tos inherit ttl inherit

Since Linux 5.5

  • Underlay of an unbound GRE device is now correctly the main VRF. That means that it is not possible anymore to cause local address collision by moving the GRE netdevice to another VRF.

Since Linux 4.15

  • If a GRE netdevice is moved to another VRF such that it causes local address collision, both tunnels are unoffloaded. The opposite logic which would notice that a netdevice became eligible for offloading due to configuration changes is currently not implemented. What falls to slow path, stays there.
  • Underlay of an unbound GRE device the same VRF that the GRE is in. This is unlike Linux, where it would be the main VRF. This issue is fixed in Linux 5.5.

Since Linux 4.14

  • Forming encapsulating routes to two tunnels that have the same local address and underlay VRF, leads to invocation of abort mechanism (see Static Routing)
  • Nothings is offloaded until an encapsulating route is added (i.e. the decap-only flow is not supported)
  • Changes to configuration done after the tunnel is offloaded are not reflected. This can be circumvented by removing and re-adding of all encapsulating routes at once (not one at a time).
  • State of bound device (up/down) is not reflected
  • Underlay of an unbound GRE device the same VRF that the GRE is in. This is unlike Linux, where it would be the main VRF. This issue is fixed in Linux 5.5.

Further Resources

  1. man ip-tunnel
  2. https://www.deepspace6.net/docs/iproute2tunnel-en.html
Clone this wiki locally