Skip to content

DPDK Notes

Shane Alcock edited this page May 10, 2023 · 5 revisions

Libtrace Data Plane Development Kit (DPDK) Support

For the latest DPDK support we recommend building from develop

Latest stable libtrace release

Libtrace release: 4.0.21
Supported DPDK versions: 16.11, 17.11, 18.11, 19.11, 20.11, 21.11, 22.11 [DPDK stable/LTS releases]
Unsupported but might work: Non-LTS releases (i.e. non XX.11 releases), 2.x series

The Data Plane Development Kit format allows packets to be captured in a truly zero-copy manner and provides direct access to every packet with almost zero overhead. This means more CPU is left for your application to process the packet. Libtrace's DPDK capture format works in a very similar way to the DAG capture format. DPDK supports many network cards a list can be found here.

Documentation and source code for the DPDK can be downloaded from https://www.dpdk.org.

DPDK development is occurring rapidly and the DPDK library API is very unstable. We are always happy to accept pull requests to fill in any gaps.

Building Libtrace against your system's DPDK (Recommended)

Most Linux distributions now ship DPDK through their package manager. This is the easiest way to get started with Libtrace and DPDK.

Install libtrace from your package manager, e.g. on Debian/Ubuntu:

apt install dpdk-dev

Build libtrace as per usual, but ensure to run ./configure with the --with-dpdk option.

./configure --with-dpdk
make
sudo make install

Note: We have found DPDK packaged in many different ways and often new releases will break Libtrace's automatic detection of the system library. Ensure you are building the latest development version of libtrace if you are having any problems.

Building libtrace against DPDK source

There are two different ways to build DPDK depending on the version

  1. The old way: make, support was removed starting with DPDK 20.11
  2. The new way: meson, support was added in 18.02

We recommend building via make if possible (DPDK 20.08 and older) with make, and only switching to meson when building DPDK 20.11 or newer.

Building DPDK (20.08 and older) from source using make

  1. Download a release from https://core.dpdk.org/download/
wget https://fast.dpdk.org/rel/dpdk-19.11.6.tar.gz
tar -xf dpdk-19.11.6.tar.gz
cd dpdk-stable-19.11.6/
  1. Create the default configuration file (x86_64-native-linuxapp-gcc/.config) for DPDK
  • Prior to DPDK 1.7 x86_64-native-linuxapp-gcc was named x86_64-default-linuxapp-gcc. And recent versions (19.11ish) have started a rename from x86_64-native-linuxapp-gcc to x86_64-native-linux-gcc
make config T=x86_64-native-linuxapp-gcc O=x86_64-native-linuxapp-gcc

Note: it is important to create and modify the .config file rather than passing the options directly to make install as this will not save the options in the .config file

  1. Set the required CONFIG_RTE_BUILD_COMBINE_LIBS=y configuration option in the .config file
  • Use CONFIG_RTE_BUILD_SHARED_LIB=y to build a shared library
  • CONFIG_RTE_LIBRTE_PMD_PCAP=y enables support for virtual pcap interfaces and is recommended for testing
echo "CONFIG_RTE_BUILD_COMBINE_LIBS=y" >> ./x86_64-native-linuxapp-gcc/.config
echo "CONFIG_RTE_LIBRTE_PMD_PCAP=y" >> ./x86_64-native-linuxapp-gcc/.config

To improve build compatibility with an unsupported kernel you might find disabling these modules useful

echo "CONFIG_RTE_LIBRTE_KNI=n" >> ./x86_64-native-linuxapp-gcc/.config
echo "CONFIG_RTE_KNI_KMOD=n" >> ./x86_64-native-linuxapp-gcc/.config
echo "CONFIG_RTE_EAL_IGB_UIO=n" >> ./x86_64-native-linuxapp-gcc/.config

If you are using a device which is not enabled by default configure it here

echo "CONFIG_RTE_LIBRTE_MLX5_PMD=y" >> ./x86_64-native-linuxapp-gcc/.config
  1. Make the DPDK library with the EXTRA_CFLAGS="-fPIC" options. This should create the static library in ./x86_64-native-linuxapp-gcc/libs/libintel_dpdk.a required by libtrace.
  • For DPDK to build your CPU requires SSE3 support, if this is not detected correctly you can try using EXTRA_CFLAGS="-fPIC -march=core2" to force it.
make install T=x86_64-native-linuxapp-gcc EXTRA_CFLAGS="-fPIC" -j 4
  1. Export RTE_SDK and RTE_TARGET, these are used by the libtrace configure script (if updated re-run ./configure)
export RTE_SDK=$(pwd)
export RTE_TARGET=x86_64-native-linuxapp-gcc

Building DPDK (20.11 and newer) from source using meson

  1. Install meson and pkg-config from your package manager
apt install meson pkg-config
  1. Download a release from https://core.dpdk.org/download/
wget https://fast.dpdk.org/rel/dpdk-20.11.tar.gz
tar -xf dpdk-20.11.tar.gz
cd dpdk-20.11/
  1. Configure where meson will install DPDK to and a directory to use for build artifacts
meson --prefix="$(pwd)/install" build

--prefix is the install location and build is the name of the directory meson will use for its build artifacts.

Note: you need to use --prefix rather than the DESTDIR method of setting the install path.

  1. Make our install directory and build using meson
mkdir install
cd build/
meson install
  1. Export RTE_SDK to point to the install path, RTE_TARGET is not needed
cd ..
export RTE_SDK=$(pwd)

Libtrace will search for the pkg-config directory based on RTE_SDK; looking at the following locations $(RTE_SDK)/install/lib/, $(RTE_SDK)/lib/, and $(RTE_SDK)/.

Compiling Libtrace with against a local copy of DPDK

  1. Configure and build libtrace - RTE_SDK and RTE_TARGET must be set in the environment for DPDK to be detected
cd ..
wget https://research.wand.net.nz/software/libtrace/libtrace-latest.tar.bz2
tar xf libtrace-latest.tar.bz2
cd libtrace*
./configure --with-dpdk=yes
make
sudo make install

If compiling fails due to an undefined symbol you can resolve this by the libraries via DPDKLIBS to make. This situation may if you enable a driver which libtrace does not know the dependencies for. For example, consider you have the ISAL driver enabled and need to link the isal library instead of make run:
make DPDKLIBS="-Wl,-lisal"
Note: you will need to prefix with -Wl, so that the library is added in the correct order.

Running Libtrace with DPDK

Note: It is strongly recommended that you test DPDK with its included samples and verify they are functioning correctly before attempting to use libtrace with DPDK.

  1. If you are not already familar with DPDK, read the DPDK Getting Started Guide and make sure the DPDK prerequisites are met such as hugepages. Use the DPDK sample applications to verify DPDK is working correctly.

  2. Load a suitable DPDK driver/kernel module (docs)

sudo modprobe uio_pci_generic

or

sudo modprobe vfio-pci

or

cd $RTE_SDK/$RTE_TARGET/kmod
sudo modprobe uio
sudo insmod ./igb_uio.ko
  1. Use the dpdk_nic_bind.py tool (found in DPDK/usertools/ or DPDK/tools/) to bind the port to the appropriate driver. (This tool changes name a lot, so it might be called something else.)
cd $RTE_SDK/usertools/
sudo ./dpdk-devbind.py -b <driver name> <nic pci address>
sudo ./dpdk-devbind.py --status
  1. Test DPDK using a libtrace tool, here the PCI address can be found with the dpdk_nic_bind.py tool or lspci
sudo tracesummary dpdk:0000:01:00.0

You can also attach dpdk to a virtual device, for example using the pcap driver

sudo tracesummary dpdkvdev:net_pcap0,iface=veth0

System Requirements

  • Gettimeofday() and/or clock_gettime() must be implemented as virtual system calls for your Linux kernel, these are called for every packet received so the advantage of using DPDK will be lost if a system call still has to be made.
  • DPDK is a polling format hence it is highly recommended to use a multi-core system so other processes can be run on the remaining cores.
  • For best performance the CPU cores that has DPDK bound to it should only be running DPDK, as such you should move interrupts away from DPDK cores.
  • For best performance, DPDK should not be run on hyper-threads.
  • For best performance use a core on the same NUMA node as the PCI card. This can be selected when starting a format using the dpdk:<PCI>-<core> notation. E.g. dpdk:0000:42:0.1-2 to use the second CPU core.

Libtrace application requirements

  • The same thread must be used to create, start and read/write packets and all other calls to libtrace format dependent functions.
  • Minimal processing should be done on the thread interacting with libtrace and the DPDK format, for two main reasons:
    1. Packets will be dropped when queues fill up, this applies to all formats
    2. The timestamping of DPDK packets occurs when trace_read_packet() is called (using gettimeofday()) so the longer packet processing takes the less accurate the timestamps are unless hardware timestamping is used.
  • When using the DPDK format the system should remain on at all times, don't put it to sleep or into hibernation.
  • Each Libtrace interface will try to allocate 512MB of huge pages so make sure you have enough huge pages.
  • Starting with libtrace 4.16, a single application can open multiple DPDK interfaces. A single interface is still limited to being opened either for writing or sending, both not both at the same time. Prior to libtrace 4.16 applications could only open one DPDK interface.

Known issues

Packet counters

DPDK drivers are known to count dropped and errored packets in different ways; sometimes dropped packets are counted as errored other times they are separated. This is down the the individual driver, so there is no good way to solve this issue. If you are relying on these counters double check they are correct for your hardware.

Many of these inconsistencies were fixed in DPDK 16.04, however, we have not tested all drivers.

Advance Settings (Defines at the top of libtrace/lib/dpdk.c)

This is section is out of date based upon testing using the Intel DPDK 1.3.1_7(No longer supported by libtrace) and an Intel 82580 based Ethernet controller. Some of these settings are not supported by all controllers.

NB_RX_MBUF - Number of memory buffers i.e. the number of packets in the ring buffer

Patch included libtrace/Intel DPDK Patches/larger_ring.patch

NB_RX_MBUF controls the maximum number of packets the DPDK format can buffer at one time. In general the larger this is the lower the packet drop rate is (Ideally this becomes 0).

There is a limit placed on the NB_RX_MBUF of 4k per RX ring by the pmd driver. This is controlled by a define for the IGB driver it is located in IntelDPDK/lib/librte_pmd_e1000/igb_rxtx.c line 1063

#define IGB_MAX_RING_DESC

It appears this can be increased without any side-effects (except more memory usage). There is a limit of 65535 due to DPDK using a uint16_t to represent this size. In order to exceed this multiples queues would need to be used (not supported by libtrace). NOTE: 65535 itself cannot be used directly due to the alignment size however 65536 - ''alignment''(such as 128) can be used. If you want to use this setting on your Intel NIC, check with the documentation to make sure there isn't a hardware limit placed on this value.

Capturing Bad Packets - Those with an ethernet checksum mismatch

A minor change can be made to the pmd driver IntelDPDK/lib/librte_pmd_e1000/igb_rxtx.c that keeps packets with bad ethernet checksums which would otherwise be dropped by default. Simply change rctl &= ~E1000_RCTL_SBP; to rctl |= E1000_RCTL_SBP;

NOTE: Bad packets don’t appear to get timestamped, so this will cause problems if used with Hardware Timestamping because there is no way of knowing if a packet is bad or not and if a timestamp is sitting in front of the packet.

HAS_HW_TIMESTAMPS_82580 - Hardware Timestamping Packets (Implemented for Intel 82580 based NICs)

To get a hardware timestamp from the Intel DPDK a change must be made to the pmd driver. I’ve made a patch for Intel 82580 based NICs see libtrace/Intel DPDK Patches/hardware_timestamp.patch. This must be first applied to DPDK and then set the HAS_HW_TIMESTAMPS_82580 define in dpdk.c to 1. Once applied the libtrace DPDK format can only be used with Intel 82580 Controllers. Packets must be read by calling trace_read_packet within half of the hardware clocks wrap around time which for Intel 82580 controller is 18/2 seconds.

In order to use timestamping the Intel NIC must support Receive Packet Timestamp in Buffer. This means the NIC will place the timestamp in a header before packet data. Libtrace then needs to correctly interpret this header things that need to be considered are:

  • Clock resolution - convert this to nanoseconds
  • Synchronising with the current time - record the time of the first packet you've received and add this to all packets after it.
  • Timer wrap around - Compare system time to that of the last packet received and estimate how many times the timer has (possibly) wrapped around then pick what makes sense.
  • Consider what happens after the device is paused - You need to restart timestamps because the clock will be reset when starting it again. The current implementation gets a system timestamp (hopefully via vsyscall) every time a packet is received. This could be done differently on a system that didn’t implement vsys calls by starting a background thread to increase a counter (i.e. do what estimated_wraps does) every 18 seconds when the clock is expected to wrap around. At this point you should get the system time to make sure you stay correctly in sync with it and the next sleep should be based on the difference.

GET_MAC_CRC_CHECKSUM

This option can be turned on by setting the define GET_MAC_CRC_CHECKSUM to 1. This gets the full packet including the checksum. This is safe to turn on, however it should be noted when writing to native interfaces like int: and ring: it's assumed that there is no checksum.

USE_CLOCK_GETTIME

Use get_clocktime() instead of gettimeofday() (nanoseconds vs microseconds). This should only be considered if clock_gettime() is a virtual system call for your system. One should remember that this timestamp is added by libtrace when trace_read_packet() is called so it's likely that the accuracy of this timestamp isn't close enough to hardware to support nanosecond accuracy anyway. If you require accurate timestamping to the nanosecond hardware timestamping is the only way to truly achieve this.

NOTE: This setting has no effect if hardware timestamping is already being used.

Capturing Jumbo Frames

Jumbo frames can be captured by setting the TRACE_OPTION_SNAPLEN using trace_config(). The size specified here excludes the checksum size and is limited to around 9k by most Intel NIC's. TRACE_OPTION_SNAPLEN may be set to less than the maximum Ethernet packet size of 1514 however this setting will drop any packets that fall above that size. So if snaplen was set to 100 then any packet over 100 bytes + 4 bytes (Ethernet CRC) will be dropped automatically by the NIC.

Clone this wiki locally