Skip to content

Upstream step by step guide

Parav Pandit edited this page Oct 24, 2023 · 19 revisions

1. Device configuration

This is a step by step guide for ConnectX-6DX. For other devices firmware version or maximum limit may vary. Please check respective sections for it.

1.1 Update firmware

Update firmware that has support for scalable functions Minimum firmware version needed is 20.30.1004 It can be downloaded from firmware downloads.

1.2 Enable support

Once firmware is updated, enable scalable function support in the device. Scalable functions support must be enabled on the PF where SFs will be used.

$ mlxconfig -d 0000:06:00.0 s PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_TOTAL_SF=252 PF_SF_BAR_SIZE=10 SRIOV_EN=0

When SFs to be used in the external controller of the DPU, user must enable SFs on the external host PF.

(a) Disable global symmetrical MSI-X configuration in external host PF.

$ mlxconfig -d 0000:06:00.0 s NUM_PF_MSIX_VALID=0

(b) Enable per PF MSI-X configuration in external host PF.

$ mlxconfig -d 0000:06:00.0 s PF_NUM_PF_MSIX_VALID=1

(c) Setup MSI-X vectors per PF, it should be 4 times the number of SFs configured. For example, when PF_TOTAL_SF=250, configure MSI-X vectors to be 1000.

$ mlxconfig -d 0000:06:00.0 s PF_TOTAL_SF=250 PF_NUM_PF_MSIX=1000 PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_SF_BAR_SIZE=10 SRIOV_EN=0

1.3 Cold reboot

Perform cold system reboot for configuration to take effect.

2. Mandatory kernel configuration

Linux kernel mlx5 subfunction support must be enabled. It is disabled by default.

Following two Kconfig flags must be enabled.

  1. MLX5_ESWITCH
  2. MLX5_SF

3. Software control and commands

Scalable functions uses 4 step process from create to use as shown below.

pictures/create-config-deploy-use.png

3.1 Setup udev rules and scripts

3.1.1 Setup udev rule file

Some systems requires explicit udev rule file as systemd/udev may be old or PCI BDF name is long which results in netdevice name being too long that kernel doesn't support.

Create a file /etc/udev/rules.d/83-mlnx-sf-name.rules

SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}!="", ATTR{phys_port_name}!="", ATTR{phys_port_name}=="*pf*sf*" \
    IMPORT{program}="/sbin/sf-rep-netdev-rename $attr{phys_port_name} $attr{ifindex}" \
    NAME="$env{NAME}"

SUBSYSTEM=="net", SUBSYSTEMS=="auxiliary", ACTION=="add", ATTRS{sfnum}!="", \
    IMPORT{program}="/sbin/auxdev-sf-netdev-rename $attr{sfnum} $attr{ifindex}" \
    NAME="$env{SF_NETDEV_NAME}"

3.1.2 Setup SF representor rename file

Create /sbin/sf-rep-netdev-rename with execution (chmod +x) permission.

#!/bin/bash

PORT_NAME=$1
IFINDEX=$2

for rep_ndev in `ls /sys/class/net/`; do
        _ifindex=`cat /sys/class/net/$rep_ndev/ifindex | head -1 2>/dev/null`
        if [ "$_ifindex" = "$IFINDEX" ]
        then
            devpath=`udevadm info /sys/class/net/$rep_ndev | grep "DEVPATH="`
            pcipath=`echo $devpath | awk -F "/net/$rep_ndev" '{print $1}'`
            array=($(echo "$pcipath" | sed 's/\// /g'))
            len=${#array[@]}
            # last element in array is pci parent device
            parent_pdev=${array[$len-1]}
            #pdev is : 0000:03:00.0, so extract them by their index
            b=`echo ${parent_pdev:5:2} | sed 's/^0//'`
            f=${parent_pdev: -1}
            echo "NAME=en${b}f${f}${PORT_NAME}"
            exit
        fi
done

3.1.2 Setup SF auxiliary device rename file

Create /sbin/auxdev-sf-netdev-rename with execution (chmod +x) permission.

#!/bin/bash

# This file renames netdevice of the SF's auxiliary device.
# It is done by using its parent PCI device + sf number.
#
# For example, when SF with sfnumber 88 is located on its parent PCI Device 03:00.0, it will be named renamed as,
#
# enp3s0f0s88.
#
# en = Ethernet
# p = pci
# 3s0sf0 = pci bdf = 0x3:00.0
# s88 = SF number 88

SFNUM=$1
IFINDEX=$2

for sf_ndev in `ls /sys/class/net/`; do
    _ifindex=`cat /sys/class/net/$sf_ndev/ifindex | head -1 2>/dev/null`
    if [ "$_ifindex" = "$IFINDEX" ]
    then
            _sfnum=`cat /sys/class/net/$sf_ndev/device/sfnum | head -1 2>/dev/null`
            if [ "$_sfnum" = "$SFNUM" ]
            then
                    devpath=`udevadm info /sys/class/net/$sf_ndev | grep "DEVPATH="`
                    pcipath=`echo $devpath | awk -F "/mlx5_core.sf" '{print $1}'`
                    array=($(echo "$pcipath" | sed 's/\// /g'))
                    len=${#array[@]}
                    # last element in array is pci parent device
                    parent_pdev=${array[$len-1]}
                    #pdev is : 0000:03:00.0, so extract them by their index
                    b=`echo ${parent_pdev:5:2} | sed 's/^0//'`
                    d=`echo ${parent_pdev:8:2} | sed 's/^0//'`
                    f=${parent_pdev: -1}
                    echo "SF_NETDEV_NAME=enp${b}s${d}f${f}s${SFNUM}"
                    exit
            fi
    fi
done

3.2 Download and install latest iproute2 source

Download:

$ git clone git://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git iproute2-next

Install:

$ yum install -y libmnl-devel
$ cd iproute2-next
$ ./configure --prefix=/usr
$ make -j all
$ make install
$ devlink -V
devlink utility, iproute2-5.11.0

Make sure it is 5.11 or higher.

3.3 Move PCI PF to switchdev mode

$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
$ devlink dev eswitch show pci/0000:06:00.0

3.4 Show the physical (aka uplink) port of the PF

$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false

3.5 Add one SF

SF after addition is still not usable for the end user application. It can be usable after configuration and activation.

$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached

When a SF is added for the external controller, such as on DPU/smartnic, user needs to supply the controller number. In a single host DPU case, there is only one controller starting with controller number = 1.

Example of adding SF for the PF 0 of the external controller 1:

$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88 controller 1
pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 1 pfnum 0 sfnum 88 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached

3.6 Show the newly added devlink port

Show the SF by port index or by its representor device

$ devlink port show ens2f0npf0sf88

Or

$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached

3.7 Set the mac address of the SF

$ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88

3.8 Configure OVS

$ systemctl start openvswitch
$ ovs-vsctl add-br network1
$ ovs-vsctl add-port network1 ens2f0npf0sf88
$ ip link set dev ens2f0npf0sf88 up
$ ovs-vsctl add-port network1 ens2f0np0
$ ip link set dev ens2f0np0 up

3.9 Now activate the SF

Activating the SF results in creating an auxiliary device and initiating driver load sequence for netdevice, rdma and vdpa devices.

Once the operational state is marked as attached, driver is attached to this SF and device loading starts.

An application interested in using the SF netdevice and rdma device needs to monitor the rdma and netdevices either through udev monitor or poll the sysfs hierarchy of SF's auxiliary device.

In future, an explicit option will be added to deterministically add the netdev and rdma device of SF.

$ devlink port function set pci/0000:06:00.0/32768 state active

3.10 View the new state of the SF

$ devlink port show ens2f0npf0sf88 -jp
{
   "port": {
      "pci/0000:06:00.0/32768": {
         "type": "eth",
         "netdev": "ens2f0npf0sf88",
         "flavour": "pcisf",
         "controller": 0,
         "pfnum": 0,
         "sfnum": 88,
         "splittable": false,
         "function": {
           "hw_addr": "00:00:00:00:88:88",
           "state": "active",
           "opstate": "attached"
          }
       }
    }
  }

3.11 View the auxiliary device of the SF

View auxiliary devices and their associated protocol (net, rdma, vdpa devices).

$  tree -l -L 3 -P "mlx5_core.sf." /sys/bus/auxiliary/devices/

There can be hundreds of auxiliary SF devices on the auxiliary bus. Each SF's auxiliary device contains a unique sfnum and PCI information. Each SF's sfnum can be read using:

$ cat /sys/bus/auxiliary/devices/mlx5_core.sf.4/sfnum
88

Now to see the parent PCI device of the SF

$ readlink /sys/bus/auxiliary/devices/mlx5_core.sf.1
../../../devices/pci0033:00/0033:00:00.0/0000:06:00.0/mlx5_core.sf.1

View the devlink instance of the SF device:

$ devlink dev show
devlink dev show auxiliary/mlx5_core.sf.4

3.12 Set all SF specific device parameters

By default all the upper layer devices such as netdev, rdma, vdpa devices are disabled for the SF which are located on the eswitch PF.

NOTE: This step is not applicable for the SFs located on the external controller.

Enable these devices explicitly. For example, enable netdev and rdma devices.

$ devlink dev param set auxiliary/mlx5_core.sf.4 name enable_eth value true cmode driverinit
$ devlink dev param set auxiliary/mlx5_core.sf.4 name enable_rdma value true cmode driverinit
$ devlink dev reload auxiliary/mlx5_core.sf.4

If user wants to use only the vdpa device of the SF, only enable the vdpa auxiliary device.

$ devlink dev param set auxiliary/mlx5_core.sf.4 name enable_vnet value true cmode driverinit
$ devlink dev reload auxiliary/mlx5_core.sf.4

3.13 View the port and netdevice associated with the SF

$ devlink port show auxiliary/mlx5_core.sf.4/1
auxiliary/mlx5_core.sf.4/1: type eth netdev enp60s0f0s88 flavour virtual splittable false

3.14 View the RDMA device for the SF

$ rdma dev show
$ ls /sys/bus/auxiliary/devices/mlx5_core.sf.4/infiniband/

3.15 How to use SF devices (RDMA, netdev, vdpa)

3.15.1 netdevice and RDMA Device of the SF

netdev and RDMA device usage guide.

3.15.2 VDPA device of the SF

vdpa device usage guide.

3.16 Deactivate SF

Once SF usage is complete, deactivate the SF. This will trigger driver unload in the host system. Once SF is deactivated, its operational state will change to be "detached". An orchestration system should poll for operational state to be changed to "detached" before deleting the SF. This ensures a graceful hot unplug.

$ devlink port function set pci/0000:06:00.0/32768 state inactive

3.17 Delete SF

Finally once the state is "inactive" and operational state is "detached", user can safely delete the SF. For faster provisioning, a user can reconfigure and active the SF again without deletion.

$ devlink port del pci/0000:06:00.0/32768
Clone this wiki locally