-
Notifications
You must be signed in to change notification settings - Fork 70
Flow Control and QoS configuration for RoCE fabrics
InfiniBand networks are inherently lossless. They incorporate a link-level flow control to ensure that packets are not dropped within the fabric. RoCE (RDMA over Converged Ethernet) implements the InfiniBand protocol over a standard Ethernet/IP network, which can be lossy. Due to the performance implications of a lossy network when running RoCE, it is recommended to enable some form of flow control within your fabric.
For detailed information, please refer to Network Considerations for Global Pause, PFC and QoS with Mellanox Switches and Adapters document.
"Global Pause" is the simplest mode of flow control for achieving a lossless Ethernet fabric.
The Ethernet standard (802.3) is unreliable (or "lossy") by design. In its primitive form, there is no guarantee for packets to reach the required destinations. The Ethernet standard gives this responsibility to the upper layer protocols (e.g. TCP).
Later on, the IEEE 802.3x (Annex 31B of 802.3) flow control standard was defined for applications that cannot build reliability on the upper layers protocols. It enables receiving buffer feedback (e.g. overflow) from a receiver to its sender.
The pause action (XOFF) is a control frame sent by the receiver to alert the sender that the receiver buffer is stressed and is about to overflow. The sender responds by stopping the transmission of any new packets until the receiver is ready to accept them again. The pause frame contains a timeout value. The sender will wait during this timeout or until an XON control message is received.
ethtool -a <interface name>
$ ethtool -a ens2
Pause parameters for ens2:
Autonegotiate: off
RX: off
TX: off
If the RX and TX settings are turned off, then they should be enabled:
$ ethtool -A ens2 rx on tx on
$ ethtool -a ens2
Pause parameters for ens2:
Autonegotiate: off
RX: on
TX: on
After enabling flow control on the adapters, the switch(es) must be configured accordingly. When using Mellanox switches you can run the following commands on individual ports, or provide a range:
mellanox-switch [standalone: master] (config) # interface ethernet 1/1-1/32 flowcontrol receive on force
mellanox-switch [standalone: master] (config) # interface ethernet 1/1-1/32 flowcontrol send on force
If you are using a switch from another vendor, then you will need to refer to their documentation for enabling global pause (IEEE 802.3x port based flow control).
In more complex environments, it may be required to employ some more advanced configurations and consider other flow control options in order to achieve maximum performance.
Please refer to the following document "Recommended Network Configuration Examples for RoCE Deployment" for detailed configuration recipes to match various production use cases.