Skip to content

Fast Reboot

Xin Liu edited this page Nov 9, 2017 · 3 revisions

SONiC Fast-Reboot (Fast-Reload) Design

Definition

Fast-reboot feature enables a switch to reboot up quickly, with minimum disruption to the data plane. It is important for data center networking operation where infrastructure maintenance and upgrade are unavoidable. Fast-reboot is one of the features to minimize traffic impact.

Requirements

  1. Fast-Reboot must be initiated by a cli command on a SONiC device, e.g. /usr/bin/fast-reboot
  2. Fast-Reboot must disrupt data plane not more than 25 seconds
  3. Fast-Reboot must disrupt control plane not more than 90 seconds
  4. Fast-Reboot should use stale FIB information while control plane reboots
  5. Fast-Reboot must support at least 2000 hosts connected to SONiC vlan interfaces
  6. Fast-Reboot must support at least 6000 ipv4 bgp routes and 3000 ipv6 /64 bgp routes
  7. LACP mode must be in SLOW mode for all LAG interfaces on a SONiC device

Interaction with other modules

Fast-Reboot must save FDB and ARP entries right before the reboot. Both FDB and ARP information should be dumped from SONiC DBs. This information will be used later to restore data plane state as soon as possible after ASIC reinitialization. Reinsertion FDB and ARP entries back to SONiC DBs could be done by swssconfig utility. SONiC must support installing dynamic FDB entries into ASIC. These entries must be updated later by ASIC when they are expired or updated. The FDB and ARP entries installation might be done using bulk operations. It will help to reduce Fast-Reboot data plane disruption time.

Configuration flow

Fast-Reboot doesn't require any configuration.

UI types to support

Fast-Reboot doesn't require any user input.

Implementation

Command line user interface

Fast-Reboot is initiated by running /usr/bin/fast-reboot command with root privileges. After running the command no other actions are required. It is implied, that an user checks a SONiC device health before initiating Fast-Reboot procedure on the SONiC device.

Required technologies

  1. LACP daemon must send a regular LACP update right before it stops working
  2. BGP daemon must support BGP graceful restart RFC4724 in Restarting Speaker mode. BGP daemon must send OPEN message with "Forwarding State" bit set in Graceful restart capability.
  3. Linux kernel must support KEXEC feature

Implementation details

After Fast-Reboot is initiated by the cli command the fast-reboot procedure:

  1. Dump FDB and ARP entries from SONiC tables to disk space, which is preserved in control plane reboots.
  2. Stop BGP daemon process in BGP graceful restart mode. It allows us to preserve routes on neighbor nodes using BGP Graceful Restart mode for 120 seconds.
  3. Stop LAG daemon to send a last LAG protocol update. It allows us to have 90 seconds to reboot control-plane.
  4. Stop docker service, otherwise the SONiC file system will be corrupted.
  5. Stop any ASIC drivers if necessary.
  6. Load a new kernel from the disk and start it using KEXEC function.
  7. After control plane is rebooted it should determine the cause of the reboot was Fast-Reboot procedure.
  8. syncd starts with '-t fast' parameter which means to start after Fast-Reboot procedure. This mode allows ASIC to initialize ASIC in the fast mode.
  9. swss restores FDB and ARP entries. It allows reduce boot time by removing FDB learning phase.
  10. LAG daemon restores LAG interfaces.
  11. BGP daemon restores BGP sessions with its neighbors.
  12. After that the SONiC works in a normal mode. Fast-Reboot is completed

Scale/performance

To have better performance, SONiC should support bulk operations for adding routing, FDB, and ARP entries. Currently, routing bulk is supported. FDB and ARP will come.

Clone this wiki locally