Source: CVPR

Year: 2018

Authors: Jie Hu, Li Shen, Gang Sun

Institutions: Momenta, University of Oxford

For any given transformation $F_{tr}:X\to U, X\in\mathbb{R}^{H′\times W′\times C′},U\in\mathbb{R}^{H\times W\times C}$, a corresponding SE block can be constructed to perform feature recalibration. The features $U$ are first passed through a squeeze operation, which aggregates the feature maps across spatial dimensions $H\times W$ to produce a channel descriptor. This is followed by an excitation operation, in which sample-specific activations, learned for each channel by a self-gating mechanism based on channel dependence, govern the excitation of each channel. The feature maps $U$ are then reweighted to generate the output of the SE block which can then be fed directly into subsequent layers.

While the template for the building block is generic, the role it performs at different depths adapts to the needs of the network. In the early layers, it learns to excite informative features in a class agnostic manner, bolstering the quality of the shared lower level representations. In later layers, the SE block becomes increasingly specialised, and responds to different inputs in a highly class-specific manner. Consequently, the benefits of feature recalibration conducted by SE blocks can be accumulated through the entire network.

Each of the learned filters operates with a local receptive field and consequently each unit of the transformation output $U$ is unable to exploit contextual information outside of this region.

To mitigate this problem, global spatial information is squeezed into a channel descriptor. This is achieved by using global average pooling to generate channel-wise statistics. Formally, a statistic $z\in\mathbb{R}^C$ is generated by shrinking $U$ through spatial dimensions $H\times W$, where the $c$-th element of $z$ is calculated by
$$z_c=F_{sq}(u_c)=\frac{1}{H\times W}\sum_{i=1}^H\sum_{j=1}^W u_c(i,j)$$

Excitation aims to fully capture channel-wise dependencies.
1. It must be flexible (in particular, it must be capable of learning a nonlinear interaction between channels)
2. It must learn a non-mutually-exclusive relationship to ensure that multiple channels are allowed to be emphasised opposed to one-hot activation.

It employs a simple gating mechanism with a sigmoid activation:
$$s=F_{ex}(z,W)=\sigma(g(z,W))=\sigma(W_2\delta(W_1z))$$
where $\delta$ refers to the ReLU function, $W_1\in\mathbb{R}^{\frac{C}{r}\times C}$ and
$W_2\in\mathbb{R}^{C\times\frac{C}{r}}$.

Excitation forms a bottleneck with two fully connected layers around the non-linearity, i.e. a dimensionality reduction layer with parameters $W_1$ with reduction ratio $r$, a ReLU and then a dimensionality increasing layer with parameters $W_2$. The final output of the block is obtained by rescaling the transformation output $U$ with the activations:
$$\tilde{x}_c=F_{scale}(u_c,s_c)=s_cu_c$$

For non-residual networks, such as Inception network, SE blocks are constructed for the network by taking the transformation $F_{tr}$ to be an entire Inception module.

Moreover, SE blocks are sufficiently flexible to be used in residual networks. The SE block transformation $F_{tr}$ is taken to be the non-identity branch of a residual module. Squeeze and excitation both act before summation with the identity branch.