Residual connections, often seen in architectures like ResNet, are designed to let the network learn modifications (or residuals) to the identity mapping rather than learning an entirely new transformation. They help in training very deep neural networks by alleviating issues like vanishing gradients. Let’s break down the concept in detail, including the mathematical aspects.

---

### 1. **Overview of Residual Connections**

- **Motivation:**  
  In deep networks, as layers increase, it becomes harder for gradients to flow backward through the network during training. Residual connections allow gradients to bypass certain layers, making it easier to train deeper models.

- **Basic Idea:**  
  Instead of trying to learn a direct mapping \( H(x) \) from input \( x \) to output, the network learns a residual function \( F(x) \) such that:
  \[
  H(x) = F(x) + x.
  \]
  Here, \( F(x) \) represents the change or “residual” needed from the input \( x \) to achieve the desired output.

- **“Add” Operation:**  
  The term “add残差链接” (residual connection) refers to the element-wise addition of the input \( x \) with the output of a series of transformations \( F(x) \). This addition is key to forming the shortcut connection that bypasses one or more layers.

---

### 2. **Structure of a Residual Block**

A typical residual block consists of:
- **A series of layers:**  
  These layers perform a transformation \( F(x) \). For example, in a basic block, you might have two convolutional layers, each followed by a batch normalization and a non-linear activation like ReLU.
  
- **A Shortcut Connection:**  
  The input \( x \) is directly added to the output of \( F(x) \) via element-wise addition:
  \[
  \text{Output} = F(x) + x.
  \]
  Optionally, if the dimensions of \( F(x) \) and \( x \) do not match, a linear projection (e.g., using a \(1 \times 1\) convolution) is applied to \( x \) to align the dimensions.

---

### 3. **Mathematical Formulation**

Consider an input \( x \) entering a residual block. The block computes:
1. **Residual Mapping:**  
   \[
   F(x, \{W_i\}) \quad \text{(e.g., a stack of convolution, batch normalization, and ReLU operations)}
   \]
   where \(\{W_i\}\) are the weights of the layers inside the block.

2. **Element-wise Addition (Skip Connection):**  
   The final output \( y \) is computed as:
   \[
   y = F(x, \{W_i\}) + x.
   \]

3. **Activation (Optional):**  
   Often, the sum is then passed through another activation function:
   \[
   y = f(F(x, \{W_i\}) + x),
   \]
   where \( f \) is typically a non-linear function like ReLU.

This formulation ensures that if \( F(x, \{W_i\}) \) learns to output zeros, the block will simply output \( x \) (i.e., an identity mapping), which can be beneficial if additional layers are not needed at that depth.

---

### 4. **Benefits of Residual Connections**

- **Easier Gradient Flow:**  
  The shortcut connection allows gradients to flow directly back to earlier layers during backpropagation, reducing the vanishing gradient problem.

- **Identity Mapping:**  
  The network can easily learn an identity function if that is optimal, enabling better performance as layers increase.

- **Faster Convergence:**  
  Training often converges faster as residual networks can better preserve information across many layers.

---

### 5. **Example: A Simple Residual Block**

Suppose you have a residual block with two convolutional layers. The computations are as follows:

1. **First Layer:**
   \[
   z_1 = \text{Conv}_1(x) + b_1, \quad a_1 = \text{ReLU}(z_1)
   \]

2. **Second Layer:**
   \[
   z_2 = \text{Conv}_2(a_1) + b_2, \quad F(x) = \text{ReLU}(z_2)
   \]

3. **Residual Addition:**
   \[
   y = F(x) + x.
   \]
   If needed, before addition, \( x \) may be transformed (e.g., via a \(1 \times 1\) convolution) to match dimensions.

This simple block illustrates how the residual mapping \( F(x) \) is added to the original input \( x \) to form the final output.

---

### 6. **Conclusion**

Residual connections, or "add残差链接", provide a mechanism for deep networks to learn more effectively by allowing layers to refine an identity mapping. This is achieved through:
- Learning a residual function \( F(x) \) instead of a direct mapping,
- Utilizing element-wise addition to combine the residual function with the original input,
- Facilitating easier gradient flow and potentially enabling faster and more stable training in very deep networks.

Understanding these components gives insight into why modern architectures like ResNet have been so successful in tackling deep learning tasks.