## 摘要 Abstract

我们提出了 YOLO，一种新的物体检测方法。之前关于物体检测的工作将分类器重新利用用于检测。相反，我们将对象检测框架为一个回归问题，涉及空间分离的边界框及其相关的类别概率。单个神经网络通过一次评估直接从完整图像预测边界框和类别概率。由于整个检测流程是一个单一网络，可以直接基于检测性能进行端到端优化。

## 统一检测 Unified Detection

- 我们将物体检测的组件统一(unify)进一个单独的神经网络中。
- 利用整个图像来做出预测并且同时预测同一类图像的所有边界框。
- 兼容实时速度和精度。

<div style="background-color: white; padding: 10px; border-radius: 5px; width: 80%; text-align:center; margin: auto;">
    <image src="./assets/model.png" alt="YOLO System">
    <a id="fig2"></a>
    <span style="font-size: 0.9em; color: gray;">图2：YOLO 模型</span>
</div>

- 我们的系统将图像分成$S \times S$的网格(如上面是7x7)。
- 每个网格预测$B$个边界框(bounding box)及其置信度分数。
    - $B$个边界框的坐标: $(x, y, w, h, \text{confidence})$
    - $C$个类别的条件概率: $Pr(\text{Class}_i | \text{Object})$
- 每个bounding box包含5个预测值: $(x, y, w, h, \text{confidence})$
    - $(x, y)$: 边界框中心相对于网格边界的位置
    - $(w, h)$: 边界框的宽度和高度，相对于整张图像的宽度和高度的比例
    - confidence: $Pr(\text{Object}) * \operatorname{IOU}_{\text{pred}}^{\text{truth}}$(预测边界框与真实边界框的IOU)
- 测试时置信度和类别概率相乘:
  - $$Pr(\text{Class}_i | \text{Object}) * Pr(\text{Object}) * \operatorname{IOU}_{\text{pred}}^{\text{truth}} = Pr(\text{Class}_i) * \operatorname{IOU}_{\text{pred}}^{\text{truth}}$$
  - 既编码了改类别出现在该框中的概率，也反映了测试框与该对象的匹配程度。
- 最后预测编码为$S \times S \times (B * 5 + C)$的张量。
- 为了在$\text{P}_{\text{ASCAL}} \text{VOC}$上评估YOLO，我们使用：
  - $S=7, B=2, C=20$，所以输出张量大小为$7 \times 7 \times 30 = 1470$。

### 网络设计 Network Design

<div style="background-color: white; padding: 10px; border-radius: 5px; width: 80%; text-align:center; margin: auto;">
    <image src="./assets/net-0.png" alt="The Architecture">
    <a id="fig3"></a>
    <span style="font-size: 0.9em; color: gray;">图3：YOLO 模型架构</span>
</div>

```python
YOLOv1(
  (backbone): Sequential(
    (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3))  # (B, 3, 448, 448) → (B, 64, 224, 224), 
    (1): LeakyReLU(negative_slope=0.1)
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)  # (B, 64, 224, 224) → (B, 64, 112, 112)
    (3): Conv2d(64, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))  # (B, 64, 112, 112) → (B, 192, 112, 112), 
    (4): LeakyReLU(negative_slope=0.1)
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)  # (B, 192, 112, 112) → (B, 192, 56, 56)
    (6): Conv2d(192, 128, kernel_size=(1, 1), stride=(1, 1))  # (B, 192, 56, 56) → (B, 128, 56, 56), 
    (7): LeakyReLU(negative_slope=0.1)
    (8): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))  # (B, 128, 56, 56) → (B, 256, 56, 56), 
    (9): LeakyReLU(negative_slope=0.1)
    (10): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))  # (B, 256, 56, 56) → (B, 256, 56, 56), 
    (11): LeakyReLU(negative_slope=0.1)
    (12): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))  # (B, 256, 56, 56) → (B, 512, 56, 56), 
    (13): LeakyReLU(negative_slope=0.1)
    (14): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)  # (B, 512, 56, 56) → (B, 512, 28, 28)
    (15): Sequential(
      (0): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))  # (B, 512, 28, 28) → (B, 256, 28, 28), 
      (1): LeakyReLU(negative_slope=0.1)
      (2): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))  # (B, 256, 28, 28) → (B, 512, 28, 28), 
      (3): LeakyReLU(negative_slope=0.1)
    )
    (16): Sequential(
      (0): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))  # (B, 512, 28, 28) → (B, 256, 28, 28), 
      (1): LeakyReLU(negative_slope=0.1)
      (2): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))  # (B, 256, 28, 28) → (B, 512, 28, 28), 
      (3): LeakyReLU(negative_slope=0.1)
    )
    (17): Sequential(
      (0): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))  # (B, 512, 28, 28) → (B, 256, 28, 28), 
      (1): LeakyReLU(negative_slope=0.1)
      (2): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))  # (B, 256, 28, 28) → (B, 512, 28, 28), 
      (3): LeakyReLU(negative_slope=0.1)
    )
    (18): Sequential(
      (0): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
      (1): LeakyReLU(negative_slope=0.1)
      (2): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (3): LeakyReLU(negative_slope=0.1)
    )
    (19): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))  # (B, 512, 28, 28) → (B, 512, 28, 28), 
    (20): LeakyReLU(negative_slope=0.1)
    (21): Conv2d(512, 1024, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))  # (B, 512, 28, 28) → (B, 1024, 28, 28), 
    (22): LeakyReLU(negative_slope=0.1)
    (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)  # (B, 1024, 28, 28) → (B, 1024, 14, 14)
    (24): Sequential(
      (0): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1))  # (B, 1024, 14, 14) → (B, 512, 14, 14), 
      (1): LeakyReLU(negative_slope=0.1)
      (2): Conv2d(512, 1024, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))  # (B, 512, 14, 14) → (B, 1024, 14, 14), 
      (3): LeakyReLU(negative_slope=0.1)
    )
    (25): Sequential(
      (0): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1))  # (B, 1024, 14, 14) → (B, 512, 14, 14), 
      (1): LeakyReLU(negative_slope=0.1)
      (2): Conv2d(512, 1024, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))  # (B, 512, 14, 14) → (B, 1024, 14, 14), 
      (3): LeakyReLU(negative_slope=0.1)
    )
    (26): Conv2d(1024, 1024, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))  # (B, 1024, 14, 14) → (B, 1024, 14, 14), 
    (27): LeakyReLU(negative_slope=0.1)
    (28): Conv2d(1024, 1024, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))  # (B, 1024, 14, 14) → (B, 1024, 7, 7), 
    (29): LeakyReLU(negative_slope=0.1)
    (30): Conv2d(1024, 1024, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))  # (B, 1024, 7, 7) → (B, 1024, 7, 7), 
    (31): LeakyReLU(negative_slope=0.1)
    (32): Conv2d(1024, 1024, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))  # (B, 1024, 7, 7) → (B, 1024, 7, 7), 
    (33): LeakyReLU(negative_slope=0.1)
  )
  (fc_layers): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=50176, out_features=4096, bias=True)  # 1024 * 7 * 7 = 50176, 
    (2): LeakyReLU(negative_slope=0.1)
    (3): Dropout(p=0.5, inplace=False)
    (4): Linear(in_features=4096, out_features=1470, bias=True)  # 7 * 7 * 30 = 1470, 
    (5): Sigmoid()
  )
)
输入形状: torch.Size([64, 3, 448, 448])
输出形状: torch.Size([64, 7, 7, 30])
```

### 训练 Training

__损失函数__:

$$\begin{aligned}
&\lambda_{\text{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{obj}} \left[ (x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2 \right] \\
+&\lambda_{\text{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{obj}} \left[ (\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i} - \sqrt{\hat{h}_i})^2 \right]\\
+&\sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{obj}} (C_i - \hat{C}_i)^2 \\
+&\lambda_{\text{noobj}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{noobj}} (C_i - \hat{C}_i)^2 \\
+&\sum_{i=0}^{S^2}  \mathbb{1}_{i}^{\text{obj}} \sum_{c\in \text{classes}} (p_i(c) - \hat{p}_i(c))^2 & 
\end{aligned}$$

- $\sum_{i=0}^{S^2}$: $i$ 遍历所有网格单元。
- $\sum_{j=0}^{B}$: $j$ 遍历每个网格单元的边界框。
- $\mathbb{1}_{i}^{\text{obj}}$: 表示物体是否出现在第$i$个网格内。
- $\mathbb{1}_{ij}^{\text{obj}}$: 表示第$i$个网格的第$j$个边界框负责预测该物体。 ($^{\text{noobj}}$表示没有物体)
- $(x_i, y_i, w_i, h_i)$: 第$i$个网格单元的第$j$个边界框的真实坐标。(加 $\hat{}$ 表示预测值)
- $C_i$: 第$i$个网格单元的第$j$个边界框的真实置信度。 (加 $\hat{}$ 表示预测值)
- $p_i(c)$: 第$i$个网格单元的类别$c$的真实条件概率。 (加 $\hat{}$ 表示预测值)