## 交叉熵

$$\begin{aligned}
\text{CE}(p, y) &= -\sum_{i}^C y_i \log(p_i) \\
\end{aligned}$$

其中:
- $C$是类别总数
- $p_i$是模型预测的第$i$类的概率
- $y_i$是实际标签的第$i$类的指示变量（1表示正确类别，0表示其他类别）

### 二元交叉熵

$$\begin{aligned}
\text{CE}(p, y) &= \begin{cases}
-\log(p) & \text{if } y = 1 \\
-\log(1 - p) & \text{otherwise}
\end{cases} \\
p_t &= \begin{cases}
p & \text{if } y = 1 \\
1 - p & \text{otherwise}
\end{cases} \\
\text{CE}(p_t) &= -\log(p_t) \\
\end{aligned}$$

### 平衡交叉熵


$$\begin{aligned}
\text{CE}(p_t) &= -\alpha_t \log(p_t) \\
\end{aligned}$$

其中:
- $\alpha_t$是类别平衡因子，通常使用类别频率的倒数进行设置，也可以使用交叉验证来确定最佳值。

## Focal Loss

- $\alpha$只关注类别数量的区别，并没有关注区分样本本身的难易程度。
- 比如某些背景元素，在训练的过程中置信度极高（$p_t \approx 1$, 对应CE=$-\log(p_t) \approx 0$），很容易被分类，但由于数量较多，导致他们的和在总损失中占据了很大比例（主导了梯度），影响了模型的训练效果。

$$\begin{aligned}
\text{FL}(p_t) &= - (1 - p_t)^\gamma \log(p_t) \\
\end{aligned}$$

- $\gamma$是调制因子，[图 1](#fig_speed)中可视化多个值$\gamma \in [0, 5]$的效果，观察到两个特性：
  - 当一个例子被错误分类且$p_t$较小的时候，调制因子$(1 - p_t)^\gamma$接近于1，对损失影响较小。随着$p_t$增大(接近1)，调制因子迅速下降，降低了易分类样本($p_t$大)的损失贡献。
  - $\gamma$降低易分类样本权重的速度比较平滑。当$\gamma=0$时，FL等价于CE。随着$\gamma$的增大，调制因子带来的影响也会增大。(实验中$\gamma=2$最佳)

<div style="background: white; padding: 10px; border-radius: 5px; margin: auto; width: 60%; ">
    <a id="fig_speed"></a>
    <image src="./assets/speed.png" alt="Focal Loss Modulating Factor">
</div>

实际中使用的是$\alpha$平衡的Focal Loss:
$$\begin{aligned}
\text{FL}(p_t) &= - \alpha_t (1 - p_t) ^\gamma \log(p_t) \\
\end{aligned}$$

## 消融实验

<div style="background: white; padding: 10px; border-radius: 5px; margin: auto; width: 60%; ">
    <image src="./assets/ablation.png" alt="Focal Loss Ablation Study">
</div>

## 代码

In [44]:
import numpy as np
# 设置np.random.seed以确保结果可复现
np.random.seed(42)

def my_softmax(logits):  # (B, C)
    max_vals = np.max(logits, axis=1, keepdims=True)
    exp_logits = np.exp(logits - max_vals)  # 减去max_vals将他们都变成负数
    return exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
def my_cross_entropy_loss(logits, targets, alpha=None):
    probs = my_softmax(logits)
    batch_size = logits.shape[0]
    eps = 1e-12
    true_probs = probs[np.arange(batch_size), targets] + eps  # [0.04130189 0.08789438 0.10629915 0.22321804 0.15923905, ...]
    ce_loss = -np.log(true_probs)  # 只考虑true class的概率
    # 按类别相加，计算贡献
    num_classes = len(np.unique(targets))
    class_contributions = np.zeros(num_classes)
    for cls in range(num_classes):
        cls_mask = (targets == cls)
        class_contributions[cls] = ce_loss[cls_mask].sum()
    print("应用alpha加权之前各类别对CE损失的贡献:", class_contributions)
    print("ce_loss:", ce_loss.mean())
    if alpha is not None:
        alpha = np.array(alpha)
        ce_loss = alpha[targets] * ce_loss
        for cls in range(num_classes):
            cls_mask = (targets == cls)
            class_contributions[cls] = ce_loss[cls_mask].sum()
        print("应用alpha加权之后各类别对CE损失的贡献:", class_contributions)
    print("weighted ce_loss:", ce_loss.mean())
    return ce_loss.mean()
def my_focal_loss(logits, targets, gamma=2.0, alpha=None):
    probs = my_softmax(logits)
    batch_size = logits.shape[0]
    eps = 1e-12
    pt = probs[np.arange(batch_size), targets] + eps  # [0.04130189 0.08789438 0.10629915 0.22321804 0.15923905, ...]
    fl_core = - np.log(pt) * (1 - pt) ** gamma
    num_classes = len(np.unique(targets))
    class_contributions = np.zeros(num_classes)
    for cls in range(num_classes):
        cls_mask = (targets == cls)
        class_contributions[cls] = fl_core[cls_mask].sum()
    print("应用alpha加权之前各类别对FL损失的贡献:", class_contributions)
    print("fl_loss:", fl_core.mean())
    if alpha is not None:
        alpha = np.array(alpha)
        fl_core = alpha[targets] * fl_core
        for cls in range(num_classes):
            cls_mask = (targets == cls)
            class_contributions[cls] = fl_core[cls_mask].sum()
        print("应用alpha加权之后各类别对FL损失的贡献:", class_contributions)
    print("weighted fl_loss:", fl_core.mean())
    return fl_core.mean()

n_classes = 5
batch_size = 64
imbalance_ratio = 0.3
class_probs = np.array([(1 - imbalance_ratio) ** i for i in range(n_classes)])  # [0.36060726 0.25242508 0.17669756 0.12368829 0.0865818 ]
class_probs /= class_probs.sum()  # 归一化
targets = np.random.choice(n_classes, size = batch_size, p=class_probs)
print("样本真实类别:\n", targets)
logits = np.random.randn(batch_size, n_classes)
print("模型输出logits(只显示前5个):\n", logits[:5])
probs = my_softmax(logits)
print("模型输出概率(只显示前5个):\n", probs[:5])

class_counts = np.bincount(targets, minlength=n_classes)
class_counts = [count if count > 0 else 1 for count in class_counts]
alpha = 1.0 / np.array(class_counts, dtype=np.float32)
alpha /= alpha.sum()  # 归一化
print("类别平衡因子alpha:", alpha)

ce_loss = my_cross_entropy_loss(logits, targets, alpha=alpha)
# print("ce_loss:", ce_loss)
fl_loss = my_focal_loss(logits, targets, gamma=2.0, alpha=alpha)
# print("fl_loss:", fl_loss)

样本真实类别:
 [1 4 2 1 0 0 0 3 1 2 0 4 3 0 0 0 0 1 1 0 1 0 0 1 1 2 0 1 1 0 1 0 0 4 4 3 0
 0 2 1 0 1 0 3 0 2 0 1 1 0 4 2 4 3 1 4 0 0 0 0 1 0 3 0]
模型输出logits(只显示前5个):
 [[ 0.34361829 -1.76304016  0.32408397 -0.38508228 -0.676922  ]
 [ 0.61167629  1.03099952  0.93128012 -0.83921752 -0.30921238]
 [ 0.33126343  0.97554513 -0.47917424 -0.18565898 -1.10633497]
 [-1.19620662  0.81252582  1.35624003 -0.07201012  1.0035329 ]
 [ 0.36163603 -0.64511975  0.36139561  1.53803657 -0.03582604]]
模型输出概率(只显示前5个):
 [[0.33953151 0.04130189 0.33296335 0.16383604 0.12236721]
 [0.2207486  0.33574359 0.30387862 0.0517348  0.08789438]
 [0.23905531 0.45530915 0.10629915 0.14256136 0.05677503]
 [0.02994662 0.22321804 0.38446903 0.09216801 0.2701983 ]
 [0.15923905 0.05818635 0.15920077 0.51636147 0.10701237]]
类别平衡因子alpha: [0.06257669 0.10306748 0.29202455 0.29202455 0.25030676]
应用alpha加权之前各类别对CE损失的贡献: [52.79996377 38.62728709 12.29027883  6.42584673 13.02320717]
ce_loss: 1.9244778687028559
应用alpha加权之后各类别对CE损失的贡献: [3.3040