## GRPO相关推导实现

#### 1. 优势函数部分

在GRPO中，优势函数通过真实环境中的一组奖励reward计算得到，而不是通过对价值估计来计算得到的。
$$ A_i = \frac{r_i - mean({r_1, r2,...,r_G})}{std({r1, r2,...,r_G})}  $$

##### 代码实现

In [13]:
import torch

In [14]:
def grpo_adv(rewards):
    """
    计算GRPO中的优势函数
    
    Args:
        rewards: 一组奖励值，形状为 [batch_size]
        
    Returns:
        advantages: 计算得到的优势函数值，形状与rewards相同
    """
    
    # 计算rewards的均值和标准差
    mean_rewards = torch.mean(rewards)
    std_rewards = torch.std(rewards)
    
    # 防止除以零
    if std_rewards == 0:
        return torch.zeros_like(rewards)
    
    # 计算优势函数
    advantages = (rewards - mean_rewards) / (std_rewards + 1e-8)
    
    return advantages
    

In [15]:
rewards = torch.tensor([0, 1, 0, 1, 1, 0], dtype=torch.float)
# convert to float
print(rewards)
adv_rst = grpo_adv(rewards)
print(adv_rst)

tensor([0., 1., 0., 1., 1., 0.])
tensor([-0.9129,  0.9129, -0.9129,  0.9129,  0.9129, -0.9129])


#### 2. KL散度部分

GRPO采用KL散度来计算两个概率分布之间的差异
$$ \text{D}_{\text{KL}} \left( \pi_{\theta} \left\| \pi_{\text{ref}} \right\| \right) = \frac{\pi_{\text{ref}}(o_i|q)}{\pi_{\theta}(o_i|q)} - \log \frac{\pi_{\text{ref}}(o_i|q)}{\pi_{\theta}(o_i|q)} - 1 $$

##### 代码实现

In [None]:
import torch

In [19]:
def grpo_kl(pi_logprobs, pi_ref_logprobs):
    """
    计算KL散度
    
    Args:
    pi_logprobs: 当前策略的对数概率
    pi_ref_logprobs: 参考策略的对数概率
    
    Return:
    KL散度值
    """
    # 计算概率比值的对数: log(pi_ref/pi) = log(pi_ref) - log(pi)
    log_ratio = pi_ref_logprobs - pi_logprobs
    
    # 计算概率比值: pi_ref/pi = exp(log(pi_ref/pi))
    ratio = torch.exp(log_ratio)
    
    # 计算KL散度: ratio - log(ratio) - 1
    kl = ratio - log_ratio - 1.0
    
    return kl

In [None]:

# 创建随机的对数概率分布
batch_size = 5
pi_logprobs = torch.randn(batch_size)
pi_ref_logprobs = torch.randn(batch_size)

# 计算KL散度
kl = grpo_kl(pi_logprobs, pi_ref_logprobs)

# 打印结果
print("当前策略的对数概率:", pi_logprobs)
print("参考策略的对数概率:", pi_ref_logprobs)
print("计算的KL散度:", kl)
print("KL散度平均值:", kl.mean().item())

