固定随机数种子仍然不能每次稳定复现结果 #129

agave233 · 2020-09-22T05:57:23Z

对于图卷积的几个example（GCN/GAT/GIN），按照如下方式固定paddle和numpy的随机数种子：

seed = 123
train_program.random_seed = seed
startup_program.random_seed = seed
np.random.seed(seed)
random.seed(seed)

在cpu上每次的运行结果完全一样，但是在gpu上运行时每次差别很大；之后尝试把pgl的图卷积层去掉，这样模型在gpu上也可以稳定复现了，想问一下是不是pgl的底层实现对于cuda还存在某些随机性的操作？

ps：尝试过设置export FLAGS_cudnn_deterministic=True ，似乎也没有用

Yelrose · 2020-09-23T04:00:22Z

是 example/gcn example/gat 还有 example/gin的代码吗？

agave233 · 2020-09-23T05:44:06Z

是 example/gcn example/gat 还有 example/gin的代码吗？

是的，我昨天重新跑了下，发现在gpu下每次运行训练的loss会在小数点后四五位有误差，定位了下具体随机的原因，应该是pgl的send函数带来的随机性，可能是某些op在cuda上运行时候会产生未知的误差，不知道你们可以排查下吗。

Yelrose · 2020-09-23T06:46:54Z

嗯，我们也在排查。example/gcn倒是没发现有这个状况，然后GIN这边确实有。

Yelrose · 2020-09-23T08:53:01Z

scatter_add 和 gather的反向梯度(scatter_add)在GPU上的实现是通过原子加并发执行的。浮点数的加法先后顺序会导致精度问题。

下面这段代码，用numpy给了一个示例

import numpy as np


data = np.random.randn(10000)
data = data.astype("float32")

d = np.cumsum(data)
while True:
    np.random.shuffle(data)
    d = np.cumsum(data)
    print(d[-1])

输出结果

Yelrose pinned this issue Sep 23, 2020

Yelrose unpinned this issue Sep 23, 2020

Yelrose closed this as completed Sep 23, 2020

agave233 mentioned this issue Sep 24, 2020

gpu下的scatter op存在随机性的精度误差 PaddlePaddle/Paddle#27527

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

固定随机数种子仍然不能每次稳定复现结果 #129

固定随机数种子仍然不能每次稳定复现结果 #129

agave233 commented Sep 22, 2020

Yelrose commented Sep 23, 2020

agave233 commented Sep 23, 2020

Yelrose commented Sep 23, 2020

Yelrose commented Sep 23, 2020

固定随机数种子仍然不能每次稳定复现结果 #129

固定随机数种子仍然不能每次稳定复现结果 #129

Comments

agave233 commented Sep 22, 2020

Yelrose commented Sep 23, 2020

agave233 commented Sep 23, 2020

Yelrose commented Sep 23, 2020

Yelrose commented Sep 23, 2020