Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

固定随机数种子仍然不能每次稳定复现结果 #129

Closed
agave233 opened this issue Sep 22, 2020 · 4 comments
Closed

固定随机数种子仍然不能每次稳定复现结果 #129

agave233 opened this issue Sep 22, 2020 · 4 comments

Comments

@agave233
Copy link

对于图卷积的几个example(GCN/GAT/GIN),按照如下方式固定paddle和numpy的随机数种子:

seed = 123
train_program.random_seed = seed
startup_program.random_seed = seed
np.random.seed(seed)
random.seed(seed)

在cpu上每次的运行结果完全一样,但是在gpu上运行时每次差别很大;之后尝试把pgl的图卷积层去掉,这样模型在gpu上也可以稳定复现了,想问一下是不是pgl的底层实现对于cuda还存在某些随机性的操作?

ps:尝试过设置export FLAGS_cudnn_deterministic=True ,似乎也没有用

@Yelrose
Copy link
Collaborator

Yelrose commented Sep 23, 2020

是 example/gcn example/gat 还有 example/gin的代码吗?

@agave233
Copy link
Author

是 example/gcn example/gat 还有 example/gin的代码吗?

是的,我昨天重新跑了下,发现在gpu下每次运行训练的loss会在小数点后四五位有误差,定位了下具体随机的原因,应该是pgl的send函数带来的随机性,可能是某些op在cuda上运行时候会产生未知的误差,不知道你们可以排查下吗。

@Yelrose
Copy link
Collaborator

Yelrose commented Sep 23, 2020

嗯,我们也在排查。example/gcn倒是没发现有这个状况,然后GIN这边确实有。

@Yelrose
Copy link
Collaborator

Yelrose commented Sep 23, 2020

scatter_add 和 gather的反向梯度(scatter_add)在GPU上的实现是通过原子加并发执行的。浮点数的加法先后顺序会导致精度问题。

下面这段代码,用numpy给了一个示例

import numpy as np


data = np.random.randn(10000)
data = data.astype("float32")

d = np.cumsum(data)
while True:
    np.random.shuffle(data)
    d = np.cumsum(data)
    print(d[-1])

输出结果

63.24582
63.245487
63.24563
63.245617
63.245617
63.245495
63.24563
63.245598
63.24564
63.245487

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants