New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove inplace broadcast_add #5551
Conversation
CI failed, removing label automerge |
这个问题是当时跑SincNet时发现的,需要进一步查一下为什么会在linear中错误调用了BroadcastAdd |
那是bias add走了BroadcastAdd的逻辑吧 |
对,但是我记得当时bias add的时候是会对x做broadcast的,这里按理说不应该broadcast |
之前应该也是因为broadcast add不支持inplace,在python里面就直接换成broadcast+add了,你看一下这两行python代码, oneflow/oneflow/python/nn/modules/math_ops.py Lines 474 to 475 in a5df297
|
这个PR改之前单测是能过的,也就是真的做了broadcast add,但是会有显存泄露问题,对训练正确性没影响。才有这个PR把BroadcastAdd拆成了broadcast+add。 |
是使用BroadcastAdd时,如果跑完前向不执行 loss.backward() 时,前一个后向图不会释放,多次调用就会OOM,怀疑是Broadcast在inplace情况下捕获时出现了循环引用。 这里实现了一个最小复现代码,在这个PR前的commit上可以稳定复现: #!/usr/bin/env python
# coding=utf-8
import oneflow as flow
for i in range(1000):
a = flow.ones((256*256, 1024), requires_grad=True).to("cuda") # 256MB
b = flow.ones((1024), requires_grad=True).to("cuda") # 256MB
a += b
c = a.sum()
print(c.numpy()) |
由于BroadcastAdd Op会把x和y都做广播处理,不一定能inplace,所以inplace版本的BroadcastAdd不能直接调用此Op。
另外使broadcast_like支持axes参数为None。