Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove inplace broadcast_add #5551

Merged
merged 2 commits into from Jul 21, 2021
Merged

Conversation

wyg1997
Copy link
Contributor

@wyg1997 wyg1997 commented Jul 21, 2021

由于BroadcastAdd Op会把x和y都做广播处理,不一定能inplace,所以inplace版本的BroadcastAdd不能直接调用此Op。

另外使broadcast_like支持axes参数为None。

@github-actions
Copy link
Contributor

CI failed, removing label automerge

@oneflow-ci-bot oneflow-ci-bot removed their request for review July 21, 2021 04:09
@oneflow-ci-bot oneflow-ci-bot merged commit 5af1c83 into master Jul 21, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the fix-inplace_broadcast_add branch July 21, 2021 08:36
@wyg1997
Copy link
Contributor Author

wyg1997 commented Aug 14, 2021

这个问题是当时跑SincNet时发现的,需要进一步查一下为什么会在linear中错误调用了BroadcastAdd

@hjchen2
Copy link
Contributor

hjchen2 commented Aug 14, 2021

这个问题是当时跑SincNet时发现的,需要进一步查一下为什么会在linear中错误调用了BroadcastAdd

那是bias add走了BroadcastAdd的逻辑吧

@wyg1997
Copy link
Contributor Author

wyg1997 commented Aug 14, 2021

那是bias add走了BroadcastAdd的逻辑吧

对,但是我记得当时bias add的时候是会对x做broadcast的,这里按理说不应该broadcast

@hjchen2
Copy link
Contributor

hjchen2 commented Aug 14, 2021

但是我记得当时bias add的时候是会对x做broadcast的,这里按理说不应该broadcast

之前应该也是因为broadcast add不支持inplace,在python里面就直接换成broadcast+add了,你看一下这两行python代码,

y = flow.experimental.broadcast_like(y, x)
return ElementwiseAdd(inplace=True)(x, y)

@wyg1997
Copy link
Contributor Author

wyg1997 commented Aug 14, 2021

之前应该也是因为broadcast add不支持inplace,在python里面就直接换成broadcast+add了

这个PR改之前单测是能过的,也就是真的做了broadcast add,但是会有显存泄露问题,对训练正确性没影响。才有这个PR把BroadcastAdd拆成了broadcast+add。

@wyg1997
Copy link
Contributor Author

wyg1997 commented Aug 14, 2021

是使用BroadcastAdd时,如果跑完前向不执行 loss.backward() 时,前一个后向图不会释放,多次调用就会OOM,怀疑是Broadcast在inplace情况下捕获时出现了循环引用。

这里实现了一个最小复现代码,在这个PR前的commit上可以稳定复现:

#!/usr/bin/env python
# coding=utf-8

import oneflow as flow

for i in range(1000):
    a = flow.ones((256*256, 1024), requires_grad=True).to("cuda")  # 256MB
    b = flow.ones((1024), requires_grad=True).to("cuda")  # 256MB
    a += b
    c = a.sum()
    print(c.numpy())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants