-
Notifications
You must be signed in to change notification settings - Fork 661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix upsample sbp infer bug and add global test #7884
Conversation
|
||
# backward compute result of oneflow is not same with pytorch | ||
@autotest(n=1, auto_backward=False, check_graph=False) | ||
def _test_global_upsample2d_bicubic(test_case, placement, sbp): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bicubic模式下,oneflow 后向计算结果与pytorch对不上,不知道是否是在实现上有差异
复现命令:设置```auto_backward=True````,
python test_consistent_upsample.py --verbose --failfast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里需要确定是否真的对不上,我们要以Pytorch1.11为准。如果确实对不上,那么就更新到:https://github.com/Oneflow-Inc/OneTeam/issues/1207#issuecomment-1073432125 ,我来debug。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里需要确定是否真的对不上,我们要以Pytorch1.11为准。如果确实对不上,那么就更新到:Oneflow-Inc/OneTeam#1207 (comment) ,我来debug。
升级pytorch到1.11后(原本是在1.10下侧的),后向计算结果仍然不一样,已更新在Oneflow-Inc/OneTeam#1207 (comment) 中
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clackhan 此bug已在https://github.com/Oneflow-Inc/oneflow/pull/7916中修复。
合并 pr7916 后,打开后向测试,直接 Abroted 的了,关闭后向没有问题,可以正常跑,报错信息如下:
python test_consistent_upsample.py --verbose --failfast
test_global_upsample2d_bicubic (__main__.TestGlobalUpsample2d) ... Environment has been initialized, this env init will do nothing.
/home/hanbinbin/anaconda3/envs/oneflow/lib/python3.8/site-packages/torch/_tensor.py:1104: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at aten/src/ATen/core/TensorBody.h:475.)
return self._grad
free(): corrupted unsorted chunks
Aborted (core dumped)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
神奇,我再看看
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
神奇,我再看看
好的
…add_var_upsample_global_test
@unittest.skip( | ||
"The nearest interpolate operation in pytorch has bug, https://github.com/pytorch/pytorch/issues/65200" | ||
) | ||
@globaltest | ||
def test_global_upsample2d_nearest(test_case): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个里pytroch中的issue已经close了,但是这测结果还是和pytorch对不上,不知道只oneflow的问题还是pytorch的问题
复现命令:注释@unittest.skip
,
python test_consistent_upsample.py --verbose --failfast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可能和我们CI环境下的PyTorch版本比较旧有关,这里暂时也和Local一样skip吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@caishenghang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
之前我和盛航在升级的过程中碰到了很多问题,还没有来得及一一解决。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我这边重复测试了一下,发现原因是因为pytorch的cpu和gpu结果在缩放系数不是整数情况下跑出的结果对不上。这个bug我之前确实反馈了,但pytorch不修就直接把我issue关了,这个问题先不管吧。
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/7884/ |
CI failed when running job: cpu-misc. PR label automerge has been removed |
Speed stats:
|
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/7884/ |
Speed stats:
|
添加var 和 upsample global 测试