Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix amax/amin/max/min write overflow #47570

Merged

Conversation

luotao1
Copy link
Contributor

@luotao1 luotao1 commented Nov 2, 2022

PR types

Bug fixes

PR changes

APIs

Describe

export FLAGS_use_system_allocator=1, 问题来源于 #47125

amax最小复现代码:

import paddle
import numpy as np
x_np = np.array([[0.2, 0.3, 0.9, 0.9], [0.1, 0.1, 0.6, 0.7]])
x = paddle.to_tensor(x_np, dtype="float32", stop_gradient=False)
out = paddle.amax(x, axis=None, keepdim=False) # 报错(前向没问题,反向触发错误)
grad_tensor = paddle.ones_like(x)
paddle.autograd.backward([out], [grad_tensor], True) 

max最小复现代码:

import paddle
import numpy as np
x_np = np.array([[0.2, 0.3, 0.9, 0.9], [0.1, 0.1, 0.6, 0.7]])
x = paddle.to_tensor(x_np, dtype="float32", stop_gradient=False)
out = paddle.max(x, axis=None, keepdim=False) # 报错(前向没问题,反向触发错误)
# out = paddle.max(x, axis=[0], keepdim=False) # 不报错
grad_tensor = paddle.ones_like(x)
paddle.autograd.backward([out], [grad_tensor], True) 

定位说明

@veyron95 定位到的情况是:

  • amax出错:Paddle/paddle/phi/kernels/funcs/reduce_functor.hAMaxOrAMinGradFunctor 内部一个计算发现了写越界:dx->device(place) = dy->broadcast(dim) * mask / equal_number;
  • max出错:Paddle/paddle/phi/kernels/funcs/reduce_functor.hMaxOrMinGradFunctor 内部一个计算发现了写越界:dx->device(place) = (dy->broadcast(dim) * equals.select(ones, zeros));
  • 两者左边 dx dim 原本是(2,4),右边结果打出来有64个数值。右边结果写入左边,写越界了。导致在后续内存释放的时候导致异常。
  • 原先未报错的原因:原先默认FLAGS_use_system_allocator=0,内存管理使用的是 BestFitAllocator, 会申请一大块内存供 tensor 使用,即使在实际使用中写越界,由于越界的部分没人使用,也不会有什么太大的问题。

@luotao1 定位到的情况是:

  • 静态图没问题,老动态图也没问题,只有新动态图有问题。把python/paddle/fluid/tests/unittests/test_max_min_amax_amin_op.py 115行 def test_dygraph(self): 注释就可以过
  • 只有axis=None的情况下,会出错。也就是 TestMaxMinAmaxAminAPI2TestMaxMinAmaxAminAPI6会出错。

两者左边 dx dim 原本是(2,4),右边结果打出来有64个数值

依次打印右边的变量,dy->broadcast(dim)是64数值,需要做一下reshape

@paddle-bot
Copy link

paddle-bot bot commented Nov 2, 2022

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@cxxly cxxly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@veyron95 veyron95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@luotao1 luotao1 merged commit 6f7a80c into PaddlePaddle:develop Nov 2, 2022
@luotao1 luotao1 deleted the fix_amax_memory_write_overflow branch November 2, 2022 06:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants