-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.5和更高版本模型GPU报错 #19628
Comments
@zhaoyuchen2018 帮忙看下sum op相关的问题? |
看起来是sum的某个tensor里面没有数据,能提供下复现的代码吗 |
Hi,请问这个issue有进展吗,有用户反馈类似问题:PaddlePaddle/PARL#151 |
看了下是某个tensor没有数据,需要复现的代码才能进一步debug |
目前发现是在多GPU时才有这个问题,设置环境变量 |
请问这个有进展吗,我碰到类似的问题, #19679 |
@zhaoyuchen2018 |
模型是一个带copy的lstm生成模型
![image](https://user-images.githubusercontent.com/2569549/64237527-18eb5800-cf2f-11e9-8f36-89a879936779.png)
模型在1.3,1.4版本下GPU和CPU都没有问题。
在使用paddle 1.5和lastest版本时候,GPU版本报错,CPU版本正常。GPU报错如下~:
提示中没有错误的具体位置~
根据提示,寻找sum相关的OP,未果~
最后定位的代码,在DynamicRNN的block()中,如下
去掉这段代码就不报错了
其中self.decoder_size是解码的lstm size,self.max_length是生成的最大步长
其中current_h是本时间步的LSTM输出,shape为[batch_size, self.decoder_size]
其中copy_score_weight来自encoder端的所有单词,shape为[batch_size, self.max_length, self.decoder_size]
同时实验去掉了reduce_sum的op,代码是
结果报的错和开始是一致的,还是报sum to lod tensor~
完整报错如下:
The text was updated successfully, but these errors were encountered: