A issue about multi-gpu #235

luyaojie · 2017-09-06T23:48:18Z

Hello, all.

I was encountering a error when I ran openmt with multi-gpu option

python train.py -data data/demo -save_model demo-model -word_vec_size 620 -gpuid 0 1 2 3

Traceback (most recent call last):
File "train.py", line 309, in
main()
File "train.py", line 270, in main
model.encoder.embeddings.load_pretrained_vectors(opt.pre_word_vecs_enc)
File "/home1/yaojie/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 262, in getattr
type(self).name, name))
AttributeError: 'DataParallel' object has no attribute 'encoder'

The text was updated successfully, but these errors were encountered:

srush · 2017-09-07T01:35:41Z

Multi gpu is "under construction" until pytorch improves. We'll add an assert for now. (PR welcome). On Sep 6, 2017 7:48 PM, "Roger" <notifications@github.com> wrote: Hello, all. I was encountering a error when I ran openmt with multi-gpu option python train.py -data data/demo -save_model demo-model -word_vec_size 620 -gpuid 0 1 2 3 Traceback (most recent call last): File "train.py", line 309, in main() File "train.py", line 270, in main model.encoder.embeddings.load_pretrained_vectors(opt.pre_word_vecs_enc) File "/home1/yaojie/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 262, in *getattr* type(self).*name*, name)) AttributeError: 'DataParallel' object has no attribute 'encoder' — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_OpenNMT_OpenNMT-2Dpy_issues_235&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=wnHFZ7D4m-9MRwk-CWlvCGbWEiQX_AvUO2LuMy4Vj7c&m=R6NcKlTYIR2kUt912dRiQLEddSBhmZ9QkgJtzM88g0U&s=VL6Q3WtzGe8jIdk8Myw6d1mOxLscyL4R_KzNnMbmXyg&e=>, or mute the thread <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AACMKp1JGg9BhoEg6r0D0jaZHF1UKVLeks5sfy9DgaJpZM4PPHD5&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=wnHFZ7D4m-9MRwk-CWlvCGbWEiQX_AvUO2LuMy4Vj7c&m=R6NcKlTYIR2kUt912dRiQLEddSBhmZ9QkgJtzM88g0U&s=OBXXD7ffs_i4zmqYHm7ii0YXmDz96mNT3uoW9xlOHHQ&e=> .

dalegebit · 2017-09-07T04:12:45Z

Your problem can be simply fixed by moving this line model.encoder.embeddings.load_pretrained_vectors(opt.pre_word_vecs_enc) to the position before model becomes DataParallel

Actually there are totally two main problems concerning multi-gpus:

When a batch is split into several parts, the the largest length of each small part may not be equal to the original one of the whole batch. So when recovered to padded LongTensors through unpack, the recovered size may not be equal for each small part. I recommend fixing it by passing an extra argument explicitly indicating the largest length of the whole batch and then concatenating an extra padding to the output according to the largest length. See this: #Pad PackedSequences to original batch length pytorch/pytorch#1591. You can also refer to my implementation: #https://github.com/dalegebit/OpenNMT-py/blob/d09323599b9ff5759b0daf08d814118faf0716c1/onmt/Models.py#L217
Pytorch currently don't support DataParallel returning dict or instances of custom classes. I have fixed this: #Allow DataParallel returning dict and any other instances of iterable custom classes pytorch/pytorch#2511. And in the meantime, you should turn RNNDecoderState into an iterable: #https://github.com/dalegebit/OpenNMT-py/blob/d09323599b9ff5759b0daf08d814118faf0716c1/onmt/Models.py#L508

srush · 2017-09-07T15:32:40Z

@dalegebit we would love a PR if you have one.

dalegebit · 2017-09-08T03:10:29Z

Sure

srush added the type:feature label Sep 7, 2017

marcotcr pushed a commit to marcotcr/OpenNMT-py that referenced this issue Sep 20, 2017

weak ranking system for seq2seq (OpenNMT#235)

9800a92

vince62s closed this as completed Aug 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A issue about multi-gpu #235

A issue about multi-gpu #235

luyaojie commented Sep 6, 2017

srush commented Sep 7, 2017 via email

dalegebit commented Sep 7, 2017 •

edited

srush commented Sep 7, 2017

dalegebit commented Sep 8, 2017

A issue about multi-gpu #235

A issue about multi-gpu #235

Comments

luyaojie commented Sep 6, 2017

srush commented Sep 7, 2017 via email

dalegebit commented Sep 7, 2017 • edited

srush commented Sep 7, 2017

dalegebit commented Sep 8, 2017

dalegebit commented Sep 7, 2017 •

edited