Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP问题 - IndexError: Caught IndexError in replica 0 on device 0 #12

Closed
chenqianben opened this issue Mar 16, 2022 · 6 comments
Closed
Labels
question Further information is requested

Comments

@chenqianben
Copy link

老师您好,在使用单机多卡的时候,会出现以下报错:

Traceback (most recent call last):
File "/data/home/qianbenchen/DocEE-main/dee/tasks/dee_task.py", line 587, in get_loss_on_batch
teacher_prob=teacher_prob,
File "/data/home/qianbenchen/envs/torch/venv/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/home/qianbenchen/envs/torch/venv/lib64/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/data/home/qianbenchen/envs/torch/venv/lib64/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/data/home/qianbenchen/envs/torch/venv/lib64/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/data/home/qianbenchen/envs/torch/venv/lib64/python3.6/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
IndexError: Caught IndexError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/data/home/qianbenchen/envs/torch/venv/lib64/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/data/home/qianbenchen/envs/torch/venv/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/home/qianbenchen/DocEE-main/dee/models/trigger_aware.py", line 172, in forward
ent_fix_mode=self.config.ent_fix_mode,
File "/data/home/qianbenchen/DocEE-main/dee/modules/doc_info.py", line 305, in get_doc_arg_rel_info_list
) = get_span_mention_info(span_dranges_list, doc_token_type_mat)
File "/data/home/qianbenchen/DocEE-main/dee/modules/doc_info.py", line 16, in get_span_mention_info
mention_type_list.append(doc_token_type_list[sent_idx][char_s])
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run_dee_task.py", line 274, in
dee_task.train(save_cpt_flag=in_argv.save_cpt_flag)
File "/data/home/qianbenchen/DocEE-main/dee/tasks/dee_task.py", line 656, in train
base_epoch_idx=resume_base_epoch,
File "/data/home/qianbenchen/DocEE-main/dee/tasks/base_task.py", line 693, in base_train
total_loss = get_loss_func(self, batch, **kwargs_dict1)
File "/data/home/qianbenchen/DocEE-main/dee/tasks/dee_task.py", line 598, in get_loss_on_batch
raise Exception("Cannot get the loss")

请问是否有得到解决呢?谢谢!

@Spico197
Copy link
Owner

您好,当前代码运行和测试时没有出现这个问题。请问是否使用了自己的数据或对代码进行了修改?

@Spico197 Spico197 added the question Further information is requested label Mar 16, 2022
@chenqianben
Copy link
Author

chenqianben commented Mar 16, 2022

老师您好,我重新git了一下,使用了您发布的完整代码,发现单机多卡模式是不能跑通的,我不确定这是不是因为模型的参数没有全部复制到其他卡的原因。不知道您有没有遇到这个问题呢?非常感谢!

@Spico197
Copy link
Owner

您好,我刚刚测试了一下 PTPCG 和Doc2EDAG 在 ChFinAnn 上的训练过程,并没有发现异常。您是否使用了自定义的数据集?
此外,您提到使用的是单机多卡模式,不过从错误提示来看这个问题和DDP无关。如果是使用了DDP,是否按照 scripts/run_doc2edag.sh 的模式编写了多卡运行的脚本?

PTPCG
Doc2EDAG

@chenqianben
Copy link
Author

老师您好,嗯嗯对的,发现了我存在的错误,因为我没有在使用分布式启动工具。如果按照run_doc2edag.sh脚本部署也没有问题。非常感谢您,祝生活愉快!

@Spico197
Copy link
Owner

好嘞

@hxi667
Copy link

hxi667 commented Apr 4, 2022

老师您好,嗯嗯对的,发现了我存在的错误,因为我没有在使用分布式启动工具。如果按照run_doc2edag.sh脚本部署也没有问题。非常感谢您,祝生活愉快!

你好,我也遇到了同样的问题。您说的在单机多卡时使用分布式启动工具 是指类似 这样嘛python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other arguments of your training script)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants