About multi-gpus training #1

ZepingZhou · 2020-10-12T08:10:31Z

Hi, Liu. Thanks for sharing your work. Now, I meet a problem when training the simple-IAM with multi-gpus. nn.DataParallel works when training the prm classification networks. However, the training process fails when it comes to the iam. Here is my modification on your codes:

self.optimizer_filling = nn.DataParallel(self.optimizer_filling, device_ids=self.Device_ids)
self.optimizer_prm = nn.DataParallel(self.optimizer_prm, device_ids=self.Device_ids)
self.prm_module = nn.DataParallel(peak_response_mapping(self.basebone, **config['model']), device_ids=self.Device_id)
self.filling_module = nn.DataParallel(instance_extent_filling(config), device_ids=self.Device_ids)
self.filling_module.module.load_state_dict(checkpoint['state_dict'], False)
self.prm_module.module.load_state_dict(checkpoint['state_dict'], False)

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/user/anaconda3/envs/CenterMask/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/user/anaconda3/envs/CenterMask/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/media/ExtHDD/zzp/simple-IAM-master/iam/modules/instance_extent_filling.py", line 105, in forward
self.channel_num, self.kernel, self.kernel)
RuntimeError: shape '[2, 112, 112, 16, 3, 3]' is invalid for input of size 1806336

ZepingZhou · 2020-10-12T08:14:39Z

Do you have any Suggestions? I would appreciate it if you could reply，thanks.

LiuYiwai · 2020-10-14T16:06:10Z

Does this error occur if you use a CPU or a single GPU?

ZepingZhou · 2020-10-15T08:19:34Z

Does this error occur if you use a CPU or a single GPU?
Thanks for your reply! The error just happen when I use multi-gpus. Now the problem is solved by using distributedDataparallel.
By the way, I find that the training losses of the filling module are very high(about 0.38). How about yours?

LiuYiwai · 2020-10-15T15:47:50Z

I got results similar to yours. I'm not sure if it's a code problem or a hyperparameter problem. I tried to email the author of the paper, but I didn't get a reply.

Finally, the problem I encountered is that I don’t know how to calculate mAP, because some threshold parameters are not 0 to 1, such as class_threshold and peak_threshold in PRM

I hope it will be helpful for your future experiments. If you have good ideas or found some bugs, please contact me

LiuYiwai · 2020-10-15T15:58:34Z

And the code of the PRM part is given by the author.

ZepingZhou · 2020-10-16T01:28:08Z

OK, Thank you for the suggestions! I will keep trying!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About multi-gpus training #1

About multi-gpus training #1

ZepingZhou commented Oct 12, 2020

ZepingZhou commented Oct 12, 2020

LiuYiwai commented Oct 14, 2020

ZepingZhou commented Oct 15, 2020

LiuYiwai commented Oct 15, 2020

LiuYiwai commented Oct 15, 2020

ZepingZhou commented Oct 16, 2020

About multi-gpus training #1

About multi-gpus training #1

Comments

ZepingZhou commented Oct 12, 2020

Hi, Liu. Thanks for sharing your work. Now, I meet a problem when training the simple-IAM with multi-gpus. nn.DataParallel works when training the prm classification networks. However, the training process fails when it comes to the iam. Here is my modification on your codes:

ZepingZhou commented Oct 12, 2020

LiuYiwai commented Oct 14, 2020

ZepingZhou commented Oct 15, 2020

LiuYiwai commented Oct 15, 2020

LiuYiwai commented Oct 15, 2020

ZepingZhou commented Oct 16, 2020