Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About multi-gpus training #1

Open
ZepingZhou opened this issue Oct 12, 2020 · 6 comments
Open

About multi-gpus training #1

ZepingZhou opened this issue Oct 12, 2020 · 6 comments

Comments

@ZepingZhou
Copy link

Hi, Liu. Thanks for sharing your work. Now, I meet a problem when training the simple-IAM with multi-gpus. nn.DataParallel works when training the prm classification networks. However, the training process fails when it comes to the iam. Here is my modification on your codes:

self.optimizer_filling = nn.DataParallel(self.optimizer_filling, device_ids=self.Device_ids)
self.optimizer_prm = nn.DataParallel(self.optimizer_prm, device_ids=self.Device_ids)
self.prm_module = nn.DataParallel(peak_response_mapping(self.basebone, **config['model']), device_ids=self.Device_id)
self.filling_module = nn.DataParallel(instance_extent_filling(config), device_ids=self.Device_ids)
self.filling_module.module.load_state_dict(checkpoint['state_dict'], False)
self.prm_module.module.load_state_dict(checkpoint['state_dict'], False)

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/user/anaconda3/envs/CenterMask/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/user/anaconda3/envs/CenterMask/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/media/ExtHDD/zzp/simple-IAM-master/iam/modules/instance_extent_filling.py", line 105, in forward
self.channel_num, self.kernel, self.kernel)
RuntimeError: shape '[2, 112, 112, 16, 3, 3]' is invalid for input of size 1806336

@ZepingZhou
Copy link
Author

Do you have any Suggestions? I would appreciate it if you could reply,thanks.

@LiuYiwai
Copy link
Owner

Does this error occur if you use a CPU or a single GPU?

@ZepingZhou
Copy link
Author

Does this error occur if you use a CPU or a single GPU?
Thanks for your reply! The error just happen when I use multi-gpus. Now the problem is solved by using distributedDataparallel.
By the way, I find that the training losses of the filling module are very high(about 0.38). How about yours?

@LiuYiwai
Copy link
Owner

I got results similar to yours. I'm not sure if it's a code problem or a hyperparameter problem. I tried to email the author of the paper, but I didn't get a reply.

Finally, the problem I encountered is that I don’t know how to calculate mAP, because some threshold parameters are not 0 to 1, such as class_threshold and peak_threshold in PRM

I hope it will be helpful for your future experiments. If you have good ideas or found some bugs, please contact me

@LiuYiwai
Copy link
Owner

And the code of the PRM part is given by the author.

@ZepingZhou
Copy link
Author

OK, Thank you for the suggestions! I will keep trying!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants