Data Parallel issue with types.MethodType #112

pkdogcom · 2018-12-07T07:21:06Z

I found that when using nn.data_parallel along with this library, there will be issues during model forward on multi-GPU as when modifying torch vision network (such as in modify_resnets function) types.MethodType bound model instance on GPU 0, so when forward is called on GPU 1, model and input will be located on different GPUs and thus lead to errors.
Using the original way of bounding function to class instead of instance seems to solve this issue, but may suffer from other problem as in #71. Is there any way to fix this issue without introducing another?

Cadene · 2018-12-08T17:21:31Z

Thanks for reporting this issue.
A simple way to fix this would be to rewrite the models from torchvision in this library.

pkdogcom · 2018-12-09T07:26:33Z

I actually think the original way of adding unbound method (features, logits, forward) to the class is the way to go as it can always correctly bind the instance which calls the unbound method to method itself (that's why unbound method exists, right?), so that weights, devices and any other information will be correct.
I think the reason why modify_xxx changes adding function to class as unbound method to adding them to instances as bound method instead, is that the implementation of modify_vgg function itself is not idempotent (deleting original features, classifiers, .etc after modification). I guess a simple fix may be checking whether the model (or in fact, VGG class) is modified before actual modification.

pkdogcom · 2018-12-09T07:29:02Z

That way we can use the unbound method way in modify_xxx functions for correct data_parallel behavior in training while not suffering from the multi-instantiation issue of the same model, or more specifically, VGG model.

pkdogcom · 2018-12-09T07:36:16Z

P..S. I also suggest to make modify_vgg consistent with other models, such as resnet, so that features method output 4-d tensor (e.g. batch x 7 x 7 x Channel) as in other models, rather than a reshaped and transformed (by the two fully-connected layers) one. That way the modified torchvision VGG model can be used interchangeably with other models as feature extractor of the user defined model.
All the fc and relu layers in classifier can be kept in the logits method.

wpwei · 2019-04-04T09:52:23Z

Thanks for reporting this issue.
A simple way to fix this would be to rewrite the models from torchvision in this library.

I made a simple fix, with those models rewritten. Pls check it out. #145

Cadene self-assigned this Dec 8, 2018

Cadene added the bug label Dec 8, 2018

wpwei mentioned this issue Apr 4, 2019

[Bug fix]Data Parallel issue with types.MethodType #145

Open

ranftlr mentioned this issue May 3, 2021

Training Requirements isl-org/DPT#15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Parallel issue with types.MethodType #112

Data Parallel issue with types.MethodType #112

pkdogcom commented Dec 7, 2018

Cadene commented Dec 8, 2018

pkdogcom commented Dec 9, 2018

pkdogcom commented Dec 9, 2018

pkdogcom commented Dec 9, 2018

wpwei commented Apr 4, 2019

Data Parallel issue with types.MethodType #112

Data Parallel issue with types.MethodType #112

Comments

pkdogcom commented Dec 7, 2018

Cadene commented Dec 8, 2018

pkdogcom commented Dec 9, 2018

pkdogcom commented Dec 9, 2018

pkdogcom commented Dec 9, 2018

wpwei commented Apr 4, 2019