Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Parallel issue with types.MethodType #112

Open
pkdogcom opened this issue Dec 7, 2018 · 5 comments
Open

Data Parallel issue with types.MethodType #112

pkdogcom opened this issue Dec 7, 2018 · 5 comments
Assignees
Labels

Comments

@pkdogcom
Copy link

pkdogcom commented Dec 7, 2018

I found that when using nn.data_parallel along with this library, there will be issues during model forward on multi-GPU as when modifying torch vision network (such as in modify_resnets function) types.MethodType bound model instance on GPU 0, so when forward is called on GPU 1, model and input will be located on different GPUs and thus lead to errors.
Using the original way of bounding function to class instead of instance seems to solve this issue, but may suffer from other problem as in #71. Is there any way to fix this issue without introducing another?

@Cadene
Copy link
Owner

Cadene commented Dec 8, 2018

Thanks for reporting this issue.
A simple way to fix this would be to rewrite the models from torchvision in this library.

@Cadene Cadene self-assigned this Dec 8, 2018
@Cadene Cadene added the bug label Dec 8, 2018
@pkdogcom
Copy link
Author

pkdogcom commented Dec 9, 2018

I actually think the original way of adding unbound method (features, logits, forward) to the class is the way to go as it can always correctly bind the instance which calls the unbound method to method itself (that's why unbound method exists, right?), so that weights, devices and any other information will be correct.
I think the reason why modify_xxx changes adding function to class as unbound method to adding them to instances as bound method instead, is that the implementation of modify_vgg function itself is not idempotent (deleting original features, classifiers, .etc after modification). I guess a simple fix may be checking whether the model (or in fact, VGG class) is modified before actual modification.

@pkdogcom
Copy link
Author

pkdogcom commented Dec 9, 2018

That way we can use the unbound method way in modify_xxx functions for correct data_parallel behavior in training while not suffering from the multi-instantiation issue of the same model, or more specifically, VGG model.

@pkdogcom
Copy link
Author

pkdogcom commented Dec 9, 2018

P..S. I also suggest to make modify_vgg consistent with other models, such as resnet, so that features method output 4-d tensor (e.g. batch x 7 x 7 x Channel) as in other models, rather than a reshaped and transformed (by the two fully-connected layers) one. That way the modified torchvision VGG model can be used interchangeably with other models as feature extractor of the user defined model.
All the fc and relu layers in classifier can be kept in the logits method.

@wpwei
Copy link

wpwei commented Apr 4, 2019

Thanks for reporting this issue.
A simple way to fix this would be to rewrite the models from torchvision in this library.

I made a simple fix, with those models rewritten. Pls check it out. #145

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants