Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP multi-gpu training issues with Imagenet example #128

Closed
kartikgupta-at-anu opened this issue Jul 5, 2022 · 7 comments
Closed

DDP multi-gpu training issues with Imagenet example #128

kartikgupta-at-anu opened this issue Jul 5, 2022 · 7 comments

Comments

@kartikgupta-at-anu
Copy link

I am trying to use multi-gpu QAT training using Imagenet example code. It runs into issue after first iteration training update.

RuntimeError: grad.numel() == bucket_view.numel() INTERNAL ASSERT FAILED at "/pytorch/torch/lib/c10d/reducer.cpp":343, please report a bug to PyTorch.

The code works fine with multi-gpu training if I comment the wrapper code that quantize the original model i.e., model=prepare_by_platform(model, args.backend). Did anyone encounter the same issue?

@kartikgupta-at-anu kartikgupta-at-anu changed the title DDP issues in Imagenet example DDP multi-gpu training issues with Imagenet example Jul 5, 2022
@PannenetsF
Copy link
Contributor

Which backend is used in your code? We have applied MQBench in multi-GPUs QAT/PTQ successfully.

@kartikgupta-at-anu
Copy link
Author

kartikgupta-at-anu commented Jul 5, 2022

I am using tensorrt backend. Can you share the modified script of something equivalent to main.py in imagenet example in this repo which can run with multi-gpu training?

@PannenetsF
Copy link
Contributor

PannenetsF commented Jul 5, 2022

@PannenetsF
Copy link
Contributor

PannenetsF commented Jul 5, 2022

Also there is a DDP example here, which should behave like the main.py in imagenet_example. https://github.com/ModelTC/MQBench/blob/main/application/imagenet_example/main_dist.py

@kartikgupta-at-anu
Copy link
Author

kartikgupta-at-anu commented Jul 5, 2022

I am trying to train on ImageNet dataset using
https://github.com/ModelTC/MQBench/blob/main/application/imagenet_example/main.py. If you can point me why this script runs into issue, that would be great. I am not sure if the repo you shared is using DDP for multi-gpu training but that is what I am trying to use. Also I can't find imagenet training scripts in United-Perception.

@kartikgupta-at-anu
Copy link
Author

Also there is a DDP example here, which should behave like the main.py in imagenet_example. https://github.com/ModelTC/MQBench/blob/main/application/imagenet_example/main_dist.py

main_dist.py seems to have missing "import models". Also not sure what the decorator @link_dist is for?

@kartikgupta-at-anu
Copy link
Author

I modified the main_dist and it seems to work fine now for multi-gpu training with DDP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants