New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP multi-gpu training issues with Imagenet example #128
Comments
Which backend is used in your code? We have applied MQBench in multi-GPUs QAT/PTQ successfully. |
I am using tensorrt backend. Can you share the modified script of something equivalent to main.py in imagenet example in this repo which can run with multi-gpu training? |
Your can give United-Perception a try, and this is a multi-GPUs training's config https://github.com/ModelTC/United-Perception/blob/main/configs/quant/det/yolox/yolox_fpga_quant_vitis_qat.yaml. |
Also there is a DDP example here, which should behave like the main.py in imagenet_example. https://github.com/ModelTC/MQBench/blob/main/application/imagenet_example/main_dist.py |
I am trying to train on ImageNet dataset using |
main_dist.py seems to have missing "import models". Also not sure what the decorator @link_dist is for? |
I modified the main_dist and it seems to work fine now for multi-gpu training with DDP. |
I am trying to use multi-gpu QAT training using Imagenet example code. It runs into issue after first iteration training update.
RuntimeError: grad.numel() == bucket_view.numel() INTERNAL ASSERT FAILED at "/pytorch/torch/lib/c10d/reducer.cpp":343, please report a bug to PyTorch.
The code works fine with multi-gpu training if I comment the wrapper code that quantize the original model i.e., model=prepare_by_platform(model, args.backend). Did anyone encounter the same issue?
The text was updated successfully, but these errors were encountered: