DDP multi-gpu training issues with Imagenet example #128

kartikgupta-at-anu · 2022-07-05T02:26:53Z

I am trying to use multi-gpu QAT training using Imagenet example code. It runs into issue after first iteration training update.

RuntimeError: grad.numel() == bucket_view.numel() INTERNAL ASSERT FAILED at "/pytorch/torch/lib/c10d/reducer.cpp":343, please report a bug to PyTorch.

The code works fine with multi-gpu training if I comment the wrapper code that quantize the original model i.e., model=prepare_by_platform(model, args.backend). Did anyone encounter the same issue?

PannenetsF · 2022-07-05T03:45:59Z

Which backend is used in your code? We have applied MQBench in multi-GPUs QAT/PTQ successfully.

kartikgupta-at-anu · 2022-07-05T03:51:19Z

I am using tensorrt backend. Can you share the modified script of something equivalent to main.py in imagenet example in this repo which can run with multi-gpu training?

PannenetsF · 2022-07-05T03:56:12Z

Your can give United-Perception a try, and this is a multi-GPUs training's config https://github.com/ModelTC/United-Perception/blob/main/configs/quant/det/yolox/yolox_fpga_quant_vitis_qat.yaml.

PannenetsF · 2022-07-05T04:15:37Z

Also there is a DDP example here, which should behave like the main.py in imagenet_example. https://github.com/ModelTC/MQBench/blob/main/application/imagenet_example/main_dist.py

kartikgupta-at-anu · 2022-07-05T04:18:49Z

I am trying to train on ImageNet dataset using
https://github.com/ModelTC/MQBench/blob/main/application/imagenet_example/main.py. If you can point me why this script runs into issue, that would be great. I am not sure if the repo you shared is using DDP for multi-gpu training but that is what I am trying to use. Also I can't find imagenet training scripts in United-Perception.

kartikgupta-at-anu · 2022-07-05T04:24:23Z

Also there is a DDP example here, which should behave like the main.py in imagenet_example. https://github.com/ModelTC/MQBench/blob/main/application/imagenet_example/main_dist.py

main_dist.py seems to have missing "import models". Also not sure what the decorator @link_dist is for?

kartikgupta-at-anu · 2022-07-05T05:53:26Z

I modified the main_dist and it seems to work fine now for multi-gpu training with DDP.

kartikgupta-at-anu changed the title ~~DDP issues in Imagenet example~~ DDP multi-gpu training issues with Imagenet example Jul 5, 2022

kartikgupta-at-anu closed this as completed Jul 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP multi-gpu training issues with Imagenet example #128

DDP multi-gpu training issues with Imagenet example #128

kartikgupta-at-anu commented Jul 5, 2022

PannenetsF commented Jul 5, 2022

kartikgupta-at-anu commented Jul 5, 2022 •

edited

PannenetsF commented Jul 5, 2022 •

edited

PannenetsF commented Jul 5, 2022 •

edited

kartikgupta-at-anu commented Jul 5, 2022 •

edited

kartikgupta-at-anu commented Jul 5, 2022

kartikgupta-at-anu commented Jul 5, 2022

DDP multi-gpu training issues with Imagenet example #128

DDP multi-gpu training issues with Imagenet example #128

Comments

kartikgupta-at-anu commented Jul 5, 2022

PannenetsF commented Jul 5, 2022

kartikgupta-at-anu commented Jul 5, 2022 • edited

PannenetsF commented Jul 5, 2022 • edited

PannenetsF commented Jul 5, 2022 • edited

kartikgupta-at-anu commented Jul 5, 2022 • edited

kartikgupta-at-anu commented Jul 5, 2022

kartikgupta-at-anu commented Jul 5, 2022

kartikgupta-at-anu commented Jul 5, 2022 •

edited

PannenetsF commented Jul 5, 2022 •

edited

PannenetsF commented Jul 5, 2022 •

edited

kartikgupta-at-anu commented Jul 5, 2022 •

edited