Fix for non-distributed training #26

GerardMaggiolino · 2022-05-09T21:40:57Z

I thought I'd document a fix for non-distributed training in case other people are trying to use this repo as well.

Even if distributed training isn't selected, you'll get a:

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

during the forward call.

This happens in the decoder head due to the SyncBN layer. Find the config file under base corresponding to your model, and change the norm_cfg to use BN instead of SyncBN. For example, I'm using SeMask-FPN, so under configs/base/models/semfpn_semask_swin.py, I change:

norm_cfg = dict(type='SyncBN', requires_grad=True)

to

norm_cfg = dict(type='BN', requires_grad=True)

The layers seem to be totally compatible with one another, so you can load weights from models using SyncBN fine. Just a warning that you'll need to hardcode this if changing between regular and distributed training.

The text was updated successfully, but these errors were encountered:

praeclarumjj3 · 2022-05-10T09:12:56Z

Hi @GerardMaggiolino, thanks for your interest.

You can quickly run a process on a single GPU with the tools/dist_train.sh script by setting the GPUS=1. You won't have to hardcode anything in that case. Is there some disadvantage to doing that and running the regular training instead?

GerardMaggiolino · 2022-05-10T09:53:22Z

This is a great point, and I don't think there is any issue with just using the dist_train.sh script instead of running train.py directly. Thanks for the great work btw, I've used SeMask for a few projects and the pre-trained weights are really helpful!

praeclarumjj3 · 2022-05-10T09:58:23Z

Thanks for the praise. Glad our work could assist you! 😄

GerardMaggiolino closed this as completed May 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for non-distributed training #26

Fix for non-distributed training #26

GerardMaggiolino commented May 9, 2022 •

edited

praeclarumjj3 commented May 10, 2022

GerardMaggiolino commented May 10, 2022

praeclarumjj3 commented May 10, 2022

Fix for non-distributed training #26

Fix for non-distributed training #26

Comments

GerardMaggiolino commented May 9, 2022 • edited

praeclarumjj3 commented May 10, 2022

GerardMaggiolino commented May 10, 2022

praeclarumjj3 commented May 10, 2022

GerardMaggiolino commented May 9, 2022 •

edited