Skip to content

test produces a warning when using DDP #12862

@ruro

Description

@ruro

Trying to trainer.test with multiple GPUs (or even when using a single GPU with DDPStrategy) produces the following warning:

PossibleUserWarning: Using `DistributedSampler` with the dataloaders. During `trainer.test()`,
it is recommended to use `Trainer(devices=1)` to ensure each sample/batch gets evaluated
exactly once. Otherwise, multi-device settings use `DistributedSampler` that replicates some
samples to make sure all devices have same batch size in case of uneven inputs.

The problem is that the warning doesn't adequately explain, how to fix this problem in all possible cases.

1. What if I am running trainer.test after trainer.fit?

Settings devices=1 in that case is not really a solution, because I want to use multiple GPUs for training. Creating a new Trainer instance also doesn't quite work, because that would create a separate experiment (AFAIK?). For example, ckpt_path="best" wouldn't work with a new Trainer instance, the Tensorboard logs will get segmented and so on.

Is it possible to use a different Strategy for tune, fit and test in a single Trainer? (btw, this might be useful even outside of this issue, as tune currently doesn't work well with DDP)

2. What if I don't care about DistributedSampler adding extra samples?

Please correct me, if I am wrong, but DistributedSampler should add at most num_devices - 1 extra samples. This means that unless you are using hundreds of devices or using extremely small datasets, the difference in metrics will probably be

a) less than the rounding precision
b) less than the natural fluctuations due to random initialization and non-deterministic CUDA shenanigans

I think that bothering users with such a minor issue isn't really desirable. Can this warning be silenced somehow?

3. Can this be fixed without requiring any changes from the users?

I found pytorch_lightning.overrides.distributed.UnrepeatedDistributedSampler, which allegedly solves this exact problem, but doesn't work for training.

Does UnrepeatedDistributedSampler solve this issue? If it does, I think it should be at least mentioned in the warning and at best - used automatically during test instead of warning the user.

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions