How to get ddp working for custom collate functions #3429

edraizen · 2020-09-09T20:02:28Z

edraizen
Sep 9, 2020

❓ Questions and Help

Hello,

I am confused about ddp training, batch sizes, and custom collate functions. I am using MinkowskiEngine for sparse 3D image segmentation, which requires batches to consist of lists of minibatches (list of samples) because the data cannot be split randomly or by a set length. However, Pytorch Lighting scatters the data prior to sending it to training_step, incorrectly scattering individual samples across different GPUs. For dp training, this can be corrected with a custom collate funcation that makes a minibatch for each GPU, where each minibatch has the length of the batch size. ddp is slightly more confusing. I am first trying to get it work using one node with 8 GPUs, with a batch size of 4.

What is your question?

When and where does the DataLoader (and LightningDataModule) initialize and create the data? Does the DataLoader get initialized for each ddp process? If so, does this initialize it with same num_workers each time?

When I printed it out, DataLoader was initialized 16 (or 32; pl seems to print everything twice) times, all with the same number of GPUs, but distributed_backend became None for half of them. If distributed_backend is None, should I set num_workers to 0?
If I split data the same as I do for dp, each ddp process (8 for 8 GPUs) calls the scatter function with a list of 8 minibatches, where each minibatch has the batch size number of samples (=> 32 samples over 8 minibatches), but only one device in device_ids. Since each ddp process is different and only has access to one GPU it makes sense that there is only one device, but is there a way to make sure the scatter function only gets one minibatch? Or if there is another way to think about this, I'd really appreciate it!

Thanks for your help!

Code

from math import ceil
import MinkowskiEngine as ME

def multigpu_sparse_collate(data, num_gpus=1):
    """Make sure samples are put together so they are put on the correct GPU
    """
    minibatch_len = int(ceil(len(data)/num_gpus))
    batches = list(zip(*(ME.utils.batch_sparse_collate(data[i:i+minibatch_len]) \
        for i in range(0, len(data), minibatch_len))))
    return batches

import pprint
import torch
import MinkowskiEngine as ME
from pytorch_lightning.overrides.data_parallel import LightningDataParallel, LightningDistributedDataParallel

class MinkowskiDataParallel(LightningDataParallel):
    def scatter(self, inputs, kwargs, device_ids):
        assert len(inputs) == 2 and len(inputs[0]) == 3 and len(inputs[0][0]) == len(inputs[0][1]), "Inputs must contain features and labels (2 [{}]) and have the same number of minibatches ({}, {})".format(len(inputs[0]), len(inputs[0][0]), len(inputs[0][1]))
        assert len(inputs[0][0]) == len(device_ids), "Error with batch, there must be the same number of minibatches ({}) as there are GPUs ({})".format(len(inputs[0][0]), len(device_ids))

        #DataLoader splits them up, so only need to map to ids
        return [((coords, feat.to(device), labels), inputs[1]) for coords, feat, labels, device in zip(inputs[0][0], inputs[0][1], inputs[0][2], device_ids)], None

    def gather(self, outputs):
        device = next(outputs[0].values()).device

        results = {}
        for key in outputs[0].keys():
            results[key] = torch.cat([x[key].to(device) for x in outputs])

        return results

class MinkowskiDistributedDataParallel(LightningDistributedDataParallel):
    def scatter(self, inputs, kwargs, device_ids):
        assert len(inputs) == 2 and len(inputs[0]) == 3 and len(inputs[0][0]) == len(inputs[0][1]), "Inputs must contain features and labels (2 [{}]) and have the same number of minibatches ({}, {})".format(len(inputs[0]), len(inputs[0][0]), len(inputs[0][1]))
        assert len(inputs[0][0]) == len(device_ids), "Error with batch, there must be the same number of minibatches ({}->{}) as there are GPUs ({}: {}) \n {}".format(len(inputs[0]), len(inputs[0][0]), len(device_ids), device_ids, pprint.pformat(inputs))

        #DataLoader splits them up, so only need to map to ids
        return [((coords, feat.to(device), labels), inputs[1]) for coords, feat, labels, device in zip(inputs[0][0], inputs[0][1], inputs[0][2], device_ids)], None

    def gather(self, outputs):
        device = next(outputs[0].values()).device

        results = {}
        for key in outputs[0].keys():
            results[key] = torch.cat([x[key].to(device) for x in outputs])

        return results

What have you tried?

What's your environment?

OS: Linux
Docker: nvidia/cuda:10.2-cudnn7-devel-ubuntu18.04
Packaging: pip
Version: 0.9.0

edraizen · 2020-09-11T14:40:58Z

edraizen
Sep 11, 2020
Author

I was able to get it to work by correcting the batch size and using the normal scatter function. It turns out I had manually multiplied the batch size by the number of GPUs, which works for dp but not for ddp. Would it be possible for Pytorch Lightning to scale the batch size for number of GPUs if using dp or warn the user that they should increase it?

Also, it would also be useful if there was a configure_dp method to make custom LightningDataParallel models.

Thanks,
Eli

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get ddp working for custom collate functions #3429

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How to get ddp working for custom collate functions #3429

edraizen Sep 9, 2020

❓ Questions and Help

What is your question?

Code

What have you tried?

What's your environment?

Replies: 1 comment

edraizen Sep 11, 2020 Author

edraizen
Sep 9, 2020

edraizen
Sep 11, 2020
Author