-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Discussed in #9251
Originally posted by jatentaki September 1, 2021
I have two questions regarding the behavior of DataLoaders when multi-gpu training with ddp/ddp_spawn. Let me first define that I use "GPU worker" to mean the process using each of the N GPUs for model forward/backward and "data worker" to mean the processes created by torch.data.utils.DataLoader to load and preprocess batches.
- How many data workers are there per GPU worker? I see that with
ddpthe dataset is being recreated for each GPU worker but the total number of data workers seems to be constant: does each GPU worker get its share of N_total_data_workers / N_gpu_workers? Is this documented somewhere? - I have a pipeline where the data workers actually use some GPU functionality (render some synthetic data via OpenGL) and I need to specify which GPU they should use. How can I figure out which GPU worker a data worker belongs to, such that I can load balance that rendering across GPUs?
Supposed I run ddp with 4 gpus, and I instancialize my dataset object and then prepare the dataset in setup() function like below.
def setup(self, stage=None):
# Assign train/val datasets for use in dataloaders
instancialize = self.instancialize()
instancialize.get_dataset()
if stage == 'fit' or stage is None:
self.trainset = instancialize.dataset['train']
self.valset = instancialize.dataset['test']
# Assign test dataset for use in dataloader(s)
if stage == 'test' or stage is None:
self.testset = instancialize.dataset['test']
I notice that the dataset is setting up for 4 times with ddp from printed contents in instancialize.get_dataset(). From my understand of ddp, we get a big batch of data each time from a dataset and distribute them equally to each of gpus. After all gpus compute the loss, we compute the mean of loss of all gpus and each model in each gpu update this loss. Here it is reasonable to have multiple dataloader, but multiple copies of dataset seems not reasonable and may harm the training. So, am I doing anything wrong in my code or having a misunderstanding of ddp? Appretiate for your answer, thx~
cc @carmocca @awaelchli @Borda @ananthsub @ninginthecloud @jjenniferdai @rohitgr7 @justusschock @kaushikb11 @akihironitta