question on distributeddataparellel(DDP) #170

jakeyahn · 2022-10-14T14:45:37Z

Hi, I am testing DDP code with 4 V100 GPU as below,

export MASTER_ADDR="localhost"
export MASTER_PORT=2222
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 src/main.py -t -metrics none -cfg CONFIG_PATH -data DATA_PATH -save SAVE_PATH -DDP -sync_bn -mpc

but im having less than 48% gpu utilization.
Could I get some help to improve this problem?

below is what I have trained with 1* RTX 3090ti for a comparison.

mingukkang · 2022-10-14T14:54:55Z

Which model are you training?
Also, What configuration do you use?

jakeyahn · 2022-10-14T14:57:43Z

Which model are you training? Also, What configuration do you use?

Both are the default ''cifar10-contragan'

mingukkang · 2022-10-14T15:00:04Z

If you use V100 GPUs that are not fully connected using NVLINK, you may encounter communication bottleneck in using SyncBN and class conditional contrastive loss which require communication between GPUs.

jakeyahn · 2022-10-14T15:32:56Z

So, inferring from the comment you gave me, may I understand the comment in a way that it is not the problem of the code, but the problem from the hardware itself, and the connection between the GPUs? I am not really familiar with NVLINK, but from the googling result, I guess there must be some jobs need to be done related to the SLI. Am I correct?

mingukkang · 2022-10-14T16:40:09Z

Sorry.

I found that your GPU (V100-SXM2) communicates via NVLINK by default.

Looking at the nvidia-smi output, I think you are running a relatively simple model with the high-end GPUs (4V100). So I think each GPU might not use all CUDA cores for computation.

Could you increase the batch size for training so that we can identify whether the low GPU utility attributes to StudioGAN code or not?

jakeyahn · 2022-10-15T16:50:13Z

Sorry.

I found that your GPU (V100-SXM2) communicates via NVLINK by default.

Looking at the nvidia-smi output, I think you are running a relatively simple model with the high-end GPUs (4V100). So I think each GPU might not use all CUDA cores for computation.

Could you increase the batch size for training so that we can identify whether the low GPU utility attributes to StudioGAN code or not?

I could not find the batch_size parser in StudioGAN. should I use the std_max? or is there any other way?

alex4727 · 2022-10-15T16:52:17Z

batch_size can be configured in config files! (.yaml files)

jakeyahn · 2022-10-16T22:05:26Z

batch_size can be configured in config files! (.yaml files)

nearly the last question before closing the issue.

are you sure changing batch_size not through config.py as below?

configs folder - CIFAR10 folder - ContraGAN-ADC.yaml file does not have batch_size parameter!

jakeyahn · 2022-10-16T22:28:00Z

and please tell me if I am wrong.
I changed the optimization.batch_size setting in config.py to 256, instead of 64 and I still have the following result.

Any comment?

alex4727 · 2022-10-21T18:17:52Z

@jakeyahn
Sorry for late reply, been busy recently :(

It is correct to change batch size in config files (.yaml) not config.py. Values specified in config.py are default values which can be overwrited by the config (.yaml) files. If you don't mention batch size in config files, StudioGAN will stick with the default batch size which is 64. https://github.com/POSTECH-CVLab/PyTorch-StudioGAN/blob/eb9cafbb61dfb9722afb5fb21eff75ed999ad52f/src/config.py#L33-L34
For GPU utilization problem, I'd say CIFAR10 is not a dataset that requires high end 4-gpus. From my experience, StudioGAN's GPU utilization increases when we use bigger image, more complex networks with more operations, or larger batches. Otherwise, CPU codes or IO issues will cause bottleneck and reduce GPU utilization. The thing is, increasing batch size might speed up training a little, but will probably end up with worse results. For CIFAR10, we find ideal batch size to be 64 or 128. In our paper experiments, all CIFAR10 models were trained on single gpu (took about a day) so I'd recommend you to just use single gpu for CIFAR10 and try DDP for bigger datasets.

Thanks!

alex4727 closed this as completed Nov 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question on distributeddataparellel(DDP) #170

question on distributeddataparellel(DDP) #170

jakeyahn commented Oct 14, 2022 •

edited

mingukkang commented Oct 14, 2022

jakeyahn commented Oct 14, 2022

mingukkang commented Oct 14, 2022 •

edited

jakeyahn commented Oct 14, 2022

mingukkang commented Oct 14, 2022

jakeyahn commented Oct 15, 2022

alex4727 commented Oct 15, 2022

jakeyahn commented Oct 16, 2022

jakeyahn commented Oct 16, 2022

alex4727 commented Oct 21, 2022

question on distributeddataparellel(DDP) #170

question on distributeddataparellel(DDP) #170

Comments

jakeyahn commented Oct 14, 2022 • edited

mingukkang commented Oct 14, 2022

jakeyahn commented Oct 14, 2022

mingukkang commented Oct 14, 2022 • edited

jakeyahn commented Oct 14, 2022

mingukkang commented Oct 14, 2022

jakeyahn commented Oct 15, 2022

alex4727 commented Oct 15, 2022

jakeyahn commented Oct 16, 2022

jakeyahn commented Oct 16, 2022

alex4727 commented Oct 21, 2022

jakeyahn commented Oct 14, 2022 •

edited

mingukkang commented Oct 14, 2022 •

edited