Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question on distributeddataparellel(DDP) #170

Closed
jakeyahn opened this issue Oct 14, 2022 · 10 comments
Closed

question on distributeddataparellel(DDP) #170

jakeyahn opened this issue Oct 14, 2022 · 10 comments

Comments

@jakeyahn
Copy link

jakeyahn commented Oct 14, 2022

Hi, I am testing DDP code with 4 V100 GPU as below,

export MASTER_ADDR="localhost"
export MASTER_PORT=2222
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 src/main.py -t -metrics none -cfg CONFIG_PATH -data DATA_PATH -save SAVE_PATH -DDP -sync_bn -mpc

but im having less than 48% gpu utilization.
Could I get some help to improve this problem?
image
image

below is what I have trained with 1* RTX 3090ti for a comparison.
image

@mingukkang
Copy link
Collaborator

Which model are you training?
Also, What configuration do you use?

@jakeyahn
Copy link
Author

Which model are you training? Also, What configuration do you use?

Both are the default ''cifar10-contragan'

@mingukkang
Copy link
Collaborator

mingukkang commented Oct 14, 2022

If you use V100 GPUs that are not fully connected using NVLINK, you may encounter communication bottleneck in using SyncBN and class conditional contrastive loss which require communication between GPUs.

@jakeyahn
Copy link
Author

So, inferring from the comment you gave me, may I understand the comment in a way that it is not the problem of the code, but the problem from the hardware itself, and the connection between the GPUs? I am not really familiar with NVLINK, but from the googling result, I guess there must be some jobs need to be done related to the SLI. Am I correct?

@mingukkang
Copy link
Collaborator

Sorry.

I found that your GPU (V100-SXM2) communicates via NVLINK by default.

Looking at the nvidia-smi output, I think you are running a relatively simple model with the high-end GPUs (4V100). So I think each GPU might not use all CUDA cores for computation.

Could you increase the batch size for training so that we can identify whether the low GPU utility attributes to StudioGAN code or not?

@jakeyahn
Copy link
Author

Sorry.

I found that your GPU (V100-SXM2) communicates via NVLINK by default.

Looking at the nvidia-smi output, I think you are running a relatively simple model with the high-end GPUs (4V100). So I think each GPU might not use all CUDA cores for computation.

Could you increase the batch size for training so that we can identify whether the low GPU utility attributes to StudioGAN code or not?

Sorry.

I found that your GPU (V100-SXM2) communicates via NVLINK by default.

Looking at the nvidia-smi output, I think you are running a relatively simple model with the high-end GPUs (4V100). So I think each GPU might not use all CUDA cores for computation.

Could you increase the batch size for training so that we can identify whether the low GPU utility attributes to StudioGAN code or not?

I could not find the batch_size parser in StudioGAN. should I use the std_max? or is there any other way?

@alex4727
Copy link
Collaborator

batch_size can be configured in config files! (.yaml files)

@jakeyahn
Copy link
Author

batch_size can be configured in config files! (.yaml files)

nearly the last question before closing the issue.

are you sure changing batch_size not through config.py as below?
image

configs folder - CIFAR10 folder - ContraGAN-ADC.yaml file does not have batch_size parameter!

@jakeyahn
Copy link
Author

and please tell me if I am wrong.
I changed the optimization.batch_size setting in config.py to 256, instead of 64 and I still have the following result.

image

Any comment?

@alex4727
Copy link
Collaborator

@jakeyahn
Sorry for late reply, been busy recently :(

  1. It is correct to change batch size in config files (.yaml) not config.py. Values specified in config.py are default values which can be overwrited by the config (.yaml) files. If you don't mention batch size in config files, StudioGAN will stick with the default batch size which is 64. https://github.com/POSTECH-CVLab/PyTorch-StudioGAN/blob/eb9cafbb61dfb9722afb5fb21eff75ed999ad52f/src/config.py#L33-L34
  2. For GPU utilization problem, I'd say CIFAR10 is not a dataset that requires high end 4-gpus. From my experience, StudioGAN's GPU utilization increases when we use bigger image, more complex networks with more operations, or larger batches. Otherwise, CPU codes or IO issues will cause bottleneck and reduce GPU utilization. The thing is, increasing batch size might speed up training a little, but will probably end up with worse results. For CIFAR10, we find ideal batch size to be 64 or 128. In our paper experiments, all CIFAR10 models were trained on single gpu (took about a day) so I'd recommend you to just use single gpu for CIFAR10 and try DDP for bigger datasets.

Thanks!

@alex4727 alex4727 closed this as completed Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants