New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
question on distributeddataparellel(DDP) #170
Comments
Which model are you training? |
Both are the default ''cifar10-contragan' |
If you use V100 GPUs that are not fully connected using NVLINK, you may encounter communication bottleneck in using SyncBN and class conditional contrastive loss which require communication between GPUs. |
So, inferring from the comment you gave me, may I understand the comment in a way that it is not the problem of the code, but the problem from the hardware itself, and the connection between the GPUs? I am not really familiar with NVLINK, but from the googling result, I guess there must be some jobs need to be done related to the SLI. Am I correct? |
Sorry. I found that your GPU (V100-SXM2) communicates via NVLINK by default. Looking at the nvidia-smi output, I think you are running a relatively simple model with the high-end GPUs (4V100). So I think each GPU might not use all CUDA cores for computation. Could you increase the batch size for training so that we can identify whether the low GPU utility attributes to StudioGAN code or not? |
I could not find the batch_size parser in StudioGAN. should I use the std_max? or is there any other way? |
batch_size can be configured in config files! (.yaml files) |
@jakeyahn
Thanks! |
Hi, I am testing DDP code with 4 V100 GPU as below,
export MASTER_ADDR="localhost"
export MASTER_PORT=2222
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 src/main.py -t -metrics none -cfg CONFIG_PATH -data DATA_PATH -save SAVE_PATH -DDP -sync_bn -mpc
but im having less than 48% gpu utilization.
Could I get some help to improve this problem?
below is what I have trained with 1* RTX 3090ti for a comparison.
The text was updated successfully, but these errors were encountered: