Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why multi-gpu training slower than single gpu #5250

Open
1 task done
wangdada-love opened this issue Dec 21, 2023 · 1 comment
Open
1 task done

why multi-gpu training slower than single gpu #5250

wangdada-love opened this issue Dec 21, 2023 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@wangdada-love
Copy link

Describe the question.

I have rewritten my data augmentation methods using the DALI module and applied them to train the DeeplabV3 model based on TensorFlow. However, I have observed that the training speed is faster when using a single GPU, and the speed significantly decreases when training on 4 GPUs. Both my data augmentation methods and the generation of DaliDataset are implemented following the official documentation:https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/frameworks/tensorflow/tensorflow-dataset-multigpu.html

My current concerns are as follows:

  1. Why is the training slower with multiple GPUs, and even with a batch size set to 64, the GPU memory can still be fully utilized?
  2. Does using the DALI data augmentation method really provide a noticeable speed improvement compared to using TensorFlow's native data augmentation method? Is it meaningful to validate the DALI approach?

Check for duplicates

  • I have searched the open bugs/issues and have found no duplicates for this bug report
@wangdada-love wangdada-love added the question Further information is requested label Dec 21, 2023
@szalpal
Copy link
Member

szalpal commented Dec 21, 2023

Hello @wangdada-love ,

Thank you for the interesting question. Let me answer the 2nd one first. DALI is used in MLPerf competition in the benchmarks posted by NVIDIA. Since MLPerf is all about performance, if the native TF would be faster, we'd be using that one ;) Additionally, we do have multitude of success stories (please refer here) that emphasise how DALI helps in data augmentation.

With regards to your firs question, without some additional details it is hard to tell what's happening. Should you like to diagnose what's happening, I'd like to suggest you two things. First, please look at the output of nvidia-smi and htop and verify if your worker resources are 100% utilized. If they are not it is likely that you need to tune training hyperparameters (e.g. num_threads, batch_size, hw_decoder_load) to fit into multi-GPU environment. Secondly, you may want to profile your training. You can find many resources and tutorials on profiling using Nsight systems. TLDR - you can invoke your training using nsys like this:

nsys profile -o my_profile python train.py

And then use Nsight Systems to open captured profile and look what happened.

@szalpal szalpal removed their assignment Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants