Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU util is not 100% #26

Closed
sangwoomo opened this issue Oct 20, 2020 · 7 comments
Closed

GPU util is not 100% #26

sangwoomo opened this issue Oct 20, 2020 · 7 comments

Comments

@sangwoomo
Copy link

Hi, thank you for a great repository!

I'm trying to train SNGAN on my custom dataset (128x128) in 4 GPU, but found that the GPU utils are ~90%. I also increased num_workers=16 and applied -rm_API but it is still not addressed. Do you have any suspicion on this issue?

By the way, even BigGAN_256 raised out-of-memory (OOM) error for 4 GPUs of 12G memory. Do I need 8 GPU server or distributed training for BigGAN_256?

Thank you for your help!

p.s. typo: load_frameowrk -> load_framework

@mingukkang
Copy link
Collaborator

  1. When we train our model using DP(Dataparallel) setting, distributing and collecting operation for Multi-GPU training can cause a training bottleneck. One possible solution is using DDP (DistributedDataParallel) :(
    Please refer to the following posting (https://medium.com/daangn/pytorch-multi-gpu-학습-제대로-하기-27270617936b).

  2. Yes, you need 8 TITAN-XP or 4 TITAN-RTX GPUs for training BigGAN 256.

  3. I have a BigGAN 256 checkpoint that gives 13~15 FID (not exactly) and 46.715 IS. If you need this, plz contact me (mgkang@postech.ac.kr).

@mingukkang
Copy link
Collaborator

Also, I have a plan to implement all our implementations using DDP and try to fill out the missing part of our table after the camera-ready deadline.

sincerely,

Minguk Kang

@sangwoomo
Copy link
Author

Hi @mingukkang, thank you for your quick response!

It would be awesome if you provide the official DDP implementation. Thank you for your service! 🤩

@mingukkang
Copy link
Collaborator

Hi @sangwoomo:)

Recently, I make the ddp branch, and it fully supports distributed data parallel (DDP) training.

if you want to use it, plz follow the below constructions:

  1. git stash
  2. git pull origin master
  3. git pull origin ddp
  4. git checkout ddp
  5. CUDA_VISIBLE_DEVICES=0,1,2,3 python3 main.py -t -e -l -rm_API -DDP

Thank you!

I also plan to add syncBN using PyTorch's official synchronized BN module.

I will merge the branch as soon as possible.

Sincerely,

@sangwoomo
Copy link
Author

Thanks a lot! :)

@mingukkang
Copy link
Collaborator

I have updated all of StudioGAN's models to support DDP training.

We can use PyTorch's official SyncBN and Mixed Precision Training to stabilize and speed up training GANs.

Sincerely,

Minguk Kang

@mingukkang
Copy link
Collaborator

successfully done:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants