GPU util is not 100% #26

sangwoomo · 2020-10-20T15:06:21Z

Hi, thank you for a great repository!

I'm trying to train SNGAN on my custom dataset (128x128) in 4 GPU, but found that the GPU utils are ~90%. I also increased num_workers=16 and applied -rm_API but it is still not addressed. Do you have any suspicion on this issue?

By the way, even BigGAN_256 raised out-of-memory (OOM) error for 4 GPUs of 12G memory. Do I need 8 GPU server or distributed training for BigGAN_256?

Thank you for your help!

p.s. typo: load_frameowrk -> load_framework

The text was updated successfully, but these errors were encountered:

mingukkang · 2020-10-20T15:30:16Z

When we train our model using DP(Dataparallel) setting, distributing and collecting operation for Multi-GPU training can cause a training bottleneck. One possible solution is using DDP (DistributedDataParallel) :(
Please refer to the following posting (https://medium.com/daangn/pytorch-multi-gpu-학습-제대로-하기-27270617936b).
Yes, you need 8 TITAN-XP or 4 TITAN-RTX GPUs for training BigGAN 256.
I have a BigGAN 256 checkpoint that gives 13~15 FID (not exactly) and 46.715 IS. If you need this, plz contact me (mgkang@postech.ac.kr).

mingukkang · 2020-10-20T15:36:15Z

Also, I have a plan to implement all our implementations using DDP and try to fill out the missing part of our table after the camera-ready deadline.

sincerely,

Minguk Kang

sangwoomo · 2020-10-20T15:41:59Z

Hi @mingukkang, thank you for your quick response!

It would be awesome if you provide the official DDP implementation. Thank you for your service! 🤩

mingukkang · 2020-11-20T15:11:40Z

Hi @sangwoomo:)

Recently, I make the ddp branch, and it fully supports distributed data parallel (DDP) training.

if you want to use it, plz follow the below constructions:

git stash
git pull origin master
git pull origin ddp
git checkout ddp
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 main.py -t -e -l -rm_API -DDP

Thank you!

I also plan to add syncBN using PyTorch's official synchronized BN module.

I will merge the branch as soon as possible.

Sincerely,

sangwoomo · 2020-11-22T10:49:57Z

Thanks a lot! :)

mingukkang · 2020-11-30T12:02:04Z

I have updated all of StudioGAN's models to support DDP training.

We can use PyTorch's official SyncBN and Mixed Precision Training to stabilize and speed up training GANs.

Sincerely,

Minguk Kang

mingukkang · 2020-11-30T12:02:43Z

successfully done:)

mingukkang closed this as completed Nov 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU util is not 100% #26

GPU util is not 100% #26

sangwoomo commented Oct 20, 2020

mingukkang commented Oct 20, 2020

mingukkang commented Oct 20, 2020

sangwoomo commented Oct 20, 2020

mingukkang commented Nov 20, 2020

sangwoomo commented Nov 22, 2020

mingukkang commented Nov 30, 2020

mingukkang commented Nov 30, 2020

GPU util is not 100% #26

GPU util is not 100% #26

Comments

sangwoomo commented Oct 20, 2020

mingukkang commented Oct 20, 2020

mingukkang commented Oct 20, 2020

sangwoomo commented Oct 20, 2020

mingukkang commented Nov 20, 2020

sangwoomo commented Nov 22, 2020

mingukkang commented Nov 30, 2020

mingukkang commented Nov 30, 2020