Should I change the batch size? #17

civilServant-666 · 2019-06-20T14:12:12Z

Pardon me if I asked a silly question.
I am using your code to train a model on my own datasets (1214 images for styleA and 1921 images for styleB; and the size of the image is 256*256). And I use the defaut batch size of "1".
Then training process is really slow, which takes me nearly 9 hours for each epoch. Is this kind of speed normal?
If not, should I change the batch size? What is the optimal batch size that can attain both good computation efficiency and accuracy?
Thank you in advance.

LynnHo · 2019-06-20T14:20:56Z

@civilServant-666 The speed is abnormal. I think you should check whether the GPU is used by the code. Batch size 1 is the official setting of CycleGAN.

civilServant-666 · 2019-06-20T15:11:58Z

Thanks. By the way, I am using the K80 GPU (an ml.p2.xlarge instance) provied by AWS SageMaker for the model training. How long do you anticipate it normally takes for the training of each epoch?

LynnHo · 2019-06-20T15:18:44Z

@civilServant-666 I haven't used K80 before, so I have no idea about that. I have used 1080Ti, which took 2-3 days for 200 epochs.

andrewginns · 2019-06-20T15:22:59Z

@civilServant-666 on my gtx 980 I get about 1 iteration processed per second at 256*256. Your K80 should be faster than that.

You are definitely not running on the gpu. Check what your tf device is reported as.

If you want faster performance try reduce your image size to something like 128*128. This will reduce your computation requirements by a factor of ~4.

civilServant-666 · 2019-06-20T15:25:16Z

Wow~>_<
In my case, it only went through 11 epochs for 5 day!
Is there any way I can pause the training , so that I can check out what the problem is, and then get back to continue the training after i fixed the problem? @LynnHo
I am not new in deep learning, so I really appreciate your help.

civilServant-666 · 2019-06-20T15:27:07Z

@andrewginns Thank you for the advice. I'll check it out.

LynnHo · 2019-06-20T15:33:35Z

@civilServant-666 The code saves a checkpoint each epoch and will restore the training if there exists a checkpoint. You can directly stop the process by CTRL+C. (check if there exists a checkpoint in the output folder)

civilServant-666 · 2019-06-20T15:42:41Z

@LynnHo Yes, there is a "checkpoints" folder under the "output" directory. To restore the training, I only need to type in the command "CUDA_VISIBLE_DEVICES=0 python train.py --dataset <my_dataset>", just like how I initially start the training, right?

LynnHo · 2019-06-20T15:47:17Z

@civilServant-666 Exactly.

civilServant-666 · 2019-06-20T15:52:12Z

@andrewginns I just use the following command to check the current available device, and no GPU showed up there. Does that mean the GPU is being occupied by the training job?
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
and here is what i get:
[name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 13894235229207954586 , name: "/device:XLA_CPU:0" device_type: "XLA_CPU" memory_limit: 17179869184 locality { } incarnation: 5161380515914230634 physical_device_desc: "device: XLA_CPU device" ]

civilServant-666 · 2019-06-20T19:41:22Z

Turn out that I did't install the tensorflow-gpu package. After running the following command in the terminal, the problem was fixed.
pip install tensorflow-gpu==2.0.0-alpha0

@LynnHo Just in case that those who are new to tensorflow like me may encounter the same problem, I think it would be better to include "tensorflow-gpu" as one of the prerequisites in the documentation.

Thank you guys for your help!

LynnHo · 2019-06-21T02:54:13Z

@civilServant-666 It's my carelessness. I have updated the README.md. Thanks a lot!

LynnHo closed this as completed Jun 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should I change the batch size? #17

Should I change the batch size? #17

civilServant-666 commented Jun 20, 2019

LynnHo commented Jun 20, 2019

civilServant-666 commented Jun 20, 2019

LynnHo commented Jun 20, 2019

andrewginns commented Jun 20, 2019

civilServant-666 commented Jun 20, 2019

civilServant-666 commented Jun 20, 2019

LynnHo commented Jun 20, 2019

civilServant-666 commented Jun 20, 2019

LynnHo commented Jun 20, 2019

civilServant-666 commented Jun 20, 2019

civilServant-666 commented Jun 20, 2019

LynnHo commented Jun 21, 2019

Should I change the batch size? #17

Should I change the batch size? #17

Comments

civilServant-666 commented Jun 20, 2019

LynnHo commented Jun 20, 2019

civilServant-666 commented Jun 20, 2019

LynnHo commented Jun 20, 2019

andrewginns commented Jun 20, 2019

civilServant-666 commented Jun 20, 2019

civilServant-666 commented Jun 20, 2019

LynnHo commented Jun 20, 2019

civilServant-666 commented Jun 20, 2019

LynnHo commented Jun 20, 2019

civilServant-666 commented Jun 20, 2019

civilServant-666 commented Jun 20, 2019

LynnHo commented Jun 21, 2019