Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should I change the batch size? #17

Closed
civilServant-666 opened this issue Jun 20, 2019 · 12 comments
Closed

Should I change the batch size? #17

civilServant-666 opened this issue Jun 20, 2019 · 12 comments

Comments

@civilServant-666
Copy link

Pardon me if I asked a silly question.
I am using your code to train a model on my own datasets (1214 images for styleA and 1921 images for styleB; and the size of the image is 256*256). And I use the defaut batch size of "1".
Then training process is really slow, which takes me nearly 9 hours for each epoch. Is this kind of speed normal?
If not, should I change the batch size? What is the optimal batch size that can attain both good computation efficiency and accuracy?
Thank you in advance.

@LynnHo
Copy link
Owner

LynnHo commented Jun 20, 2019

@civilServant-666 The speed is abnormal. I think you should check whether the GPU is used by the code. Batch size 1 is the official setting of CycleGAN.

@civilServant-666
Copy link
Author

Thanks. By the way, I am using the K80 GPU (an ml.p2.xlarge instance) provied by AWS SageMaker for the model training. How long do you anticipate it normally takes for the training of each epoch?

@LynnHo
Copy link
Owner

LynnHo commented Jun 20, 2019

@civilServant-666 I haven't used K80 before, so I have no idea about that. I have used 1080Ti, which took 2-3 days for 200 epochs.

@andrewginns
Copy link

@civilServant-666 on my gtx 980 I get about 1 iteration processed per second at 256*256. Your K80 should be faster than that.

You are definitely not running on the gpu. Check what your tf device is reported as.

If you want faster performance try reduce your image size to something like 128*128. This will reduce your computation requirements by a factor of ~4.

@civilServant-666
Copy link
Author

Wow~>_<
In my case, it only went through 11 epochs for 5 day!
Is there any way I can pause the training , so that I can check out what the problem is, and then get back to continue the training after i fixed the problem? @LynnHo
I am not new in deep learning, so I really appreciate your help.

@civilServant-666
Copy link
Author

@andrewginns Thank you for the advice. I'll check it out.

@LynnHo
Copy link
Owner

LynnHo commented Jun 20, 2019

@civilServant-666 The code saves a checkpoint each epoch and will restore the training if there exists a checkpoint. You can directly stop the process by CTRL+C. (check if there exists a checkpoint in the output folder)

@civilServant-666
Copy link
Author

@LynnHo Yes, there is a "checkpoints" folder under the "output" directory. To restore the training, I only need to type in the command "CUDA_VISIBLE_DEVICES=0 python train.py --dataset <my_dataset>", just like how I initially start the training, right?

@LynnHo
Copy link
Owner

LynnHo commented Jun 20, 2019

@civilServant-666 Exactly.

@civilServant-666
Copy link
Author

@andrewginns I just use the following command to check the current available device, and no GPU showed up there. Does that mean the GPU is being occupied by the training job?
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
and here is what i get:
[name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 13894235229207954586 , name: "/device:XLA_CPU:0" device_type: "XLA_CPU" memory_limit: 17179869184 locality { } incarnation: 5161380515914230634 physical_device_desc: "device: XLA_CPU device" ]

@civilServant-666
Copy link
Author

Turn out that I did't install the tensorflow-gpu package. After running the following command in the terminal, the problem was fixed.
pip install tensorflow-gpu==2.0.0-alpha0

@LynnHo Just in case that those who are new to tensorflow like me may encounter the same problem, I think it would be better to include "tensorflow-gpu" as one of the prerequisites in the documentation.

Thank you guys for your help!

@LynnHo
Copy link
Owner

LynnHo commented Jun 21, 2019

@civilServant-666 It's my carelessness. I have updated the README.md. Thanks a lot!

@LynnHo LynnHo closed this as completed Jun 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants