-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should I change the batch size? #17
Comments
@civilServant-666 The speed is abnormal. I think you should check whether the GPU is used by the code. Batch size 1 is the official setting of CycleGAN. |
Thanks. By the way, I am using the K80 GPU (an ml.p2.xlarge instance) provied by AWS SageMaker for the model training. How long do you anticipate it normally takes for the training of each epoch? |
@civilServant-666 I haven't used K80 before, so I have no idea about that. I have used 1080Ti, which took 2-3 days for 200 epochs. |
@civilServant-666 on my gtx 980 I get about 1 iteration processed per second at 256*256. Your K80 should be faster than that. You are definitely not running on the gpu. Check what your tf device is reported as. If you want faster performance try reduce your image size to something like 128*128. This will reduce your computation requirements by a factor of ~4. |
Wow~>_< |
@andrewginns Thank you for the advice. I'll check it out. |
@civilServant-666 The code saves a checkpoint each epoch and will restore the training if there exists a checkpoint. You can directly stop the process by CTRL+C. (check if there exists a checkpoint in the |
@LynnHo Yes, there is a "checkpoints" folder under the "output" directory. To restore the training, I only need to type in the command "CUDA_VISIBLE_DEVICES=0 python train.py --dataset <my_dataset>", just like how I initially start the training, right? |
@civilServant-666 Exactly. |
@andrewginns I just use the following command to check the current available device, and no GPU showed up there. Does that mean the GPU is being occupied by the training job? |
Turn out that I did't install the tensorflow-gpu package. After running the following command in the terminal, the problem was fixed. @LynnHo Just in case that those who are new to tensorflow like me may encounter the same problem, I think it would be better to include "tensorflow-gpu" as one of the prerequisites in the documentation. Thank you guys for your help! |
@civilServant-666 It's my carelessness. I have updated the README.md. Thanks a lot! |
Pardon me if I asked a silly question.
I am using your code to train a model on my own datasets (1214 images for styleA and 1921 images for styleB; and the size of the image is 256*256). And I use the defaut batch size of "1".
Then training process is really slow, which takes me nearly 9 hours for each epoch. Is this kind of speed normal?
If not, should I change the batch size? What is the optimal batch size that can attain both good computation efficiency and accuracy?
Thank you in advance.
The text was updated successfully, but these errors were encountered: