Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about training details #24

Closed
ObeliskChoi opened this issue May 25, 2020 · 2 comments
Closed

Questions about training details #24

ObeliskChoi opened this issue May 25, 2020 · 2 comments

Comments

@ObeliskChoi
Copy link

Thanks for your excellent work. I have some questions about the training phase.

  1. When I trained the model on 2 Nvidia P100 GPUs with batch size 32, it spent nearly 6.5 minutes for every 100 iterations. It cost nearly the same time when I trained on either .png or lmdb format. Could you give any advices for accelerating the training speed?
  2. Did you train the model with batch size 16 for 600000 iters as the .yml file to achieve the provided pretrained model, or with batch size 24 and fewer iters? When the batch size changed, did you modify the initial learning rate or other related setting?
  3. Segmentation fault occured when I ran test.py in codes/models/modules/DCNv2, but I ignored the error and succeeded in running the training and testing phase with a close PSNR result. Would this be the reason for the slow training speed or lead to some errors?
    Looking forward to your reply. Thank you.
@Mukosame
Copy link
Owner

Mukosame commented Jun 8, 2020

Hi @ObeliskChoi ,

  1. I have noticed that the training time reported by different people from the issues varies a lot: some might be less than 2 mins, while others might take very long, which could be more or less due to the devices themselves. I have some small tips: put your data on ssd instead of mounted hdd; try not to interfere it with other tasks on disk i/o or gpu quota.
  2. The more iterations you use, the better results you can get. We use batch size 16 to make the debug easier on a single gpu, and use 24 to train on 2 GPUs. Changing the initial learning rate depending on the batch size could give you a better result, but I haven't used it in this paper.
  3. In my limited experience, segmentation fault usually leads to unable to load DCNv2 in test/train. So I am also confused about it. You can check if the DCNv2 is really running well on the GPU.
    Please let me know if you have more questions.

@ObeliskChoi
Copy link
Author

Thanks for your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants