Questions about training details #24

ObeliskChoi · 2020-05-25T08:42:31Z

Thanks for your excellent work. I have some questions about the training phase.

When I trained the model on 2 Nvidia P100 GPUs with batch size 32, it spent nearly 6.5 minutes for every 100 iterations. It cost nearly the same time when I trained on either .png or lmdb format. Could you give any advices for accelerating the training speed?
Did you train the model with batch size 16 for 600000 iters as the .yml file to achieve the provided pretrained model, or with batch size 24 and fewer iters? When the batch size changed, did you modify the initial learning rate or other related setting?
Segmentation fault occured when I ran test.py in codes/models/modules/DCNv2, but I ignored the error and succeeded in running the training and testing phase with a close PSNR result. Would this be the reason for the slow training speed or lead to some errors?
Looking forward to your reply. Thank you.

Mukosame · 2020-06-08T17:04:39Z

I have noticed that the training time reported by different people from the issues varies a lot: some might be less than 2 mins, while others might take very long, which could be more or less due to the devices themselves. I have some small tips: put your data on ssd instead of mounted hdd; try not to interfere it with other tasks on disk i/o or gpu quota.
The more iterations you use, the better results you can get. We use batch size 16 to make the debug easier on a single gpu, and use 24 to train on 2 GPUs. Changing the initial learning rate depending on the batch size could give you a better result, but I haven't used it in this paper.
In my limited experience, segmentation fault usually leads to unable to load DCNv2 in test/train. So I am also confused about it. You can check if the DCNv2 is really running well on the GPU.
Please let me know if you have more questions.

ObeliskChoi · 2020-06-12T02:25:09Z

Thanks for your reply.

Mukosame closed this as completed Jul 22, 2020

ThompsonHe mentioned this issue Oct 26, 2020

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle) #44

Closed

Provide feedback