-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors in training when using FLOAT16 #490
Comments
hi @abdelrahman-gaber
|
When I use all FLOAT16 as you said, I got the error: This is the train.prototxt file I am using: https://drive.google.com/open?id=1_5YS_XY_rPKsV_uXHn6hXDSIB8yEiVo9 Thank you. |
@abdelrahman-gaber what particular script did you use to create your LMDB? |
I used the script provided by the original SSD implementation with minor modification to accept text files, my script can be found here: https://drive.google.com/file/d/1HBbGD4-G2mhqmeIW8aY5CeQBxTHwTRt1/view?usp=sharing Note that the lmdb files generated was used to train the model with the original SSD implementation and it worked fine, so I used the same lmdb files when training the model with nvidia caffe but found these problems. |
@abdelrahman-gaber sorry, I need complete step-by-step instructions to re-build the lmdb. Your script uses label map, which I need to re-create too etc. |
I solved this problem by running the same lmdb script again with NVIDIA Caffe version, which generated new lmdb files (with the same scripts and same data, just re-run it again with the new caffe). However, the problem when using all FLOAT16 still exists, and it only works with: |
@abdelrahman-gaber seems like we are not synced yet. :) |
I am sorry for that, I uploaded all necessary files and they are as follows, As the ground truth bboxes need to be converted to certain format, here are the ground truth files used for the training: https://drive.google.com/file/d/1Iw48nhHIplZvBfpTFvmR7L1IXrCGyn02/view?usp=sharing The images for training can be downloaded from the website: http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/ Please tell me if I missed any step. |
Hi @abdelrahman-gaber thank you for reporting this bug. It's reproduced and fixed now. Please read the note for #493 about performance implications and SSD fp16 sample models. I'd appreciate your feedback. |
Thank you so much. I will run the training again by the middle of this week, and tell you if I faced any problem. |
@abdelrahman-gaber Please also do
I'll fix it later |
Thank you @drnikolaev The training is working now after fixing this bug.
|
Hi @drnikolaev The training can run now without reporting any errors, but the training process itself is not working well. I read about the scaling of gradients which is necessary for the training in fp16 mode, as I understand I should tune the parameter: I also faced another problem when trying to use the VGG pretrained model, it is working well in fp32 mode, but not working with fp16 mode! I reported this in a new issue #499 All files I am using for training and testing are here, in logs folder you can find the output logs for different configurations I tried. Thank you. |
@abdelrahman-gaber sorry about delay. Please try to switch back to
|
@drnikolaev Thank you for your reply. I did all modifications you mentioned, but actually the problem is still the same. Now I am trying the training without using any pretrained model, and when I set the IO and math types to Float like this:
Only in this case the model is working well; the validation accuracy is increasing, and training loss is decreasing. I would be more than thankful if you can try this training process by yourself, all files for preparing the dataset is as mentioned in the previous comments. Also I hope you can give an estimated time for solving this issue. Thank you. |
@abdelrahman-gaber could you verify https://github.com/drnikolaev/caffe/tree/caffe-0.17 release candidate? |
@drnikolaev Thanks for the update, I will test it and tell you. |
@drnikolaev Thank you, It seems that the training is working now, the validation accuracy is increasing and test loss is decreasing. However, I can only train from scratch and still not able to use the pretrained model as mentioned here: #499 I will let the training run until the end and will tell you if noticed weird behavior. |
@abdelrahman-gaber Please verify https://github.com/NVIDIA/caffe/tree/v0.17.1 release and reopen the issue if needed. |
Hi,
I am training a model with caffe-0.17 and want to use fp16 support. The training is running well when I use the normal float, but once I add these lines in the train.prototxt:
default_forward_type: FLOAT16
default_backward_type: FLOAT16
default_forward_math: FLOAT
default_backward_math: FLOAT
some error happen after 3 iterations of the training as follows:
The error also changes when I run the training again, and it can come like this:
Check failed: label >= 0 (-1 vs. 0)
or
Check failed: label < num_classes (3 vs. 2)
when I replace FLOAT16 by FLOAT it works fine!. I am using GPU Tesla V100-SXM2 with 16GB memory, with CUDA 9.0 and CUDNN 7.0.
I want to make sure that fp16 is supported for this configuration (this GPU and cuda libraries), and also the problem is not the same which indicate that something is not stable, is there any modification I should do to allow the support of fp16.
Thank you.
The text was updated successfully, but these errors were encountered: