New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN values after a single gradient step #40
Comments
Hi, the model checkpoint contains fp16 parameters for speed, but gradients for these weights are very prone to overflow/underflow without careful loss scaling, causing You can avoid this by casting all weights to fp32 with |
Thank you! Did not realize a single step already breaks the fp16 parameters. The code works now. |
@MartinPernus Hey, I'm curious about fine-tuning CLIP. Could you share some details on how many image-text pairs you used and what batch size? Did it work to fine-tune it? I'd appreciate any input! |
@NotNANtoN |
Interesting use-case, thanks! Weird, that the validation loss decreases while the test loss doesn't, I think I've never seen that during my trainings. |
I should have elaborated further. The training and validation splits were constructed manually from a set of binary attributes, while the test set belonged to an actual dataset of face descriptions + face images. The motivation for that was that we could in theory leverage datasets that contain binary attributes to improve CLIP performance for text-based face search. This could be useful in criminal investigations, where the victim describes the suspect and we would not have to constrain ourself to prefixed set of binary attributes. Hope I have made myself clearer, I am happy to discuss this further. |
Hi Martin, thanks that clears it up! Interesting use-case, but it makes sense it this is very hard for the model to extrapolate to fuller sentences. Unfortunately, I can't think of a way to leverage datasets with binary attributes in a better way than you tried there. |
Thanks, it works for me~ |
Freezing the loaded model may also solve the problem. For example: |
I met the same promblom recently, nan after random epochs. Now I changed the model to fp32 and train again, but the training time is nearly double....Is there any way to use fp16 and avoid nan? Why the nan happen during training with fp16 model? Thanks for ur reply. |
Hi!
Using PyTorch 1.7.1, I get NaN values after a single parameter update:
Output:
The text was updated successfully, but these errors were encountered: