-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue on pre-training #4
Comments
Hi, thanks for ur interest in our work! In my experience, I never encounter such errors for my pre-training experiments on either our internal codebase or this public one. Since my current internship hasn't finished, it might be hard for me to do the debugging for this issue this week. But I'll definitely let u know as soon as we have any updates on it. Meanwhile, feel free to provide updates if u have any new findings (or new errors lol). Thanks! |
Thanks for your prompt reply! I can pre-train normally after setting Woodbury to False in line 54 of model/FRN.py, and the results look reasonable.😀 Not sure if Woodbury Identity is causing the problem. |
Hi thanks for your updates! My colleague helped me run a preliminary check on pretraining using our lab's machine. Looks like without changing any hyper-parameters, the current pretraining code runs well without errors. This also matches my impression. But I'll also take a double-check on the pretraining by myself after I return to school. Btw, just wondering are you using the same Conda environment we provided in the README? Not sure whether this error is due to pytorch's version, but worth taking a check. And thanks for letting us know the non-Woodbury implementation works well on ur side! Just as you can find in the paper, although all of our results are using the Woodbury one for consistency, these two implementations are indeed mathematically equivalent. And also as you can find in the speed analysis section in our paper, for this specific pretraining case, the non-Woodbury form should be slightly faster than the Woodbury one. Since pretraining on ImageNet gonna take quite a long time, I personally didn't re-do the pre-training using the non-Woodbury implementation, but only checked the inference consistency. Not sure whether the current hyper-parameters also works best for the non-Woodbury pre-training, but good to know its results look reasonable on your side. Thanks! |
I also encountered the same problem. After setting Woodbury to False according to the above suggestions, the pre-training no longer reported an error. However, the problem reappeared when running the fine-tuning part of train.sh. |
What needs to be added is that the error I got is inverse_cuda: For batch 2: U(24,24) is zero, singular U. Looking forward to your reply! |
Hi all, thanks for the updates and information. I'm still unable to reproduce this issue on our lab machine using the current hyperparameters. From what I can tell though, it looks like learning is diverging for the learned parameter alpha and sending the corresponding regularization term lambda to zero (or possibly infinity, in @AmethystZeroAtwell's case). I can think of a few possible fixes beyond changing the Woodbury parameter:
I have pushed fix 1 since that's just good practice, but in the interest of faithfully reproducing the results in the paper we'll probably leave 2 and 3 to the user's discretion. I was unable to observe any differences in pretraining performance for fix 2, but I also couldn't finish the entire run (for unrelated reasons) so I can't say with any certainty that implementing fix 2 will actually reproduce the results of the paper. Same disclaimer for 3: shouldn't affect benchmark performance beyond stability, but still might. Hope that helps, let me know if the floor value fails to eliminate these errors. |
Thanks to Davis! And thanks to @walker1207 for bringing up the issues on your side. Just a quick followup question here, @AmethystZeroAtwell @walker1207, have you tried using the provided conda environment provided in the readme? Not sure whether these issues are related to the package versions, but given that we cannot reproduce this error on our machines, it might worth taking a double-check on this. Thanks! |
Hi @walker1207 , I found that running the code on PyTorch 1.7.1 causes this problem, on PyTorch 1.7.0 it works fine. Thanks to @Tsingularity @daviswer for your complete response and wonderful work! |
@walker1207 just a quick followup: would the upgrade on pytorch solve the error on ur side? |
closing this for now. feel free to re-open it if this issue happens again and cannot be solved using the solution above. |
I ran mini-ImageNet/FRN/ResNet-12_pretrain/train.sh according to the instructions but got this error, which I think is caused by the network outputting the inf value. Reducing the initial learning rate to 1e-3 can avoid this error.
Traceback (most recent call last):
File "train.py", line 35, in
tm.train(model)
File "../../../../trainers/trainer.py", line 185, in train
iter_counter=iter_counter)
File "../../../../trainers/frn_train.py", line 97, in pre_train
log_prediction = model.forward_pretrain(inp)
File "../../../../models/FRN.py", line 136, in forward_pretrain
beta=beta) # wayquery_shotresolution, way
File "../../../../models/FRN.py", line 75, in get_recon_dist
m_inv = (sts + torch.eye(sts.size(-1)).to(sts.device).unsqueeze(0).mul(lam)).inverse() # way, d, d
RuntimeError: CUDA error: an illegal memory access was encountered
Looking forward to your reply!
The text was updated successfully, but these errors were encountered: