Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue on pre-training #4

Closed
AmethystZeroAtwell opened this issue Aug 16, 2021 · 10 comments
Closed

Issue on pre-training #4

AmethystZeroAtwell opened this issue Aug 16, 2021 · 10 comments

Comments

@AmethystZeroAtwell
Copy link

I ran mini-ImageNet/FRN/ResNet-12_pretrain/train.sh according to the instructions but got this error, which I think is caused by the network outputting the inf value. Reducing the initial learning rate to 1e-3 can avoid this error.

Traceback (most recent call last):
File "train.py", line 35, in
tm.train(model)
File "../../../../trainers/trainer.py", line 185, in train
iter_counter=iter_counter)
File "../../../../trainers/frn_train.py", line 97, in pre_train
log_prediction = model.forward_pretrain(inp)
File "../../../../models/FRN.py", line 136, in forward_pretrain
beta=beta) # wayquery_shotresolution, way
File "../../../../models/FRN.py", line 75, in get_recon_dist
m_inv = (sts + torch.eye(sts.size(-1)).to(sts.device).unsqueeze(0).mul(lam)).inverse() # way, d, d
RuntimeError: CUDA error: an illegal memory access was encountered

Looking forward to your reply!

@Tsingularity
Copy link
Owner

Hi, thanks for ur interest in our work!

In my experience, I never encounter such errors for my pre-training experiments on either our internal codebase or this public one.

Since my current internship hasn't finished, it might be hard for me to do the debugging for this issue this week. But I'll definitely let u know as soon as we have any updates on it. Meanwhile, feel free to provide updates if u have any new findings (or new errors lol).

Thanks!

@AmethystZeroAtwell
Copy link
Author

Thanks for your prompt reply!

I can pre-train normally after setting Woodbury to False in line 54 of model/FRN.py, and the results look reasonable.😀

Not sure if Woodbury Identity is causing the problem.

@Tsingularity
Copy link
Owner

Hi thanks for your updates!

My colleague helped me run a preliminary check on pretraining using our lab's machine. Looks like without changing any hyper-parameters, the current pretraining code runs well without errors. This also matches my impression. But I'll also take a double-check on the pretraining by myself after I return to school.

Btw, just wondering are you using the same Conda environment we provided in the README? Not sure whether this error is due to pytorch's version, but worth taking a check.

And thanks for letting us know the non-Woodbury implementation works well on ur side! Just as you can find in the paper, although all of our results are using the Woodbury one for consistency, these two implementations are indeed mathematically equivalent. And also as you can find in the speed analysis section in our paper, for this specific pretraining case, the non-Woodbury form should be slightly faster than the Woodbury one. Since pretraining on ImageNet gonna take quite a long time, I personally didn't re-do the pre-training using the non-Woodbury implementation, but only checked the inference consistency. Not sure whether the current hyper-parameters also works best for the non-Woodbury pre-training, but good to know its results look reasonable on your side. Thanks!

@Fulin-Gao
Copy link

I also encountered the same problem. After setting Woodbury to False according to the above suggestions, the pre-training no longer reported an error. However, the problem reappeared when running the fine-tuning part of train.sh.

@Fulin-Gao
Copy link

What needs to be added is that the error I got is inverse_cuda: For batch 2: U(24,24) is zero, singular U. Looking forward to your reply!

@daviswer
Copy link
Collaborator

daviswer commented Aug 23, 2021

Hi all, thanks for the updates and information. I'm still unable to reproduce this issue on our lab machine using the current hyperparameters. From what I can tell though, it looks like learning is diverging for the learned parameter alpha and sending the corresponding regularization term lambda to zero (or possibly infinity, in @AmethystZeroAtwell's case). I can think of a few possible fixes beyond changing the Woodbury parameter:

  1. Add a floor value of 1e-6 to lambda for safety (FRN.py, line 66)
  2. Scale down the initialized support pool values (FNR.py, line 43) by a factor of sqrt(num_channels) when using a resnet backbone (so that support and query feature distributions align, hopefully preventing divergent learning on the regularization parameter)
  3. Artificially slow down the learning rate for alpha by replacing alpha.exp() (FRN.py, line 66) with alpha.div(10).exp()

I have pushed fix 1 since that's just good practice, but in the interest of faithfully reproducing the results in the paper we'll probably leave 2 and 3 to the user's discretion. I was unable to observe any differences in pretraining performance for fix 2, but I also couldn't finish the entire run (for unrelated reasons) so I can't say with any certainty that implementing fix 2 will actually reproduce the results of the paper. Same disclaimer for 3: shouldn't affect benchmark performance beyond stability, but still might.

Hope that helps, let me know if the floor value fails to eliminate these errors.

@Tsingularity
Copy link
Owner

Thanks to Davis!

And thanks to @walker1207 for bringing up the issues on your side.

Just a quick followup question here, @AmethystZeroAtwell @walker1207, have you tried using the provided conda environment provided in the readme? Not sure whether these issues are related to the package versions, but given that we cannot reproduce this error on our machines, it might worth taking a double-check on this.

Thanks!

@AmethystZeroAtwell
Copy link
Author

Hi @walker1207 , I found that running the code on PyTorch 1.7.1 causes this problem, on PyTorch 1.7.0 it works fine.

Thanks to @Tsingularity @daviswer for your complete response and wonderful work!

@Tsingularity
Copy link
Owner

@walker1207 just a quick followup: would the upgrade on pytorch solve the error on ur side?
thanks!

@Tsingularity
Copy link
Owner

closing this for now. feel free to re-open it if this issue happens again and cannot be solved using the solution above.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants