Issue on pre-training #4

AmethystZeroAtwell · 2021-08-16T11:43:40Z

I ran mini-ImageNet/FRN/ResNet-12_pretrain/train.sh according to the instructions but got this error, which I think is caused by the network outputting the inf value. Reducing the initial learning rate to 1e-3 can avoid this error.

Traceback (most recent call last):
File "train.py", line 35, in
tm.train(model)
File "../../../../trainers/trainer.py", line 185, in train
iter_counter=iter_counter)
File "../../../../trainers/frn_train.py", line 97, in pre_train
log_prediction = model.forward_pretrain(inp)
File "../../../../models/FRN.py", line 136, in forward_pretrain
beta=beta) # wayquery_shotresolution, way
File "../../../../models/FRN.py", line 75, in get_recon_dist
m_inv = (sts + torch.eye(sts.size(-1)).to(sts.device).unsqueeze(0).mul(lam)).inverse() # way, d, d
RuntimeError: CUDA error: an illegal memory access was encountered

Looking forward to your reply!

Tsingularity · 2021-08-16T21:54:24Z

Hi, thanks for ur interest in our work!

In my experience, I never encounter such errors for my pre-training experiments on either our internal codebase or this public one.

Since my current internship hasn't finished, it might be hard for me to do the debugging for this issue this week. But I'll definitely let u know as soon as we have any updates on it. Meanwhile, feel free to provide updates if u have any new findings (or new errors lol).

Thanks!

AmethystZeroAtwell · 2021-08-19T12:51:55Z

Thanks for your prompt reply!

I can pre-train normally after setting Woodbury to False in line 54 of model/FRN.py, and the results look reasonable.😀

Not sure if Woodbury Identity is causing the problem.

Tsingularity · 2021-08-20T23:02:40Z

Hi thanks for your updates!

My colleague helped me run a preliminary check on pretraining using our lab's machine. Looks like without changing any hyper-parameters, the current pretraining code runs well without errors. This also matches my impression. But I'll also take a double-check on the pretraining by myself after I return to school.

Btw, just wondering are you using the same Conda environment we provided in the README? Not sure whether this error is due to pytorch's version, but worth taking a check.

And thanks for letting us know the non-Woodbury implementation works well on ur side! Just as you can find in the paper, although all of our results are using the Woodbury one for consistency, these two implementations are indeed mathematically equivalent. And also as you can find in the speed analysis section in our paper, for this specific pretraining case, the non-Woodbury form should be slightly faster than the Woodbury one. Since pretraining on ImageNet gonna take quite a long time, I personally didn't re-do the pre-training using the non-Woodbury implementation, but only checked the inference consistency. Not sure whether the current hyper-parameters also works best for the non-Woodbury pre-training, but good to know its results look reasonable on your side. Thanks!

Fulin-Gao · 2021-08-23T09:11:24Z

I also encountered the same problem. After setting Woodbury to False according to the above suggestions, the pre-training no longer reported an error. However, the problem reappeared when running the fine-tuning part of train.sh.

Fulin-Gao · 2021-08-23T09:15:40Z

What needs to be added is that the error I got is inverse_cuda: For batch 2: U(24,24) is zero, singular U. Looking forward to your reply!

daviswer · 2021-08-23T15:49:13Z

Hi all, thanks for the updates and information. I'm still unable to reproduce this issue on our lab machine using the current hyperparameters. From what I can tell though, it looks like learning is diverging for the learned parameter alpha and sending the corresponding regularization term lambda to zero (or possibly infinity, in @AmethystZeroAtwell's case). I can think of a few possible fixes beyond changing the Woodbury parameter:

Add a floor value of 1e-6 to lambda for safety (FRN.py, line 66)
Scale down the initialized support pool values (FNR.py, line 43) by a factor of sqrt(num_channels) when using a resnet backbone (so that support and query feature distributions align, hopefully preventing divergent learning on the regularization parameter)
Artificially slow down the learning rate for alpha by replacing alpha.exp() (FRN.py, line 66) with alpha.div(10).exp()

I have pushed fix 1 since that's just good practice, but in the interest of faithfully reproducing the results in the paper we'll probably leave 2 and 3 to the user's discretion. I was unable to observe any differences in pretraining performance for fix 2, but I also couldn't finish the entire run (for unrelated reasons) so I can't say with any certainty that implementing fix 2 will actually reproduce the results of the paper. Same disclaimer for 3: shouldn't affect benchmark performance beyond stability, but still might.

Hope that helps, let me know if the floor value fails to eliminate these errors.

Tsingularity · 2021-08-23T16:06:54Z

Thanks to Davis!

And thanks to @walker1207 for bringing up the issues on your side.

Just a quick followup question here, @AmethystZeroAtwell @walker1207, have you tried using the provided conda environment provided in the readme? Not sure whether these issues are related to the package versions, but given that we cannot reproduce this error on our machines, it might worth taking a double-check on this.

Thanks!

AmethystZeroAtwell · 2021-08-28T08:58:48Z

Hi @walker1207 , I found that running the code on PyTorch 1.7.1 causes this problem, on PyTorch 1.7.0 it works fine.

Thanks to @Tsingularity @daviswer for your complete response and wonderful work!

Tsingularity · 2021-08-30T22:42:35Z

@walker1207 just a quick followup: would the upgrade on pytorch solve the error on ur side?
thanks!

Tsingularity · 2021-10-04T17:42:01Z

closing this for now. feel free to re-open it if this issue happens again and cannot be solved using the solution above.

Tsingularity closed this as completed Oct 4, 2021

Tsingularity mentioned this issue Nov 18, 2021

RuntimeError: inverse_cuda: For batch 0: U(2,2) is zero, singular U. #6

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue on pre-training #4

Issue on pre-training #4

AmethystZeroAtwell commented Aug 16, 2021

Tsingularity commented Aug 16, 2021

AmethystZeroAtwell commented Aug 19, 2021

Tsingularity commented Aug 20, 2021

Fulin-Gao commented Aug 23, 2021

Fulin-Gao commented Aug 23, 2021

daviswer commented Aug 23, 2021 •

edited

Loading

Tsingularity commented Aug 23, 2021

AmethystZeroAtwell commented Aug 28, 2021

Tsingularity commented Aug 30, 2021

Tsingularity commented Oct 4, 2021

Issue on pre-training #4

Issue on pre-training #4

Comments

AmethystZeroAtwell commented Aug 16, 2021

Tsingularity commented Aug 16, 2021

AmethystZeroAtwell commented Aug 19, 2021

Tsingularity commented Aug 20, 2021

Fulin-Gao commented Aug 23, 2021

Fulin-Gao commented Aug 23, 2021

daviswer commented Aug 23, 2021 • edited Loading

Tsingularity commented Aug 23, 2021

AmethystZeroAtwell commented Aug 28, 2021

Tsingularity commented Aug 30, 2021

Tsingularity commented Oct 4, 2021

daviswer commented Aug 23, 2021 •

edited

Loading