Model can not converge on the LRA Pathfinder #22

violet-zct · 2022-04-20T06:10:58Z

Hi,

Thanks for the great work! When I ran your code on the LRA pathfinder dataset (using your config), I found it can't converge till the end of the 200th epoch as shown in the following log: loss=0.693, val/accuracy=0.499, val/loss=0.693, test/accuracy=0.495, test/loss=0.693, train/accuracy=0.501, train/loss=0.693. The loss is 0.693 throughout training.

Do you have any thoughts on this? Thanks!

The text was updated successfully, but these errors were encountered:

albertfgu · 2022-04-20T20:36:46Z

The model changed a bit since the initial release. We will release a branch or tag marking the initial release where those configs should reproduce the original results (you should also be able to find it in the commit history). We have also been working on re-creating those results with the newest version of the model, which should involve only minor changes to the configs.

violet-zct · 2022-04-20T21:53:28Z

Thanks! Would you mind pointing me to the commit that I can use to reproduce the results?

albertfgu · 2022-04-20T21:56:32Z

Actually it looks like we already tagged it here: https://github.com/HazyResearch/state-spaces/releases/tag/v1

violet-zct · 2022-04-20T22:00:22Z

Great, thanks so much!

violet-zct · 2022-04-24T06:00:40Z

Hi, I used this commit and ran on two datasets: pathfinder-32 and cifar, with two random seeds respectively, including your default one 1112.
For pathfinder, the best valid accuracies are 77.59 and 78.14.
For cifar, the best valid accuracies are 79.72 and 79.8.

For both datasets, the valid accuracies are much worse than the test results reported in the paper. I used A40 to run these experiments. Do you know why this happens or is there a different commit of code that can be used for reproduction?
Thanks!

albertfgu · 2022-04-24T12:26:15Z

I'm not sure why this is happening. Many other people have been able to reproduce the experiments. What versions of pytorch and pytorch-lightning are you running? Which Cauchy kernel do you have installed? Can you paste the command lines you're using?

violet-zct · 2022-04-24T18:19:50Z

Here are my specifications:
pytorch 1.11.0
pytorch-lightening 1.6.1
Cauchy kernel: def cauchy_conj_slow(v, z, w): in state-spaces/src/models/functional/cauchy.py following the issue here #9 (comment).

The command line I used is exactly the same as in your README:

python -m train wandb=null experiment=s4-lra-cifar
python -m train wandb=null experiment=s4-lra-pathfinder

Thanks!

albertfgu · 2022-04-24T22:53:31Z

Did you have any issue installing either of the two faster Cauchy kernels? It is conceivable that they might have subtle numerical differences. We tested using the custom CUDA kernel.

I just checked out that commit and ran the CIFAR command and am getting to 80% val in 15 epochs and currently 82% at 30 epochs. So I think it's working properly.

It is also possible (although less likely) that pytorch-lightning changed something. If possible, I would suggest:

pip install pytorch-lightning=1.5.10
git checkout main
cd extensions/cauchy
python setup.py install
cd ../..
git checkout v1

and try running the command from there. The pykeops kernel installed with pip install pykeops==1.5 should also work.

violet-zct · 2022-04-24T23:45:51Z

Thanks so much for the response and instructions! I will try what you suggested.

albertfgu · 2022-04-26T14:40:41Z

The job I launched ended up getting to around 86% val accuracy. Let me know if you figure out the issue; if it ends up being a problem with cauchy_conj_slow or a package version I'll update the README.

violet-zct · 2022-04-26T18:06:58Z

Thanks! I was handling something else yesterday and will get back to you asap.

violet-zct · 2022-04-29T00:20:38Z

Hi Albert, sorry for the delay. I just recreated a new environment with pytorch 1.11.0, pytorch_lightning 1.5.10 installed, and I also successfully compiled the custom CUDA Cauchy kernel. I ran experiments on CIFAR on both A40 and A100, however, I still could not reproduce the results and I got something similar to my previous run:

I have no clues what could be the reason why this happens since I didn't modify anything from your code.
Thanks!

violet-zct · 2022-05-03T21:39:57Z

Hi Albert, to confirm, both my friend and I can not reproduce the results with v1 independently. But I can reproduce your results of v2.

albertfgu · 2022-05-05T15:47:52Z

Thanks for reporting back! I'll leave this issue open for longer because some other people are still trying to reproduce V1.

Just to check more variables: Is your friend using the same computing resources (e.g. same cluster or machine types) as you?

I definitely checked these results on an A100 before the V1 release, and as I reported above a fresh version of the repo still gets to high 80's on CIFAR for me on a P100, so I am really confused as well.

violet-zct · 2022-05-05T21:10:25Z

We are using the same cluster but different machine types.

albertfgu · 2022-05-09T18:43:44Z

Hi,

Could you downgrade to Pytorch 1.10 and try again when you have time? We just discovered a bug in Pytorch 1.11 (pytorch/pytorch#77081) with Dropout2d which is causing a noticeable difference on small sCIFAR models and will probably cause a difference for large models.

violet-zct · 2022-05-09T19:46:41Z

Thanks! Did you use Pytorch 1.10 for your version 1? I can downgrade and see if it can reproduce lately.

albertfgu · 2022-05-09T19:52:54Z

Yeah we were on torch 1.10 for a long time. The run I did above was also on 1.10

albertfgu · 2022-08-11T17:52:06Z

Closing this issue as the original problems were confirmed to be a PyTorch bug and have since been resolved.

ethanbar11 mentioned this issue Jul 20, 2022

Memory Corruption Error in Kernel _setup_linear #56

Open

albertfgu closed this as completed Aug 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model can not converge on the LRA Pathfinder #22

Model can not converge on the LRA Pathfinder #22

violet-zct commented Apr 20, 2022

albertfgu commented Apr 20, 2022

violet-zct commented Apr 20, 2022

albertfgu commented Apr 20, 2022

violet-zct commented Apr 20, 2022

violet-zct commented Apr 24, 2022

albertfgu commented Apr 24, 2022

violet-zct commented Apr 24, 2022

albertfgu commented Apr 24, 2022

violet-zct commented Apr 24, 2022

albertfgu commented Apr 26, 2022

violet-zct commented Apr 26, 2022

violet-zct commented Apr 29, 2022

violet-zct commented May 3, 2022

albertfgu commented May 5, 2022

violet-zct commented May 5, 2022

albertfgu commented May 9, 2022

violet-zct commented May 9, 2022

albertfgu commented May 9, 2022

albertfgu commented Aug 11, 2022

Model can not converge on the LRA Pathfinder #22

Model can not converge on the LRA Pathfinder #22

Comments

violet-zct commented Apr 20, 2022

albertfgu commented Apr 20, 2022

violet-zct commented Apr 20, 2022

albertfgu commented Apr 20, 2022

violet-zct commented Apr 20, 2022

violet-zct commented Apr 24, 2022

albertfgu commented Apr 24, 2022

violet-zct commented Apr 24, 2022

albertfgu commented Apr 24, 2022

violet-zct commented Apr 24, 2022

albertfgu commented Apr 26, 2022

violet-zct commented Apr 26, 2022

violet-zct commented Apr 29, 2022

violet-zct commented May 3, 2022

albertfgu commented May 5, 2022

violet-zct commented May 5, 2022

albertfgu commented May 9, 2022

violet-zct commented May 9, 2022

albertfgu commented May 9, 2022

albertfgu commented Aug 11, 2022