Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model can not converge on the LRA Pathfinder #22

Closed
violet-zct opened this issue Apr 20, 2022 · 19 comments
Closed

Model can not converge on the LRA Pathfinder #22

violet-zct opened this issue Apr 20, 2022 · 19 comments

Comments

@violet-zct
Copy link

Hi,

Thanks for the great work! When I ran your code on the LRA pathfinder dataset (using your config), I found it can't converge till the end of the 200th epoch as shown in the following log: loss=0.693, val/accuracy=0.499, val/loss=0.693, test/accuracy=0.495, test/loss=0.693, train/accuracy=0.501, train/loss=0.693. The loss is 0.693 throughout training.

Do you have any thoughts on this? Thanks!

@albertfgu
Copy link
Contributor

The model changed a bit since the initial release. We will release a branch or tag marking the initial release where those configs should reproduce the original results (you should also be able to find it in the commit history). We have also been working on re-creating those results with the newest version of the model, which should involve only minor changes to the configs.

@violet-zct
Copy link
Author

Thanks! Would you mind pointing me to the commit that I can use to reproduce the results?

@albertfgu
Copy link
Contributor

Actually it looks like we already tagged it here: https://github.com/HazyResearch/state-spaces/releases/tag/v1

@violet-zct
Copy link
Author

Great, thanks so much!

@violet-zct
Copy link
Author

Hi, I used this commit and ran on two datasets: pathfinder-32 and cifar, with two random seeds respectively, including your default one 1112.
For pathfinder, the best valid accuracies are 77.59 and 78.14.
For cifar, the best valid accuracies are 79.72 and 79.8.

For both datasets, the valid accuracies are much worse than the test results reported in the paper. I used A40 to run these experiments. Do you know why this happens or is there a different commit of code that can be used for reproduction?
Thanks!

@albertfgu
Copy link
Contributor

I'm not sure why this is happening. Many other people have been able to reproduce the experiments. What versions of pytorch and pytorch-lightning are you running? Which Cauchy kernel do you have installed? Can you paste the command lines you're using?

@violet-zct
Copy link
Author

Here are my specifications:
pytorch 1.11.0
pytorch-lightening 1.6.1
Cauchy kernel: def cauchy_conj_slow(v, z, w): in state-spaces/src/models/functional/cauchy.py following the issue here #9 (comment).

The command line I used is exactly the same as in your README:

python -m train wandb=null experiment=s4-lra-cifar
python -m train wandb=null experiment=s4-lra-pathfinder

Thanks!

@albertfgu
Copy link
Contributor

Did you have any issue installing either of the two faster Cauchy kernels? It is conceivable that they might have subtle numerical differences. We tested using the custom CUDA kernel.

I just checked out that commit and ran the CIFAR command and am getting to 80% val in 15 epochs and currently 82% at 30 epochs. So I think it's working properly.

It is also possible (although less likely) that pytorch-lightning changed something. If possible, I would suggest:

pip install pytorch-lightning=1.5.10
git checkout main
cd extensions/cauchy
python setup.py install
cd ../..
git checkout v1

and try running the command from there. The pykeops kernel installed with pip install pykeops==1.5 should also work.

@violet-zct
Copy link
Author

Thanks so much for the response and instructions! I will try what you suggested.

@albertfgu
Copy link
Contributor

The job I launched ended up getting to around 86% val accuracy. Let me know if you figure out the issue; if it ends up being a problem with cauchy_conj_slow or a package version I'll update the README.

@violet-zct
Copy link
Author

Thanks! I was handling something else yesterday and will get back to you asap.

@violet-zct
Copy link
Author

Hi Albert, sorry for the delay. I just recreated a new environment with pytorch 1.11.0, pytorch_lightning 1.5.10 installed, and I also successfully compiled the custom CUDA Cauchy kernel. I ran experiments on CIFAR on both A40 and A100, however, I still could not reproduce the results and I got something similar to my previous run:

image

I have no clues what could be the reason why this happens since I didn't modify anything from your code.
Thanks!

@violet-zct
Copy link
Author

Hi Albert, to confirm, both my friend and I can not reproduce the results with v1 independently. But I can reproduce your results of v2.

@albertfgu
Copy link
Contributor

Thanks for reporting back! I'll leave this issue open for longer because some other people are still trying to reproduce V1.

Just to check more variables: Is your friend using the same computing resources (e.g. same cluster or machine types) as you?

I definitely checked these results on an A100 before the V1 release, and as I reported above a fresh version of the repo still gets to high 80's on CIFAR for me on a P100, so I am really confused as well.

@violet-zct
Copy link
Author

We are using the same cluster but different machine types.

@albertfgu
Copy link
Contributor

Hi,

Could you downgrade to Pytorch 1.10 and try again when you have time? We just discovered a bug in Pytorch 1.11 (pytorch/pytorch#77081) with Dropout2d which is causing a noticeable difference on small sCIFAR models and will probably cause a difference for large models.

@violet-zct
Copy link
Author

Thanks! Did you use Pytorch 1.10 for your version 1? I can downgrade and see if it can reproduce lately.

@albertfgu
Copy link
Contributor

Yeah we were on torch 1.10 for a long time. The run I did above was also on 1.10

@albertfgu
Copy link
Contributor

Closing this issue as the original problems were confirmed to be a PyTorch bug and have since been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants