New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error Loading pre-trained checkpoints #174
Comments
Note that I had updated the dataset config section to test the pre-trained model on the val set:
|
Hi @chaitjo - This seems independent of the recent changes. The pretrained models were trained with DDP and you are attempting to load them without DDP (we can update the docs to clarify this) - hence the
Otherwise you could run this with DDP and only specify 1 gpu to avoid modifying |
Thanks for the prompt response @mshuaibii. I see, it works now! Just anecdotally, do you think a single server with 2-4 GPU cards is sufficient for someone to play around with new ideas on OCP with the 200K data split? (Obviously, to do something more serious, I understand one would need larger compute!) |
Additionally, is there a way to evaluate/test pre-trained models with multi-GPU on a single server, without using distributed data parallel? |
Yup - a few of us at CMU have been exploring ideas with similar resources. Although it's a little harder to iterate several new ideas as 200k on 2-4 GPUs could take a few days. As far as the IS2RE splits - 2-4 GPUs can be used comfortably for all data splits.
Multi-GPU is enabled in the repo through distributed data parallel, so no. DDP doesn't require multiple nodes to function - it can be run on a single machine just fine with the following command:
Was there a specific issue you were encountering that prevented you from using DDP on your machine? |
Hi! It seems that the updates to the codebase have made the pre-trained checkpoints released earlier to be incompatible with the current models. Here, I tried loading a pre-trained SchNet downloaded for S2EF from MODELS.md and following the instructions in TRAIN.md:
The text was updated successfully, but these errors were encountered: