New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
having trouble training for GEOM-Mol + trained models #1
Comments
Hi, I would like to avoid uploading further models since they take up quite some space in the repo already and I think GitHub is more meant for uploading code. Maybe I will make some more pre-trained models available elsewhere. |
Hi, sorry I posted the wrong error message: Current thread 0x00007fc9f915e700 (most recent call first): Thread 0x00007fcad7f3f700 (most recent call first): Thread 0x00007fcc56bdd700 (most recent call first): This happens after it starts training, as can be seen by the epoch train 1 message, but then crashes. Any idea on what's causing this? |
Ah ok, If you only want to train a single model initialization and avoid this problem, you can just remove the If you want to train all the different seeds you would have to set them manually instead of using |
not sure, removing multithreaded_seeds: as well as inputting multithreaded_seeds:[123] gives the same core dumped error as before |
Okay, but what is the complete error message in that case? (It cannot contain the multithreading stuff and should be different I think) As far as I know, (which is not very far with this topic) the segmentation fault basically means that something in CUDA went wrong and execution had to be stopped. The different error message will show you at which call from python this happened (which works because in train.py "faulthandler" is imported) |
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/functional.py", line 1753, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA out of memory. Tried to allocate 410.00 MiB (GPU 0; 11.17 GiB total capacity; 9.92 GiB already allocated; 336.44 MiB free; 10.30 GiB reserved in total by PyTorch)
Any idea?
Also would it be possible for you to put up trained models for both QM9 and Geom-Drugs
The text was updated successfully, but these errors were encountered: