Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

having trouble training for GEOM-Mol + trained models #1

Closed
rohanvarm opened this issue Nov 1, 2021 · 5 comments
Closed

having trouble training for GEOM-Mol + trained models #1

rohanvarm opened this issue Nov 1, 2021 · 5 comments

Comments

@rohanvarm
Copy link

File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/functional.py", line 1753, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA out of memory. Tried to allocate 410.00 MiB (GPU 0; 11.17 GiB total capacity; 9.92 GiB already allocated; 336.44 MiB free; 10.30 GiB reserved in total by PyTorch)
Any idea?

Also would it be possible for you to put up trained models for both QM9 and Geom-Drugs

@HannesStark
Copy link
Owner

Hi,
It seems that you do not have enough RAM.
You would need less RAM if you reduce the batchsize. This can be done in the .yml config file that you are using by setting batch_size: 500 to a lower value such as 250.

I would like to avoid uploading further models since they take up quite some space in the repo already and I think GitHub is more meant for uploading code. Maybe I will make some more pre-trained models available elsewhere.

@rohanvarm
Copy link
Author

Hi, sorry I posted the wrong error message:
[Epoch 1; Iter 50/ 1400] train: loss: 4.0249128
Fatal Python error: Segmentation fault

Current thread 0x00007fc9f915e700 (most recent call first):

Thread 0x00007fcad7f3f700 (most recent call first):
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/threading.py", line 300 in wait
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/queue.py", line 179 in get
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/tensorboard/summary/writer/event_file_writer.py", line 227 in run
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/threading.py", line 926 in _bootstrap_inner
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007fcc56bdd700 (most recent call first):
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/autograd/init.py", line 147 in backward
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/tensor.py", line 245 in backward
File "/home/ubuntu/codebase/mol_pretraining/3DInfomax/trainer/trainer.py", line 129 in process_batch
File "/home/ubuntu/codebase/mol_pretraining/3DInfomax/trainer/trainer.py", line 148 in predict
File "/home/ubuntu/codebase/mol_pretraining/3DInfomax/trainer/trainer.py", line 82 in train
File "train.py", line 548 in train_geom
File "train.py", line 277 in train
File "train.py", line 687 in
Segmentation fault (core dumped)

This happens after it starts training, as can be seen by the epoch train 1 message, but then crashes. Any idea on what's causing this?

@HannesStark
Copy link
Owner

Ah ok,
If found that happening in some cases if you use multiple threads and run multiple processes on the same GPU on Linux.
This is turned on in the code when you train multiple different seeds at once via the multithreaded_seeds: option in a config file.

If you only want to train a single model initialization and avoid this problem, you can just remove the multithreaded_seeds: option.

If you want to train all the different seeds you would have to set them manually instead of using multithreaded_seeds:

@rohanvarm
Copy link
Author

not sure, removing multithreaded_seeds: as well as inputting multithreaded_seeds:[123] gives the same core dumped error as before

@HannesStark
Copy link
Owner

Okay, but what is the complete error message in that case? (It cannot contain the multithreading stuff and should be different I think)

As far as I know, (which is not very far with this topic) the segmentation fault basically means that something in CUDA went wrong and execution had to be stopped. The different error message will show you at which call from python this happened (which works because in train.py "faulthandler" is imported)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants