having trouble training for GEOM-Mol + trained models #1

rohanvarm · 2021-11-01T16:55:33Z

File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/functional.py", line 1753, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA out of memory. Tried to allocate 410.00 MiB (GPU 0; 11.17 GiB total capacity; 9.92 GiB already allocated; 336.44 MiB free; 10.30 GiB reserved in total by PyTorch)
Any idea?

Also would it be possible for you to put up trained models for both QM9 and Geom-Drugs

HannesStark · 2021-11-01T16:59:57Z

Hi,
It seems that you do not have enough RAM.
You would need less RAM if you reduce the batchsize. This can be done in the .yml config file that you are using by setting batch_size: 500 to a lower value such as 250.

I would like to avoid uploading further models since they take up quite some space in the repo already and I think GitHub is more meant for uploading code. Maybe I will make some more pre-trained models available elsewhere.

rohanvarm · 2021-11-01T17:08:39Z

Hi, sorry I posted the wrong error message:
[Epoch 1; Iter 50/ 1400] train: loss: 4.0249128
Fatal Python error: Segmentation fault

Current thread 0x00007fc9f915e700 (most recent call first):

Thread 0x00007fcad7f3f700 (most recent call first):
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/threading.py", line 300 in wait
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/queue.py", line 179 in get
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/tensorboard/summary/writer/event_file_writer.py", line 227 in run
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/threading.py", line 926 in _bootstrap_inner
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007fcc56bdd700 (most recent call first):
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/autograd/init.py", line 147 in backward
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/tensor.py", line 245 in backward
File "/home/ubuntu/codebase/mol_pretraining/3DInfomax/trainer/trainer.py", line 129 in process_batch
File "/home/ubuntu/codebase/mol_pretraining/3DInfomax/trainer/trainer.py", line 148 in predict
File "/home/ubuntu/codebase/mol_pretraining/3DInfomax/trainer/trainer.py", line 82 in train
File "train.py", line 548 in train_geom
File "train.py", line 277 in train
File "train.py", line 687 in
Segmentation fault (core dumped)

This happens after it starts training, as can be seen by the epoch train 1 message, but then crashes. Any idea on what's causing this?

HannesStark · 2021-11-01T17:15:22Z

Ah ok,
If found that happening in some cases if you use multiple threads and run multiple processes on the same GPU on Linux.
This is turned on in the code when you train multiple different seeds at once via the multithreaded_seeds: option in a config file.

If you only want to train a single model initialization and avoid this problem, you can just remove the multithreaded_seeds: option.

If you want to train all the different seeds you would have to set them manually instead of using multithreaded_seeds:

rohanvarm · 2021-11-01T17:44:51Z

not sure, removing multithreaded_seeds: as well as inputting multithreaded_seeds:[123] gives the same core dumped error as before

HannesStark · 2021-11-05T15:22:36Z

Okay, but what is the complete error message in that case? (It cannot contain the multithreading stuff and should be different I think)

As far as I know, (which is not very far with this topic) the segmentation fault basically means that something in CUDA went wrong and execution had to be stopped. The different error message will show you at which call from python this happened (which works because in train.py "faulthandler" is imported)

HannesStark closed this as completed Nov 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

having trouble training for GEOM-Mol + trained models #1

having trouble training for GEOM-Mol + trained models #1

rohanvarm commented Nov 1, 2021

HannesStark commented Nov 1, 2021

rohanvarm commented Nov 1, 2021

HannesStark commented Nov 1, 2021

rohanvarm commented Nov 1, 2021

HannesStark commented Nov 5, 2021

having trouble training for GEOM-Mol + trained models #1

having trouble training for GEOM-Mol + trained models #1

Comments

rohanvarm commented Nov 1, 2021

HannesStark commented Nov 1, 2021

rohanvarm commented Nov 1, 2021

HannesStark commented Nov 1, 2021

rohanvarm commented Nov 1, 2021

HannesStark commented Nov 5, 2021