train_folder_ff does not utilize GPU #115

rashigeek · 2023-06-16T16:43:37Z

I am trying to train a force fields model by using a variation of the following command that is mentioned in the readme to match my directories:

train_folder_ff.py --root_dir "alignn/examples/sample_data_ff" --config "alignn/examples/sample_data_ff/config_example_atomwise.json" --output_dir=temp
However, training is super slow and does not seem to utilize the GPU at all. This can be further confirmed by running nvidia-smi and viewing the output during training:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.43.02 Driver Version: 535.43.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:07:00.0 Off | N/A |
| 0% 42C P8 13W / 170W | 71MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1405 G /usr/lib/xorg/Xorg 56MiB |
| 0 N/A N/A 1571 G /usr/bin/gnome-shell 5MiB |
+---------------------------------------------------------------------------------------+

If I am training a model that does not utilize force fields, the GPU is used.
For example, running train_folder.py --root_dir "alignn/examples/sample_data" --config "alignn/examples/sample_data/config_example.json" --output_dir=temp and simultanously running nvidia-smi gives the following output:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.43.02 Driver Version: 535.43.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:07:00.0 Off | N/A |
| 0% 46C P2 62W / 170W | 921MiB / 12288MiB | 39% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1405 G /usr/lib/xorg/Xorg 56MiB |
| 0 N/A N/A 1571 G /usr/bin/gnome-shell 5MiB |
| 0 N/A N/A 29095 C .../miniconda3/envs/version/bin/python 848MiB |
+---------------------------------------------------------------------------------------+

I have done my best to check that all the dependencies are compatible and I can confirm that the device is switched to cuda in the train_folder_ff.py script.

The text was updated successfully, but these errors were encountered:

knc6 · 2023-06-16T17:30:27Z

Hi @rashigeek

What is the batch_size that you are using?

Lower batch_size tends to under-utilize GPUs.

rashigeek · 2023-06-16T18:19:01Z

I have tried a wide range of batch sizes even really big batch sizes such as 1028 but the performance was unaffected. I even tried passing batch_size as an argument and the problem still persisted.

ChemZhihaoWang · 2024-01-24T03:21:34Z

I'm having the same problem, I can't use the GPU when running run_alignn_ff.py.

knc6 · 2024-01-24T12:28:26Z

I am not able to reproduce this issue. Running this example on colab:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train_folder_ff does not utilize GPU #115

train_folder_ff does not utilize GPU #115

rashigeek commented Jun 16, 2023

knc6 commented Jun 16, 2023

rashigeek commented Jun 16, 2023

ChemZhihaoWang commented Jan 24, 2024

knc6 commented Jan 24, 2024

train_folder_ff does not utilize GPU #115

train_folder_ff does not utilize GPU #115

Comments

rashigeek commented Jun 16, 2023

knc6 commented Jun 16, 2023

rashigeek commented Jun 16, 2023

ChemZhihaoWang commented Jan 24, 2024

knc6 commented Jan 24, 2024