Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System seemed stopped during refine #35

Open
ChrisLoSK opened this issue Nov 25, 2022 · 8 comments
Open

System seemed stopped during refine #35

ChrisLoSK opened this issue Nov 25, 2022 · 8 comments

Comments

@ChrisLoSK
Copy link

ChrisLoSK commented Nov 25, 2022

Hi there,

I am a student new to cryo-EM. I am now trying to apply IsoNet on the analysis of my data and I have encountered a problem.

I found that it took extremely long time in the refine step without any response or error messages.

The slurm after 4 hours of running still in Epoch 1/10 stage [as (1) below]. I repeated running with the official tutoral HIV dataset and exactly same commands and parameters according to the tutorial, and got the same problem. The code seemed still running in Epoch 1/10 even after 15 hours.
Then I have checked the GPU [nviaid-smi checked in (2) below]. The GPUs seemed are not working(?), while memory is being used. No new files were written in the waiting hours.

Would anyone give me some advice? Thank you very much!

Chris

(1) Slurm log-----------------------------------------------------------------------------------
11-25 10:58:34, INFO
######Isonet starts refining######

11-25 10:58:38, INFO Start Iteration1!
11-25 10:58:38, WARNING The results folder already exists
The old results folder will be renamed (to results~)
11-25 11:00:31, INFO Noise Level:0.0
11-25 11:01:08, INFO Done preparing subtomograms!
11-25 11:01:08, INFO Start training!
11-25 11:01:10, INFO Loaded model from disk
11-25 11:01:10, INFO begin fitting
Epoch 1/10
slurm-37178.out (END)

(2) nvidia-smi --------------------------------------------------------

Fri Nov 25 14:38:39 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:04:00.0 Off | N/A |
| 30% 33C P8 19W / 350W | 17755MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 3090 Off | 00000000:43:00.0 Off | N/A |
| 30% 32C P8 20W / 350W | 17755MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 3090 Off | 00000000:89:00.0 Off | N/A |
| 30% 30C P8 32W / 350W | 17755MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 3090 Off | 00000000:C4:00.0 Off | N/A |
| 30% 30C P8 25W / 350W | 17755MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1212666 C python3 17747MiB |
| 0 N/A N/A 3278054 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1212666 C python3 17747MiB |
| 1 N/A N/A 3278054 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 1212666 C python3 17747MiB |
| 2 N/A N/A 3278054 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 1212666 C python3 17747MiB |
| 3 N/A N/A 3278054 G /usr/lib/xorg/Xorg 4MiB |

@LianghaoZhao
Copy link
Contributor

I seems like have the same problem. I wonder if you have solved?

@procyontao
Copy link
Collaborator

I found this similar issue: keras-team/keras#11603, which is related to a cudnn version. The dependencies should match what listed here https://www.tensorflow.org/install/source#gpu

@ChrisLoSK
Copy link
Author

ChrisLoSK commented Dec 8, 2022 via email

@LianghaoZhao
Copy link
Contributor

I finally found useconda install cudatoolkitworked well. It provided essential library for tensorflow. Besides, I found modifing the log level from "info"to "debug" in isonet.py could provided more information.

@abhatta2p
Copy link

abhatta2p commented Dec 9, 2022

Hi all,

a linux novice here. I have the same issue as OP on a standalone workstation: "refine" job gets stuck in the first iteration at Epoch 1/10.

Following @LianghaoZhao's comment, I tried "conda install cudatoolkit", but that did not solve the problem. Changing log level to "debug" though, I could at least identify the issue from the log:

OP_REQUIRES failed at xla_ops.cc:417 : INVALID_ARGUMENT: Trying to access resource Resource-1414-at-0x38b1eb20 located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:3 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device

which I assume means GPU 3 is trying to access something on GPU 0. Using only one GPU, I was able get the refinement to progress (hasn't finished yet at the time of writing this post), but I am unsure what might be the causing this issue using multiple GPUs.

OS: Ubuntu 20.04.5
GPUs: 4x RTX A5000
Nvidia driver version: 515.86.01
CUDA version: 11.2
cuDNN version: 8.1.1
Python version: 3.7.15
Tensorflow version: 2.11.0

Would be more than happy to provide more info/logs for debugging, if needed. I've been having issues with Tensorflow/Keras with DeePict as well, and wonder if the two issues are somehow related.

Best,
Arjun

@LianghaoZhao
Copy link
Contributor

Hi all,

a linux novice here. I have the same issue as OP on a standalone workstation: "refine" job gets stuck in the first iteration at Epoch 1/10.

Following @LianghaoZhao's comment, I tried "conda install cudatoolkit", but that did not solve the problem. Changing log level to "debug" though, I could at least identify the issue from the log:

OP_REQUIRES failed at xla_ops.cc:417 : INVALID_ARGUMENT: Trying to access resource Resource-1414-at-0x38b1eb20 located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:3 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device

which I assume means GPU 3 is trying to access something on GPU 0. Using only one GPU, I was able get the refinement to progress (hasn't finished yet at the time of writing this post), but I am unsure what might be the causing this issue using multiple GPUs.

OS: Ubuntu 20.04.5 GPUs: 4x RTX A5000 Nvidia driver version: 515.86.01 CUDA version: 11.2 cuDNN version: 8.1.1 Python version: 3.7.15 Tensorflow version: 2.11.0

Would be more than happy to provide more info/logs for debugging, if needed. I've been having issues with Tensorflow/Keras with DeePict as well, and wonder if the two issues are somehow related.

Best, Arjun

Oh, I finally solved this. I found the exactly same error at last. conda install cudatoolkit is not the correct solution.
I found this only happens in multi-GPU environment. And single GPU trained correctly. I change the code in train.py to move model.complie into the scope of with strategy.scope() and solved. I have forked this repo and the last commit in my repo is the solution. My CUDA version, cuDNN version and Tensorflow version is same as you.

@procyontao
Copy link
Collaborator

Hi @LianghaoZhao,

Thank you for reporting your bug fixation. Would you like review your code in your fork and create a pull request so that it can be merged to the master branch

@BhattaArjun2p
Copy link

Hi all,
a linux novice here. I have the same issue as OP on a standalone workstation: "refine" job gets stuck in the first iteration at Epoch 1/10.
Following @LianghaoZhao's comment, I tried "conda install cudatoolkit", but that did not solve the problem. Changing log level to "debug" though, I could at least identify the issue from the log:
OP_REQUIRES failed at xla_ops.cc:417 : INVALID_ARGUMENT: Trying to access resource Resource-1414-at-0x38b1eb20 located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:3 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
which I assume means GPU 3 is trying to access something on GPU 0. Using only one GPU, I was able get the refinement to progress (hasn't finished yet at the time of writing this post), but I am unsure what might be the causing this issue using multiple GPUs.
OS: Ubuntu 20.04.5 GPUs: 4x RTX A5000 Nvidia driver version: 515.86.01 CUDA version: 11.2 cuDNN version: 8.1.1 Python version: 3.7.15 Tensorflow version: 2.11.0
Would be more than happy to provide more info/logs for debugging, if needed. I've been having issues with Tensorflow/Keras with DeePict as well, and wonder if the two issues are somehow related.
Best, Arjun

Oh, I finally solved this. I found the exactly same error at last. conda install cudatoolkit is not the correct solution. I found this only happens in multi-GPU environment. And single GPU trained correctly. I change the code in train.py to move model.complie into the scope of with strategy.scope() and solved. I have forked this repo and the last commit in my repo is the solution. My CUDA version, cuDNN version and Tensorflow version is same as you.

Thank you very much for the bug-fix, @LianghaoZhao. I just tried out the newest commit, and it works just fine with multiple GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants