System seemed stopped during refine #35

ChrisLoSK · 2022-11-25T14:56:19Z

Hi there,

I am a student new to cryo-EM. I am now trying to apply IsoNet on the analysis of my data and I have encountered a problem.

I found that it took extremely long time in the refine step without any response or error messages.

The slurm after 4 hours of running still in Epoch 1/10 stage [as (1) below]. I repeated running with the official tutoral HIV dataset and exactly same commands and parameters according to the tutorial, and got the same problem. The code seemed still running in Epoch 1/10 even after 15 hours.
Then I have checked the GPU [nviaid-smi checked in (2) below]. The GPUs seemed are not working(?), while memory is being used. No new files were written in the waiting hours.

Would anyone give me some advice? Thank you very much!

Chris

(1) Slurm log-----------------------------------------------------------------------------------
11-25 10:58:34, INFO
######Isonet starts refining######

11-25 10:58:38, INFO Start Iteration1!
11-25 10:58:38, WARNING The results folder already exists
The old results folder will be renamed (to results~)
11-25 11:00:31, INFO Noise Level:0.0
11-25 11:01:08, INFO Done preparing subtomograms!
11-25 11:01:08, INFO Start training!
11-25 11:01:10, INFO Loaded model from disk
11-25 11:01:10, INFO begin fitting
Epoch 1/10
slurm-37178.out (END)

(2) nvidia-smi --------------------------------------------------------

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1212666 C python3 17747MiB |
| 0 N/A N/A 3278054 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1212666 C python3 17747MiB |
| 1 N/A N/A 3278054 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 1212666 C python3 17747MiB |
| 2 N/A N/A 3278054 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 1212666 C python3 17747MiB |
| 3 N/A N/A 3278054 G /usr/lib/xorg/Xorg 4MiB |

LianghaoZhao · 2022-12-08T15:33:37Z

I seems like have the same problem. I wonder if you have solved?

procyontao · 2022-12-08T18:33:25Z

I found this similar issue: keras-team/keras#11603, which is related to a cudnn version. The dependencies should match what listed here https://www.tensorflow.org/install/source#gpu

ChrisLoSK · 2022-12-08T23:39:03Z

Dear Dr. Liu, Thank you very much! I have found one slower machine in the laboratory that could run IsoNet properly. Meanwhile I will compare the driver versions with the cluster (which got the problem) to see if we could fix it. Best regards, Chris

…

________________________________ From: Yuntao Liu ***@***.***> Sent: 08 December 2022 18:33 To: IsoNet-cryoET/IsoNet ***@***.***> Cc: Lo, Chris ***@***.***>; Author ***@***.***> Subject: Re: [IsoNet-cryoET/IsoNet] System seemed stopped during refine (Issue #35) This email from ***@***.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list<https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email stamping for this address. I found this similar issue: keras-team/keras#11603<keras-team/keras#11603>, which is related to a cudnn version. The dependencies should match what listed here https://www.tensorflow.org/install/source#gpu — Reply to this email directly, view it on GitHub<#35 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A4KAEM2VXZ7LKKGKNO6G5F3WMISYBANCNFSM6AAAAAASLMS7CU>. You are receiving this because you authored the thread.Message ID: ***@***.***>

LianghaoZhao · 2022-12-09T02:00:22Z

I finally found useconda install cudatoolkitworked well. It provided essential library for tensorflow. Besides, I found modifing the log level from "info"to "debug" in isonet.py could provided more information.

abhatta2p · 2022-12-09T18:04:04Z

Hi all,

a linux novice here. I have the same issue as OP on a standalone workstation: "refine" job gets stuck in the first iteration at Epoch 1/10.

Following @LianghaoZhao's comment, I tried "conda install cudatoolkit", but that did not solve the problem. Changing log level to "debug" though, I could at least identify the issue from the log:

OP_REQUIRES failed at xla_ops.cc:417 : INVALID_ARGUMENT: Trying to access resource Resource-1414-at-0x38b1eb20 located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:3 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device

which I assume means GPU 3 is trying to access something on GPU 0. Using only one GPU, I was able get the refinement to progress (hasn't finished yet at the time of writing this post), but I am unsure what might be the causing this issue using multiple GPUs.

OS: Ubuntu 20.04.5
GPUs: 4x RTX A5000
Nvidia driver version: 515.86.01
CUDA version: 11.2
cuDNN version: 8.1.1
Python version: 3.7.15
Tensorflow version: 2.11.0

Would be more than happy to provide more info/logs for debugging, if needed. I've been having issues with Tensorflow/Keras with DeePict as well, and wonder if the two issues are somehow related.

Best,
Arjun

LianghaoZhao · 2022-12-10T01:15:42Z

Hi all,

a linux novice here. I have the same issue as OP on a standalone workstation: "refine" job gets stuck in the first iteration at Epoch 1/10.

Following @LianghaoZhao's comment, I tried "conda install cudatoolkit", but that did not solve the problem. Changing log level to "debug" though, I could at least identify the issue from the log:

OP_REQUIRES failed at xla_ops.cc:417 : INVALID_ARGUMENT: Trying to access resource Resource-1414-at-0x38b1eb20 located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:3 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device

which I assume means GPU 3 is trying to access something on GPU 0. Using only one GPU, I was able get the refinement to progress (hasn't finished yet at the time of writing this post), but I am unsure what might be the causing this issue using multiple GPUs.

OS: Ubuntu 20.04.5 GPUs: 4x RTX A5000 Nvidia driver version: 515.86.01 CUDA version: 11.2 cuDNN version: 8.1.1 Python version: 3.7.15 Tensorflow version: 2.11.0

Would be more than happy to provide more info/logs for debugging, if needed. I've been having issues with Tensorflow/Keras with DeePict as well, and wonder if the two issues are somehow related.

Best, Arjun

Oh, I finally solved this. I found the exactly same error at last. conda install cudatoolkit is not the correct solution.
I found this only happens in multi-GPU environment. And single GPU trained correctly. I change the code in train.py to move model.complie into the scope of with strategy.scope() and solved. I have forked this repo and the last commit in my repo is the solution. My CUDA version, cuDNN version and Tensorflow version is same as you.

procyontao · 2022-12-13T21:36:48Z

Hi @LianghaoZhao,

Thank you for reporting your bug fixation. Would you like review your code in your fork and create a pull request so that it can be merged to the master branch

BhattaArjun2p · 2022-12-23T19:50:01Z

Hi all,
a linux novice here. I have the same issue as OP on a standalone workstation: "refine" job gets stuck in the first iteration at Epoch 1/10.
Following @LianghaoZhao's comment, I tried "conda install cudatoolkit", but that did not solve the problem. Changing log level to "debug" though, I could at least identify the issue from the log:
OP_REQUIRES failed at xla_ops.cc:417 : INVALID_ARGUMENT: Trying to access resource Resource-1414-at-0x38b1eb20 located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:3 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
which I assume means GPU 3 is trying to access something on GPU 0. Using only one GPU, I was able get the refinement to progress (hasn't finished yet at the time of writing this post), but I am unsure what might be the causing this issue using multiple GPUs.
OS: Ubuntu 20.04.5 GPUs: 4x RTX A5000 Nvidia driver version: 515.86.01 CUDA version: 11.2 cuDNN version: 8.1.1 Python version: 3.7.15 Tensorflow version: 2.11.0
Would be more than happy to provide more info/logs for debugging, if needed. I've been having issues with Tensorflow/Keras with DeePict as well, and wonder if the two issues are somehow related.
Best, Arjun

Oh, I finally solved this. I found the exactly same error at last. conda install cudatoolkit is not the correct solution. I found this only happens in multi-GPU environment. And single GPU trained correctly. I change the code in train.py to move model.complie into the scope of with strategy.scope() and solved. I have forked this repo and the last commit in my repo is the solution. My CUDA version, cuDNN version and Tensorflow version is same as you.

Thank you very much for the bug-fix, @LianghaoZhao. I just tried out the newest commit, and it works just fine with multiple GPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System seemed stopped during refine #35

System seemed stopped during refine #35

ChrisLoSK commented Nov 25, 2022 •

edited

Loading

LianghaoZhao commented Dec 8, 2022

procyontao commented Dec 8, 2022

ChrisLoSK commented Dec 8, 2022 via email

LianghaoZhao commented Dec 9, 2022

abhatta2p commented Dec 9, 2022 •

edited

Loading

LianghaoZhao commented Dec 10, 2022

procyontao commented Dec 13, 2022

BhattaArjun2p commented Dec 23, 2022

System seemed stopped during refine #35

System seemed stopped during refine #35

Comments

ChrisLoSK commented Nov 25, 2022 • edited Loading

LianghaoZhao commented Dec 8, 2022

procyontao commented Dec 8, 2022

ChrisLoSK commented Dec 8, 2022 via email

LianghaoZhao commented Dec 9, 2022

abhatta2p commented Dec 9, 2022 • edited Loading

LianghaoZhao commented Dec 10, 2022

procyontao commented Dec 13, 2022

BhattaArjun2p commented Dec 23, 2022

ChrisLoSK commented Nov 25, 2022 •

edited

Loading

abhatta2p commented Dec 9, 2022 •

edited

Loading