Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors during refine #38

Open
schmitbp opened this issue Jan 6, 2023 · 5 comments
Open

Errors during refine #38

schmitbp opened this issue Jan 6, 2023 · 5 comments

Comments

@schmitbp
Copy link

schmitbp commented Jan 6, 2023

Hello everyone, I've been running into difficulties running Isonet's refinement and was hoping to find some assistance.
To give some background details, I'm trying to correct for the missing wedge on 5 tomograms. After following the tutorial online, I was able to generate a star file, correct the CTF, generate masks, and extract the subtomograms. However, when running the refine program, the job fails.

I ran the following script: isonet.py refine subtomo.star --gpuID 0,1,2,3,4,5 --iterations 30 --noise_start_iter 10,15,20,25 --noise_level 0.05,0.1,0.15,0.2
I submitted the job on a node on our university cluster. I asked for 6 GPU A100 Devices and 600GB memory.
Later in the evening the job failed after stalling out at Epoch 1/10.

I've been in contact with our CHPC department who tried to look further into this and found that we have 3 issues convoluted which makes it hard to find exactly what the problem is. So, let's break it down in its constituent parts:
a. Isonet was originally installed on rocky8. Our CHPC department tested out Tensorflow and checked whether CUDNN worked correctly. It did.
The corresponding module was written for Rocky8, however a few days ago we realized we need the software to run on a Centos7 node (this is the server I have access to).
But attempting this on Centos7 (or Rocky8), I didn't see any tangible progress (still stuck at Epoch 1/10). At the end, both jobs were stopped and threw an error shown in the screenshot below:

Do you all have any suggestions for getting the refinement to work? Please let me know if you need any additional details
Best,
Ben

Error1

@schmitbp
Copy link
Author

schmitbp commented Jan 6, 2023

I will add to this: I've tried the suggested OOM error (reducing batch size from default, which is 8 for 4x GPUs, to 4)

@schmitbp
Copy link
Author

schmitbp commented Jan 7, 2023

When I type --log_level debug I get the following: ######Isonet starts refining######

2023-01-07 10:45:03.212576: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
10:45:06, DEBUG [tpu_cluster_resolver.py:35] Falling back to TensorFlow client; we recommended you install the Cloud TPU client directly with pip install cloud-tpu-client.
2023-01-07 10:45:12.112492: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2023-01-07 10:45:12.114592: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2023-01-07 10:45:12.243121: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA A40 computeCapability: 8.6
coreClock: 1.74GHz coreCount: 84 deviceMemorySize: 44.56GiB deviceMemoryBandwidth: 648.29GiB/s
2023-01-07 10:45:12.244814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
pciBusID: 0000:23:00.0 name: NVIDIA A40 computeCapability: 8.6
coreClock: 1.74GHz coreCount: 84 deviceMemorySize: 44.56GiB deviceMemoryBandwidth: 648.29GiB/s
2023-01-07 10:45:12.246497: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 2 with properties:
pciBusID: 0000:41:00.0 name: NVIDIA A40 computeCapability: 8.6
coreClock: 1.74GHz coreCount: 84 deviceMemorySize: 44.56GiB deviceMemoryBandwidth: 648.29GiB/s
2023-01-07 10:45:12.248127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 3 with properties:
pciBusID: 0000:61:00.0 name: NVIDIA A40 computeCapability: 8.6
coreClock: 1.74GHz coreCount: 84 deviceMemorySize: 44.56GiB deviceMemoryBandwidth: 648.29GiB/s
2023-01-07 10:45:12.248169: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2023-01-07 10:45:12.653225: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2023-01-07 10:45:12.653308: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2023-01-07 10:45:13.011714: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2023-01-07 10:45:13.618927: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2023-01-07 10:45:13.970493: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2023-01-07 10:45:14.179938: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2023-01-07 10:45:14.848669: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2023-01-07 10:45:14.873971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1, 2, 3
10:45:14, DEBUG [refine.py:223] [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
10:45:14, INFO [refine.py:58] Start Iteration1!
10:45:14, WARNING [utils.py:28] The results folder already exists
The old results folder will be renamed (to results~)
2023-01-07 10:45:15.013727: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-07 10:45:15.508665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA A40 computeCapability: 8.6
coreClock: 1.74GHz coreCount: 84 deviceMemorySize: 44.56GiB deviceMemoryBandwidth: 648.29GiB/s
2023-01-07 10:45:15.510222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
pciBusID: 0000:23:00.0 name: NVIDIA A40 computeCapability: 8.6
coreClock: 1.74GHz coreCount: 84 deviceMemorySize: 44.56GiB deviceMemoryBandwidth: 648.29GiB/s
2023-01-07 10:45:15.511709: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 2 with properties:
pciBusID: 0000:41:00.0 name: NVIDIA A40 computeCapability: 8.6
coreClock: 1.74GHz coreCount: 84 deviceMemorySize: 44.56GiB deviceMemoryBandwidth: 648.29GiB/s
2023-01-07 10:45:15.513165: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 3 with properties:
pciBusID: 0000:61:00.0 name: NVIDIA A40 computeCapability: 8.6
coreClock: 1.74GHz coreCount: 84 deviceMemorySize: 44.56GiB deviceMemoryBandwidth: 648.29GiB/s
2023-01-07 10:45:15.513248: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2023-01-07 10:45:15.513278: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2023-01-07 10:45:15.513310: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2023-01-07 10:45:15.513342: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2023-01-07 10:45:15.513362: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2023-01-07 10:45:15.513381: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2023-01-07 10:45:15.513399: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2023-01-07 10:45:15.513417: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2023-01-07 10:45:15.524693: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1, 2, 3
2023-01-07 10:45:15.524754: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1

@schmitbp
Copy link
Author

schmitbp commented Jan 7, 2023

tensorflow.python.framework.errors_impl.ResourceExhaustedError: 5 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[2,32,32,32,128] and type bool on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
[[node replica_1/model/dropout_14/dropout/GreaterEqual (defined at /threading.py:973) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[Func/gradient_tape/replica_2/mean_absolute_error/cond/StatelessIf/then/_149/input/_477/_964]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[2,32,32,32,128] and type bool on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
[[node replica_1/model/dropout_14/dropout/GreaterEqual (defined at /threading.py:973) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(2) Resource exhausted: OOM when allocating tensor with shape[2,32,32,32,128] and type bool on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
[[node replica_1/model/dropout_14/dropout/GreaterEqual (defined at /threading.py:973) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[replica_2/cond_1/then/_247/replica_2/cond_1/cond/switch_pred/_1020/_1022]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(3) Resource exhausted: OOM when allocating tensor with shape[2,32,32,32,128] and type bool on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
[[node replica_1/model/dropout_14/dropout/GreaterEqual (defined at /threading.py:973) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[gradient_tape/mean_absolute_error/cond/StatelessIf/then/_93/gradient_tape/mean_absolute_error/cond/gradients/mean_absolute_error/cond/cond_grad/StatelessIf/then/_676/gradient_tape/mean_absolute_error/cond/gradients/mean_absolute_error/cond/cond_grad/gradients/mean_absolute_error/cond/cond/remove_squeezable_dimensions/cond_grad/StatelessIf/switch_pred/_1635/_1254]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(4) Resource exhausted: OOM when allocating tensor with shape[2,32,32,32,128] and type bool on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
[[node replica_1/model/dropout_14/dropout/GreaterEqual (defined at /threading.py:973) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[replica_3/cond_1/then/_269/replica_3/cond_1/cond/then/_1070/replica_3/cond_1/cond/remove_squeezable_dimensions/cond/switch_pred/_1858/_1186]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_48451]

Function call stack:
train_function -> train_function -> train_function -> train_function -> train_function

2023-01-07 12:18:06.622056: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.

@procyontao
Copy link
Collaborator

Hi,

I have not encountered this type of problem before. And you are using A100/A40, both should have sufficient VRAM.

It seems that the data can not be loaded onto GPU. I think this could be related to Nvidia cuda/driver problems. So how did you install the environment. I typically use "conda install" to install cuda toolkit and cudnn in an anaconda environment.

@schmitbp
Copy link
Author

schmitbp commented Jan 9, 2023

Hello, thank you for your help, I've attached here how the program was installed (as a .png) as well as the .yml file (saved as a .pdf) which contains all the dependencies
isonet.yml.pdf
Installation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants