multi-GPU training fails #20

Borda · 2019-10-21T17:18:29Z

crashes with a similar error even on training head...

INFO:root:Train on 14626 samples, val on 1625 samples, with batch size 16.
Epoch 1/150
2019-10-11 23:42:30.041545: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
913/914 [============================>.] - ETA: 1s - loss: 27.99742019-10-12 00:00:47.026746: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at tensor_array_ops.cc:661 : Invalid argument: TensorArray replica_0/model_3/yolo_loss/TensorArray_5484: Could not read from TensorArray index 0. Furthermore, the element shape is not fully defined: [?,?,3]. It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written. If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.
2019-10-12 00:00:47.027158: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at tensor_array_ops.cc:661 : Invalid argument: TensorArray replica_0/model_3/yolo_loss/TensorArray_1_5485: Could not read from TensorArray index 0. Furthermore, the element shape is not fully defined: [?,?,3]. It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written. If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.
2019-10-12 00:00:47.027194: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at tensor_array_ops.cc:661 : Invalid argument: TensorArray replica_0/model_3/yolo_loss/TensorArray_2_5486: Could not read from TensorArray index 0. Furthermore, the element shape is not fully defined: [?,?,3]. It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written. If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.
Traceback (most recent call last):
File "scripts/training.py", line 211, in <module>
_main(**arg_params)
File "scripts/training.py", line 182, in _main
callbacks=[tb_logging, checkpoint, reduce_lr, early_stopping])
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/engine/training_generator.py", line 234, in fit_generator
workers=0)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/engine/training.py", line 1472, in evaluate_generator
verbose=verbose)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/engine/training_generator.py", line 346, in evaluate_generator
outs = model.test_on_batch(x, y, sample_weight=sample_weight)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/engine/training.py", line 1256, in test_on_batch
outputs = self.test_function(ins)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in _call_
return self._call(inputs)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "/home/j.borovec/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1439, in _call_
run_metadata_ptr)
File "/home/j.borovec/.local/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in _exit_
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: TensorArray replica_0/model_3/yolo_loss/TensorArray_5484: Could not read from TensorArray index 0. Furthermore, the element shape is not fully defined: [?,?,3]. It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written. If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.
[\\{{node replica_0/model_3/yolo_loss/TensorArrayStack/TensorArrayGatherV3}}]
[\\{{node replica_1/model_3/yolo_loss/ExpandDims_3}}]

see qqwweee#204, qqwweee#497

The text was updated successfully, but these errors were encountered:

Borda · 2019-10-21T18:27:23Z

https://stackoverflow.com/questions/56813036/could-not-read-from-tensorarray-index-0-possible-you-are-working-with-resizeabl

Borda · 2019-10-23T15:41:36Z

in ending an epoch fails with

2019-10-23 17:38:42.472726: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at tensor_array_ops.cc:661 : Invalid argument: TensorArray replica_0/model_3/yolo_loss/TensorArray_1689: Could not read from TensorArray index 0.  Furthermore, the element shape is not fully defined: [?,?,3].  It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written.  If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.
2019-10-23 17:38:42.473311: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at tensor_array_ops.cc:661 : Invalid argument: TensorArray replica_0/model_3/yolo_loss/TensorArray_1_1690: Could not read from TensorArray index 0.  Furthermore, the element shape is not fully defined: [?,?,3].  It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written.  If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.
2019-10-23 17:38:42.473343: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at tensor_array_ops.cc:661 : Invalid argument: TensorArray replica_0/model_3/yolo_loss/TensorArray_2_1691: Could not read from TensorArray index 0.  Furthermore, the element shape is not fully defined: [?,?,3].  It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written.  If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.

it is happening only with multi-GPU training

Borda mentioned this issue Oct 24, 2019

porting to TF 2.0 #21

Closed

Borda added bug Something isn't working help wanted Extra attention is needed labels Nov 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-GPU training fails #20

multi-GPU training fails #20

Borda commented Oct 21, 2019

Borda commented Oct 21, 2019

Borda commented Oct 23, 2019 •

edited

Loading

multi-GPU training fails #20

multi-GPU training fails #20

Comments

Borda commented Oct 21, 2019

Borda commented Oct 21, 2019

Borda commented Oct 23, 2019 • edited Loading

Borda commented Oct 23, 2019 •

edited

Loading