Skip to content
This repository has been archived by the owner on Mar 24, 2021. It is now read-only.

multi-GPU training fails #20

Open
Borda opened this issue Oct 21, 2019 · 2 comments
Open

multi-GPU training fails #20

Borda opened this issue Oct 21, 2019 · 2 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@Borda
Copy link
Owner

Borda commented Oct 21, 2019

crashes with a similar error even on training head...

INFO:root:Train on 14626 samples, val on 1625 samples, with batch size 16.
Epoch 1/150
2019-10-11 23:42:30.041545: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
913/914 [============================>.] - ETA: 1s - loss: 27.99742019-10-12 00:00:47.026746: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at tensor_array_ops.cc:661 : Invalid argument: TensorArray replica_0/model_3/yolo_loss/TensorArray_5484: Could not read from TensorArray index 0. Furthermore, the element shape is not fully defined: [?,?,3]. It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written. If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.
2019-10-12 00:00:47.027158: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at tensor_array_ops.cc:661 : Invalid argument: TensorArray replica_0/model_3/yolo_loss/TensorArray_1_5485: Could not read from TensorArray index 0. Furthermore, the element shape is not fully defined: [?,?,3]. It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written. If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.
2019-10-12 00:00:47.027194: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at tensor_array_ops.cc:661 : Invalid argument: TensorArray replica_0/model_3/yolo_loss/TensorArray_2_5486: Could not read from TensorArray index 0. Furthermore, the element shape is not fully defined: [?,?,3]. It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written. If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.
Traceback (most recent call last):
File "scripts/training.py", line 211, in <module>
_main(**arg_params)
File "scripts/training.py", line 182, in _main
callbacks=[tb_logging, checkpoint, reduce_lr, early_stopping])
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/engine/training_generator.py", line 234, in fit_generator
workers=0)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/engine/training.py", line 1472, in evaluate_generator
verbose=verbose)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/engine/training_generator.py", line 346, in evaluate_generator
outs = model.test_on_batch(x, y, sample_weight=sample_weight)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/engine/training.py", line 1256, in test_on_batch
outputs = self.test_function(ins)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in _call_
return self._call(inputs)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "/home/j.borovec/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1439, in _call_
run_metadata_ptr)
File "/home/j.borovec/.local/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in _exit_
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: TensorArray replica_0/model_3/yolo_loss/TensorArray_5484: Could not read from TensorArray index 0. Furthermore, the element shape is not fully defined: [?,?,3]. It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written. If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.
[\\{{node replica_0/model_3/yolo_loss/TensorArrayStack/TensorArrayGatherV3}}]
[\\{{node replica_1/model_3/yolo_loss/ExpandDims_3}}]

see qqwweee#204, qqwweee#497

@Borda
Copy link
Owner Author

Borda commented Oct 23, 2019

in ending an epoch fails with

2019-10-23 17:38:42.472726: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at tensor_array_ops.cc:661 : Invalid argument: TensorArray replica_0/model_3/yolo_loss/TensorArray_1689: Could not read from TensorArray index 0.  Furthermore, the element shape is not fully defined: [?,?,3].  It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written.  If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.
2019-10-23 17:38:42.473311: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at tensor_array_ops.cc:661 : Invalid argument: TensorArray replica_0/model_3/yolo_loss/TensorArray_1_1690: Could not read from TensorArray index 0.  Furthermore, the element shape is not fully defined: [?,?,3].  It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written.  If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.
2019-10-23 17:38:42.473343: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at tensor_array_ops.cc:661 : Invalid argument: TensorArray replica_0/model_3/yolo_loss/TensorArray_2_1691: Could not read from TensorArray index 0.  Furthermore, the element shape is not fully defined: [?,?,3].  It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written.  If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.

it is happening only with multi-GPU training

@Borda Borda mentioned this issue Oct 24, 2019
@Borda Borda added bug Something isn't working help wanted Extra attention is needed labels Nov 12, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant