Exceptions and failures when use MultiWorkerMirroredStrategy #373

ma-siddiqui · 2020-11-16T08:05:14Z

When I use tf.distribute.experimental.MultiWorkerMirroredStrategy to run training on multiple machine I face following errors. Please advise when other necessary changes are needed.

2020-11-16 12:03:50,968 (cross_device_ops:1130) INFO: Collective batch_all_reduce for IndexedSlices: 1 all-reduces, group_size = 2
2020-11-16 12:03:56.443402: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:439] error: Internal: Complete shape not known for AdamWeightDecay/allreduce/CollectiveReduce_23
2020-11-16 12:03:56.443474: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:1121] error: Internal: Complete shape not known for AdamWeightDecay/allreduce/CollectiveReduce_23
2020-11-16 12:03:56.443606: E tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:1138] ScopedAllocatorOptimizer: Internal: Complete shape not known for AdamWeightDecay/allreduce/CollectiveReduce_23

dathudeptrai · 2020-11-16T08:54:51Z

When I use tf.distribute.experimental.MultiWorkerMirroredStrategy to run training on multiple machine I face following errors. Please advise when other necessary changes are needed.

2020-11-16 12:03:50,968 (cross_device_ops:1130) INFO: Collective batch_all_reduce for IndexedSlices: 1 all-reduces, group_size = 2
2020-11-16 12:03:56.443402: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:439] error: Internal: Complete shape not known for AdamWeightDecay/allreduce/CollectiveReduce_23
2020-11-16 12:03:56.443474: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:1121] error: Internal: Complete shape not known for AdamWeightDecay/allreduce/CollectiveReduce_23
2020-11-16 12:03:56.443606: E tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:1138] ScopedAllocatorOptimizer: Internal: Complete shape not known for AdamWeightDecay/allreduce/CollectiveReduce_23

can you try to replace AdamWeightDecay by simple Adam first ?

ma-siddiqui · 2020-11-20T20:39:33Z

Yes. I did it but no luck and faced the same error again.

ma-siddiqui · 2020-11-21T06:30:11Z

When I use tf.distribute.experimental.MultiWorkerMirroredStrategy to run training on multiple machine I face following errors. Please advise when other necessary changes are needed.
2020-11-16 12:03:50,968 (cross_device_ops:1130) INFO: Collective batch_all_reduce for IndexedSlices: 1 all-reduces, group_size = 2
2020-11-16 12:03:56.443402: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:439] error: Internal: Complete shape not known for AdamWeightDecay/allreduce/CollectiveReduce_23
2020-11-16 12:03:56.443474: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:1121] error: Internal: Complete shape not known for AdamWeightDecay/allreduce/CollectiveReduce_23
2020-11-16 12:03:56.443606: E tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:1138] ScopedAllocatorOptimizer: Internal: Complete shape not known for AdamWeightDecay/allreduce/CollectiveReduce_23

can you try to replace AdamWeightDecay by simple Adam first ?

Yes. I did it but no luck and faced the same error again.

ma-siddiqui · 2020-11-30T20:10:37Z

Hi, any update?

ma-siddiqui · 2020-11-30T20:14:58Z

Hi, just to confirm, the fix added (given below) will solve my problem? Please confirm if the fix is against this bug.

Support Multi-GPU gradient Accumulate for trainer. #377

stale · 2021-01-29T20:49:24Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

dathudeptrai self-assigned this Nov 23, 2020

dathudeptrai added the bug 🐛 Something isn't working label Nov 23, 2020

stale bot added the wontfix label Jan 29, 2021

stale bot closed this as completed Feb 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exceptions and failures when use MultiWorkerMirroredStrategy #373

Exceptions and failures when use MultiWorkerMirroredStrategy #373

ma-siddiqui commented Nov 16, 2020

dathudeptrai commented Nov 16, 2020

ma-siddiqui commented Nov 20, 2020 •

edited

Loading

ma-siddiqui commented Nov 21, 2020

ma-siddiqui commented Nov 30, 2020

ma-siddiqui commented Nov 30, 2020

stale bot commented Jan 29, 2021

Exceptions and failures when use MultiWorkerMirroredStrategy #373

Exceptions and failures when use MultiWorkerMirroredStrategy #373

Comments

ma-siddiqui commented Nov 16, 2020

dathudeptrai commented Nov 16, 2020

ma-siddiqui commented Nov 20, 2020 • edited Loading

ma-siddiqui commented Nov 21, 2020

ma-siddiqui commented Nov 30, 2020

ma-siddiqui commented Nov 30, 2020

stale bot commented Jan 29, 2021

ma-siddiqui commented Nov 20, 2020 •

edited

Loading