Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exceptions and failures when use MultiWorkerMirroredStrategy #373

Closed
ma-siddiqui opened this issue Nov 16, 2020 · 6 comments
Closed

Exceptions and failures when use MultiWorkerMirroredStrategy #373

ma-siddiqui opened this issue Nov 16, 2020 · 6 comments
Assignees
Labels
bug 🐛 Something isn't working wontfix

Comments

@ma-siddiqui
Copy link

When I use tf.distribute.experimental.MultiWorkerMirroredStrategy to run training on multiple machine I face following errors. Please advise when other necessary changes are needed.

2020-11-16 12:03:50,968 (cross_device_ops:1130) INFO: Collective batch_all_reduce for IndexedSlices: 1 all-reduces, group_size = 2
2020-11-16 12:03:56.443402: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:439] error: Internal: Complete shape not known for AdamWeightDecay/allreduce/CollectiveReduce_23
2020-11-16 12:03:56.443474: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:1121] error: Internal: Complete shape not known for AdamWeightDecay/allreduce/CollectiveReduce_23
2020-11-16 12:03:56.443606: E tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:1138] ScopedAllocatorOptimizer: Internal: Complete shape not known for AdamWeightDecay/allreduce/CollectiveReduce_23

@dathudeptrai
Copy link
Collaborator

When I use tf.distribute.experimental.MultiWorkerMirroredStrategy to run training on multiple machine I face following errors. Please advise when other necessary changes are needed.

2020-11-16 12:03:50,968 (cross_device_ops:1130) INFO: Collective batch_all_reduce for IndexedSlices: 1 all-reduces, group_size = 2
2020-11-16 12:03:56.443402: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:439] error: Internal: Complete shape not known for AdamWeightDecay/allreduce/CollectiveReduce_23
2020-11-16 12:03:56.443474: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:1121] error: Internal: Complete shape not known for AdamWeightDecay/allreduce/CollectiveReduce_23
2020-11-16 12:03:56.443606: E tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:1138] ScopedAllocatorOptimizer: Internal: Complete shape not known for AdamWeightDecay/allreduce/CollectiveReduce_23

can you try to replace AdamWeightDecay by simple Adam first ?

@ma-siddiqui
Copy link
Author

ma-siddiqui commented Nov 20, 2020

Yes. I did it but no luck and faced the same error again.

@ma-siddiqui
Copy link
Author

When I use tf.distribute.experimental.MultiWorkerMirroredStrategy to run training on multiple machine I face following errors. Please advise when other necessary changes are needed.
2020-11-16 12:03:50,968 (cross_device_ops:1130) INFO: Collective batch_all_reduce for IndexedSlices: 1 all-reduces, group_size = 2
2020-11-16 12:03:56.443402: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:439] error: Internal: Complete shape not known for AdamWeightDecay/allreduce/CollectiveReduce_23
2020-11-16 12:03:56.443474: W tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:1121] error: Internal: Complete shape not known for AdamWeightDecay/allreduce/CollectiveReduce_23
2020-11-16 12:03:56.443606: E tensorflow/core/grappler/optimizers/scoped_allocator_optimizer.cc:1138] ScopedAllocatorOptimizer: Internal: Complete shape not known for AdamWeightDecay/allreduce/CollectiveReduce_23

can you try to replace AdamWeightDecay by simple Adam first ?

Yes. I did it but no luck and faced the same error again.

@dathudeptrai dathudeptrai self-assigned this Nov 23, 2020
@dathudeptrai dathudeptrai added the bug 🐛 Something isn't working label Nov 23, 2020
@ma-siddiqui
Copy link
Author

Hi, any update?

@ma-siddiqui
Copy link
Author

Hi, just to confirm, the fix added (given below) will solve my problem? Please confirm if the fix is against this bug.

Support Multi-GPU gradient Accumulate for trainer. #377

@stale
Copy link

stale bot commented Jan 29, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the wontfix label Jan 29, 2021
@stale stale bot closed this as completed Feb 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working wontfix
Projects
None yet
Development

No branches or pull requests

2 participants