Skip to content

Resolve TF saved model not portable issue with tf.keras.optimizers #4031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

supercharleszhu
Copy link
Contributor

@supercharleszhu supercharleszhu commented Mar 15, 2024

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Fixes #4028
Root Cause: the self._allreduce_grad will be treated as a tf.function when we initialize the optimizer outside tf.function(code). Since all the concrete function will be saved into saved_model.pb (see code here), it will cause the HorovodAllRuduce Op to be saved into the graph, which might not be registered in other environment.

When using the tf.keras.optimzer.legacy, It will not reach this path because when we call model.fit, the optimizer.minimize will call _compute_gradients which is not overriden by compute_gradient function in DistributedOptimizer, so the distributed optimizer is not taking effect at all!

Resolution

We don't need to explicitly register the allreduce_grad as a tf function, as in graph mode, it will be trace in the outer tf.function. In this case, the horovod ops will not be saved explicitly

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

Copy link

Unit Test Results

   568 files   -   156    568 suites   - 156   6h 41m 28s ⏱️ - 1h 24m 39s
   887 tests ±    0    692 ✅  -    76    195 💤 + 76  0 ❌ ±0 
12 738 runs   - 3 471  8 641 ✅  - 2 712  4 097 💤  - 759  0 ❌ ±0 

Results for commit 36ea180. ± Comparison against base commit 9f88e1d.

This pull request skips 76 tests.
test.parallel.test_mxnet1.MX1Tests ‑ test_gluon_trainer
test.parallel.test_mxnet1.MX1Tests ‑ test_gpu_required
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allgather_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error
test.parallel.test_tensorflow.TensorFlowTests ‑ test_gpu_required
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_fused_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_grad_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_variable_size_fused_gpu
…

Copy link

Unit Test Results (with flaky tests)

   593 files   -   295    593 suites   - 295   6h 50m 12s ⏱️ - 2h 8m 29s
   887 tests ±    0    691 ✅  -    77    195 💤 +   76  0 ❌ ±0  1 🔥 +1 
13 081 runs   - 7 158  8 926 ✅  - 4 863  4 154 💤  - 2 296  0 ❌ ±0  1 🔥 +1 

For more details on these errors, see this check.

Results for commit 36ea180. ± Comparison against base commit 9f88e1d.

This pull request skips 76 tests.
test.parallel.test_mxnet1.MX1Tests ‑ test_gluon_trainer
test.parallel.test_mxnet1.MX1Tests ‑ test_gpu_required
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allgather_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error
test.parallel.test_tensorflow.TensorFlowTests ‑ test_gpu_required
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_fused_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_grad_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_variable_size_fused_gpu
…

@EnricoMi
Copy link
Collaborator

EnricoMi commented Mar 21, 2024

Needs #4033 and #4035.

@supercharleszhu
Copy link
Contributor Author

@EnricoMi Has the CI been fixed? Seems that above 2 prs are checked in.

Copy link

stale bot commented Jan 31, 2025

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

Tensorflow Saved model not portable with latest tf.keras.optimizers
2 participants