Upgrade MaxText to work with Goodput v15 #2687
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR upgrades MaxText to use the Goodput v15 library correctly. This also fixes a potential process contention bug by removing an outdated API call. Additionally, this PR moves monitoring API to use context managers and adds the safe exit + flushing of metrics at the end of the run.
The Goodput library used to previously be pinned to v10. This change removed the pin, causing an automatic upgrade to latest version (
ml-goodput-measurement v0.0.15). This version has multiple changes including the move from multithreaded monitoring system to a multiprocess system. It also moved all cumulative computation and monitoring to a single monitoring process to reduce the need to synchronize shared data sources (the goodput cache, cloud logs) and fix critical performance bottlenecks. We recently found an issue where launching two separate processes sequentially (one for cumulative Goodput, one for cumulative step deviation) caused the primary Goodput monitoring process to crash and shut down upon the secondary process starting:The
ml-goodput-measurement v0.0.15consolidated the cumulative goodput and step deviation metrics computation and upload into the main Goodput monitoring process. The correct integration in MaxText that previously spun off separate processes for these metrics, is to remove the redundant API call and eliminate the source of contention between two processes that are doing repeated work.FIXES: b/456054371
Tests
E2E run 1 with no disruptions:
E2E run 2 with disruptions:
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.