Skip to content

Conversation

@dipannita08
Copy link
Collaborator

@dipannita08 dipannita08 commented Nov 14, 2025

Description

This PR upgrades MaxText to use the Goodput v15 library correctly. This also fixes a potential process contention bug by removing an outdated API call. Additionally, this PR moves monitoring API to use context managers and adds the safe exit + flushing of metrics at the end of the run.

The Goodput library used to previously be pinned to v10. This change removed the pin, causing an automatic upgrade to latest version (ml-goodput-measurement v0.0.15). This version has multiple changes including the move from multithreaded monitoring system to a multiprocess system. It also moved all cumulative computation and monitoring to a single monitoring process to reduce the need to synchronize shared data sources (the goodput cache, cloud logs) and fix critical performance bottlenecks. We recently found an issue where launching two separate processes sequentially (one for cumulative Goodput, one for cumulative step deviation) caused the primary Goodput monitoring process to crash and shut down upon the secondary process starting:

I1029 08:08:10.968331 140099251349312 monitoring.py:754] Cumulative goodput monitoring process started for job: lid-pw-deepsee-1-ylg (PID: 185)
Started Goodput upload to Tensorboard & GCM in the background!
I1029 08:08:10.969586 140099251349312 monitoring.py:45] [PID: 185] Goodput worker process started for job: lid-pw-deepsee-1-ylg
I1029 08:08:10.979298 140099251349312 monitoring.py:858] Step deviation process started for job: lid-pw-deepsee-1-ylg (PID: 222)
Started step time deviation upload to Tensorboard & GCM in the background!
I1029 08:08:10.980461 140099251349312 monitoring.py:114] [PID: 222] Step deviation worker process started for job: lid-pw-deepsee-1-ylg
I1029 08:08:10.979673 140099251349312 monitoring.py:767] Shutting down cumulative goodput process (PID: 185)
I1029 08:08:11.371815 140099251349312 monitoring.py:104] [PID: 185] Goodput worker process for job lid-pw-deepsee-1-ylg stopped.

The ml-goodput-measurement v0.0.15 consolidated the cumulative goodput and step deviation metrics computation and upload into the main Goodput monitoring process. The correct integration in MaxText that previously spun off separate processes for these metrics, is to remove the redundant API call and eliminate the source of contention between two processes that are doing repeated work.

FIXES: b/456054371

Tests

E2E run 1 with no disruptions:

  • Logs
  • Cloud Monitoring:
image

E2E run 2 with disruptions:

  • Logs
  • Cloud Monitoring:
image

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@dipannita08 dipannita08 force-pushed the fix-goodput-v15-integration branch from 155cf62 to 178985e Compare November 14, 2025 20:18
@dipannita08 dipannita08 force-pushed the fix-goodput-v15-integration branch from 178985e to 4fc65f9 Compare November 14, 2025 20:43
@copybara-service copybara-service bot merged commit 62b249a into main Nov 15, 2025
40 checks passed
@copybara-service copybara-service bot deleted the fix-goodput-v15-integration branch November 15, 2025 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants