Skip to content

Elastic events goodput recording integration#3901

Open
dipannita08 wants to merge 1 commit into
mainfrom
record-elastic-goodput-events
Open

Elastic events goodput recording integration#3901
dipannita08 wants to merge 1 commit into
mainfrom
record-elastic-goodput-events

Conversation

@dipannita08
Copy link
Copy Markdown
Collaborator

@dipannita08 dipannita08 commented May 13, 2026

Description

This PR instruments the elastic training implementation with GoodputRecorder API calls to capture the breakdown of time during elastic events. This is replica of this PR which tested these changes on replica-resize for GA2.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Tests

  • Example logs generated for an elastic event here
  • Workload Overview timeline for above workload: here
  • Elastic events timeline for above workload: here

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 14, 2026

Codecov Report

❌ Patch coverage is 14.28571% with 18 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/maxtext/trainers/pre_train/train.py 5.26% 17 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@dipannita08 dipannita08 changed the title Custom goodput recording for elastic training events Elastic events goodput recording integration May 14, 2026
Copy link
Copy Markdown
Collaborator

@lukebaumann lukebaumann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you able to run at a small scale with these changes and confirm that the events are logged correctly?

from pathwaysutils.elastic import manager

elastic_manager: manager.Manager | None = None
_pending_reinit_recorder = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This shouldn't have an underscore prefix since it is used in train.py, right?

if recorder:
recorder.record_custom_badput_event_end_time(custom_badput_event_type=event_type)
recorder.record_custom_badput_event_start_time(custom_badput_event_type='elastic_reinitialization')
elastic_utils._pending_reinit_recorder = recorder
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be updated through a setter function in elastic_utils?

@shralex
Copy link
Copy Markdown
Collaborator

shralex commented May 14, 2026

@gemini-cli /investigate

@github-actions
Copy link
Copy Markdown

🤖 Hi @shralex, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

@github-actions
Copy link
Copy Markdown

🤖 CI Failure Investigation Report

I have analyzed the recent test failures in the CI pipeline for PR #3901 and identified that they are not caused by the changes in this PR. Instead, they appear to be regressions introduced by the JAX 0.10.0 upgrade (PR #3854), which was recently integrated into the CI environment.

🔍 What Failed

  • Job/Matrix: maxtext_jupyter_notebooks / run

  • Error: ImportError: cannot import name 'get_terminal_size' from 'click.termui'

  • Impact: All Jupyter notebook runs failed.

  • Job/Matrix: cpu-post-training-unit tests / tpu-post-training-unit tests

  • Error: Segmentation fault (exit code 139)

  • Impact: All post-training unit tests failed.

  • Job/Matrix: maxtext_tpu_pathways_unit_tests

  • Error: AssertionError: Dataset c4: could not find data in /root/tensorflow_datasets

  • Impact: Pathways unit tests failed.

🪵 Error Details & Stack Trace

ImportError in Jupyter Notebooks:

  File "/__w/maxtext/maxtext/.venv/lib/python3.12/site-packages/huggingface_hub/cli/hf.py", line 19, in <module>
    import typer
  File "/__w/maxtext/maxtext/.venv/lib/python3.12/site-packages/typer/__init__.py", line 12, in <module>
    from click.termui import get_terminal_size as get_terminal_size
ImportError: cannot import name 'get_terminal_size' from 'click.termui'

Segmentation fault in Post-Training Tests:

Segmentation fault      $PYTHON_EXE -m pytest ${INPUTS_PYTEST_ADDOPTS} -v -m "${FINAL_PYTEST_MARKER}" --durations=0 $PYTEST_COV_ARGS $SPLIT_ARGS ${INPUTS_PYTEST_EXTRA_ARGS}

💡 Root Cause Analysis & Context

Confidence: High

The failures are highly correlated with the JAX 0.10.0 upgrade and the associated dependency updates in PR #3854:

  1. ImportError: click was updated to version 8.3.3. This version (and other 8.x versions) removed click.termui.get_terminal_size, which is still used by older versions of typer. huggingface_hub depends on typer, leading to this breakage in the Jupyter notebook environment.
  2. Segmentation fault: Post-training tests (which do not use the changes in PR Elastic events goodput recording integration #3901) are crashing with a segmentation fault. This strongly suggests a regression or instability in the newly updated JAX 0.10.0 or its interaction with the TPU/CPU environment.
  3. Dataset Access: The AssertionError regarding the C4 dataset in Pathways tests suggests an environment configuration issue or a change in how tensorflow-datasets (updated to 4.9.9) interacts with the CI filesystem.

🛠️ Recommended Action

Since these issues are regressions on main (or the merged state of main and recent PRs), they should be addressed in a separate effort to stabilize the JAX 0.10.0 upgrade. PR #3901 itself appears to be safe and unrelated to these failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants