Skip to content

Gate TPU power profiling events behind profile_power_events flag#3303

Merged
copybara-service[bot] merged 1 commit intoAI-Hypercomputer:mainfrom
abhinavgoel95:fix/gate-tpu-power-profiling-events
Mar 4, 2026
Merged

Gate TPU power profiling events behind profile_power_events flag#3303
copybara-service[bot] merged 1 commit intoAI-Hypercomputer:mainfrom
abhinavgoel95:fix/gate-tpu-power-profiling-events

Conversation

@abhinavgoel95
Copy link
Copy Markdown
Contributor

Problem

Using profiler=xplane on a GPU host fails because advanced_configuration is
unconditionally populated with TPU-specific keys (tpu_power_trace_level,
e2e_enable_fw_*_event). CUPTI does not recognise these keys and throws
INVALID_ARGUMENT, aborting the GPU trace and leaving only CPU traces behind.

Fix

Introduce a profile_power_events: False flag in base.yml. The
advanced_configuration block in profiler.py is now gated behind this flag —
GPU runs work out of the box by default; TPU users who want power/thermal tracing
can opt in with profile_power_events=True.

Testing

  • GPU: profiler=xplane now captures full GPU traces without INVALID_ARGUMENT error
  • TPU: existing behaviour preserved when profile_power_events=True
  • Default profile_power_events=False is backward-compatible for all existing runs

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 3, 2026

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/maxtext/common/profiler.py 0.00% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

On GPU, the CUPTI tracer does not recognize TPU-specific keys in
advanced_configuration (tpu_power_trace_level, e2e_enable_fw_*_event),
causing an INVALID_ARGUMENT error that aborts the GPU xplane trace and
leaves only CPU traces behind.

Introduce a profile_power_events flag (default False) that gates the
TPU-specific advanced_configuration block. GPU runs are unaffected by
default; TPU users who want power/thermal tracing can opt in with
profile_power_events=True.
@abhinavgoel95 abhinavgoel95 force-pushed the fix/gate-tpu-power-profiling-events branch from 000d6d0 to ea55a0d Compare March 3, 2026 23:52
@copybara-service copybara-service Bot merged commit 12fe4ce into AI-Hypercomputer:main Mar 4, 2026
17 of 22 checks passed
@abhinavgoel95 abhinavgoel95 deleted the fix/gate-tpu-power-profiling-events branch March 5, 2026 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants