Gate TPU power profiling events behind profile_power_events flag#3303
Merged
copybara-service[bot] merged 1 commit intoAI-Hypercomputer:mainfrom Mar 4, 2026
Conversation
gobbleturk
approved these changes
Mar 3, 2026
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
hengtaoguo
approved these changes
Mar 3, 2026
On GPU, the CUPTI tracer does not recognize TPU-specific keys in advanced_configuration (tpu_power_trace_level, e2e_enable_fw_*_event), causing an INVALID_ARGUMENT error that aborts the GPU xplane trace and leaves only CPU traces behind. Introduce a profile_power_events flag (default False) that gates the TPU-specific advanced_configuration block. GPU runs are unaffected by default; TPU users who want power/thermal tracing can opt in with profile_power_events=True.
000d6d0 to
ea55a0d
Compare
12fe4ce
into
AI-Hypercomputer:main
17 of 22 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Using
profiler=xplaneon a GPU host fails becauseadvanced_configurationisunconditionally populated with TPU-specific keys (
tpu_power_trace_level,e2e_enable_fw_*_event). CUPTI does not recognise these keys and throwsINVALID_ARGUMENT, aborting the GPU trace and leaving only CPU traces behind.Fix
Introduce a
profile_power_events: Falseflag inbase.yml. Theadvanced_configurationblock inprofiler.pyis now gated behind this flag —GPU runs work out of the box by default; TPU users who want power/thermal tracing
can opt in with
profile_power_events=True.Testing
profiler=xplanenow captures full GPU traces without INVALID_ARGUMENT errorprofile_power_events=Trueprofile_power_events=Falseis backward-compatible for all existing runs