[pull] master from tensorflow:master#2575
Merged
pull[bot] merged 16 commits intoMu-L:masterfrom Jan 14, 2026
Merged
Conversation
PiperOrigin-RevId: 856063702
Reverts 68f1213 PiperOrigin-RevId: 856065178
PiperOrigin-RevId: 856068335
This change updates the `opaque` field in `CustomCallThunkProto` and the `str` fields in XLA FFI attribute protos from `string` to `bytes`. This allows these fields to contain arbitrary byte sequences, including non-UTF8 data, without causing proto parsing errors. New tests are added to verify that parsing succeeds with non-UTF8 content. PiperOrigin-RevId: 856069026
The functionality of the RealImagExpander, which simplifies `real(x)` and `imag(x)` when the input `x` is not a complex type, has been integrated into the AlgebraicSimplifier. PiperOrigin-RevId: 856075359
Imported from GitHub PR openxla/xla#34715 📝 Summary of Changes Update custom_call op_name with target to easily differentiate diff onednn custom ops. 🎯 Justification This will help debugging and viewing the ops in profile traceview. So, instead of `custom-call.2993.clone` it will show as `custom-call.2993.clone__onednn$matmul`. It will be easier to view the timeline trace. 📊 Benchmark (for Performance Improvements) This doesn't affect performance. Copybara import of the project: -- cec648197096c80c694b41247b973b2747e1e45e by Gauri Deshpande <gauri1.deshpande@intel.com>: Update onednn custom call name Merging this change closes #34715 PiperOrigin-RevId: 856080728
Have autotuner control the register spilling strategy rather than the ptx compiler This is a continuation of the previous work to add register spilling information into Executable to be accessed by the caller. This CL now uses this information in order to discard (or keep) executable candidates. This approach is both more logical from the caller's perspective and allows for more fine-tuning of the field in the future. Changed: fix in autotuner_compile_util.cc - accessing out.value() before checking for an error status. Added a return on failure before the register spilling check. Reverts 31a591a PiperOrigin-RevId: 856086795
Adds function to check if a `riegeli::Reader` points to a split proto file. This will be used to determine if this is a old or new format serialized ExecutableAndOptionsProto or GpuExecutable when deserealizing. PiperOrigin-RevId: 856099943
The code was assuming that if the output shape is a tuple, there will be users and those users will be GetTupleElement ops. However all that is needed is to determine the buffer slices, and we can do that via the tuple by passing the right ShapeIndex. This bug was found when also calling RunBackend() for a gpu_compiler_test. This modified test failed before and passes now. PiperOrigin-RevId: 856108591
The CL enhances the error reporting in Triton support checks by providing more specific messages when a conversion or operation is not supported. This includes detailing the types involved and the reasons for the unsupported decision. Additionally, test failure messages are improved by including the explanation from the CodegenDecision when an instruction is unexpectedly supported. PiperOrigin-RevId: 856110556
The table contains info we cannot get from the CUDA API. It is used to fill the recently added execution unit description in the device description. This change also includes immediate usage of this table: * The FPU count is updated to use it if the info is present in the table * Performance model base gets a new method to estimate the peak scalar performance for datatype. PiperOrigin-RevId: 856117428
This change adds support for scaled-dot operations where the input operands are of type F4E2M1FN and the scales are of type F8E8M0FNU. It includes: - Adding a test case in JAX's scaled_dot_test.py for F4 types. - Adding a device test in Triton's fusion_emitter_device_test.cc to verify Triton's handling of F4 scaled-dot on Hopper GPUs. - Disabling cuBLAS autotuning for F4 types in gemm_fusion_autotuner.cc as cuBLAS does not support this. - Updating composite_rewriter.cc to recognize F4E2M1FN as a valid operand type for scaled-dot when paired with F8E8M0FNU scales. - Improved logging in composite_rewriter.cc for unsupported scaled-dot cases. FP4 has an error in MMAv2 lowering path. The line auto dotOpA = cast<DotOperandEncodingAttr>(aTensorTy.getEncoding()); is wrong because the attr is not DotOperandEncodingAttr type. as a result of that FP4 scaled dot lowering crashes on some tile sizes on B200 and always crash on H100. PiperOrigin-RevId: 856123857
…ackedTranspose Imported from GitHub PR openxla/xla#34633 Replace per-transpose loops with a single unified loop that processes all transposes simultaneously, computing indices once and reusing them across all operations. Update `packed_transpose_multiple_heroes.hlo` test to verify the single-loop structure with multiple iter_args. It reduces the execution time for `fused_convert_transpose_3.hlo` ([attached](https://github.com/user-attachments/files/23861262/fused_convert_transpose_3.txt)) from Llama 3 8B FP8 by ~30% for MI300 and MI355. Copybara import of the project: -- 37e4ed1dd96afb39bb2b4a958800842d12545fa5 by Aleksei Nurmukhametov <anurmukh@amd.com>: [XLA:GPU] Fuse shmem write loops for transposes in PackedTranspose Replace per-transpose loops with a single unified loop that processes all transposes simultaneously, computing indices once and reusing them across all operations. Update packed_transpose_multiple_heroes.hlo test to verify the single-loop structure with multiple iter_args. Merging this change closes #34633 PiperOrigin-RevId: 856127072
PiperOrigin-RevId: 856130233
We had to infer TC clock scales since the info is not officially available. This change also corrects the B200 test device info based on available sources. A couple of interesting points are lower throughput on F64 and non-TC F16 vs H100. gpu_fusible_test has been updated to account for corrected larger SM number in the test data (cores*threads: 132*2048=270336 -> 148*2048=303104) PiperOrigin-RevId: 856144053
Imported from GitHub PR openxla/xla#36346 Hi, I was testing TensorFlow code with Svace static analyzer and found possible null dereference in XLA. Possible null dereference may occur because of missing return in MsaAlgorithm::UpdateAllocationRequirementForUseAliases(). There is a case when `aliased_allocation` variable could be null, and it could be dereferenced in AddAliasedRequiredAssignment(). #99907 Copybara import of the project: -- 969d905e81751fdf92581ad5fc5289ecc798d727 by Daniil Kutz <kutz@ispras.ru>: [XLA] Add missing return to prevent nullptr dereference Merging this change closes #36346 PiperOrigin-RevId: 856151503
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )