Develop upstream sync 240624 #2569

mmakevic-amd · 2024-06-25T06:40:10Z

No description provided.

It can be disabled again by setting --xla_gpu_mlir_emitter_level=0 Remove two unrolling tests. With MLIR emitters enabled, we do unroll in these two cases, but benchmarking shows that it is at least as fast as with the old emitters. Remove one test in element_wise_row_vectorization.hlo. It checks that we do not vectorize for multi-output fusions, but there is no reason why we shouldn't. With MLIR emitters, we would vectorize here. PiperOrigin-RevId: 644283642

PiperOrigin-RevId: 644285351

PiperOrigin-RevId: 644286074

…ned_einsum_handler to handle all2all Imported from GitHub PR openxla/xla#13310 Added a rewrite logic in gpu_windowned_einsum_handler to decompose a2a+gemm or gemm+a2a into smaller a2a+gemm/gemm+a2a to hide communuication overhead. Partial results will be aggregated at the end. An example will be: ``` input | a2a{replica_groups={{0,1}}} | gemm ``` decomposed into ``` input / \ slice1 slice2 / \ a2a1 a2a2 / \ gemm1 gemm2 \ / add ``` All partial gemms will be dispatched to parallel streams too to achieve gemm-gemm overlap. Performance metrics: For an unit with just a2a+gemm or gemm+a2a, we see from 5-15% speedup depending on the size by doing this type of composition. Copybara import of the project: -- 557c540df51b3c238f87ae01262f1a6000ee4499 by TJ <tjx@nvidia.com>: Added a rewrite logic in gpu_windowned_einsum_handler to decompose a2a+gemm or gemm+a2a into smaller a2a+gemm/gemm+a2a to hide communuication overhead. -- d3e9b2fc28b484f263609c21b6177ea948aa8e01 by TJ <tjx@nvidia.com>: Changed testing to use file check -- 3de9fbb962438837e77353c3c0b2a96e3e0d397e by TJ Xu <tjx@nvidia.com>: Added e2e tests address recent changes to thunk emission with execution stream id -- d7790ed5e206c5e1ebf33afa8e34d7faedff4d47 by TJ Xu <tjx@nvidia.com>: Added file check to BUILD file Merging this change closes tensorflow#13310 PiperOrigin-RevId: 644291310

Imported from GitHub PR openxla/xla#13866 Copybara import of the project: -- 66fc76f233ec0beae75265d89e3526dee5cf84da by Harsha HS <Harsha.HavanurShamsundara@amd.com>: Handle disabled backends for AMD case Merging this change closes tensorflow#13866 PiperOrigin-RevId: 644296216

PiperOrigin-RevId: 644296802

PiperOrigin-RevId: 644299858

PiperOrigin-RevId: 644300313

…timeouts. PiperOrigin-RevId: 644304982

Heuristics for compute and memory access time are shared with GpuPerformanceModel. The main difference is that we assume that each tile is read or computed only once and results are shared between threads via shared memory. PiperOrigin-RevId: 644313383

…codes in Triton emitter. In order to test this exhaustively, change the `TritonType` derivation logic to propagate a `Status` instead of crashing whenever a mapping between the provided HLO type and Triton types has not been defined. Landing this as a single change makes sense because exhaustively testing `bitcast`s and `reshape`s is a rather canonical way of checking that this new error propagation logic works as intended. PiperOrigin-RevId: 644313457

Constraints are constructed from merging the constraints of all the `SymbolicTile`s encountered while constructing the resulting symbolic tiled HLO computation. If any of the `SymbolicTile`s is unsatisfiable, the construction of the `SymbolicTileAnalysis` object does not succeed. Likewise, construction fails if some constraints can not be merged with others. Constraints are now checked to be satisfied by the provided tile parameters when attempting to extract a `TiledHloComputation` out of `SymbolicTileAnalysis`. In order to avoid checking constraints too many times, we allow pinky-promising that the provided tile parameters satisfy the constraints, to voluntarily bypass the checks. PiperOrigin-RevId: 644321131

Updates LLVM usage to match [93ffe1792fd9](llvm/llvm-project@93ffe1792fd9) PiperOrigin-RevId: 644334496

PiperOrigin-RevId: 644336930

…::Status. In some situations, this meant also changing unrelated files to directly include tsl/platform/statusor.h to get the definitions for TF_ASSIGN_OR_RETURN, etc., where they were getting transitively included for free. PiperOrigin-RevId: 644340828

…tion PiperOrigin-RevId: 644342386

…row. Imported from GitHub PR openxla/xla#13781 Copybara import of the project: -- 060f02a0c356edffa1037da07150eb19ef387231 by Ilia Sergachev <isergachev@nvidia.com>: [GPU] Let the on-disk kernel compilation cache grow. -- 3572421e08fef72e8a6c49105c6a5ec5c9b47a5d by Ilia Sergachev <isergachev@nvidia.com>: Add new flag description -- ecb79ca37583599b8bed8cd8c457139981c589be by Ilia Sergachev <isergachev@nvidia.com>: Move code -- 3695df02c2f9ec73f4c70d6cacb23d983e3e7a21 by Ilia Sergachev <isergachev@nvidia.com>: Improve checks -- f421890b6548a9a4dfcc5350e791bc8860615dbc by Ilia Sergachev <isergachev@nvidia.com>: Add another test Merging this change closes tensorflow#13781 PiperOrigin-RevId: 644345246

PiperOrigin-RevId: 644348830

PiperOrigin-RevId: 644353852

Tile sizes are usually small, so it's better to use InlinedVector or SmallVector to store them. PiperOrigin-RevId: 644362570

PiperOrigin-RevId: 644364189

…est`. `bitcast`s are not meaningful pre-optimizations because intermediate HLO ops do not have a layout at that point. For that reason, incorrect `bitcast`s evade verifier checks in `ParseAndReturnVerifiedModule`. This was hiding a data type mismatch in our tests. Since all the `bitcast`s in `symbolic_tile_test` have `reshape` semantics, we simply replace them with `reshape`s, which is handled well by `ParseAndReturnVerifiedModule`. PiperOrigin-RevId: 644366242

When autotuning them, calling cuBLASlt is considered as a fallback. Before this change, only cuBLAS path was considered in the autotuner. This is a temporary change; the GemmRewriter "fp8" parameter will be removed, so only one call will be needed. PiperOrigin-RevId: 644374262

PiperOrigin-RevId: 644382031

PiperOrigin-RevId: 644384966

This algorithm is responsible for numerical problems in 4+ models from different customers. It's likely that other customers also have issues that they didn't report yet. Let's disable algo id 14 for all shapes for now until the cuDNN team has a chance to look at the issue. PiperOrigin-RevId: 644385128

PiperOrigin-RevId: 644392118

…_sharding_util`. We may also use it in SPMD partitioner. This cl only change the location of a util function without behavior change. PiperOrigin-RevId: 644393231

Updates LLVM usage to match [52d87de7a42d](llvm/llvm-project@52d87de7a42d) PiperOrigin-RevId: 644397020

…ption If the option is set, we will maintain (read/write) a per-fusion autotune cache in the given directory. The directory must exist. Cache invalidation has to be handled by the user (e.g. please use an empty directory if you want to start with an empty cache). XLA version checks must be done by the user (e.g. if you want to cache fusions created with different versions of XLA, please use different directories). (If the using library already has a version handling mechanism, like JAX, then it shouldn't be difficult for them to create separate directories based on that version (and all the parameters which matter to them).) Default: no file based cache. There is minimal support for multiple processes using the same cache - the rename trick is used to avoid writing the same file by multiple processes at the same time or reading incomplete files. We use SHA256 hashes in the filenames and assume that no collisions occur. This is a simple implementation to allow people to test it and find good use-cases. If needed we can refine it later. Considered use case: People running [multiple] [similar] models [through JAX]. For example there are 2 similar HLOs that we want to run with JAX (using the same "XLA binary") and it would be nice to reuse the autotune results from the first, if some kernels appear in both. Similarly: Consider the use case of a researcher sitting at a Colab session and making small changes to their model. They should mostly get cache hits! Limitations: It is not recommended to change the cache directory during the run of a process, because then the in-memory and the file based cache can become inconsistent. At least clear the in-memory cache if you change it. When loading results with LoadAutotuneResults[FromFile], they are not written into the cache directory. PiperOrigin-RevId: 644406688

PiperOrigin-RevId: 645811580

…tcher with oneDNN custom call rewrite Imported from GitHub PR openxla/xla#10301 This PR enables oneDNN library call for the matched XLA HLO Convolution pattern through custom_call instruction. In particular, this PR: 1. Adds oneDNN convolution rewriter pass that will rewrite HLO Convolution to oneDNN Convolution. 2. Refactors backend config to enhance code reusability. 3. Adds a convolution test file to verify rewrite and execution result Copybara import of the project: -- b7d0abb4f683595c91bf144bcf1b208254ca5c74 by Akhil Goel <akhil.goel@intel.com>: Add onednn convolution support -- fbf544ea346250129c00023da7e193aecad5375b by Akhil Goel <akhil.goel@intel.com>: Remove unused symbol from BUILD file -- 76c079109361c75a6c512d32801039ed7a3f30e1 by Akhil Goel <akhil.goel@intel.com>: Fix buildifier error -- 01f59d2d11481a1ac0b3e7a491261e1ee541aac0 by Akhil Goel <akhil.goel@intel.com>: Address Review Comments -- 667502387a855ed0974822334283b890f84d1c34 by Akhil Goel <akhil.goel@intel.com>: Refactor oneDNN rewritability check to a separate function -- 6670413b5150518044a371e99e60a9ea36984660 by Akhil Goel <akhil.goel@intel.com>: Add cpu package to onednn_config proto file -- c1e8fc78e2a6a039ef615b17130be8b0fbd9c901 by Akhil Goel <akhil.goel@intel.com>: Push missing change in merge -- 9f446d82fef58f0b0b946f6c5e4baba8cf5e4a50 by Akhil Goel <akhil.goel@intel.com>: Mark outdated ids as reserved Merging this change closes tensorflow#10301 PiperOrigin-RevId: 645921244

PiperOrigin-RevId: 645929520

PiperOrigin-RevId: 645944358

PiperOrigin-RevId: 645988622

PiperOrigin-RevId: 646015314

PiperOrigin-RevId: 646015316

PiperOrigin-RevId: 646031586

PiperOrigin-RevId: 646038400

Updates LLVM usage to match [e5a41f0afc15](llvm/llvm-project@e5a41f0afc15) PiperOrigin-RevId: 646067665

Updates LLVM usage to match [5cd0ba30f53d](llvm/llvm-project@5cd0ba30f53d) PiperOrigin-RevId: 646103303

…240624

Imported from GitHub PR openxla/xla#14017 This fixes build break due to f4212dc and 0f75900 Copybara import of the project: -- d914293af6aec074f8e313b93140bde66fac5171 by Harsha HS <Harsha.HavanurShamsundara@amd.com>: [ROCm] Fix Build break due to f4212dc and 0f75900 Merging this change closes tensorflow#14017 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#14017 from ROCm:ci_fix_struct_build_break_20240620 d914293af6aec074f8e313b93140bde66fac5171 PiperOrigin-RevId: 645152887

i-chaochen · 2024-07-08T00:18:19Z

retest Ubuntu-GPU-single please

i-chaochen · 2024-07-29T15:27:54Z

@hsharsha why we're not waiting till CI for updated docker image to finish?

hsharsha · 2024-07-29T15:32:25Z

@hsharsha why we're not waiting till CI for updated docker image to finish?

It passed http://ml-ci.amd.com:21096/job/tensorflow/job/ThirdParty-XLA/job/PR-2569/12/console. I re-triggered it by accident.

akuegel and others added 30 commits June 18, 2024 01:31

[xla:cpu] Add FftThunk

9128be0

PiperOrigin-RevId: 644285351

Add test case for 1D convolution

cf31ac0

PiperOrigin-RevId: 644286074

Remove unnecessary paths from patch file.

d4ac361

PiperOrigin-RevId: 644296802

compat: Update forward compatibility horizon to 2024-06-18

df0c5fc

PiperOrigin-RevId: 644299858

Update GraphDef version to 1897.

cfeaaff

PiperOrigin-RevId: 644300313

[XLA:GPU] Disable running gpu_cub_sort_test in debug mode to avoid …

dc76516

…timeouts. PiperOrigin-RevId: 644304982

Integrate LLVM at llvm/llvm-project@93ffe1792fd9

19b82d6

Updates LLVM usage to match [93ffe1792fd9](llvm/llvm-project@93ffe1792fd9) PiperOrigin-RevId: 644334496

[XLA:GPU] Use priority fusion in TritonGemmAutotunerExtractor.

4b12044

PiperOrigin-RevId: 644336930

[XLA:GPU] Add initial SymbolicTileAnalysis::GetGoodTilings implementa…

44cb866

…tion PiperOrigin-RevId: 644342386

[XLA:GPU] Support tiling Softmax example

85f91e8

PiperOrigin-RevId: 644348830

[XLA:GPU][NFC] Move GPU specific latency estimator to a separate file.

db5c569

PiperOrigin-RevId: 644353852

[XLA:GPU] Use absl::Span instead of std::vector to pass tile sizes.

c4a89ad

Tile sizes are usually small, so it's better to use InlinedVector or SmallVector to store them. PiperOrigin-RevId: 644362570

Move BlockedSparseToMMA pattern from Triton to XLA.

85b9052

PiperOrigin-RevId: 644364189

Fix bug in array type conversion util

e23a719

PiperOrigin-RevId: 644382031

[XLA:GPU][MLIR-based emitters] Kill thread tiling for MlirColumnReduce.

65550eb

PiperOrigin-RevId: 644384966

Internal change only.

5555ec6

PiperOrigin-RevId: 644392118

Move InferDotOperandSharding from sharding_propagation.cc to `hlo…

4085633

…_sharding_util`. We may also use it in SPMD partitioner. This cl only change the location of a util function without behavior change. PiperOrigin-RevId: 644393231

Integrate LLVM at llvm/llvm-project@52d87de7a42d

6e6641a

Updates LLVM usage to match [52d87de7a42d](llvm/llvm-project@52d87de7a42d) PiperOrigin-RevId: 644397020

chihuahua and others added 21 commits June 23, 2024 02:39

[XLA:GPU] Set reduce_window_rewrite_base_length to 16 by default

841feca

PiperOrigin-RevId: 645811580

Automated Code Change

88980c0

PiperOrigin-RevId: 645929520

Add missing const qualifier in tflite::Subgraph.

24d85f3

PiperOrigin-RevId: 645944358

[xla:cpu] Add benchmark for op gather

4e518ef

PiperOrigin-RevId: 645988622

Update GraphDef version to 1903.

e5fc010

PiperOrigin-RevId: 646015314

compat: Update forward compatibility horizon to 2024-06-24

ab8ac1f

PiperOrigin-RevId: 646015316

Disable Zapfhahn for tests that time out.

8458769

PiperOrigin-RevId: 646031586

[XLA:GPU] Remove unused function in triton_support_test

94728ca

PiperOrigin-RevId: 646038400

Integrate LLVM at llvm/llvm-project@e5a41f0afc15

b674403

Updates LLVM usage to match [e5a41f0afc15](llvm/llvm-project@e5a41f0afc15) PiperOrigin-RevId: 646067665

Integrate LLVM at llvm/llvm-project@5cd0ba30f53d

347de0e

Updates LLVM usage to match [5cd0ba30f53d](llvm/llvm-project@5cd0ba30f53d) PiperOrigin-RevId: 646103303

Merge remote-tracking branch 'upstream/master' into develop-upstream-…

dc439ed

…240624

Fix merge conflicts

f8a8e75

Re-enable fixed HLO tests

096d7f1

Enable dot_algorithm_support_test and determinism_test

e9be647

Enable dot tests

5d5d09c

Disable determinism_test due to openxla/xla#13263

6a78b6c

Disable triangular_solve_test

075b5ba

Fix reduce_large_row_to_scalar.hlo.test

60b8a3e

Fix failing gpu_kernel_tiling_test subtests

7b52eab

Disable dot tests due to ROCm/frameworks-internal#8332 (comment) .

150bb37

mmakevic-amd force-pushed the develop-upstream-sync-240624 branch from 15233ac to 150bb37 Compare July 16, 2024 05:46

i-chaochen requested review from hsharsha and i-chaochen July 25, 2024 13:12

hsharsha approved these changes Jul 29, 2024

View reviewed changes

hsharsha merged commit f1d1afd into develop-upstream Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Develop upstream sync 240624 #2569

Develop upstream sync 240624 #2569

Uh oh!

mmakevic-amd commented Jun 25, 2024

Uh oh!

i-chaochen commented Jul 8, 2024

Uh oh!

i-chaochen commented Jul 29, 2024

Uh oh!

hsharsha commented Jul 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Develop upstream sync 240624 #2569

Develop upstream sync 240624 #2569

Uh oh!

Conversation

mmakevic-amd commented Jun 25, 2024

Uh oh!

i-chaochen commented Jul 8, 2024

Uh oh!

i-chaochen commented Jul 29, 2024

Uh oh!

hsharsha commented Jul 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants