Skip to content

Conversation

@mmakevic-amd
Copy link

No description provided.

akuegel and others added 30 commits June 18, 2024 01:31
It can be disabled again by setting --xla_gpu_mlir_emitter_level=0

Remove two unrolling tests. With MLIR emitters enabled, we do unroll
in these two cases, but benchmarking shows that it is at least as fast
as with the old emitters.

Remove one test in element_wise_row_vectorization.hlo. It checks that we
do not vectorize for multi-output fusions, but there is no reason why we
shouldn't. With MLIR emitters, we would vectorize here.

PiperOrigin-RevId: 644283642
PiperOrigin-RevId: 644285351
PiperOrigin-RevId: 644286074
…ned_einsum_handler to handle all2all

Imported from GitHub PR openxla/xla#13310

Added a rewrite logic in gpu_windowned_einsum_handler to decompose a2a+gemm or gemm+a2a into smaller a2a+gemm/gemm+a2a to hide communuication overhead. Partial results will be aggregated at the end.
An example will be:
```

input
   |
a2a{replica_groups={{0,1}}}
   |
gemm

```

decomposed into
```

                  input
                 /         \
           slice1      slice2
              /               \
           a2a1             a2a2
            /                  \
       gemm1                 gemm2
                \           /
                    add
```

All partial gemms will be dispatched to parallel streams too to achieve gemm-gemm overlap.

Performance metrics:
For an unit with just a2a+gemm or gemm+a2a, we see from 5-15% speedup depending on the size by doing this type of composition.
Copybara import of the project:

--
557c540df51b3c238f87ae01262f1a6000ee4499 by TJ <tjx@nvidia.com>:

Added a rewrite logic in gpu_windowned_einsum_handler to decompose
a2a+gemm or gemm+a2a into smaller a2a+gemm/gemm+a2a to hide
communuication overhead.

--
d3e9b2fc28b484f263609c21b6177ea948aa8e01 by TJ <tjx@nvidia.com>:

Changed testing to use file check

--
3de9fbb962438837e77353c3c0b2a96e3e0d397e by TJ Xu <tjx@nvidia.com>:

Added e2e tests
address recent changes to thunk emission with execution stream id

--
d7790ed5e206c5e1ebf33afa8e34d7faedff4d47 by TJ Xu <tjx@nvidia.com>:

Added file check to BUILD file

Merging this change closes tensorflow#13310

PiperOrigin-RevId: 644291310
Imported from GitHub PR openxla/xla#13866

Copybara import of the project:

--
66fc76f233ec0beae75265d89e3526dee5cf84da by Harsha HS <Harsha.HavanurShamsundara@amd.com>:

Handle disabled backends for AMD case

Merging this change closes tensorflow#13866

PiperOrigin-RevId: 644296216
PiperOrigin-RevId: 644296802
PiperOrigin-RevId: 644300313
Heuristics for compute and memory access time are shared with GpuPerformanceModel. The main difference is that we assume that each tile is read or computed only once and results are shared between threads via shared memory.

PiperOrigin-RevId: 644313383
…codes in Triton emitter.

In order to test this exhaustively, change the `TritonType` derivation logic to
propagate a `Status` instead of crashing whenever a mapping between the provided
HLO type and Triton types has not been defined.

Landing this as a single change makes sense because exhaustively testing
`bitcast`s and `reshape`s is a rather canonical way of checking that this new
error propagation logic works as intended.

PiperOrigin-RevId: 644313457
Constraints are constructed from merging the constraints of all the
`SymbolicTile`s encountered while constructing the resulting symbolic tiled HLO
computation. If any of the `SymbolicTile`s is unsatisfiable, the construction
of the `SymbolicTileAnalysis` object does not succeed. Likewise, construction
fails if some constraints can not be merged with others.

Constraints are now checked to be satisfied by the provided tile parameters
when attempting to extract a `TiledHloComputation` out of
`SymbolicTileAnalysis`. In order to avoid checking constraints too many times,
we allow pinky-promising that the provided tile parameters satisfy the
constraints, to voluntarily bypass the checks.

PiperOrigin-RevId: 644321131
Updates LLVM usage to match
[93ffe1792fd9](llvm/llvm-project@93ffe1792fd9)

PiperOrigin-RevId: 644334496
…::Status.

In some situations, this meant also changing unrelated files to directly include tsl/platform/statusor.h to get the definitions for TF_ASSIGN_OR_RETURN, etc., where they were getting transitively included for free.

PiperOrigin-RevId: 644340828
…row.

Imported from GitHub PR openxla/xla#13781

Copybara import of the project:

--
060f02a0c356edffa1037da07150eb19ef387231 by Ilia Sergachev <isergachev@nvidia.com>:

[GPU] Let the on-disk kernel compilation cache grow.

--
3572421e08fef72e8a6c49105c6a5ec5c9b47a5d by Ilia Sergachev <isergachev@nvidia.com>:

Add new flag description

--
ecb79ca37583599b8bed8cd8c457139981c589be by Ilia Sergachev <isergachev@nvidia.com>:

Move code

--
3695df02c2f9ec73f4c70d6cacb23d983e3e7a21 by Ilia Sergachev <isergachev@nvidia.com>:

Improve checks

--
f421890b6548a9a4dfcc5350e791bc8860615dbc by Ilia Sergachev <isergachev@nvidia.com>:

Add another test

Merging this change closes tensorflow#13781

PiperOrigin-RevId: 644345246
PiperOrigin-RevId: 644348830
Tile sizes are usually small, so it's better to use InlinedVector or SmallVector to store them.

PiperOrigin-RevId: 644362570
…est`.

`bitcast`s are not meaningful pre-optimizations because intermediate HLO ops
do not have a layout at that point. For that reason, incorrect `bitcast`s
evade verifier checks in `ParseAndReturnVerifiedModule`. This was hiding a
data type mismatch in our tests.

Since all the `bitcast`s in `symbolic_tile_test` have `reshape` semantics,
we simply replace them with `reshape`s, which is handled well by
`ParseAndReturnVerifiedModule`.

PiperOrigin-RevId: 644366242
When autotuning them, calling cuBLASlt is considered as a fallback.  Before this change, only cuBLAS path was considered in the autotuner.

This is a temporary change; the GemmRewriter "fp8" parameter will be removed, so only one call will be needed.

PiperOrigin-RevId: 644374262
PiperOrigin-RevId: 644382031
This algorithm is responsible for numerical problems in 4+ models
from different customers. It's likely that other customers also
have issues that they didn't report yet.

Let's disable algo id 14 for all shapes for now until the cuDNN team has a chance
to look at the issue.

PiperOrigin-RevId: 644385128
PiperOrigin-RevId: 644392118
…_sharding_util`.

We may also use it in SPMD partitioner. This cl only change the location of a util function without behavior change.

PiperOrigin-RevId: 644393231
Updates LLVM usage to match
[52d87de7a42d](llvm/llvm-project@52d87de7a42d)

PiperOrigin-RevId: 644397020
…ption

If the option is set, we will maintain (read/write) a per-fusion autotune cache in the given directory.

The directory must exist.

Cache invalidation has to be handled by the user (e.g. please use an empty directory if you want to start with an empty cache).

XLA version checks must be done by the user (e.g. if you want to cache fusions created with different versions of XLA, please use different directories).
(If the using library already has a version handling mechanism, like JAX, then it shouldn't be difficult for them to create separate directories based on that version (and all the parameters which matter to them).)

Default: no file based cache.

There is minimal support for multiple processes using the same cache - the rename trick is used to avoid writing the same file by multiple processes at the same time or reading incomplete files.

We use SHA256 hashes in the filenames and assume that no collisions occur.

This is a simple implementation to allow people to test it and find good use-cases. If needed we can refine it later.

Considered use case:
People running [multiple] [similar] models [through JAX]. For example there are 2 similar HLOs that we want to run with JAX (using the same "XLA binary") and it would be nice to reuse the autotune results from the first, if some kernels appear in both.
Similarly: Consider the use case of a researcher sitting at a Colab session and making small changes to their model. They should mostly get cache hits!

Limitations:

It is not recommended to change the cache directory during the run of a process, because then the in-memory and the file based cache can become inconsistent. At least clear the in-memory cache if you change it.

When loading results with LoadAutotuneResults[FromFile], they are not written into the cache directory.

PiperOrigin-RevId: 644406688
chihuahua and others added 21 commits June 23, 2024 02:39
…tcher with oneDNN custom call rewrite

Imported from GitHub PR openxla/xla#10301

This PR enables oneDNN library call for the matched XLA HLO Convolution pattern through custom_call instruction. In particular, this PR:

1. Adds oneDNN convolution rewriter pass that will rewrite HLO Convolution to oneDNN Convolution.
2. Refactors backend config to enhance code reusability.
3. Adds a convolution test file to verify rewrite and execution result
Copybara import of the project:

--
b7d0abb4f683595c91bf144bcf1b208254ca5c74 by Akhil Goel <akhil.goel@intel.com>:

Add onednn convolution support

--
fbf544ea346250129c00023da7e193aecad5375b by Akhil Goel <akhil.goel@intel.com>:

Remove unused symbol from BUILD file

--
76c079109361c75a6c512d32801039ed7a3f30e1 by Akhil Goel <akhil.goel@intel.com>:

Fix buildifier error

--
01f59d2d11481a1ac0b3e7a491261e1ee541aac0 by Akhil Goel <akhil.goel@intel.com>:

Address Review Comments

--
667502387a855ed0974822334283b890f84d1c34 by Akhil Goel <akhil.goel@intel.com>:

Refactor oneDNN rewritability check to a separate function

--
6670413b5150518044a371e99e60a9ea36984660 by Akhil Goel <akhil.goel@intel.com>:

Add cpu package to onednn_config proto file

--
c1e8fc78e2a6a039ef615b17130be8b0fbd9c901 by Akhil Goel <akhil.goel@intel.com>:

Push missing change in merge

--
9f446d82fef58f0b0b946f6c5e4baba8cf5e4a50 by Akhil Goel <akhil.goel@intel.com>:

Mark outdated ids as reserved

Merging this change closes tensorflow#10301

PiperOrigin-RevId: 645921244
PiperOrigin-RevId: 645929520
PiperOrigin-RevId: 645988622
PiperOrigin-RevId: 646015314
PiperOrigin-RevId: 646031586
Updates LLVM usage to match
[e5a41f0afc15](llvm/llvm-project@e5a41f0afc15)

PiperOrigin-RevId: 646067665
Updates LLVM usage to match
[5cd0ba30f53d](llvm/llvm-project@5cd0ba30f53d)

PiperOrigin-RevId: 646103303
Imported from GitHub PR openxla/xla#14017

This fixes build break due to f4212dc and 0f75900
Copybara import of the project:

--
d914293af6aec074f8e313b93140bde66fac5171 by Harsha HS <Harsha.HavanurShamsundara@amd.com>:

[ROCm] Fix Build break due to f4212dc and 0f75900

Merging this change closes tensorflow#14017

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#14017 from ROCm:ci_fix_struct_build_break_20240620 d914293af6aec074f8e313b93140bde66fac5171
PiperOrigin-RevId: 645152887
@i-chaochen
Copy link
Collaborator

retest Ubuntu-GPU-single please

@mmakevic-amd mmakevic-amd force-pushed the develop-upstream-sync-240624 branch from 15233ac to 150bb37 Compare July 16, 2024 05:46
@hsharsha hsharsha merged commit f1d1afd into develop-upstream Jul 29, 2024
@i-chaochen
Copy link
Collaborator

@hsharsha why we're not waiting till CI for updated docker image to finish?

@hsharsha
Copy link

@hsharsha why we're not waiting till CI for updated docker image to finish?

It passed http://ml-ci.amd.com:21096/job/tensorflow/job/ThirdParty-XLA/job/PR-2569/12/console. I re-triggered it by accident.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.