Skip to content

Conversation

@z1-cciauto
Copy link
Collaborator

No description provided.

matthias-springer and others added 30 commits November 18, 2025 13:47
)

Add pass options to run lowerings to NVVM without pattern rollback. This
makes the dialect conversions easier to debug and improves
performance/memory usage.
The index == 0 scenerio has already been handled by the early return, so
only the upper half scenerio is relevant here.
llvm#168392)

Whenever llvm#149042 is relanded we will soon start EVL tail folding
vectorized loops that have live-outs, e.g.:

```c
int f(int *x, int n) {
  for (int i = 0; i < n; i++) {
    int y = x[i] + 1;
    x[y] = y;
  }
  return y;
}
```

These are vectorized by extracting the last "active lane" in the loop's
exit:

```llvm
loop:
  %vl = call i32 @llvm.experimental.get.vector.length(i64 %avl, i32 4, i1 true)
  ...

exit:
  %lastidx = sub i64 %vl, 1
  %lastelt = extractelement <vscale x 4 x i32> %y, i64 %lastidx
```

Which in RISC-V translates to a vslidedown.vx with a VL of 1:

```llvm
bb.loop:
    %vl:gprnox0 = PseudoVSETVLI ...
    %y:vr = PseudoVADD_VI_M1 $noreg, %x, 1,  AVL=-1
    ...
bb.exit:
    %lastidx:gprnox0 = ADDI %vl, -1
    %w:vr = PseudoVSLIDEDOWN_VX_M1 $noreg, %y, %lastidx, AVL=1
```

However today we will fail to reduce the VL of %y in the loop and will
end up with two extra VL toggles. The reason being that today
RISCVVLOptimizer is conservative with vslidedown.vx as it can read the
lanes of %y past its own VL. So in `getMinimumVLForUser` we say that
vslidedown.vx demands the entirety of %y.

One observation with the sequence above is that it only actually needs
to read the first %vl lanes of %y, because the last lane of vs2 used is
offset + 1. In this case, that's `%lastidx + 1 = %vl - 1 + 1 = %vl`.

This PR teaches RISCVVLOptimizer about this case in
`getMinimumVLForVSLIDEDOWN_VX`, and in doing so removes the VL toggles.

The one case that I had to think about for a bit was whenever `ADDI %vl,
-1` wraps, i.e. when %vl=0 and the resulting offset is all ones. This
should always be larger than the largest VLMAX, so vs2 will be
completely slid down and absent from the output. So we don't need to
read anything from vs2.

This patch on its own has no observable effect on llvm-test-suite or
SPEC CPU 2017 w/ rva23u64 today.
### Summary
This PR resolves llvm#163895.
Just add fcmp-sse part of X86 vector builtins for CIR.

---------

Co-authored-by: liuzhenya <zyliu@siorigin.com>
We need to fallthrough here in case we're not jumping to the labels.
This is only needed in expression contexts.
Add a pass option to `convert-scf-to-cf` to deactivate pattern rollback
for better performance. The lowering patterns from SCF->CF to benefit a
lot from this feature because `splitBlock` is expensive in the rollback
driver.
…vm#168430)

Updated the evaluate handler to check for DAP ErrorResponse bodies,
which are used to display user errors if a request fails. This was
updated in PR llvm#167720

This should fix https://lab.llvm.org/buildbot/#/builders/163
…ts (llvm#166851)

`-fsanitize=address,fuzzer` should be rejected like
`-fsanitize=fuzzer,address`.
The address sanitizer enables the device sanitizer pipeline. The fuzzer
implicitly turns on LLVMs SanitizerCoverage, which the driver then
forwards to the device cc1. SanitizerCoverage is not supported on
amdgcn.
…vm#168058)

Also breaks the long inheritance chains by making both
`SIGfx10CacheControl` and
`SIGfx12CacheControl` inherit from `SICacheControl` directly.

With this patch, we now just have 3 `SICacheControl` implementations
that each
do their own thing, and there is no more code hidden 3 superclasses
above (which made this code harder to read and maintain than it needed
to be).
1. Fixed 2 DTLTO cache tests that failed on MacOS because input to grep
command is different compared to Windows
2. Removed unneeded comments from  dtlto-cache.ll
…m#166360)

This will be used to support ZT0 in the MachineSMEABIPass.
…lvm#166247)

This patch implements a transform to hoists single-scalar replicated
loads with invariant addresses out of the vector loop to the preheader
when scoped noalias metadata proves they cannot alias with any stores in
the loop.

This enables hosting of loads we can prove do not alias any stores in
the loop due to memory runtime checks added during vectorization.

PR: llvm#166247
)

This generates more optimal codegen when using partial reductions with
predication.

```
partial_reduce_*mla(acc, sel(p, mul(*ext(a), *ext(b)), splat(0)), splat(1))
-> partial_reduce_*mla(acc, sel(p, a, splat(0)), b)

partial.reduce.*mla(acc, sel(p, *ext(op), splat(0)), splat(1))
-> partial.reduce.*mla(acc, sel(p, op, splat(0)), splat(trunc(1)))
```
…lvm#168341)

This is harmless due to the previous checks for > 0, but it is still
confusing for the readers.
…154972)

AsmLexer expects the buffer it's provided for lexing to be
NULL-terminated, where the NULL terminator is pointed to by
`CurBuf.end()`. However, this expectation isn't explicitly stated
anywhere.

This commit adds a couple of comments as well as an assert as means of
documenting this expectation.
…#167705)

Generally, to_tensor and to_buffer already perform sufficient
verification. However, there are some unnecessarily strict constraints:
* builtin tensor requires its buffer counterpart to always be memref
* to_buffer on ranked tensor requires to always return memref

These checks are assertions (i.e. preconditions), however, they actually
prevent an apparently useful bufferization where builtin tensors could
become custom buffers. Lift these assertions, maintaining the
verification procedure unchanged, to allow builtin -> custom
bufferizations at operation boundary level.
…used in constexpr (llvm#162816)

This PR just resolves ss/sd part of AVX512 masked arithmetic intrinsics of llvm#160559.
…s to be used in constexpr (llvm#168496)

### Summary
This PR resolves llvm#160559  - other pd/ps/epi/epu part of AVX512 masked arithmetic intrinsics.
Add a few patterns for extadd pairwise.
…in (NFC) (llvm#168343)

In 4 years the plugin wasn't adapted to other object formats. This patch
makes it specific for ELF, which will allow to remove some abstractions
down the line. It also moves the plugin from LLVMOrcJIT into
LLVMOrcDebugging, which didn't exist back then.
Nest arguments are supported by CC in X86CallingConv.td. Nothing special
is required in GlobalISel as we reuse the code.

Nest attribute is mostly generated by fortran frontend.
…167322)

There was a minor oversight in commit 6836261; the AArch64 GICv5
instruction `GIC CDEOI` takes no operands, since the text of the
specification says:
```
The Rt field should be set to 0b11111. If the Rt field is not
set to 0b11111, it is CONSTRAINED UNPREDICTABLE whether:
* The instruction is UNDEFINED.
* The instruction behaves as if the Rt field is set to 0b11111.
```
This commit adds support for tgen05.mma family of instructions in the NVVM MLIR dialect and lowers to LLVM Intrinsics. Please refer [PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-instructions) for information
…m#148650)

This patch introduces preliminary support for additional memory
locations.
They are: target_mem0 and target_mem1 and they model memory locations
that cannot be represented with existing memory locations.

It was a solution suggested in :
https://discourse.llvm.org/t/rfc-improving-fpmr-handling-for-fp8-intrinsics-in-llvm/86868/6

Currently, these locations are not yet target-specific. The goal is to
enable the compiler to express read/write effects on these resources.
(Reland of llvm#161546, fixing three build and test issues)

This commit adds optimized assembly versions of single-precision float
multiplication and division. Both functions are implemented in a style
that can be assembled as either of Arm and Thumb2; for multiplication, a
separate implementation is provided for Thumb1. Also, extensive new
tests are added for multiplication and division.

These implementations can be removed from the build by defining the
cmake variable COMPILER_RT_ARM_OPTIMIZED_FP=OFF.

Outlying parts of the functionality which are not on the fast path, such
as NaN handling and underflow, are handled in helper functions written
in C. These can be shared between the Arm/Thumb2 and Thumb1
implementations, and also reused by other optimized assembly functions
we hope to add in future.
fhahn and others added 6 commits November 18, 2025 11:31
…llvm#166245)

Implement CastInfo from VPRecipeBase to VPIRMetadata to support
isa/dyn_Cast. This is similar to CastInfoVPPhiAccessors, supporting
dyn_cast by down-casting to the concrete recipe types inheriting from
VPIRMetadata.

Can be used for more generalized VPIRMetadata printing following
llvm#165825.

PR: llvm#166245
Fixed llvm#148354

Lower SPIR-V Tan/Tanh ops using the corresponding LLVM intrinsics to
reduce instructions and prevent overflow caused by the previous
`exp`-based expansion.
…7915)

Exceptions include intrinsics that:
* take or return floating point data
* read or write FFR
* read or write memory
* read or write SME state
…m#168427)

This adds handling for f16 and f128 lround/llround under LP64 targets,
promoting the f16 where needed and using a libcall for f128. This
codegen is now identical to the selection dag version.
@z1-cciauto z1-cciauto requested a review from a team November 18, 2025 12:07
@z1-cciauto
Copy link
Collaborator Author

@z1-cciauto z1-cciauto merged commit e9b3d79 into amd-staging Nov 18, 2025
5 checks passed
@z1-cciauto z1-cciauto deleted the upstream_merge_202511180707 branch November 18, 2025 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.