merge main into amd-staging #615

z1-cciauto · 2025-11-18T12:07:18Z

No description provided.

) Add pass options to run lowerings to NVVM without pattern rollback. This makes the dialect conversions easier to debug and improves performance/memory usage.

The index == 0 scenerio has already been handled by the early return, so only the upper half scenerio is relevant here.

llvm#168392) Whenever llvm#149042 is relanded we will soon start EVL tail folding vectorized loops that have live-outs, e.g.: ```c int f(int *x, int n) { for (int i = 0; i < n; i++) { int y = x[i] + 1; x[y] = y; } return y; } ``` These are vectorized by extracting the last "active lane" in the loop's exit: ```llvm loop: %vl = call i32 @llvm.experimental.get.vector.length(i64 %avl, i32 4, i1 true) ... exit: %lastidx = sub i64 %vl, 1 %lastelt = extractelement <vscale x 4 x i32> %y, i64 %lastidx ``` Which in RISC-V translates to a vslidedown.vx with a VL of 1: ```llvm bb.loop: %vl:gprnox0 = PseudoVSETVLI ... %y:vr = PseudoVADD_VI_M1 $noreg, %x, 1, AVL=-1 ... bb.exit: %lastidx:gprnox0 = ADDI %vl, -1 %w:vr = PseudoVSLIDEDOWN_VX_M1 $noreg, %y, %lastidx, AVL=1 ``` However today we will fail to reduce the VL of %y in the loop and will end up with two extra VL toggles. The reason being that today RISCVVLOptimizer is conservative with vslidedown.vx as it can read the lanes of %y past its own VL. So in `getMinimumVLForUser` we say that vslidedown.vx demands the entirety of %y. One observation with the sequence above is that it only actually needs to read the first %vl lanes of %y, because the last lane of vs2 used is offset + 1. In this case, that's `%lastidx + 1 = %vl - 1 + 1 = %vl`. This PR teaches RISCVVLOptimizer about this case in `getMinimumVLForVSLIDEDOWN_VX`, and in doing so removes the VL toggles. The one case that I had to think about for a bit was whenever `ADDI %vl, -1` wraps, i.e. when %vl=0 and the resulting offset is all ones. This should always be larger than the largest VLMAX, so vs2 will be completely slid down and absent from the output. So we don't need to read anything from vs2. This patch on its own has no observable effect on llvm-test-suite or SPEC CPU 2017 w/ rva23u64 today.

### Summary This PR resolves llvm#163895. Just add fcmp-sse part of X86 vector builtins for CIR. --------- Co-authored-by: liuzhenya <zyliu@siorigin.com>

We need to fallthrough here in case we're not jumping to the labels. This is only needed in expression contexts.

Add a pass option to `convert-scf-to-cf` to deactivate pattern rollback for better performance. The lowering patterns from SCF->CF to benefit a lot from this feature because `splitBlock` is expensive in the rollback driver.

…m#167929)

…vm#168430) Updated the evaluate handler to check for DAP ErrorResponse bodies, which are used to display user errors if a request fails. This was updated in PR llvm#167720 This should fix https://lab.llvm.org/buildbot/#/builders/163

…ts (llvm#166851) `-fsanitize=address,fuzzer` should be rejected like `-fsanitize=fuzzer,address`. The address sanitizer enables the device sanitizer pipeline. The fuzzer implicitly turns on LLVMs SanitizerCoverage, which the driver then forwards to the device cc1. SanitizerCoverage is not supported on amdgcn.

…vm#168058) Also breaks the long inheritance chains by making both `SIGfx10CacheControl` and `SIGfx12CacheControl` inherit from `SICacheControl` directly. With this patch, we now just have 3 `SICacheControl` implementations that each do their own thing, and there is no more code hidden 3 superclasses above (which made this code harder to read and maintain than it needed to be).

1. Fixed 2 DTLTO cache tests that failed on MacOS because input to grep command is different compared to Windows 2. Removed unneeded comments from dtlto-cache.ll

…m#166360) This will be used to support ZT0 in the MachineSMEABIPass.

…lvm#166247) This patch implements a transform to hoists single-scalar replicated loads with invariant addresses out of the vector loop to the preheader when scoped noalias metadata proves they cannot alias with any stores in the loop. This enables hosting of loads we can prove do not alias any stores in the loop due to memory runtime checks added during vectorization. PR: llvm#166247

) This generates more optimal codegen when using partial reductions with predication. ``` partial_reduce_*mla(acc, sel(p, mul(*ext(a), *ext(b)), splat(0)), splat(1)) -> partial_reduce_*mla(acc, sel(p, a, splat(0)), b) partial.reduce.*mla(acc, sel(p, *ext(op), splat(0)), splat(1)) -> partial.reduce.*mla(acc, sel(p, op, splat(0)), splat(trunc(1))) ```

…lvm#168341) This is harmless due to the previous checks for > 0, but it is still confusing for the readers.

…154972) AsmLexer expects the buffer it's provided for lexing to be NULL-terminated, where the NULL terminator is pointed to by `CurBuf.end()`. However, this expectation isn't explicitly stated anywhere. This commit adds a couple of comments as well as an assert as means of documenting this expectation.

…#167705) Generally, to_tensor and to_buffer already perform sufficient verification. However, there are some unnecessarily strict constraints: * builtin tensor requires its buffer counterpart to always be memref * to_buffer on ranked tensor requires to always return memref These checks are assertions (i.e. preconditions), however, they actually prevent an apparently useful bufferization where builtin tensors could become custom buffers. Lift these assertions, maintaining the verification procedure unchanged, to allow builtin -> custom bufferizations at operation boundary level.

…used in constexpr (llvm#162816) This PR just resolves ss/sd part of AVX512 masked arithmetic intrinsics of llvm#160559.

…s to be used in constexpr (llvm#168496) ### Summary This PR resolves llvm#160559 - other pd/ps/epi/epu part of AVX512 masked arithmetic intrinsics.

Add a few patterns for extadd pairwise.

…in (NFC) (llvm#168343) In 4 years the plugin wasn't adapted to other object formats. This patch makes it specific for ELF, which will allow to remove some abstractions down the line. It also moves the plugin from LLVMOrcJIT into LLVMOrcDebugging, which didn't exist back then.

Nest arguments are supported by CC in X86CallingConv.td. Nothing special is required in GlobalISel as we reuse the code. Nest attribute is mostly generated by fortran frontend.

…167322) There was a minor oversight in commit 6836261; the AArch64 GICv5 instruction `GIC CDEOI` takes no operands, since the text of the specification says: ``` The Rt field should be set to 0b11111. If the Rt field is not set to 0b11111, it is CONSTRAINED UNPREDICTABLE whether: * The instruction is UNDEFINED. * The instruction behaves as if the Rt field is set to 0b11111. ```

This commit adds support for tgen05.mma family of instructions in the NVVM MLIR dialect and lowers to LLVM Intrinsics. Please refer [PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-instructions) for information

…m#148650) This patch introduces preliminary support for additional memory locations. They are: target_mem0 and target_mem1 and they model memory locations that cannot be represented with existing memory locations. It was a solution suggested in : https://discourse.llvm.org/t/rfc-improving-fpmr-handling-for-fp8-intrinsics-in-llvm/86868/6 Currently, these locations are not yet target-specific. The goal is to enable the compiler to express read/write effects on these resources.

(Reland of llvm#161546, fixing three build and test issues) This commit adds optimized assembly versions of single-precision float multiplication and division. Both functions are implemented in a style that can be assembled as either of Arm and Thumb2; for multiplication, a separate implementation is provided for Thumb1. Also, extensive new tests are added for multiplication and division. These implementations can be removed from the build by defining the cmake variable COMPILER_RT_ARM_OPTIMIZED_FP=OFF. Outlying parts of the functionality which are not on the fast path, such as NaN handling and underflow, are handled in helper functions written in C. These can be shared between the Arm/Thumb2 and Thumb1 implementations, and also reused by other optimized assembly functions we hope to add in future.

…llvm#166245) Implement CastInfo from VPRecipeBase to VPIRMetadata to support isa/dyn_Cast. This is similar to CastInfoVPPhiAccessors, supporting dyn_cast by down-casting to the concrete recipe types inheriting from VPIRMetadata. Can be used for more generalized VPIRMetadata printing following llvm#165825. PR: llvm#166245

Fixed llvm#148354 Lower SPIR-V Tan/Tanh ops using the corresponding LLVM intrinsics to reduce instructions and prevent overflow caused by the previous `exp`-based expansion.

…7915) Exceptions include intrinsics that: * take or return floating point data * read or write FFR * read or write memory * read or write SME state

…m#168427) This adds handling for f16 and f128 lround/llround under LP64 targets, promoting the f16 where needed and using a libcall for f128. This codegen is now identical to the selection dag version.

z1-cciauto · 2025-11-18T12:08:31Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/2846

matthias-springer and others added 30 commits November 18, 2025 13:47

[mlir][NVVM] Add no-rollback option to NVVM lowering passes (llvm#168477

951ab04

) Add pass options to run lowerings to NVVM without pattern rollback. This makes the dialect conversions easier to debug and improves performance/memory usage.

[RISCV] Remove unused argument check (NFC) (llvm#168313)

ea26d92

The index == 0 scenerio has already been handled by the early return, so only the upper half scenerio is relevant here.

[CIR] X86 vector fcmp-sse vector builtins (llvm#167125)

7354533

### Summary This PR resolves llvm#163895. Just add fcmp-sse part of X86 vector builtins for CIR. --------- Co-authored-by: liuzhenya <zyliu@siorigin.com>

[clang][bytecode] Fix fallthrough to switch labels (llvm#168484)

886d24d

We need to fallthrough here in case we're not jumping to the labels. This is only needed in expression contexts.

[ORC] Remove unnecessary LLVM_ABI on function def. NFCI. (llvm#168478)

f15b756

[RISCV] Add an option to enable CFIInstrInserter. (llvm#164477)

6886d49

[SLP] Invariant loads cannot have a memory dependency on stores. (llv…

a618895

…m#167929)

Fixed 2 tests that failed on MacOS (llvm#168482)

beb06eb

1. Fixed 2 DTLTO cache tests that failed on MacOS because input to grep command is different compared to Windows 2. Removed unneeded comments from dtlto-cache.ll

[AArch64][SME] Add support for zeroing ZT0 to CommitZASavePseudo (llv…

603ac57

…m#166360) This will be used to support ZT0 in the MachineSMEABIPass.

[lldb][nfc] Fix incorrect union usage in UnwindAssemblyInstEmulation (l…

542d88d

…lvm#168341) This is harmless due to the previous checks for > 0, but it is still confusing for the readers.

[Headers][X86] Allow AVX512 masked arithmetic ss/sd intrinsics to be …

f9256ca

…used in constexpr (llvm#162816) This PR just resolves ss/sd part of AVX512 masked arithmetic intrinsics of llvm#160559.

[Headers][X86] Allow AVX512 masked arithmetic pd/ps/epi/epu intrinsic…

2ea1a09

…s to be used in constexpr (llvm#168496) ### Summary This PR resolves llvm#160559 - other pd/ps/epi/epu part of AVX512 masked arithmetic intrinsics.

[WebAssembly] Add patterns for extadd pairwise (llvm#167960)

672757b

Add a few patterns for extadd pairwise.

[X86][GlobalISel] Enable nest arguments (llvm#165173)

49d77d8

Nest arguments are supported by CC in X86CallingConv.td. Nothing special is required in GlobalISel as we reuse the code. Nest attribute is mostly generated by fortran frontend.

[gn build] Port

88465af

[gn build] Port 3ce893f

3378ea2

fhahn and others added 6 commits November 18, 2025 11:31

[MLIR][SPIRV] Lower SPIR-V Tan/Tanh ops to LLVM intrinsics (llvm#168419)

27231bc

Fixed llvm#148354 Lower SPIR-V Tan/Tanh ops using the corresponding LLVM intrinsics to reduce instructions and prevent overflow caused by the previous `exp`-based expansion.

[LLVM][AArch64] Mark SVE integer intrinsics as speculatable. (llvm#16…

591c463

…7915) Exceptions include intrinsics that: * take or return floating point data * read or write FFR * read or write memory * read or write SME state

[MLIR][NVVM] Move the docs to markdown file (llvm#168375)

76dac58

[AArch64][GlobalISel] Add better basic legalization for llround. (llv…

4ecfaa6

…m#168427) This adds handling for f16 and f128 lround/llround under LP64 targets, promoting the f16 where needed and using a libcall for f128. This codegen is now identical to the selection dag version.

merge main into amd-staging

81da1c3

z1-cciauto requested review from antiagainst, fabianmcg and kuhar as code owners November 18, 2025 12:07

z1-cciauto requested a review from a team November 18, 2025 12:07

ronlieb approved these changes Nov 18, 2025

View reviewed changes

z1-cciauto merged commit e9b3d79 into amd-staging Nov 18, 2025
5 checks passed

z1-cciauto deleted the upstream_merge_202511180707 branch November 18, 2025 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merge main into amd-staging #615

merge main into amd-staging #615

Uh oh!

z1-cciauto commented Nov 18, 2025

Uh oh!

z1-cciauto commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

33 participants

merge main into amd-staging #615

merge main into amd-staging #615

Uh oh!

Conversation

z1-cciauto commented Nov 18, 2025

Uh oh!

z1-cciauto commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

33 participants