merge main into amd-staging #336

ronlieb · 2025-10-21T22:53:02Z

No description provided.

Fix incorrect linking and dependencies introduced in llvm#161179 that break standalone builds of Flang. Signed-off-by: Michał Górny <mgorny@gentoo.org>

…164335)

…lvm#161638) They were previously optimized to not emit any waitcnt, which is technically correct because there is no reordering of operations at workgroup scope in CU mode for GFX10+. This breaks transitivity however, for example if we have the following sequence of events in one thread: - some stores - store atomic release syncscope("workgroup") - barrier then another thread follows with - barrier - load atomic acquire - store atomic release syncscope("agent") It does not work because, while the other thread sees the stores, it cannot release them at the wider scope. Our release fences aren't strong enough to "wait" on stores from other waves. We also cannot strengthen our release fences any further to allow for releasing other wave's stores because only GFX12 can do that with `global_wb`. GFX10-11 do not have the writeback instruction. It'd also add yet another level of complexity to code sequences, with both acquire/release having CU-mode only alternatives. Lastly, acq/rel are always used together. The price for synchronization has to be paid either at the acq, or the rel. Strengthening the releases would just make the memory model more complex but wouldn't help performance. So the choice here is to streamline the code sequences by making CU and WGP mode emit almost identical (vL0 inv is not needed in CU mode) code for release (or stronger) atomic ordering. This also removes the `vm_vsrc(0)` wait before barriers. Now that the release fence in CU mode is strong enough, it is no longer needed. Supersedes llvm#160501 Solves SC1-6454

This adds support for ptrtoaddr in the `ptradd p, ptrtoaddr(p2) - ptrtoaddr(p) -> p2` fold. This fold requires that p and p2 have the same underlying object (otherwise the provenance may not be the same). The argument I would like to make here is that because the underlying objects are the same (and the pointers in the same address space), the non-address bits of the pointer must be the same. Looking at some specific cases of underlying object relationship: * phi/select: Trivially true. * getelementptr: Only modifies address bits, non-address bits must remain the same. * addrspacecast round-trip cast: Must preserve all bits because we optimize such round-trip casts away. * non-interposable global alias: I'm a bit unsure about this one, but I guess the alias and the aliasee must have the same non-address bits? * various intrinsics like launder.invariant.group, ptrmask. I think these all either preserve all pointer bits (like the invariant.group ones) or at least the non-address bits (like ptrmask). There are some interesting cases like amdgcn.make.buffer.rsrc, but those are cross address-space. ----- There is a second `gep (gep p, C), (sub 0, ptrtoint(p)) -> C` transform in this function, which I am not extending to handle ptrtoaddr, adding negative tests instead. This transform is overall dubious for provenance reasons, but especially dubious with ptrtoaddr, as then we don't have the guarantee that provenance of `p` has been exposed.

This test uses -debug-only, so needs an assertion-enabled build.

…lvm#160499) If we already have a dup(x) as part of the DAG along with a scalar_to_vec(x), we can re-use the result of the dup to the scalar_to_vec(x).

… warnings. NFC. (llvm#164369)

…lvm#162850) This reapplication fixes the use after free caused by not properly updating the bucket list in one case. Original commit message: Instead of just calling the single element `erase` on every element of the range, we can combine some of the operations in a custom implementation. Specifically, we don't need to search for the previous node or re-link the list every iteration. Removing this unnecessary work results in some nice performance improvements: ``` ----------------------------------------------------------------------------------------------------------------------- Benchmark old new ----------------------------------------------------------------------------------------------------------------------- std::unordered_set<int>::erase(iterator, iterator) (erase half the container)/0 457 ns 459 ns std::unordered_set<int>::erase(iterator, iterator) (erase half the container)/32 995 ns 626 ns std::unordered_set<int>::erase(iterator, iterator) (erase half the container)/1024 18196 ns 7995 ns std::unordered_set<int>::erase(iterator, iterator) (erase half the container)/8192 124722 ns 70125 ns std::unordered_set<std::string>::erase(iterator, iterator) (erase half the container)/0 456 ns 461 ns std::unordered_set<std::string>::erase(iterator, iterator) (erase half the container)/32 1183 ns 769 ns std::unordered_set<std::string>::erase(iterator, iterator) (erase half the container)/1024 27827 ns 18614 ns std::unordered_set<std::string>::erase(iterator, iterator) (erase half the container)/8192 266681 ns 226107 ns std::unordered_map<int, int>::erase(iterator, iterator) (erase half the container)/0 455 ns 462 ns std::unordered_map<int, int>::erase(iterator, iterator) (erase half the container)/32 996 ns 659 ns std::unordered_map<int, int>::erase(iterator, iterator) (erase half the container)/1024 15963 ns 8108 ns std::unordered_map<int, int>::erase(iterator, iterator) (erase half the container)/8192 136493 ns 71848 ns std::unordered_multiset<int>::erase(iterator, iterator) (erase half the container)/0 454 ns 455 ns std::unordered_multiset<int>::erase(iterator, iterator) (erase half the container)/32 985 ns 703 ns std::unordered_multiset<int>::erase(iterator, iterator) (erase half the container)/1024 16277 ns 9085 ns std::unordered_multiset<int>::erase(iterator, iterator) (erase half the container)/8192 125736 ns 82710 ns std::unordered_multimap<int, int>::erase(iterator, iterator) (erase half the container)/0 457 ns 454 ns std::unordered_multimap<int, int>::erase(iterator, iterator) (erase half the container)/32 1091 ns 646 ns std::unordered_multimap<int, int>::erase(iterator, iterator) (erase half the container)/1024 17784 ns 7664 ns std::unordered_multimap<int, int>::erase(iterator, iterator) (erase half the container)/8192 127098 ns 72806 ns ``` This reverts commit acc3a62.

Some instances of the `Operand` class used in Tablegen instruction definitions expand to a cluster of multiple operands at the MC layer, such as complex addressing modes involving base + offset + shift, or clusters of operands describing conditional Arm instructions or predicated MVE instructions. There's currently no convenient way for C++ code to know the offset of one of those sub-operands from the start of the cluster: instead it just hard-codes magic numbers like `index+2`, which is hard to read and fragile. This patch adds an extra piece of output to `InstrInfoEmitter` to define those instruction offsets, based on the name of the `Operand` class instance in Tablegen, and the names assigned to the sub-operands in the `MIOperandInfo` field. For example, if target Foo were to define def Bar : Operand { let MIOperandInfo = (ops GPR:$first, i32imm:$second); // ... } then the new constants would be `Foo::SUBOP_Bar_first` and `Foo::SUBOP_Bar_second`, defined as 0 and 1 respectively. As an example, I've converted some magic numbers related to the MVE predication operand types (`vpred_n` and its superset `vpred_r`) to use the new named constants in place of the integer literals they previously used. This is more verbose, but also clearer, because it explains why the integer is chosen instead of what its value is.

While debugging the tests for llvm#155000 I found it helpful to have both sides of the simulated gdb-rsp traffic rather than just the responses so I've extended the packetLog in MockGDBServerResponder to record traffic in both directions. Tests have been updated accordingly

…pose (llvm#161841) ## Description Adds a new canonicalizer that folds `vector.from_elements(vector.transpose))` => `vector.from_elements`. This canonicalization reorders the input elements for `vector.from_elements`, adjusts the output shape to match the effect of the transpose op and eliminating its need. ## Testing Added a 2D vector lit test that verifies the working of the rewrite. --------- Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>

The dependence testing functions in DA assume that the analyzed AddRec does not wrap over the entire iteration space. For AddRecs that may wrap, DA should conservatively return unknown dependence. However, no validation is currently performed to ensure that this condition holds, which can lead to incorrect results in some cases. This patch introduces the notion of *monotonicity* and a validation logic to check whether a SCEV is monotonic. The monotonicity check classifies the SCEV into one of the following categories: - Unknown: Nothing is known about the monotonicity of the SCEV. - Invariant: The SCEV is loop-invariant. - MultivariateSignedMonotonic: The SCEV doesn't wrap in a signed sense for any iteration of the loops in the loop nest. The current validation logic basically searches an affine AddRec recursively and checks whether the `nsw` flag is present. Notably, it is still unclear whether we should also have a category for unsigned monotonicity. The monotonicity check is still under development and disabled by default for now. Since such a check is necessary to make DA sound, it should be enabled by default once the functionality is sufficient. Split off from llvm#154527.

Replace with trunc(add(X,Y)) to avoid premature folding in upcoming patch llvm#164227

Let Clang emit `llvm.tbaa.errno` metadata in order to let LLVM carry out optimizations around errno-writing libcalls to, as long as it is proved the involved memory location does not alias `errno`. Previous discussion: https://discourse.llvm.org/t/rfc-modelling-errno-memory-effects/82972.

Split off from PR llvm#163525, this standalone patch replaces use of undef as incoming PHI values with zero, in order to reduce the likelihood of contributors hitting the `undef deprecator` warning in github.

PR llvm#157084 added an option `da-run-siv-routines-only` to run only SIV routines in DA. This PR replaces that option with a more fine-grained one that allows to select other than SIV routines as well. This option is useful for regression testing of individual DA routines. This patch also reorganizes regression tests that use `da-run-siv-routines-only`.

) Part of llvm#102817. This is a natural follow-up to llvm#163006. We are forwarding `std::generate_n` to `std::__for_each_n` (`std::for_each_n` needs c++17), resulting in improved performance for segmented iterators. before: ``` std::generate_n(deque<int>)/32 17.5 ns 17.3 ns 40727273 std::generate_n(deque<int>)/50 25.7 ns 25.5 ns 26352941 std::generate_n(deque<int>)/1024 490 ns 487 ns 1445161 std::generate_n(deque<int>)/8192 3908 ns 3924 ns 179200 ``` after: ``` std::generate_n(deque<int>)/32 11.1 ns 11.0 ns 64000000 std::generate_n(deque<int>)/50 16.1 ns 16.0 ns 44800000 std::generate_n(deque<int>)/1024 291 ns 292 ns 2357895 std::generate_n(deque<int>)/8192 2269 ns 2250 ns 298667 ```

There are cases where `addEntryPointAtOffset` is called with a given `Offset` that points to an address within a constant island. This triggers `assert(!isInConstantIsland(EntryPointAddress)` and causes BOLT to crash. This patch adds a check which ignores functions that would add such entry points and warns the user.

In both verbose and non-verbose mode we will now use the `llvm::dwarf::LanguageDescription` to turn the version into a human readable string. In verbose mode we also display the raw version code (similar to how we display addresses in verbose mode). To make the version code and prettified easier to distinguish, we print the prettified name in colour (if available), which is consistent with how `DW_AT_language` is printed in colour. Before: ``` 0x0000000c: DW_TAG_compile_unit DW_AT_language_name (DW_LNAME_C) DW_AT_language_version (201112) ``` After: ``` 0x0000000c: DW_TAG_compile_unit DW_AT_language_name (DW_LNAME_C) DW_AT_language_version (201112 C11) ```

…runc(x),trunc(x)) (llvm#164227) We're very careful not to truncate binary arithmetic ops if it will affect legality, or cause additional truncation instructions, hence we currently limit this to cases where one operand is constant. But if both ops are the same (i.e. for some add/mul cases) then we wouldn't increase the number of truncations, so can be slightly more aggressive at folding the truncation.

I noticed a couple more small optimization opportunities when generating DWARF expressions from the internal DW_OP_LLVM_extract_bits_* operations: * DW_OP_deref can be used, rather than DW_OP_deref_size, when the deref size is the word size. * If the bit offset is 0 and an unsigned extraction is desired, then sometimes the shifting can be skipped entirely, or replaced with DW_OP_and.

llvm#149706) Move narrowInterleaveGroups to to general VPlan optimization stage. To do so, narrowInterleaveGroups now has to find a suitable VF where all interleave groups are consecutive and saturate the full vector width. If such a VF is found, the original VPlan is split into 2: a) a new clone which contains all VFs of Plan, except VFToOptimize, and b) the original Plan with VFToOptimize as single VF. The original Plan is then optimized. If a new copy for the other VFs has been created, it is returned and the caller has to add it to the list of candidate plans. Together with llvm#149702, this allows to take the narrowed interleave groups into account when computing costs to choose the best VF and interleave count. One example where we currently miss interleaving/unrolling when narrowing interleave groups is https://godbolt.org/z/Yz77zbacz PR: llvm#149706

From Sam Liu: >CUDA supports thread block clusters https://docs.nvidia.com/cuda/cuda-c-programming-guide/#thread-block-clusters > >In their atomic intrinsics, cluster scope is supported https://docs.nvidia.com/cuda/cuda-c-programming-guide/#nv-atomic-fetch-add-and-nv-atomic-add > >For compatibility, clang and hip needs to support cluster scope.

…op induction variables (llvm#161117) Fix llvm#157934. In liveness analysis, variables that are not analyzed are set as dead variables, but some variables are definitely live. --------- Co-authored-by: Mehdi Amini <joker.eph@gmail.com>

llvm#163598) While my objective is to make the shrinkfp path safe for ConstantFP based splats I discovered the following issues also affect ConstantVector based splats: 1. PreferBFloat is not set for bfloat vectors. 2. getMinimumFPType() returns a scalar type for vector constants where getSplatValue() is successful.

Try to remove `UnsafeFPMath` uses in PowerPC backend. These global flags block some improvements like https://discourse.llvm.org/t/rfc-honor-pragmas-with-ffp-contract-fast/80797. Remove them incrementally. FP operations may raise exceptions are replaced by constrained intrinsics. However, vector type is not supported by these intrinsics.

…vm#164087)

) Add "sourced" in a few places where OmpBlockConstruct was created.

Fix MSAN failure and expensive test failure.

Implement MIR2Vec embedder for generating vector representations of Machine IR instructions, basic blocks, and functions. This patch introduces changes necessary to *embed* machine opcodes. Machine operands would be handled incrementally in the upcoming patches.

Reverts llvm#164321 Align behavior with other CUDA Compiler

) Based on the double precision's sin/cos fast path algorithm: Step 1: Perform range reduction `y = x mod pi/8` with target errors < 2^-54. This is because the worst case mod pi/8 for single precision is ~2^-31, so to have up to 1 ULP errors from the range reduction, the targeted errors should `be 2^(-31 - 23) = 2^-54`. Step 2: Polynomial approximation We use degree-5 and degree-4 polynomials to approximate sin and cos of the reduced angle respectively. Step 3: Combine the results using trig identities ```math \begin{align*} \sin(x) &= \sin(y) \cdot \cos(k \cdot \frac{\pi}{8}) + \cos(y) \cdot \sin(k \cdot \frac{\pi}{8}) \\ \cos(x) &= \cos(y) \cdot \cos(k \cdot \frac{\pi}{8}) - \sin(y) \cdot \sin(k \cdot \frac{\pi}{8}) \end{align*} ``` Overall errors: <= 3 ULPs for default rounding modes (tested exhaustively). Current limitation: large range reduction requires FMA instructions for binary32. This restriction will be removed in the followup PR. --------- Co-authored-by: Petr Hosek <phosek@google.com>

…lvm#164455) Add attributes to the unit tests required to pass `spirv-val`. Addresses llvm#161852

) Create a POSIX `<nl_types.h>` header with `catopen`, `catclose`, and `catgets` function declarations. Provide the stub/placeholder implementations which always return error. This is consistent with the way locales are currently (un-)implemented in llvm-libc. Notably, providing `<nl_types.h>` fixes the last remaining issue with building libc++ against llvm-libc (on certain configuration of x86_64 Linux) after disabling threads and wide-characters in libc++.

Upstream the basic support for the C++ try catch statement with a try block that doesn't contain any call instructions and with a catch-all statement Issue llvm#154992

With llvm#163862, this is not really necessary and causes downstream issues.

llvm#162332) Originally llvm#161912, we've now decided that an explicit GPL notification is redundant with the LICENSE file, which is a common convention for relaying this information. Co-authored-by: Cameron McInally <cmcinally@nvidia.com>

…lvm#164346) llvm#140443 makes use of the CMake variable `Python3_EXECUTABLE_DEBUG`, which was introduced in CMake version 3.30. On systems with an inferior version of cmake, the lit tests will try to run with an empty `config.python_executable`. This PR adds a warning and falls back to using `Python3_EXECUTABLE` if the CMake version is less than `3.30`.

Fixes test failure issues (caused by llvm#162161) in Windows buildbots.

This introduces the support for 32-bit ARM Fuchsia target which uses the aapcs-linux ABI defaulting to thumbv8a as the target.

…3332) Partially Fixes llvm#160806

Implementation files using the Intel syntax explicitly specify it. Do the same for the few files using AT&T syntax. This also enables building LLVM with `-mllvm -x86-asm-syntax=intel` in one's Clang config files (i.e. a global preference for Intel syntax). No functional change intended.

…IL target (llvm#164472) This is a temporary measure to explicitly remove the unrecognized named metadata when targeting DXIL. This should be removed for an allowlist as tracked here: llvm#164473.

This commit introduces a base-class implementation for a method that reads memory from multiple ranges at once. This implementation simply calls the underlying `ReadMemoryFromInferior` method on each requested range, intentionally bypassing the memory caching mechanism (though this may be easily changed in the future). `Process` implementations that can be perform this operation more efficiently - e.g. with the MultiMemPacket described in [1] - are expected to override this method. As an example, this commit changes AppleObjCClassDescriptorV2 to use the new API. Note about the API ------------------ In the RFC, we discussed having the API return some kind of class `ReadMemoryRangesResult`. However, while writing such a class, it became clear that it was merely wrapping a vector, without providing anything useful. For example, this class: ``` struct ReadMemoryRangesResult { ReadMemoryRangesResult( llvm::SmallVector<llvm::MutableArrayRef<uint8_t>> ranges) : ranges(std::move(ranges)) {} llvm::ArrayRef<llvm::MutableArrayRef<uint8_t>> getRanges() const { return ranges; } private: llvm::SmallVector<llvm::MutableArrayRef<uint8_t>> ranges; }; ``` As can be seen in the added test and in the added use-case (AppleObjCClassDescriptorV2), users of this API will just iterate over the vector of memory buffers. So they want a return type that can be iterated over, and the vector seems more natural than creating a new class and defining iterators for it. Likewise, in the RFC, we discussed wrapping the result into an `Expected`. Upon experimenting with the code, this feels like it limits what the API is able to do as the base class implementation never needs to fail the entire result, it's the individual reads that may fail and this is expressed through a zero-length result. Any derived classes overriding `ReadMemoryRanges` should also never produce a top level failure: if they did, they can just fall back to the base class implementation, which would produce a better result. The choice of having the caller allocate a buffer and pass it to `Process::ReadMemoryRanges` is done mostly to follow conventions already done in the Process class. [1]: https://discourse.llvm.org/t/rfc-a-new-vectorized-memory-read-packet/

…NFC) Split off to clarify naming, as suggested in llvm#156262.

skganesan008 · 2025-10-21T22:53:22Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/2362

mgorny and others added 30 commits October 21, 2025 08:42

[flang] Fix standalone build regression from llvm#161179 (llvm#164309)

e0bc382

Fix incorrect linking and dependencies introduced in llvm#161179 that break standalone builds of Flang. Signed-off-by: Michał Górny <mgorny@gentoo.org>

[AMDGPU] Remove magic constants from V_PK_ADD_F32 pattern. NFC (llvm#…

e4f3e9a

…164335)

[Hexagon] Add REQUIRES: asserts to test

db478ba

This test uses -debug-only, so needs an assertion-enabled build.

[AArch64] Combing scalar_to_reg into DUP if the DUP already exists (l…

34c6fa3

…lvm#160499) If we already have a dup(x) as part of the DAG along with a scalar_to_vec(x), we can re-use the result of the dup to the scalar_to_vec(x).

[CAS] OnDiskGraphDB - fix MSVC "not all control paths return a value"…

1bf7ed2

… warnings. NFC. (llvm#164369)

[VPlan] Use VPlan::getRegion to shorten code (NFC) (llvm#164287)

cc850b8

[VPlan] Improve code using m_APInt (NFC) (llvm#161683)

3fbae10

[SystemZ] Avoid trunc(add(X,X)) patterns (llvm#164378)

1360aec

Replace with trunc(add(X,Y)) to avoid premature folding in upcoming patch llvm#164227

[LV][NFC] Remove undef from phi incoming values (llvm#163762)

822c291

Split off from PR llvm#163525, this standalone patch replaces use of undef as incoming PHI values with zero, in order to reduce the likelihood of contributors hitting the `undef deprecator` warning in github.

[flang][OpenMP] Use parser::UnwrapRef instead of thing/value, NFC (ll…

34cf8bb

…vm#164087)

[flang][OpenMP] Remember to set source in OmpBlockConstruct (llvm#164131

a042f69

) Add "sourced" in a few places where OmpBlockConstruct was created.

cachemeifyoucan and others added 18 commits October 21, 2025 17:08

[CAS] Fix test fallouts from llvm#114102 (llvm#164457)

7287816

Fix MSAN failure and expensive test failure.

Revert "[flang][cuda][rt] Canonicalize block size values" (llvm#164460)

4a1ea3e

Reverts llvm#164321 Align behavior with other CUDA Compiler

[SPIR-V] Fix unit tests for dynamic indexing to add validation step. (l…

321a419

…lvm#164455) Add attributes to the unit tests required to pass `spirv-val`. Addresses llvm#161852

[CIR] Upstream Exception CXXTryStmt (llvm#162528)

d019a02

Upstream the basic support for the C++ try catch statement with a try block that doesn't contain any call instructions and with a catch-all statement Issue llvm#154992

[mlir] Partially revert llvm#162903 (llvm#164464)

e1e4154

With llvm#163862, this is not really necessary and causes downstream issues.

[MIR2Vec] Fix to skip tests in MIR2VecEmbeddingTestFixture (llvm#164467)

2219119

Fixes test failure issues (caused by llvm#162161) in Windows buildbots.

[Clang][LLVM] Support for Fuchsia on ARM (llvm#163848)

7b190b7

This introduces the support for 32-bit ARM Fuchsia target which uses the aapcs-linux ABI defaulting to thumbv8a as the target.

[Hexagon] Handle bitcast of i64 -> v64i1 when Hvx is enabled (llvm#16…

297f972

…3332) Partially Fixes llvm#160806

[DirectX] remove unrecognized 'llvm.errno.tbaa' named metadata for DX…

bcf7267

…IL target (llvm#164472) This is a temporary measure to explicitly remove the unrecognized named metadata when targeting DXIL. This should be removed for an allowlist as tracked here: llvm#164473.

[VPlan] Clarify naming for helpers to create loop&replicate regions (…

82b5934

…NFC) Split off to clarify naming, as suggested in llvm#156262.

merge main into amd-staging

084ec38

ronlieb requested review from dpalermo, searlmc1 and skganesan008 October 21, 2025 22:53

ronlieb requested review from Groverkss, antiagainst, kuhar and nicolasvasilache as code owners October 21, 2025 22:53

ronlieb requested a review from kzhuravl October 21, 2025 22:54

dpalermo approved these changes Oct 21, 2025

View reviewed changes

ronlieb merged commit 20410c0 into amd-staging Oct 22, 2025
13 checks passed

ronlieb deleted the amd/merge/upstream_merge_20251021171255 branch October 22, 2025 02:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merge main into amd-staging #336

merge main into amd-staging #336

Uh oh!

ronlieb commented Oct 21, 2025

Uh oh!

skganesan008 commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

66 participants

merge main into amd-staging #336

merge main into amd-staging #336

Uh oh!

Conversation

ronlieb commented Oct 21, 2025

Uh oh!

skganesan008 commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

66 participants