Skip to content

Conversation

@z1-cciauto
Copy link
Collaborator

No description provided.

cmc-rep and others added 30 commits November 26, 2025 17:34
`[[nodiscard]]` should be applied to functions where discarding the
return value is most likely a correctness issue.

-
https://libcxx.llvm.org/CodingGuidelines.html#apply-nodiscard-where-relevant
…thIf` (llvm#169606)

This commit adds support for `replaceUsesWithIf` (and variants such as
`replaceAllUsesExcept`) to the `ConversionPatternRewriter`. This API is
supported only in no-rollback mode. An assertion is triggered in
rollback mode. (This missing assertion has been confusing for users
because it seemed that the API supported, while it was actually not
working properly.)

This commit brings us a bit closer towards removing
[this](https://github.com/llvm/llvm-project/blob/76ec25f729fcc7ae576caf21293cc393e68e7cf7/mlir/lib/Transforms/Utils/DialectConversion.cpp#L1214)
workaround.

Additional changes are needed to support this API in rollback mode. In
particular, no entries should be added to the `ConversionValueMapping`
for conditional replacements. It's unclear at this point if this API can
be supported in rollback mode, so this is deferred to later.

This commit turns `replaceUsesWithIf` into a virtual function, so that
the `ConversionPatternRewriter` can override it. All other API functions
for conditional value replacements call that function.

Note for LLVM integration: If you are seeing failed assertions due to
this change, you are using unsupported API in your dialect conversion.
You have 3 options: (1) Migrate to the no-rollback driver. (2) Rewrite
your patterns without the unsupported API. (3) Last resort: bypass the
rewriter and call `replaceUsesWithIf` etc. directly on the `Value`
object.
`[[nodiscard]]` should be applied to functions where discarding the
return value is most likely a correctness issue.
-
https://libcxx.llvm.org/CodingGuidelines.html#apply-nodiscard-where-relevant
…lvm#169744)

# Summary
This is a forward fix for test errors from
llvm#163653.

The PR moved debugger initialization outside of
InitializeRequestHandler, and into Launch/AttachRequestHandlers to
support DAP sessions sharing debugger instances for dynamically created
targets. However, DExTer's DAP class seemed to set breakpoints before
the debugger was initialized, which caused the tests to hang waiting for
a breakpoint to hit due to none of the breakpoints getting resolved.

# Tests
```
bin/llvm-lit -v /home/qxy11/llvm/llvm-project/cross-project-tests/debuginfo-tests/dexter-tests/
```
…69005)

This change adds the `RN` and `RZ` rounding modes to the
`convert.f32x2.to.f16x2` and `convert.f32x2.to.bf16x2` Ops.

Tests are added in `convert_fp16x2.mlir` and
`invalid_convert_fp16x2.mlir`.
Tests with these Ops in `convert_stochastic_rounding.mlir` and
`invalid-convert-stochastic-rounding.mlir` have been removed or
modified.

PTX spec reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt
… types (llvm#168188)

Add createIntrinsicCall overload that accepts return type and arguments,
automatically resolve overload types rather than requiring manual
computation. Simplifies NVVM_PrefetchOp by removing conditional overload
logic.
… macro arguments (llvm#169757)

When the start and end token are both spelled in macro arguments, we
still want to reject the range if they come from two separate macro
arguments, as the original specified range is not precisely spelled in a
single sequence of characters in source.
…vm#169517)

This commit extends the CGProfile module flags export with support for missing function references. Previously, this caused a crash and now it's properly exported to `null` values in the metadata node.
Fixes: llvm#160717
We emit this diagnostic from CheckPointerToIntegralCast() already, so
remove the emission from CastPointerIntegral().
- Mass-reformat tests in
`std/utilities/optional/optional.object/optional.object.ctor` and
rearrange header `#include`s
- No functional changes
- Prelude for llvm#169203
Use const& in range-based for loop to avoid unnecessary copies
…specified layouts (llvm#169267)

Introduce anchor layout for XeGPU anchor ops: load_nd, store_nd,
prefetch_nd, dpas, load, store, prefetch, load_matrix, store_matrix, and
atomic_rmw. Anchor layout is permanent, and is guaranteed to be honored
by XeGPU distribution and lowerinngs once specified.
1. Add anchor_layout for XeGPU anchor OPs: load_nd, store_nd,
prefetch_nd, dpas, load, store, prefetch, load_matrix, store_matrix, and
atomic_rmw.
2. rename layout attributes to anchor_layout for these ops: load, store,
load_matrix, store_matrix
3. update layout propagation pass: Only when user doesn't specify anchor
layout, the pass computes a default layout and set to anchor op's
permant layout and use that for propagation. if user specified anchor
layout, the pass takes user-specified anchor layout. permant layout and
use that for propagation. if user specified anchor layout, the pass
takes user-specified anchor layout.
…ax_lanes (llvm#169293)

On RISC-V, some loops that the loop vectorizer vectorizes pre-LTO may
turn out to have the exact trip count exposed after LTO, see llvm#164762.

If the trip count is small enough we can fold away the
@llvm.experimental.get.vector.length intrinsic based on this corollary
from the LangRef:

> If %cnt is less than or equal to %max_lanes, the return value is equal
to %cnt.

This on its own doesn't remove the @llvm.experimental.get.vector.length
in llvm#164762 since we also need to teach computeKnownBits about
@llvm.experimental.get.vector.length and the sub recurrence, but this PR
is a starting point.

I've added this in InstCombine rather than InstSimplify since we may
need to insert a truncation (@llvm.experimental.get.vector.length can
take an i64 %cnt argument, the result is always i32).

Note that there was something similar done in VPlan in llvm#167647 for when
the loop vectorizer knows the trip count.
This patch adds recognition of high-half multiply by parts into a single
larger multiply.

Considering a multiply made up of high and low parts, we can split the
multiply into:
  x * y == (xh*T + xl) * (yh*T + yl)
where `xh == x>>32` and `xl == x & 0xffffffff`. `T = 2^32`.
This expands to
  xh*yh*T*T + xh*yl*T + xl*yh*T + xl*yl
which I find it helpful to be drawn as
  [  xh*yh  ]
       [  xh*yl  ]
       [  xl*yh  ]
            [  xl*yl  ]

We are looking for the "high" half, which is xh*yh + xh*yl>>32 + xl*yh>>32 +
carrys. The carry makes this difficult and there are multiple ways of
representing it. The ones we attempt to support here are:
 Carry:  xh*yh + carry + lowsum
         carry = lowsum < xh*yl ? 0x1000000 : 0
         lowsum = xh*yl + xl*yh + (xl*yl>>32)
 Ladder: xh*yh + c2>>32 + c3>>32
         c2 = xh*yl + (xl*yl >> 32); c3 = c2&0xffffffff + xl*yh
 Carry4: xh*yh + carry + crosssum>>32 + (xl*yl + crosssum&0xffffffff) >> 32
         crosssum = xh*yl + xl*yh
         carry = crosssum < xh*yl ? 0x1000000 : 0
 Ladder4: xh*yh + (xl*yh)>>32 + (xh*yl)>>32 + low>>32;
         low = (xl*yl)>>32 + (xl*yh)&0xffffffff + (xh*yl)&0xfffffff

They all start by matching `xh*yh` + 2 or 3 other operands. The bottom of the
tree is `xh*yh`, `xh*yl`, `xl*yh` and `xl*yl`.

Based on llvm#156879 by @c-rhodes
With EVL tail folding, the LastActiveLane can be computed with EVL - 1.
This removes the need for a header mask and vfirst.m for loops with live
outs on RISC-V:

     # %bb.5:                                # %for.cond.cleanup7
    -       vsetvli zero, zero, e32, m2, ta, ma
    -       vmv.v.x v8, s1
    -       vmsleu.vv       v10, v8, v22
    -       vfirst.m        a0, v10
    -       srli    a1, a0, 63
    -       czero.nez       a0, a0, a1
    -       czero.eqz       a1, s8, a1
    -       or      a0, a0, a1
    -       addi    a0, a0, -1
    -       vsetvli zero, zero, e64, m4, ta, ma
    -       vslidedown.vx   v8, v12, a0
    +       addi    s1, s1, -1
    +       vslidedown.vx   v8, v12, s1
…#169776)

This commit fixes the import of `branch_weights` metadata from LLVM IR
to the LLVM dialect. Previously, `branch_weights` metadata containing
the `!"expected"` field were rejected because the importer expected
integer weights at operand 1, but found a string.
As far as I can tell the llvm.arm.mve.vminnm.m intrinsic used in these tests
was the pre-upstream name of llvm.arm.mve.min.predicated. The tests should not
need IR sections, so remove them just relying on the MIR portions.
This tests show how type-checking is performed for
`__builtin_amdgcn_load_to_lds`,
but not for `__builtin_amdgcn_raw_ptr_buffer_load_lds`,
`__builtin_amdgcn_struct_ptr_buffer_load_lds` and
`__builtin_amdgcn_global_load_lds` since they are declared with the 't'
attribute.


Stacked on top of: llvm#165387
…ed by no-return blocks (llvm#167548)

At present, the shrink wrapping pass misses opportunities to shrink wrap
in the presence of machine basic blocks which exit the function without
returning. Such cases arise from C++ functions like the following:
```cxx
int foo(int err, void* ptr) {
    if (err == -1) {
         if (ptr == nullptr) {
             throw MyException("Received `nullptr`!", __FILE__, __LINE__);
         }
         
         handle(ptr);
    }
    
    return STATUS_OK;
}
```
In particular, assuming `MyException`'s constructor is not marked
`noexcept`, the above code will generate a trivial EH landing pad
calling `__cxa_free_exception()` and rethrowing the unhandled internal
exception, exiting the function without returning. As such, the shrink
wrapping pass refuses to touch the above function, spilling to the stack
on every call, even though no CSRs are clobbered on the hot path. This
patch tweaks the shrink wrapping logic to enable the pass to fire in
this and similar cases.
… AVX512 VPMULTISHIFTQB intrinsics to be used in constexpr (llvm#168995)

Resolves llvm#167477
The natural assumption is that there's some sort of order here
and having people read the reference manual before the basic
tutorial does not make sense to me.
Currently LLVM fails to recognize a manual implementation of `phadd`

https://godbolt.org/z/zozrssaWb

```llvm
declare <8 x i16> @llvm.x86.ssse3.phadd.sw.128(<8 x i16>, <8 x i16>)

declare <8 x i16> @llvm.sadd.sat.v8i16(<8 x i16>, <8 x i16>)

define <8 x i16> @phaddsw_v8i16_intrinsic(<8 x i16> %a, <8 x i16> %b) {
entry:
  %res = call <8 x i16> @llvm.x86.ssse3.phadd.sw.128(<8 x i16> %a, <8 x i16> %b)
  ret <8 x i16> %res
}

define <8 x i16> @phaddsw_v8i16_generic(<8 x i16> %a, <8 x i16> %b) {
entry:
  %even = shufflevector <8 x i16> %a, <8 x i16> %b,
    <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
  %odd  = shufflevector <8 x i16> %a, <8 x i16> %b,
    <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
  %sum = call <8 x i16> @llvm.sadd.sat.v8i16(<8 x i16> %even, <8 x i16> %odd)
  ret <8 x i16> %sum
}
```

```asm
phaddsw_v8i16_intrinsic:                # @phaddsw_v8i16_intrinsic
        phaddsw xmm0, xmm1
        ret

phaddsw_v8i16_generic:                  # @phaddsw_v8i16_generic
        movdqa  xmm2, xmmword ptr [rip + .LCPI1_0] # xmm2 = [0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
        movdqa  xmm3, xmm1
        pshufb  xmm3, xmm2
        movdqa  xmm4, xmm0
        pshufb  xmm4, xmm2
        punpcklqdq      xmm4, xmm3              # xmm4 = xmm4[0],xmm3[0]
        psrad   xmm1, 16
        psrad   xmm0, 16
        packssdw        xmm0, xmm1
        paddsw  xmm0, xmm4
        ret
```

This PR does recognize the pattern.
`[[nodiscard]]` should be applied to functions where discarding the
return value is most likely a correctness issue.

-
https://libcxx.llvm.org/CodingGuidelines.html#apply-nodiscard-where-relevant
tbaederr and others added 10 commits November 27, 2025 13:06
…set (llvm#169786)

We can't access the RecordLayout of an invalid decl, so return failure
if that happens.

Fixes llvm#167076
Extends test coverage to include different start and step values, as
well as interleaving.
`[[nodiscard]]` should be applied to functions where discarding the
return value is most likely a correctness issue.

-
https://libcxx.llvm.org/CodingGuidelines.html#apply-nodiscard-where-relevant
…169781)

Let standard casting / builtin_convertvector handle the conversions from BF16 to F32

My only query is how to best implement _mm_cvtpbh_ps - I went for the
v8bf16 -> v8f32 conversion followed by subvector extraction in the end,
but could just as easily extract a v4bf16 first - makes no difference to
final optimized codegen.

First part of llvm#154911
…9670)

Remove explicit VT numbers from ValueTypes.td so that patches that add a
new VT do not have to renumber the entire file.

In TableGen VTs are now identified by ValueType.LLVMName instead of
ValueType.Value. This is important for target-defined types (typically
based on PtrValueType) which are not mentioned in ValueTypes.td itself.
Adding support for serializing the ada entry flags helps with mir based
test cases. Without this change, the flags are simple displayed as being
"unkmown".
@z1-cciauto z1-cciauto requested a review from a team November 27, 2025 13:19
@z1-cciauto
Copy link
Collaborator Author

@z1-cciauto z1-cciauto merged commit 2520866 into amd-staging Nov 27, 2025
13 checks passed
@z1-cciauto z1-cciauto deleted the upstream_merge_202511270819 branch November 27, 2025 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.