Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a fast processor-native bitshift function #52828

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

jakobnissen
Copy link
Contributor

@jakobnissen jakobnissen commented Jan 9, 2024

In Julia, bitshifting a B-bit integer x by B bits causes the result to be zero(x). Also, negative bitshifts are supported, e.g. x << -1. This might semantically be sensible, but it does not correspond to either x86 or AArch64 behaviour. The result is that Julia's bitshifts are not optimised to a single instruction, which makes them unnecessarily slow in some performance sensitive contexts.
In constrast, in the C language, bitshifting by more than the bitwidth is undefined behaviour, which allows the compiler to assume it never happens, and optimise the shift to a single assembly instruction.
The difference of one CPU instruction vs a handful may seem trivial, but in performance sensitive code this can really matter, e.g. #30674 and attractivechaos/plb2#48.

This commit adds the functions unsafe_ashr, unsafe_lshr and unsafe_shl. The goal of these functions are to compile to single shift instructions in order to be as fast as possible.

Decisions

1. Semantics when the shift is too high

What happens in the CPU when you shift x >> n, where x is a B bit integer an n >= B? Let's call these "overflowing shifts". As far as I can tell, on x86, AArch and RISCV - so, basically all instruction sets that matter - only the lower 5 bits of n are used for 8-32 bit integers, and only the lower 6 bits for 64-bit integers.
Note that this implies that when x and n are 8-bit integers, masking n by 0x07 does NOT correspond to the native shift instruction - it should be masked by 0x1f. I'm not 100% certain about that - all the documentation I can find simply assumes 32-bit operands.

So, what options do we have?

a) Native behaviour

Here, we just do what the CPU does when it encounters overflowing bitshifts. In particular, we shift with n % max(32, 8*sizeof(x)). That's a weird rule, but really, IMO, no more weird than how signed overflow wraps from e.g. 127 to -128. This has maximal performance, but the semantics are weirdly complex and unpedagogical on overflow.

If we take this approach, we might want to keep the documented behaviour simply by formally documenting only that we promise the return type is correct, but we make no promises about the returned value.

c) Shift with n % (8*sizeof(x))

On x86, this produces optimal code for 64 and 32 bits, and is a little slower on 8 and 16 bit integers (the performance is somewhere between the native behaviour and the current shifting behaviour). The advantage here is that it's semantically simpler than the solution above.

2. What should the name be for e.g. the equivalent of >>?

>>%

Pro: It's short, it looks like >>, and it's an infix operator, which makes it much more readable in complex bitshifting code

Con: It takes up valuable "ASCII real estate value", and the proposed semantics of >>% vs >> is different from +% vs + as proposed by Keno, which might be confusing

unsafe_ashr

The unsafe_ prefix is a nice way to warn that the resulting value may be arbitrary (if we go with that behaviour), similarly to unsafe_trunc. However, it may give misleading annotations of e.g. memory unsafety that other unsafe_ functions can cause. It's also long and annoying to use in bit hacking compared with infix operators.

3. Should we have a fallback definition (::Integer, ::Integer)

Pro: It makes generic code possible with these operations, and it makes it less annoying for users who don't have to define their own implementations

Con: A processor-native bitshift only really makes sense for BitIntegers, and adding a generic function is semantically misleading

My recommendations:

  • It should have native behaviour on overflow. I see little reason in having simple, easy-to-understand behaviour on overflow - presumably, unintentional overflow usually leads to nasty bugs no matter how simple the overflow behaviour is. And it seems silly for me to leave half the performance on the table for 8 and 16 bit bitshifts just because we want nicer behaviour on overflow.
  • We should not have a fallback Integer definition, since the purpose of this function is native bitshifting which doesnt exist for generic objects, only bit integers.

Closes #50225

In Julia, bitshifting a bitinteger x by more bits than is present in x causes x
to be set to zero. Also, negative bitshifts are supported.
This might semantically be more correct (and the former also matches LLVM's
definition of bitshifts), but it does not correspond to either x86 or AArch64
behaviour. The result is that Julia's bitshifts are not optimised to a single
instruction.
In constrast, in the C language, bitshifting by more than the bitwidth is UB,
which allows the compiler to assume it never happens, and optimise the shift to
a single instruction.

This commit adds the wrapping shift functions >>% >>>% and <<%, which overflows
if the shift is too high or negative. The overflow behaviour is explicitly not
stated, but the implemented behaviour matches the native behaviour of x86 and
AArch64.

This commit requires JuliaSyntax support, which will be implemented in a parallel
PR to JuliaSyntax.
@jakobnissen jakobnissen requested a review from Keno January 9, 2024 11:12
@jakobnissen jakobnissen added the needs decision A decision on this change is needed label Jan 9, 2024
@jakobnissen
Copy link
Contributor Author

I've assigned @Keno since you did #50790 which is along the same lines.

@jakobnissen jakobnissen marked this pull request as draft January 9, 2024 15:10
@giordano
Copy link
Contributor

giordano commented Jan 9, 2024

If anyone has an ARM computer and can help me check it, that'd be grand

What's the test?

@jakobnissen
Copy link
Contributor Author

jakobnissen commented Jan 9, 2024

@giordano I've used this:

Code to test:
function foo(x::Union{Int8, UInt8, Int16, UInt16}, y::Base.BitInteger)
    Core.Intrinsics.lshr_int(Core.Intrinsics.zext_int(UInt32, x), (y % UInt32) & 0x1f) % typeof(x)
end

function foo(x::T, y::Base.BitInteger) where {T <: Union{Int32, UInt32, UInt64, Int64, UInt128, Int128}}
    Core.Intrinsics.lshr_int(x, (y % UInt32) & (8*sizeof(T) - 1)) % T
end

function qux(x::Union{Int8, UInt8, Int16, UInt16}, y::Base.BitInteger)
    Core.Intrinsics.shl_int(Core.Intrinsics.zext_int(UInt32, x), (y % UInt32) & 0x1f) % typeof(x)
end

function qux(x::T, y::Base.BitInteger) where {T <: Union{Int32, UInt32, UInt64, Int64, UInt128, Int128}}
    Core.Intrinsics.shl_int(x, (y % UInt32) & (8*sizeof(T) - 1)) % T
end

function bar(x::Union{Int8, Int16}, y::Base.BitInteger)
    Core.Intrinsics.ashr_int(Core.Intrinsics.sext_int(UInt32, x), (y % UInt32) & 0x1f) % typeof(x)
end

function bar(x::T, y::Base.BitInteger) where {T <: Union{Int32, Int64, Int128}}
    Core.Intrinsics.ashr_int(x, (y % UInt32) & (8*sizeof(T) - 1)) % T
end

function bar(x::Union{UInt8, UInt16}, y::Base.BitInteger)
    Core.Intrinsics.lshr_int(Core.Intrinsics.zext_int(UInt32, x), (y % UInt32) & 0x1f) % typeof(x)
end

function bar(x::T, y::Base.BitInteger) where {T <: Union{UInt32, UInt64, UInt128}}
    Core.Intrinsics.lshr_int(x, (y % UInt32) & (8*sizeof(T) - 1)) % T
end

for T1 in Base.BitInteger64_types
    for T2 in Base.BitInteger64_types
        for f in [foo, bar, qux]
            print(T1, " ", T2, " ", f)
            io = IOBuffer()
            code_native(io, f, (T1, T2); dump_module=false, debuginfo=:none, raw=true)
            s = collect(eachline(IOBuffer(String(take!(io)))))
            filter!(s) do i
                ss = lstrip(i)
                all(!startswith(ss, j) for j in [r"mov\s", r"pop\s", r"nop\s", r"ret\s?", r"push\s"])
            end
            for i in s
                println(i)
            end
        end
    end
end

It prints the emitted instructions for each type, filtering for the trivial ones like moving, pushing returning. Then I go through the list and make sure every combination only emits a shift.
The all(!startswith(ss, j) for j in [r"mov\s", r"pop\s", r"nop\s", r"ret\s?", r"push\s"]) line may need some modification for AArch though. Specifically, check there are no jumps, conditional moves, calls or anything like that.

@giordano
Copy link
Contributor

giordano commented Jan 9, 2024

With your unmodified code I get

Int8 Int8 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xff
	lsr	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 Int8 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 Int8 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xff
	lsl	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 Int16 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xff
	lsr	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 Int16 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 Int16 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xff
	lsl	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 Int32 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xff
	lsr	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 Int32 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 Int32 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xff
	lsl	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 Int64 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xff
	lsr	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 Int64 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 Int64 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xff
	lsl	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 UInt8 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xff
	lsr	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 UInt8 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 UInt8 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xff
	lsl	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 UInt16 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xff
	lsr	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 UInt16 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 UInt16 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xff
	lsl	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 UInt32 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xff
	lsr	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 UInt32 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 UInt32 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xff
	lsl	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 UInt64 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xff
	lsr	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 UInt64 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int8 UInt64 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xff
	lsl	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 Int8 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xffff
	lsr	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 Int8 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 Int8 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xffff
	lsl	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 Int16 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xffff
	lsr	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 Int16 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 Int16 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xffff
	lsl	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 Int32 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xffff
	lsr	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 Int32 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 Int32 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xffff
	lsl	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 Int64 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xffff
	lsr	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 Int64 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 Int64 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xffff
	lsl	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 UInt8 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xffff
	lsr	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 UInt8 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 UInt8 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xffff
	lsl	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 UInt16 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xffff
	lsr	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 UInt16 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 UInt16 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xffff
	lsl	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 UInt32 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xffff
	lsr	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 UInt32 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 UInt32 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xffff
	lsl	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 UInt64 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xffff
	lsr	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 UInt64 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int16 UInt64 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	and	w9, w0, #0xffff
	lsl	w0, w9, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 Int8 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 Int8 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 Int8 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 Int16 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 Int16 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 Int16 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 Int32 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 Int32 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 Int32 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 Int64 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 Int64 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 Int64 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 UInt8 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 UInt8 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 UInt8 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 UInt16 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 UInt16 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 UInt16 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 UInt32 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 UInt32 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 UInt32 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 UInt64 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 UInt64 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int32 UInt64 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 Int8 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 Int8 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 Int8 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 Int16 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 Int16 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 Int16 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 Int32 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 Int32 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 Int32 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 Int64 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 Int64 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 Int64 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 UInt8 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 UInt8 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 UInt8 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 UInt16 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 UInt16 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 UInt16 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 UInt32 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 UInt32 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 UInt32 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 UInt64 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 UInt64 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	asr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
Int64 UInt64 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 Int8 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 Int8 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 Int8 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 Int16 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 Int16 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 Int16 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 Int32 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 Int32 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 Int32 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 Int64 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 Int64 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 Int64 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 UInt8 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 UInt8 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 UInt8 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 UInt16 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 UInt16 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 UInt16 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 UInt32 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 UInt32 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 UInt32 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 UInt64 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 UInt64 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt8 UInt64 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 Int8 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 Int8 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 Int8 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 Int16 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 Int16 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 Int16 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 Int32 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 Int32 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 Int32 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 Int64 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 Int64 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 Int64 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 UInt8 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 UInt8 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 UInt8 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 UInt16 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 UInt16 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 UInt16 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 UInt32 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 UInt32 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 UInt32 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 UInt64 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 UInt64 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt16 UInt64 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 Int8 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 Int8 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 Int8 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 Int16 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 Int16 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 Int16 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 Int32 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 Int32 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 Int32 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 Int64 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 Int64 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 Int64 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 UInt8 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 UInt8 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 UInt8 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 UInt16 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 UInt16 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 UInt16 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 UInt32 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 UInt32 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 UInt32 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 UInt64 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 UInt64 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt32 UInt64 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	w0, w0, w1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 Int8 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 Int8 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 Int8 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 Int16 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 Int16 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 Int16 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 Int32 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 Int32 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 Int32 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 Int64 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 Int64 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 Int64 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 UInt8 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 UInt8 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 UInt8 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 UInt16 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 UInt16 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 UInt16 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 UInt32 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 UInt32 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 UInt32 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 UInt64 foo	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 UInt64 bar	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsr	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16
UInt64 UInt64 qux	.text
	stp	x29, x30, [sp, #-16]!
	ldr	x8, [x20, #16]
	lsl	x0, x0, x1
	ldr	x8, [x8, #16]
	ldr	xzr, [x8]
	ldp	x29, x30, [sp], #16

stp is a store instruction, ldr is a load instruction, lsl is the shift instruction, ldp is another load instruction.

With more cleaning up

code

for T1 in Base.BitInteger64_types
    for T2 in Base.BitInteger64_types
        for f in [foo, bar, qux]
            print(T1, " ", T2, " ", f)
            io = IOBuffer()
            code_native(io, f, (T1, T2); dump_module=false, debuginfo=:none, raw=true)
            s = collect(eachline(IOBuffer(String(take!(io)))))
            filter!(s) do i
                ss = lstrip(i)
                all(!startswith(ss, j) for j in [r"ret\s?", r"(mov|ldr|ldp|stp)\s"])
            end
            for i in s
                println(i)
            end
        end
    end
end

I get this simplified output:

Int8 Int8 foo	.text
	and	w9, w0, #0xff
	lsr	w0, w9, w1
Int8 Int8 bar	.text
	asr	w0, w0, w1
Int8 Int8 qux	.text
	and	w9, w0, #0xff
	lsl	w0, w9, w1
Int8 Int16 foo	.text
	and	w9, w0, #0xff
	lsr	w0, w9, w1
Int8 Int16 bar	.text
	asr	w0, w0, w1
Int8 Int16 qux	.text
	and	w9, w0, #0xff
	lsl	w0, w9, w1
Int8 Int32 foo	.text
	and	w9, w0, #0xff
	lsr	w0, w9, w1
Int8 Int32 bar	.text
	asr	w0, w0, w1
Int8 Int32 qux	.text
	and	w9, w0, #0xff
	lsl	w0, w9, w1
Int8 Int64 foo	.text
	and	w9, w0, #0xff
	lsr	w0, w9, w1
Int8 Int64 bar	.text
	asr	w0, w0, w1
Int8 Int64 qux	.text
	and	w9, w0, #0xff
	lsl	w0, w9, w1
Int8 UInt8 foo	.text
	and	w9, w0, #0xff
	lsr	w0, w9, w1
Int8 UInt8 bar	.text
	asr	w0, w0, w1
Int8 UInt8 qux	.text
	and	w9, w0, #0xff
	lsl	w0, w9, w1
Int8 UInt16 foo	.text
	and	w9, w0, #0xff
	lsr	w0, w9, w1
Int8 UInt16 bar	.text
	asr	w0, w0, w1
Int8 UInt16 qux	.text
	and	w9, w0, #0xff
	lsl	w0, w9, w1
Int8 UInt32 foo	.text
	and	w9, w0, #0xff
	lsr	w0, w9, w1
Int8 UInt32 bar	.text
	asr	w0, w0, w1
Int8 UInt32 qux	.text
	and	w9, w0, #0xff
	lsl	w0, w9, w1
Int8 UInt64 foo	.text
	and	w9, w0, #0xff
	lsr	w0, w9, w1
Int8 UInt64 bar	.text
	asr	w0, w0, w1
Int8 UInt64 qux	.text
	and	w9, w0, #0xff
	lsl	w0, w9, w1
Int16 Int8 foo	.text
	and	w9, w0, #0xffff
	lsr	w0, w9, w1
Int16 Int8 bar	.text
	asr	w0, w0, w1
Int16 Int8 qux	.text
	and	w9, w0, #0xffff
	lsl	w0, w9, w1
Int16 Int16 foo	.text
	and	w9, w0, #0xffff
	lsr	w0, w9, w1
Int16 Int16 bar	.text
	asr	w0, w0, w1
Int16 Int16 qux	.text
	and	w9, w0, #0xffff
	lsl	w0, w9, w1
Int16 Int32 foo	.text
	and	w9, w0, #0xffff
	lsr	w0, w9, w1
Int16 Int32 bar	.text
	asr	w0, w0, w1
Int16 Int32 qux	.text
	and	w9, w0, #0xffff
	lsl	w0, w9, w1
Int16 Int64 foo	.text
	and	w9, w0, #0xffff
	lsr	w0, w9, w1
Int16 Int64 bar	.text
	asr	w0, w0, w1
Int16 Int64 qux	.text
	and	w9, w0, #0xffff
	lsl	w0, w9, w1
Int16 UInt8 foo	.text
	and	w9, w0, #0xffff
	lsr	w0, w9, w1
Int16 UInt8 bar	.text
	asr	w0, w0, w1
Int16 UInt8 qux	.text
	and	w9, w0, #0xffff
	lsl	w0, w9, w1
Int16 UInt16 foo	.text
	and	w9, w0, #0xffff
	lsr	w0, w9, w1
Int16 UInt16 bar	.text
	asr	w0, w0, w1
Int16 UInt16 qux	.text
	and	w9, w0, #0xffff
	lsl	w0, w9, w1
Int16 UInt32 foo	.text
	and	w9, w0, #0xffff
	lsr	w0, w9, w1
Int16 UInt32 bar	.text
	asr	w0, w0, w1
Int16 UInt32 qux	.text
	and	w9, w0, #0xffff
	lsl	w0, w9, w1
Int16 UInt64 foo	.text
	and	w9, w0, #0xffff
	lsr	w0, w9, w1
Int16 UInt64 bar	.text
	asr	w0, w0, w1
Int16 UInt64 qux	.text
	and	w9, w0, #0xffff
	lsl	w0, w9, w1
Int32 Int8 foo	.text
	lsr	w0, w0, w1
Int32 Int8 bar	.text
	asr	w0, w0, w1
Int32 Int8 qux	.text
	lsl	w0, w0, w1
Int32 Int16 foo	.text
	lsr	w0, w0, w1
Int32 Int16 bar	.text
	asr	w0, w0, w1
Int32 Int16 qux	.text
	lsl	w0, w0, w1
Int32 Int32 foo	.text
	lsr	w0, w0, w1
Int32 Int32 bar	.text
	asr	w0, w0, w1
Int32 Int32 qux	.text
	lsl	w0, w0, w1
Int32 Int64 foo	.text
	lsr	w0, w0, w1
Int32 Int64 bar	.text
	asr	w0, w0, w1
Int32 Int64 qux	.text
	lsl	w0, w0, w1
Int32 UInt8 foo	.text
	lsr	w0, w0, w1
Int32 UInt8 bar	.text
	asr	w0, w0, w1
Int32 UInt8 qux	.text
	lsl	w0, w0, w1
Int32 UInt16 foo	.text
	lsr	w0, w0, w1
Int32 UInt16 bar	.text
	asr	w0, w0, w1
Int32 UInt16 qux	.text
	lsl	w0, w0, w1
Int32 UInt32 foo	.text
	lsr	w0, w0, w1
Int32 UInt32 bar	.text
	asr	w0, w0, w1
Int32 UInt32 qux	.text
	lsl	w0, w0, w1
Int32 UInt64 foo	.text
	lsr	w0, w0, w1
Int32 UInt64 bar	.text
	asr	w0, w0, w1
Int32 UInt64 qux	.text
	lsl	w0, w0, w1
Int64 Int8 foo	.text
	lsr	x0, x0, x1
Int64 Int8 bar	.text
	asr	x0, x0, x1
Int64 Int8 qux	.text
	lsl	x0, x0, x1
Int64 Int16 foo	.text
	lsr	x0, x0, x1
Int64 Int16 bar	.text
	asr	x0, x0, x1
Int64 Int16 qux	.text
	lsl	x0, x0, x1
Int64 Int32 foo	.text
	lsr	x0, x0, x1
Int64 Int32 bar	.text
	asr	x0, x0, x1
Int64 Int32 qux	.text
	lsl	x0, x0, x1
Int64 Int64 foo	.text
	lsr	x0, x0, x1
Int64 Int64 bar	.text
	asr	x0, x0, x1
Int64 Int64 qux	.text
	lsl	x0, x0, x1
Int64 UInt8 foo	.text
	lsr	x0, x0, x1
Int64 UInt8 bar	.text
	asr	x0, x0, x1
Int64 UInt8 qux	.text
	lsl	x0, x0, x1
Int64 UInt16 foo	.text
	lsr	x0, x0, x1
Int64 UInt16 bar	.text
	asr	x0, x0, x1
Int64 UInt16 qux	.text
	lsl	x0, x0, x1
Int64 UInt32 foo	.text
	lsr	x0, x0, x1
Int64 UInt32 bar	.text
	asr	x0, x0, x1
Int64 UInt32 qux	.text
	lsl	x0, x0, x1
Int64 UInt64 foo	.text
	lsr	x0, x0, x1
Int64 UInt64 bar	.text
	asr	x0, x0, x1
Int64 UInt64 qux	.text
	lsl	x0, x0, x1
UInt8 Int8 foo	.text
	lsr	w0, w0, w1
UInt8 Int8 bar	.text
	lsr	w0, w0, w1
UInt8 Int8 qux	.text
	lsl	w0, w0, w1
UInt8 Int16 foo	.text
	lsr	w0, w0, w1
UInt8 Int16 bar	.text
	lsr	w0, w0, w1
UInt8 Int16 qux	.text
	lsl	w0, w0, w1
UInt8 Int32 foo	.text
	lsr	w0, w0, w1
UInt8 Int32 bar	.text
	lsr	w0, w0, w1
UInt8 Int32 qux	.text
	lsl	w0, w0, w1
UInt8 Int64 foo	.text
	lsr	w0, w0, w1
UInt8 Int64 bar	.text
	lsr	w0, w0, w1
UInt8 Int64 qux	.text
	lsl	w0, w0, w1
UInt8 UInt8 foo	.text
	lsr	w0, w0, w1
UInt8 UInt8 bar	.text
	lsr	w0, w0, w1
UInt8 UInt8 qux	.text
	lsl	w0, w0, w1
UInt8 UInt16 foo	.text
	lsr	w0, w0, w1
UInt8 UInt16 bar	.text
	lsr	w0, w0, w1
UInt8 UInt16 qux	.text
	lsl	w0, w0, w1
UInt8 UInt32 foo	.text
	lsr	w0, w0, w1
UInt8 UInt32 bar	.text
	lsr	w0, w0, w1
UInt8 UInt32 qux	.text
	lsl	w0, w0, w1
UInt8 UInt64 foo	.text
	lsr	w0, w0, w1
UInt8 UInt64 bar	.text
	lsr	w0, w0, w1
UInt8 UInt64 qux	.text
	lsl	w0, w0, w1
UInt16 Int8 foo	.text
	lsr	w0, w0, w1
UInt16 Int8 bar	.text
	lsr	w0, w0, w1
UInt16 Int8 qux	.text
	lsl	w0, w0, w1
UInt16 Int16 foo	.text
	lsr	w0, w0, w1
UInt16 Int16 bar	.text
	lsr	w0, w0, w1
UInt16 Int16 qux	.text
	lsl	w0, w0, w1
UInt16 Int32 foo	.text
	lsr	w0, w0, w1
UInt16 Int32 bar	.text
	lsr	w0, w0, w1
UInt16 Int32 qux	.text
	lsl	w0, w0, w1
UInt16 Int64 foo	.text
	lsr	w0, w0, w1
UInt16 Int64 bar	.text
	lsr	w0, w0, w1
UInt16 Int64 qux	.text
	lsl	w0, w0, w1
UInt16 UInt8 foo	.text
	lsr	w0, w0, w1
UInt16 UInt8 bar	.text
	lsr	w0, w0, w1
UInt16 UInt8 qux	.text
	lsl	w0, w0, w1
UInt16 UInt16 foo	.text
	lsr	w0, w0, w1
UInt16 UInt16 bar	.text
	lsr	w0, w0, w1
UInt16 UInt16 qux	.text
	lsl	w0, w0, w1
UInt16 UInt32 foo	.text
	lsr	w0, w0, w1
UInt16 UInt32 bar	.text
	lsr	w0, w0, w1
UInt16 UInt32 qux	.text
	lsl	w0, w0, w1
UInt16 UInt64 foo	.text
	lsr	w0, w0, w1
UInt16 UInt64 bar	.text
	lsr	w0, w0, w1
UInt16 UInt64 qux	.text
	lsl	w0, w0, w1
UInt32 Int8 foo	.text
	lsr	w0, w0, w1
UInt32 Int8 bar	.text
	lsr	w0, w0, w1
UInt32 Int8 qux	.text
	lsl	w0, w0, w1
UInt32 Int16 foo	.text
	lsr	w0, w0, w1
UInt32 Int16 bar	.text
	lsr	w0, w0, w1
UInt32 Int16 qux	.text
	lsl	w0, w0, w1
UInt32 Int32 foo	.text
	lsr	w0, w0, w1
UInt32 Int32 bar	.text
	lsr	w0, w0, w1
UInt32 Int32 qux	.text
	lsl	w0, w0, w1
UInt32 Int64 foo	.text
	lsr	w0, w0, w1
UInt32 Int64 bar	.text
	lsr	w0, w0, w1
UInt32 Int64 qux	.text
	lsl	w0, w0, w1
UInt32 UInt8 foo	.text
	lsr	w0, w0, w1
UInt32 UInt8 bar	.text
	lsr	w0, w0, w1
UInt32 UInt8 qux	.text
	lsl	w0, w0, w1
UInt32 UInt16 foo	.text
	lsr	w0, w0, w1
UInt32 UInt16 bar	.text
	lsr	w0, w0, w1
UInt32 UInt16 qux	.text
	lsl	w0, w0, w1
UInt32 UInt32 foo	.text
	lsr	w0, w0, w1
UInt32 UInt32 bar	.text
	lsr	w0, w0, w1
UInt32 UInt32 qux	.text
	lsl	w0, w0, w1
UInt32 UInt64 foo	.text
	lsr	w0, w0, w1
UInt32 UInt64 bar	.text
	lsr	w0, w0, w1
UInt32 UInt64 qux	.text
	lsl	w0, w0, w1
UInt64 Int8 foo	.text
	lsr	x0, x0, x1
UInt64 Int8 bar	.text
	lsr	x0, x0, x1
UInt64 Int8 qux	.text
	lsl	x0, x0, x1
UInt64 Int16 foo	.text
	lsr	x0, x0, x1
UInt64 Int16 bar	.text
	lsr	x0, x0, x1
UInt64 Int16 qux	.text
	lsl	x0, x0, x1
UInt64 Int32 foo	.text
	lsr	x0, x0, x1
UInt64 Int32 bar	.text
	lsr	x0, x0, x1
UInt64 Int32 qux	.text
	lsl	x0, x0, x1
UInt64 Int64 foo	.text
	lsr	x0, x0, x1
UInt64 Int64 bar	.text
	lsr	x0, x0, x1
UInt64 Int64 qux	.text
	lsl	x0, x0, x1
UInt64 UInt8 foo	.text
	lsr	x0, x0, x1
UInt64 UInt8 bar	.text
	lsr	x0, x0, x1
UInt64 UInt8 qux	.text
	lsl	x0, x0, x1
UInt64 UInt16 foo	.text
	lsr	x0, x0, x1
UInt64 UInt16 bar	.text
	lsr	x0, x0, x1
UInt64 UInt16 qux	.text
	lsl	x0, x0, x1
UInt64 UInt32 foo	.text
	lsr	x0, x0, x1
UInt64 UInt32 bar	.text
	lsr	x0, x0, x1
UInt64 UInt32 qux	.text
	lsl	x0, x0, x1
UInt64 UInt64 foo	.text
	lsr	x0, x0, x1
UInt64 UInt64 bar	.text
	lsr	x0, x0, x1
UInt64 UInt64 qux	.text
	lsl	x0, x0, x1

Note that there are some and instructions, in addition to the shifting ones.

@StefanKarpinski StefanKarpinski added the status:triage This should be discussed on a triage call label Jan 9, 2024
@StefanKarpinski
Copy link
Sponsor Member

I'm in favor. It's very doable to implement these for yourself for any given argument type, but annoying to implement generically as you've done here, so having the guaranteed wrapping versions seems good.

@mbauman
Copy link
Sponsor Member

mbauman commented Jan 9, 2024

The docstrings here are pretty confusing to me — it's not immediately clear what "overflowing" means or when it'd happen in this context. In the analogy to + and * arithmetic, I think of the existing x << n as already being an overflowing operation — it performs its operation mod 2^(sizeof(x)*8).

Using the word overflowing as a jargony standin for "undefined behavior" seems not great.

@mikmoore
Copy link
Contributor

mikmoore commented Jan 9, 2024

When you said "wrapping shift," I initially thought you meant bitrotate. But I see that isn't actually what you're doing here. What does it mean for a bitshift to overflow? Even after reading this discussion, the diff, and the docstrings, I'm not entirely clear.

My best understanding is that this is a bitshift with undefined behavior when attempting to shift left_operand by any value not in the set 0:8sizeof(left_operand). Or maybe the same but on the set 0:8sizeof(Int)? Is either of these correct?

The rest of my post is written assuming my above guess is correct. Although I think most of my arguments remain relevant even if not.

"Overflowing bit shift" does not appear to be standard terminology. A Google search does not turn up any specific operation by that name. The search does find a few discussions on overflow in bit shifts (and the UB that follows), but in any case this is a bad description. Given my persistent fuzzyness on what this operator actually does, the docstring needs a significant change of terminology and explanation.

If this exposes UB, it should explicitly mention "undefined behavior" and the conditions for invoking/avoiding it within the docstring.

My initial reaction to >>% is that it seems more natural as an infix for bitrotate. How common is the operation you're exposing? Is it really worth an infix in the first place? Is there precedence for this infix operator in other languages? From my poor understanding, it seems that the only difference between >>% and >> is performance and UB. It seems like a named operator would be plenty suitable, in that case. I would suggest this get a named function, instead. Something like unchecked_shra or unsafe_shra (and _shrl, _shl), for example. Whether it is exported or merely public is something one can debate, but I'm inclined towards public if UB really is a risk.

The current implementation only supports Base.BitInteger types. There is no definition for an arbitrary Integer in either position. A generic fallback should be provided. If >>% is just >> with UB, it can simply fall back to >>.

@jakobnissen
Copy link
Contributor Author

jakobnissen commented Jan 9, 2024

The purpose of the functions is to provide bitshifting with zero overhead. In particular, I want the semantics to be:

  • For an operator f(a, b), where f is any of the three shift operators the valid values of b is 0:8*sizeof(a)-1, of any Base.BitInteger type.
  • Given a valid shift value, compute the same result as the ordinary shift operators as efficiently as possible
  • Given an invalid shift value, we guarantee the function still returns, and still returns the correct type, but we make no guarantees about the returned value.
  • In practice, the implementation might differ between different architectures, such that it always compile down to just a single instruction. AFAIK we can't write arch-specific Julia code ATM, but we can optimise it for x86 for now and then change it later.

Using the word overflowing as a jargony standin for "undefined behavior" seems not great.

I don't think "undefined behaviour" is a good term. People associate it with nasal demons and compilers doing terrible things like deleting whole functions, and your entire program being considered invalid such that anything goes. This is very much not the case with invalid bitshift values here. We guarantee the function returns in an orderly fashion, although the returned value is unspecified. I've changed the docstrings to be more precise, but I've avoided the term "undefined behaviour".

How common is the operation you're exposing?

I'm not sure. I use it a lot in my own code, but I don't really have a sense of how common this is. If I'm atypical, it makes sense to preserve the >>% operators for something else. From my point of view, they're used more often than bitrotate for sure (these functions are even used in the current Julia implementation of bitrotate, through bit hacks). I suspect they're used a fair bit in high performance code - after all, that's one of the things Julia excel at, and that's when you want to go as close to the metal as possible.

There is no definition for an arbitrary Integer in either position. A generic fallback should be provided.

I disagree. This function is all about doing bit operations as efficiently as possible, by matching the CPUs native instructions. It makes no sense to provide a generic implementation, that would only mislead people into thinking there is an efficient implementation for their custom Integer type. If you have an unknown Integer type (or write completely generic code), there is no way you can do super efficient bit operations on it anyway, and you should just use the standard bitshift operators. Staying this close to the metal only makes sense if you know approximately what data you have.

I won't die on that hill, but I really think that if you write so generic code you don't know for sure that your data is a Base.BitInteger (or some other concrete type you've implemented >>% for), you shouldn't use these functions.

@PallHaraldsson
Copy link
Contributor

I think we want this at least since Julia itself uses << and >> e.g. in base/hamt.jl so it could use it. The point is this is faster, right. Probably ok to export (but not to change the current definition).

@mbauman
Copy link
Sponsor Member

mbauman commented Jan 9, 2024

I understand all that — and I have wanted/written the functionality myself! — but neither the term "overflowing shift" nor the name <<% make all that much sense to me. In the analogy to #50790, the operation x +% y is explicitly saying that you want to perform pure mathematical addition and then apply % T. This allows you to convert other + operations to error upon overflow — and here overflow means a mathematical result that's bigger than T!

I suppose there's the question of what "pure mathematical bitshifting" is, but in Julia we say it's equivalent to x * 2^n or fld(x, 2^n). This seems pretty straightforward — <<% should behave exactly like <<. The existing << happily satisfies with the definition of modular arithmetic above. Suppose we have <<% — what would the semantics of #50790's @Base.Experimental.make_all_arithmetic_checked be for <<? Does it error on 2 << 63? Or just 1 << 64? What about 2 << -1?

The fact that our processors happily overflow and do "modular arithmetic" up to a point is just quirky. And it's annoying to need to know that quirk to get good performance. But I don't think it's the same as +%.

@mikmoore
Copy link
Contributor

mikmoore commented Jan 9, 2024

With the new explanation, this seems (to me) like a clear case of UB. UB is used to assert that certain values will not appear in certain situations so that useful optimizations can be made. If those values actually occur, then UB permits the compiler to ignore the consequences. If it was on the hook to handle the consequences predictably, then optimization would be impossible and UB would be useless. UB is not something to be ashamed of, but it is definitely something to be made aware of and definitely the terminology "undefined behavior" should be used so that users can understand the consequences of abuse.

I don't see any difference between the UB here and the UB exposed in @fastmath, @inbounds, or unsafe_wrap (e.g., @fastmath isnan(NaN) == false). Those also make an orderly return of a value of the correct type (unless a segfault occurs, but that only happens when the non-UB conditions are violated), but it may not be the answer one had hoped for if the criteria for avoiding UB are violated.

Some languages (C, I think?) define the overflow of signed integer addition to be UB. This is useful because the compiler can always assume that the sum of two nonnegative values is nonnegative. Julia instead defines signed integer addition overflow to return the result of modular arithmetic. In practice, those other languages also give that same result (that's what the hardware does, after all), but the compiler can assume extra properties about that result thanks to the UB. Those properties might be wrong if the non-UB conditions are violated, so the compiler might make optimizations that result in "incorrect" results in UB situations.

Any program written with the proposed <<% and avoiding UB (due to out-of-range shifts) would be semantically equivalent if written with <<. Thus, the proposal is a pure performance transformation and it relies on UB to achieve it. So I'm increasingly in favor of not using an infix operator and instead using a name like unsafe_shl rather than <<%. This can absolutely be public API and I'm still amenable to exporting it.

From the stated criteria

  • Given a valid shift value, compute the same result as the ordinary shift operators as efficiently as possible
  • Given an invalid shift value, we guarantee the function still returns, and still returns the correct type, but we make no guarantees about the returned value.

I'm absolutely in favor of having generic fallbacks to ordinary shifts for Integer. The ordinary shifts match those same criteria, with the only difference being that "as efficiently as possible" has no better definition than the ordinary shifts. Needing to copy-paste an implementation just to replace all << with <<% for certain types is bad ergonomics and is error prone. We still implement fma(::Int,::Int,::Int) even though there is zero performance difference to +(*(::Int,::Int),::Int). Generic code is what makes things compose.

@adienes
Copy link
Contributor

adienes commented Jan 9, 2024

So I'm increasingly in favor of not using an infix operator and instead using a name like unsafe_shl rather than <<%. This can absolutely be public API and I'm still amenable to exporting it.

my reasoning is more superficial than principled, but I agree with this conclusion. primarily because I don't like the aesthetics of op% so much and think that named functions are more clear

We still implement fma(::Int,::Int,::Int) even though there is zero performance difference to +(*(::Int,::Int),::Int).

not to derail the discussion, but what do you mean by this? I have certainly seen performance improvements in my own code by replacing +(*(... )) with fma(...)

@jakobnissen
Copy link
Contributor Author

jakobnissen commented Jan 9, 2024

Thinking more about it, you're right. If we think about + vs +%, the former could be turned into a checked version (which we guarantee will not overflow), whereas the latter has explicit overflow behaviour. My proposed >>% is the opposite - >> has explicit "overflow" behaviour, whereas >>% produces platform-specific values in the presence of overflow.

So yes, let's call them unsafe_shift_left or something. I guess I can always do const ⪡ = unsafe_shift_left in my own code if I want it infix :)

Although:

If those values actually occur, then UB permits the compiler to ignore the consequences.

This is NOT what I propose. I don't want the compiler to do whatever in the presence of an invalid shift value. I want it to return an arbitrary value of the correct type (in practice, whatever the platform's native shift instruction does). In this particular case, I don't see any advantage of declaring invalid bitshift values to be UB, and these associations to the term "undefined behaviour" is exactly why I think we should not choose it. If there are good performance reasons to have something result in UB, so be it. But here, I just don't see that being the case.

@KristofferC
Copy link
Sponsor Member

FWIW, since this will presumably use the LLVM freeze to prevent any UB from propagating, I don't think unsafe is really warranted since that in my mind had connotations with e.g memory unsafeness.

@mikmoore
Copy link
Contributor

mikmoore commented Jan 9, 2024

NOTE: someone else has offered a more nuanced view of unsafe_ than I am, so feel free to disregard the following in this specific discussion. I'll concede this situation is more akin to @fastmath optimizations than no-holds-barred UB. But I'm keeping the below because I think it is still useful commentary. Perhaps unchecked_ might be a more appropriate prefix. Although I'll still insist that a processor is mostly free (to my knowledge) to HCF or have other sorts of UB for out-of-range inputs, so I'm not completely disuaded from unsafe_.


This is NOT what I propose. I don't want the compiler to do whatever in the presence of an invalid shift value. I want it to return an arbitrary value of the correct type (in practice, whatever the platform's native shift instruction does).

No compiler has ever actually attempted to summon nasal demons (or if they have tried, no one has ever reported one being successful). The compiler must ensure the result is correct when the non-UB conditions are satisfied. It is not required to fulfill any specific semantic in UB situations. It will almost-certainly decide against printing the complete works of William Shakespeare when UB because that is more work than it has to do. It will also not go out of its way to insert an error path for values that you promised it would never see because that's just more work. In all likelihood, the compiler will see that it gets the correct result for valid inputs via an unchecked shift and it will quietly ignore the fact that it doesn't know what will happen if it uses invalid values because UB says it doesn't have to care.

Is it important that this be semantically guaranteed not to throw an error? Wouldn't it be convenient if the user were informed that they are breaking the promise they made and that the output of the program might be garbage? They won't actually get that notice, but they wouldn't be mad if they did. They'd try to go fix the mistake rather than try to silence the error. What use is a program that doesn't produce the correct result?

EDIT: Thanks to the following poster for pointing out one aspect of UB that I consistently manage to neglect, which does draw a slight wedge between the desired semantics here and the unrestricted UB I've been advocating. Although I'm still not totally certain that UB is entirely unreasonable here, I can acknowledge the possible merits of requiring hardware-native behavior rather than compiler-level UB on invalid shifts.

@mbauman
Copy link
Sponsor Member

mbauman commented Jan 9, 2024

In all likelihood, the compiler will see that it gets the correct result for valid inputs via an unchecked shift and it will quietly ignore the fact that it doesn't know what will happen if it uses invalid values because UB says it doesn't have to care.

@mikmoore this isn't terribly unlike our discussion on software-checked integer division. It's true that LLVM's shl i64 %x, %n will behave exactly as we'd want for all values of %n, even out-of-bounds ones... as long as %n is itself non-const. The moment it gets an invalid constant, you'll get a ret i64 poison. And once you get a poison, LLVM will happily keep throwing away work as much as it can, continuing on as far as it can go. Julia will happily propagate constants as deep as it can go and LLVM will happily bubble poison right back up at ya. That's the difference between a "compiler's UB" and simply returning an implementation-defined value. So, yeah, I shouldn't have used the words UB — no iteration here actually did UB in the poison sense of the phrase.


So what is this operation called? native_shl and <<′? Or a theoretical assume(0 <= n < sizeof(x)*8-1); x << n? Or bitshift_left? Or simply shl?

@gbaraldi
Copy link
Member

gbaraldi commented Jan 9, 2024

I'm a bit confused as to what this proposes? Because wrapping shift doesn't map to an LLVM instruction, is this then something that behaves directly like the LLVM shift/C shift where it's just UB?

The value produced is op1 * 2op2 mod 2n, where n is the width of the result. If op2 is (statically or dynamically) equal to or larger than the number of bits in op1, this instruction returns a poison value. If the arguments are vectors, each vector element of op1 is shifted by the corresponding shift amount in op2.
See reference https://llvm.org/docs/LangRef.html#shl-instruction

@mbauman mbauman changed the title Add wrapping shift function Add a fast processor-native bitshift function Jan 9, 2024
@mbauman
Copy link
Sponsor Member

mbauman commented Jan 9, 2024

I've re-titled this to hopefully better capture the intent here ("wrapping" confused me, too) — the goal as I see it is to have some x <<′ n that's a single native processor operation and behaves like << for a limited subset of values of n, specifically 0 <= n < sizeof(x)*8. What it does outside that range could either be defined to act like the existing processor instructions do or it could be dependent upon the platform; I don't really care.

@gbaraldi
Copy link
Member

gbaraldi commented Jan 9, 2024

So the thing is, currently this maps to the LLVM call that has UB (which we might not care about). LLVM also has the funnel shifts (which do quite cool things but I'm not sure if that's what we want)

@KristofferC
Copy link
Sponsor Member

Just freeze the value coming out of it? I think we already do that for other things to avoid UB, e.g.

julia> f(x) = Base.fptosi(Int64, x)
f (generic function with 1 method)

julia> @code_llvm f(1.0)
;  @ REPL[5]:1 within `f`
define i64 @julia_f_227(double %0) #0 {
top:
  %1 = fptosi double %0 to i64
  %2 = freeze i64 %1 <-----------------------
  ret i64 %2
}

@StefanKarpinski
Copy link
Sponsor Member

I feel like this PR is headed in the wrong direction. I'd like for these operations to be well-defined and safe, but do what most CPUs already do, which is discard all but the last bits of the shift argument. Unfortunately, that means that it won't be possible to shift by 64 bits, but so be it, perhaps that's what these operations should do.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Jan 10, 2024

I also think the original names of <<% etc. were pretty good. The question is what is the % actually a modulus of and the answer to that is that the shift is reduced modulo the number of bits of the first argument. An implementation that keeps the compiler happy is a bit annoying, but this one is fairly simple:

<<%(n::T, k::Integer) where {T<:Base.BitInteger} = n << ((k % UInt8) % UInt8(8*sizeof(T)))
>>%(n::T, k::Integer) where {T<:Base.BitInteger} = n >> ((k % UInt8) % UInt8(8*sizeof(T)))
>>>%(n::T, k::Integer) where {T<:Base.BitInteger} = n >>> ((k % UInt8) % UInt8(8*sizeof(T)))

And now you can see where the modulus in the name comes from. If we had defined % to do modulus properly instead of rem then it might even be possible to skip the first % UInt8 but nevertheless, this is a modulus.

@jakobnissen
Copy link
Contributor Author

jakobnissen commented Jan 10, 2024

Unfortunately, that implementation doesn't map exactly onto x86 shift instructions, instead it compiles to an and and a shr instruction when used with 8-bit integers.

IIUC this is because of two annoying factors:

  1. On integers < 64 bits, CPUs always use the lower 5 bits of the second operand. This also includes shifts on 8-bit integers. So, the second operand should be not % UInt8(8 * sizeof(T)), but rather % max(0x20, UInt8(8*sizeof(T))
  2. However, LLVM does not accept a shift larger than the bitsize of the first operand.

So, for 8-bit integers, if we modulo 8, an extra and instruction must be inserted since the CPU "wants" to modulo 32 natively. However, if we modulo 32, then LLVM will insert extra instructions to avoid returning a poison value.

We can get around it by extending the 8-bit integer to a 32-bit integer, shifting modulo 32, then truncating back to 8 bits. That produces better code, but it's icky that we forcibly to 32-bit bitshifts on 8-bit integers.

Is there a way to get around LLVM's limitation and just produce the raw shr instruction we want?

@mikmoore
Copy link
Contributor

Just wanting to make sure that things stay in perspective:

Can someone provide a vague remark on the potential performance improvement this may render to a useful calculation (i.e., not a nanobenchmark that simply does a bunch of shifts without motivation)? I don't need a literal benchmark, just hoping someone can at least assert a vague figure. My understanding is that the best "safe" version (modulo-shift) results in 2 instructions instead of the more-desirable 1 (but improved from ~4 for known-sign shifts or ~11 for unknown shifts), but is this really a bottleneck in practice? AND instructions are among the cheapest available on a processor. Are there useful situations where the computational density of non-constant shifts is so high as to make an extra AND per shift a meaningful performance loss? More than 10-20%?

If there is real performance that we're missing and wanting here, then we probably aren't alone and perhaps this warrants an upstream issue requesting that LLVM expose the desired semantics? At some point, that may be easier than trying to hack something.


P.S.: With the semantic that >>% takes the shift amount modulo the bitwidth, I would find that spelling totally acceptable and intuitive.

@gbaraldi
Copy link
Member

Btw, most constant bitshifts inside of functions do get compiled to a single instruction. It's the dynamic ones that need some coaxing to not have the guards around them.

@StefanKarpinski
Copy link
Sponsor Member

I'm honestly not too concerned about the extra and in the case where the first argument is a single byte, I'd rather just have a semantically consistent definition of what these operators do that would map to a single instruction for all the other argument types. Why don't I care about UInt8? Several reasons:

  1. It's relatively rare to be doing bit shifts on a single byte
  2. I doubt the extra and instruction has much performance impact
  3. Single-byte operations typically don't have SIMD versions, which is where the real perf lies

So my inclination is to define the operations consistently the way I did and let the and be in there.

@mbauman
Copy link
Sponsor Member

mbauman commented Jan 10, 2024

Rust does call this a wrapping_shl, which then calls an unchecked_shl that has UB. This safe "wrapping" version masks by the number of bits and so it doesn't have the super microoptimized one-cpu-op for the smaller ints... and since they're ok with exposing LLVM's UB folks who really want the one-op can go for that.

I like the <<% name, too, but only in isolation. I don't like it if it's gonna live alongside +%, -% and *% because then it's confusing what's being modulo'ed... and I could even see there being a @% macro the flags an entire block of code as known to do modular arithmetic a la @. (yes, I know it'd be semantically very different, but I think it's still a transferable concept). Of course, throwing a naming wrench based on names we don't even have doesn't feel very fair.

The binary GCD algorithm from #30674 makes for a decent testbed. x <<ᵐ n is masking n as LLVM demands. <<ᵃ is using UnsafeAssume to promise to LLVM that n is in its allowed range — this probably isn't a good "final" version because it's crashy but it's a good proxy for the ideal performance.

Definitions
@noinline function gcd(<<, >>, a::T, b::T) where {T}
    @noinline throw1(a, b) = throw(OverflowError("gcd($a, $b) overflows"))
    a == 0 && return abs(b)
    b == 0 && return abs(a)
    za = trailing_zeros(a)
    zb = trailing_zeros(b)
    k = min(za, zb)
    u = unsigned(abs(a >> za))
    v = unsigned(abs(b >> zb))
    while u != v
        if u > v
            u, v = v, u
        end
        v -= u
        v >>= trailing_zeros(v)
    end
    r = u << k
    # T(r) would throw InexactError; we want OverflowError instead
    r > typemax(T) && throw1(a, b)
    r % T
end
x <<ᵐ n = x << (n & (sizeof(x)*8-1))
x >>ᵐ n = x >> (n & (sizeof(x)*8-1))

using UnsafeAssume
x <<ᵃ n = (unsafe_assume_condition(n >= 0); unsafe_assume_condition(n < sizeof(x)*8); x << n)
x >>ᵃ n = (unsafe_assume_condition(n >= 0); unsafe_assume_condition(n < sizeof(x)*8); x >> n)

For Int64 they generate the exact same native code on my ARM M1. But for Int16 and Int8 the assume versions perform better by skipping the superfluous mask:

julia> A = rand(Int64, 100_000);

julia> @btime (s = 0; @inbounds for n = 1:length($A)-1 s += gcd(<<, >>, $A[n], $A[n+1]) end; s);
  10.641 ms (0 allocations: 0 bytes)

julia> @btime (s = 0; @inbounds for n = 1:length($A)-1 s += gcd(<<ᵐ, >>ᵐ, $A[n], $A[n+1]) end; s);
  9.244 ms (0 allocations: 0 bytes)

julia> @btime (s = 0; @inbounds for n = 1:length($A)-1 s += gcd(<<ᵃ, >>ᵃ, $A[n], $A[n+1]) end; s);
  9.244 ms (0 allocations: 0 bytes)

julia> A = rand(Int16, 100_000);

julia> @btime (s = 0; @inbounds for n = 1:length($A)-1 s += gcd(<<, >>, $A[n], $A[n+1]) end; s);
  3.166 ms (0 allocations: 0 bytes)

julia> @btime (s = 0; @inbounds for n = 1:length($A)-1 s += gcd(<<ᵐ, >>ᵐ, $A[n], $A[n+1]) end; s);
  3.155 ms (0 allocations: 0 bytes)

julia> @btime (s = 0; @inbounds for n = 1:length($A)-1 s += gcd(<<ᵃ, >>ᵃ, $A[n], $A[n+1]) end; s);
  2.899 ms (0 allocations: 0 bytes)

julia> A = rand(Int8, 100_000);

julia> @btime (s = 0; @inbounds for n = 1:length($A)-1 s += gcd(<<, >>, $A[n], $A[n+1]) end; s);
  1.498 ms (0 allocations: 0 bytes)

julia> @btime (s = 0; @inbounds for n = 1:length($A)-1 s += gcd(<<ᵐ, >>ᵐ, $A[n], $A[n+1]) end; s);
  1.484 ms (0 allocations: 0 bytes)

julia> @btime (s = 0; @inbounds for n = 1:length($A)-1 s += gcd(<<ᵃ, >>ᵃ, $A[n], $A[n+1]) end; s);
  1.396 ms (0 allocations: 0 bytes)

@jakobnissen
Copy link
Contributor Author

jakobnissen commented Jan 11, 2024

One more benchmark: Same as Matt's above, but with two differences:

  • To see the effect of the reduction in instructions more clearly, I add @inline instead of @noinline to the gcd function
  • I also test one more set of functions, and which displays the safe, but somewhat obscure overflow behaviour I advocate (and which are implemented in the current state of this PR). Results for Int64 and Int32 are the same. But for Int16 and Int8:
function f(f1::F1, f2::F2, A) where {F1, F2}
    s = 0
    @inbounds for n in 1:length(A)-1
        s += gcd(f1, f2, A[n], A[n+1])
    end
    s
end

function (a, n)
    if sizeof(a) == 4 || sizeof(a) == 8
        return a >> (n & (8*sizeof(a)-1))
    else
        a2 = Core.Intrinsics.sext_int(UInt32, a)
        (a2 >> (n & 31)) % typeof(a)
    end
end

function (a, n)
    if sizeof(a) == 4 || sizeof(a) == 8
        return a << (n & (8*sizeof(a)-1))
    else
        a2 = Core.Intrinsics.zext_int(UInt32, a)
        (a2 << (n & 31)) % typeof(a)
    end
end
julia> A = rand(Int16, 100_000);

julia> @btime f(<<, >>, A);
  998.830 μs (1 allocation: 16 bytes)

julia> @btime f(<<ᵐ, >>ᵐ, A);
  942.925 μs (1 allocation: 16 bytes)

julia> @btime f(<<ᵃ, >>ᵃ, A);
  906.677 μs (1 allocation: 16 bytes)

julia> @btime f(⪡, ⪢, A);
  886.559 μs (1 allocation: 16 bytes)

julia> A = rand(Int8, 100_000);

julia> @btime f(<<, >>, A);
  1.007 ms (1 allocation: 16 bytes)

julia> @btime f(<<ᵐ, >>ᵐ, A);
  945.901 μs (1 allocation: 16 bytes)

julia> @btime f(<<ᵃ, >>ᵃ, A);
  909.542 μs (1 allocation: 16 bytes)

julia> @btime f(⪡, ⪢, A);
  887.090 μs (1 allocation: 16 bytes)

So, somehow even faster than the unsafe assume one, despite it being safe, and about 12% faster than the default bitshifts.
With the @inline, I see , and performing on par with the unsafe assume one - slightly slower on 8-bit ones, slightly faster on 16-bit ones.

@mbauman
Copy link
Sponsor Member

mbauman commented Jan 11, 2024

Because everything is terrible, I see the opposite behavior on an M1:

preamble
julia> @inline function gcd(<<, >>, a::T, b::T) where {T}
           @noinline throw1(a, b) = throw(OverflowError("gcd($a, $b) overflows"))
           a == 0 && return abs(b)
           b == 0 && return abs(a)
           za = trailing_zeros(a)
           zb = trailing_zeros(b)
           k = min(za, zb)
           u = unsigned(abs(a >> za))
           v = unsigned(abs(b >> zb))
           while u != v
               if u > v
                   u, v = v, u
               end
               v -= u
               v >>= trailing_zeros(v)
           end
           r = u << k
           # T(r) would throw InexactError; we want OverflowError instead
           r > typemax(T) && throw1(a, b)
           r % T
       end
       x <<ᵐ n = x << (n & (sizeof(x)*8-1))
       x >>ᵐ n = x >> (n & (sizeof(x)*8-1))

       using UnsafeAssume
       x <<ᵃ n = (unsafe_assume_condition(n >= 0); unsafe_assume_condition(n < sizeof(x)*8); x << n)
       x >>ᵃ n = (unsafe_assume_condition(n >= 0); unsafe_assume_condition(n < sizeof(x)*8); x >> n)
>>ᵃ (generic function with 1 method)

julia> function f(f1::F1, f2::F2, A) where {F1, F2}
           s = 0
           @inbounds for n in 1:length(A)-1
               s += gcd(f1, f2, A[n], A[n+1])
           end
           s
       end
f (generic function with 1 method)

julia> function (a, n)
           if sizeof(a) == 4 || sizeof(a) == 8
               return a >> (n & (8*sizeof(a)-1))
           else
               a2 = Core.Intrinsics.sext_int(UInt32, a)
               (a2 >> (n & 31)) % typeof(a)
           end
       end
 (generic function with 1 method)

julia> function (a, n)
           if sizeof(a) == 4 || sizeof(a) == 8
               return a << (n & (8*sizeof(a)-1))
           else
               a2 = Core.Intrinsics.zext_int(UInt32, a)
               (a2 << (n & 31)) % typeof(a)
           end
       end
 (generic function with 1 method)

julia> using BenchmarkTools

julia> A = rand(Int32, 100_000);

julia> @btime f(<<, >>, A);
  5.133 ms (1 allocation: 16 bytes)

julia> @btime f(<<ᵐ, >>ᵐ, A);
  4.671 ms (1 allocation: 16 bytes)

julia> @btime f(<<ᵃ, >>ᵃ, A);
  4.672 ms (1 allocation: 16 bytes)

julia> @btime f(, , A);
  4.672 ms (1 allocation: 16 bytes)
julia> A = rand(Int16, 100_000);

julia> @btime f(<<, >>, A);
  3.021 ms (1 allocation: 16 bytes)

julia> @btime f(<<ᵐ, >>ᵐ, A);
  3.092 ms (1 allocation: 16 bytes)

julia> @btime f(<<ᵃ, >>ᵃ, A);
  2.771 ms (1 allocation: 16 bytes)

julia> @btime f(, , A);
  3.050 ms (1 allocation: 16 bytes)

julia> A = rand(Int8, 100_000);

julia> @btime f(<<, >>, A);
  1.363 ms (1 allocation: 16 bytes)

julia> @btime f(<<ᵐ, >>ᵐ, A);
  1.343 ms (1 allocation: 16 bytes)

julia> @btime f(<<ᵃ, >>ᵃ, A);
  1.257 ms (1 allocation: 16 bytes)

julia> @btime f(, , A);
  1.382 ms (1 allocation: 16 bytes)

It's also worth checking the SIMD-ability of these operators. The GCD example doesn't SIMD. And of course will only SIMD at 32-bit width minimum.

julia> code_native((xs,y)->map(x->(x<<y), xs), (NTuple{8,Int16},Int16), debuginfo=:none)
# ...
	sxtw	x9, w1
	cmp	w1, #0
	cset	w10, lt
	cmp	w1, #15
	cset	w11, hi
	neg	x12, x9
	cmp	x12, #15
	mov	w12, #15
	csneg	x9, x12, x9, hs
	ldr	q0, [x0]
	dup.8h	v1, w1
	ushl.8h	v1, v0, v1
	dup.8b	v2, w11
	ushll.8h	v2, v2, #0
	shl.8h	v2, v2, #15
	cmge.8h	v2, v2, #0
	and.16b	v1, v1, v2
	dup.8h	v2, w9
	neg.8h	v2, v2
	dup.8b	v3, w10
	sshl.8h	v0, v0, v2
	ushll.8h	v2, v3, #0
	shl.8h	v2, v2, #15
	cmlt.8h	v2, v2, #0
	bif.16b	v0, v1, v2
	str	q0, [x8]
	ret

julia> code_native((xs,y)->map(x->(x<<ᵐy), xs), (NTuple{8,Int16},Int16), debuginfo=:none)
# ...
	and	w9, w1, #0xf
	ldr	q0, [x0]
	dup.8h	v1, w9
	ushl.8h	v0, v0, v1
	str	q0, [x8]
	ret

julia> code_native((xs,y)->map(x->(x<<ᵃy), xs), (NTuple{8,Int16},Int16), debuginfo=:none)
# ...
	ldr	q0, [x0]
	dup.8h	v1, w1
	ushl.8h	v0, v0, v1
	str	q0, [x8]
	ret

julia> code_native((xs,y)->map(x->(x⪡y), xs), (NTuple{8,Int16},Int16), debuginfo=:none)
# ...
	and	w9, w1, #0x1f
	ldr	q0, [x0]
	ushll.4s	v1, v0, #0
	ushll2.4s	v0, v0, #0
	dup.4s	v2, w9
	ushl.4s	v0, v0, v2
	ushl.4s	v1, v1, v2
	uzp1.8h	v0, v1, v0
	str	q0, [x8]
	ret

And M1 does have ushl.16b ops for Int8, too.

@JeffBezanson
Copy link
Sponsor Member

Good discussion. I just want to chime in to agree that we should avoid UB, and that this should not be considered an "unsafe" function.

I agree with Matt on the >>% names. For a large shift amount, the mathematically-correct result for >> is -1 or 0, which we already give, so >>% should not be necessary. For <<, we give the result modulo the type, so it already computes <<% and we could potentially introduce that as an overflow-unchecked operator just like the proposed +%. Using the % for the strange shift amount masking is a pun.

@StefanKarpinski
Copy link
Sponsor Member

Ok, so what should we call these then? They're not unsafe (at least we don't want them to be), but they're also not modular. So what are they then?

@JeffBezanson
Copy link
Sponsor Member

Seems closest to @fastmath to me.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 12, 2024

Crystal seems to call them >>& and <<& (or a masked shift), since the implementation of the operator in this PR is x << (n & nbits_mask)

@gbaraldi
Copy link
Member

gbaraldi commented Jan 18, 2024

So in triage we were discussing and came to the conclusion that the behaviour should follow what LLVM defines

This instruction always performs a logical shift right operation. The most significant bits of the result will be filled with zero bits after the shift. If op2 is (statically or dynamically) equal to or larger than the number of bits in op1, this instruction returns a poison value. If the arguments are vectors, each vector element of op1 is shifted by the corresponding shift amount in op2.

but with a freeze operation after, which should eliminate the UB on the operation and move to it only returning an unspecified value

@mbauman
Copy link
Sponsor Member

mbauman commented Jan 18, 2024

That would seem to imply a name of FastMath.shl_fast, correct? That way we're not embedding the modulo-nbits behaviors into either the documentation or name. It could initially be implemented with modulo behaviors, though, since that's the easiest way to get the optimal codegen for Int32 and Int64... and gets us 90% of the way there for the smaller ints.

@jakobnissen jakobnissen removed needs decision A decision on this change is needed status:triage This should be discussed on a triage call labels Jan 18, 2024
@gbaraldi
Copy link
Member

gbaraldi commented Jan 18, 2024

So this is what this might look like for one of the types (though this should be implemented as an intrinsic)

function shl(x::Int64, n::Int64)
    Base.llvmcall(
        """ %3 = shl i64 %0, %1
            %4 = freeze i64 %3
            ret i64 %4""", Int64, Tuple{Int64, Int64},x,n)
end


function shr(x::Int64, n::Int64)
    Base.llvmcall(
        """ %3 = ashr i64 %0, %1
            %4 = freeze i64 %3
            ret i64 %4""", Int64, Tuple{Int64, Int64}, x, n)
end

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 18, 2024

In that case, fastmath is indeed a good name, as it should be noted that that version is not eligible for constant folding (unlike this PR), as that computation is not consistent (it is not pure)

@mikmoore
Copy link
Contributor

Although every instance of "fast math" in Julia up to now (and in many languages) has referred exclusively to IEEE754 floating point values, so there's a bit of a name collision there. The same goes for the Base.FastMath module, so I don't agree that's the best spot for this (unless we're expecting to add several other similar optimization in the near/medium future). I still think names like unsafe_shl would be suitable -- all the unsafe_ functions have specific preconditions for proper operation and may behave unpredictably otherwise, which seems to be what's happening here.

Regardless of name and code location, I absolutely definitely would not want this to be affiliated with @fastmath (I can't tell if that's being proposed). The education campaign to tell people it affected integer operations would need to be immense. It would also be a breaking change, since integer operations were previously safe in @fastmath contexts.

@uniment
Copy link

uniment commented Feb 23, 2024

Spitballing, I wonder if there's a new idiom to build here...

Imagine the following ways to specify the "flavors" of an operation:

safe{+}(a, b)
unsafe{<<}(x, n)
fast{sin}(x)

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Feb 23, 2024

[I like that we are thinking of a solution, but maybe it should live in a package. Is it strictly needed for Julia itself? We want to have a way, and document it at least, by pointing to a package, or safe/optimal solution/type, such as UInt6, see below.]

At JuliaSyntax issue:

For bitshifting, Julias definition of e.g. >> checks for overflow, which leads to suboptimal code.

First, that's wrong, it can't overflow, but if got me thinking what checks are needed, and a possible solution:

@code_lowered 1 >> Unsigned(1)  # still sub-optimal, but better than 1 >> 1

Could we not leverage the type system, with a bit-shift type, 1 >> int_shift(1) could get rid of all, i.e. the cmpq $63, %rsi etc. overhead. Basically it would be a UInt6 on 64-bit platforms, or an UInt5 on 32-bit... You may want both of those defined (not sure if Rust has them be it has the intriguing NonZeroU8 in std::num - Rust), with modular arithmetic, but at least you shift the problem from the shift instructions (no pun intended). [I note that the rotate instructions do not seem to have any problems, a bit funny since often based on two shift instructions, so is that to be relied on that rotates will always be compiled well?]

EDIT: I see we would need UInt3 actually too, and all up to UInt6 for Int128... Zig has such types (and few other languages), maybe (also) for this same reason,

@code_native Int8(1) >> Unsigned(1)  # is such a shift actually often needed, or the e.g. the less optimized Int128(1) >> Unsigned(1)?

@c42f
Copy link
Member

c42f commented Mar 12, 2024

I know I'm late to the party here, but I wanted to chime in briefly and say that my gut reaction is that we should absoutely not add new operator syntax for this.

Special syntax is an expense which is subtly imposed on all Julia users.

If >>% is unusual and not commonly used, or has subtle/ugly semantics, then it should get an unusual and ugly name to reflect that. Like FastMath.shl_fast as suggested by Matt. The length of the discussion above seems like proof that it has subtle semantics.

Special syntax needs to meet a high bar of being fairly useful to a broad range of users. Or being extremely useful to a narrower set of users. I don't think this operation meets that bar.

The rest of this seems great - we should absolutely teach the compiler what it needs to know about this operation, and make it possible for people who need it to use it. Let's just use a normal function name for this (maybe with @fastmath support).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: Native bitshift operation