Skip to content

Commit

Permalink
[BPF] Replace BPFMIPeepholeTruncElim by custom logic in isZExtFree()
Browse files Browse the repository at this point in the history
Replace `BPFMIPeepholeTruncElim` by adding an overload for
`TargetLowering::isZExtFree()` aware that zero extension is
free for `ISD::LOAD`.

Short description
=================

The `BPFMIPeepholeTruncElim` handles two patterns:

Pattern #1:

    %1 = LDB %0, ...              %1 = LDB %0, ...
    %2 = AND_ri %1, 0xff      ->  %2 = MOV_ri %1    <-- (!)

Pattern #2:

    bb.1:                         bb.1:
      %a = LDB %0, ...              %a = LDB %0, ...
      br %bb3                       br %bb3
    bb.2:                         bb.2:
      %b = LDB %0, ...        ->    %b = LDB %0, ...
      br %bb3                       br %bb3
    bb.3:                         bb.3:
      %1 = PHI %a, %b               %1 = PHI %a, %b
      %2 = AND_ri %1, 0xff          %2 = MOV_ri %1  <-- (!)

Plus variations:
- AND_ri_32 instead of AND_ri
- SLL/SLR instead of AND_ri
- LDH, LDW, LDB32, LDH32, LDW32

Both patterns could be handled by built-in transformations at
instruction selection phase if suitable `isZExtFree()` implementation
is provided. The idea is borrowed from `ARMTargetLowering::isZExtFree`.

When evaluating on BPF kernel selftests and remove_truncate_*.ll LLVM
test cases this revisions performs slightly better than
BPFMIPeepholeTruncElim, see "Impact" section below for details.

Commit also adds a few test cases to make sure that patterns in
question are handled.

Long description
================

Why this works: Pattern #1
--------------------------

Consider the following example:

    define i1 @foo(ptr %p) {
    entry:
      %a = load i8, ptr %p, align 1
      %cond = icmp eq i8 %a, 0
      ret i1 %cond
    }

Log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command:

    ...
    Type-legalized selection DAG: %bb.0 'foo:entry'
    SelectionDAG has 13 nodes:
      t0: ch,glue = EntryToken
              t2: i64,ch = CopyFromReg t0, Register:i64 %0
            t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
          t19: i64 = and t16, Constant:i64<255>
        t17: i64 = setcc t19, Constant:i64<0>, seteq:ch
      t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
      t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
    ...
    Replacing.1 t19: i64 = and t16, Constant:i64<255>
    With: t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
     and 0 other values
    ...
    Optimized type-legalized selection DAG: %bb.0 'foo:entry'
    SelectionDAG has 11 nodes:
      t0: ch,glue = EntryToken
            t2: i64,ch = CopyFromReg t0, Register:i64 %0
          t20: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
        t17: i64 = setcc t20, Constant:i64<0>, seteq:ch
      t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
      t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
    ...

Note:
- Optimized type-legalized selection DAG:
  - `t19 = and t16, 255` had been replaced by `t16` (load).
  - Patterns like `(and (load ... i8), 255)` are replaced by `load`
    in `DAGCombiner::BackwardsPropagateMask` called from
    `DAGCombiner::visitAND`.
  - Similarly patterns like `(shl (srl ..., 56), 56)` are replaced by
    `(and ..., 255)` in `DAGCombiner::visitSRL` (this function is huge,
    look for `TLI.shouldFoldConstantShiftPairToMask()` call).

Why this works: Pattern #2
--------------------------

Consider the following example:

    define i1 @foo(ptr %p) {
    entry:
      %a = load i8, ptr %p, align 1
      br label %next

    next:
      %cond = icmp eq i8 %a, 0
      ret i1 %cond
    }

Consider log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command.
Log for first basic block:

    Initial selection DAG: %bb.0 'foo:entry'
    SelectionDAG has 9 nodes:
      t0: ch,glue = EntryToken
      t3: i64 = Constant<0>
            t2: i64,ch = CopyFromReg t0, Register:i64 %1
          t5: i8,ch = load<(load (s8) from %ir.p)> t0, t2, undef:i64
        t6: i64 = zero_extend t5
      t8: ch = CopyToReg t0, Register:i64 %0, t6
    ...
    Replacing.1 t6: i64 = zero_extend t5
    With: t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
     and 0 other values
    ...
    Optimized lowered selection DAG: %bb.0 'foo:entry'
    SelectionDAG has 7 nodes:
      t0: ch,glue = EntryToken
          t2: i64,ch = CopyFromReg t0, Register:i64 %1
        t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
      t8: ch = CopyToReg t0, Register:i64 %0, t9

Note:
- Initial selection DAG:
  - `%a = load ...` is lowered as `t6 = (zero_extend (load ...))`
    w/o special `isZExtFree()` overload added by this commit
    it is instead lowered as `t6 = (any_extend (load ...))`.
  - The decision to generate `zero_extend` or `any_extend` is
    done in `RegsForValue::getCopyToRegs` called from
    `SelectionDAGBuilder::CopyValueToVirtualRegister`:
    - if `isZExtFree()` for load returns true `zero_extend` is used;
    - `any_extend` is used otherwise.
- Optimized lowered selection DAG:
  - `t6 = (any_extend (load ...))` is replaced by
    `t9 = load ..., zext from i8`
    This is done by `DagCombiner.cpp:tryToFoldExtOfLoad()` called from
    `DAGCombiner::visitZERO_EXTEND`.

Log for second basic block:

    Initial selection DAG: %bb.1 'foo:next'
    SelectionDAG has 13 nodes:
      t0: ch,glue = EntryToken
                t2: i64,ch = CopyFromReg t0, Register:i64 %0
              t4: i64 = AssertZext t2, ValueType:ch:i8
            t5: i8 = truncate t4
          t8: i1 = setcc t5, Constant:i8<0>, seteq:ch
        t9: i64 = any_extend t8
      t11: ch,glue = CopyToReg t0, Register:i64 $r0, t9
      t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
    ...
    Replacing.2 t18: i64 = and t4, Constant:i64<255>
    With: t4: i64 = AssertZext t2, ValueType:ch:i8
    ...
    Type-legalized selection DAG: %bb.1 'foo:next'
    SelectionDAG has 13 nodes:
      t0: ch,glue = EntryToken
              t2: i64,ch = CopyFromReg t0, Register:i64 %0
            t4: i64 = AssertZext t2, ValueType:ch:i8
          t18: i64 = and t4, Constant:i64<255>
        t16: i64 = setcc t18, Constant:i64<0>, seteq:ch
      t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
      t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
    ...
    Optimized type-legalized selection DAG: %bb.1 'foo:next'
    SelectionDAG has 11 nodes:
      t0: ch,glue = EntryToken
            t2: i64,ch = CopyFromReg t0, Register:i64 %0
          t4: i64 = AssertZext t2, ValueType:ch:i8
        t16: i64 = setcc t4, Constant:i64<0>, seteq:ch
      t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
      t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
    ...

Note:
- Initial selection DAG:
  - `t0` is an input value for this basic block, it corresponds load
    instruction (`t9`) from the first basic block.
  - It is accessed within basic block via
    `t4` (AssertZext (CopyFromReg t0, ...)).
  - The `AssertZext` is generated by RegsForValue::getCopyFromRegs
    called from SelectionDAGBuilder::getCopyFromRegs, it is generated
    only when `LiveOutInfo` with known number of leading zeros is
    present for `t0`.
  - Known register bits in `LiveOutInfo` are computed by
    `SelectionDAG::computeKnownBits` called from
    `SelectionDAGISel::ComputeLiveOutVRegInfo`.
  - `computeKnownBits()` generates leading zeros information for
    `(load ..., zext from ...)` but *does not* generate leading zeros
    information for `(load ..., anyext from ...)`.
    This is why `isZExtFree()` added in this commit is important.
- Type-legalized selection DAG:
  - `t5 = truncate t4` is replaced by `t18 = and t4, 255`
- Optimized type-legalized selection DAG:
  - `t18 = and t4, 255` is replaced by `t4`, this is done by
    `DAGCombiner::SimplifyDemandedBits` called from
    `DAGCombiner::visitAND`, which simplifies patterns like
    `(and (assertzext ...))`

Impact
------

This change covers all remove_truncate_*.ll test cases:
- for -mcpu=v4 there are no changes in the generated code;
- for -mcpu=v2 code generated for remove_truncate_7 and
  remove_truncate_8 improved slightly, for other tests it is
  unchanged.

For remove_truncate_7:

    Before this revision                 After this revision
    --------------------                 -------------------
        r1 <<= 0x20                          r1 <<= 0x20
        r1 >>= 0x20                          r1 >>= 0x20
        if r1 == 0x0 goto +0x2 <LBB0_2>      if r1 == 0x0 goto +0x2 <LBB0_2>
        r1 = *(u32 *)(r2 + 0x0)              r0 = *(u32 *)(r2 + 0x0)
        goto +0x1 <LBB0_3>                   goto +0x1 <LBB0_3>
    <LBB0_2>:                            <LBB0_2>:
        r1 = *(u32 *)(r2 + 0x4)              r0 = *(u32 *)(r2 + 0x4)
    <LBB0_3>:                            <LBB0_3>:
        r0 = r1                              exit
        exit

For remove_truncate_8:

    Before this revision                 After this revision
    --------------------                 -------------------
        r2 = *(u32 *)(r1 + 0x0)              r2 = *(u32 *)(r1 + 0x0)
        r3 = r2                              r3 = r2
        r3 <<= 0x20                          r3 <<= 0x20
        r4 = r3                              r3 s>>= 0x20
        r4 s>>= 0x20
        if r4 s> 0x2 goto +0x5 <LBB0_3>      if r3 s> 0x2 goto +0x4 <LBB0_3>
        r4 = *(u32 *)(r1 + 0x4)              r3 = *(u32 *)(r1 + 0x4)
        r3 >>= 0x20
        if r3 >= r4 goto +0x2 <LBB0_3>       if r2 >= r3 goto +0x2 <LBB0_3>
        r2 += 0x2                            r2 += 0x2
        *(u32 *)(r1 + 0x0) = r2              *(u32 *)(r1 + 0x0) = r2
    <LBB0_3>:                            <LBB0_3>:
        r0 = 0x3                             r0 = 0x3
        exit                                 exit

For kernel BPF selftests statistics is as follows: (-mcpu=v4):
- For -mcpu=v4: 9 out of 655 object files have differences,
  in all cases total number of instructions marginally decreased
  (-27 instructions).
- For -mcpu=v2: 9 out of 655 object files have differences:
  - For 19 object files number of instruction decreased
    (-129 instruction in total): some redundant `rX &= 0xffff`
    and register to register assignments removed;
  - For 2 object files number of instructions increased +2
    instructions in each file.

Both -mcpu=v2 instruction increases could be reduced to the same
example:

    define void @foo(ptr %p) {
    entry:
      %a = load i32, ptr %p, align 4
      %b = sext i32 %a to i64
      %c = icmp ult i64 1, %b
      br i1 %c, label %next, label %end

    next:
      call void inttoptr (i64 62 to ptr)(i32 %a)
      br label %end

    end:
      ret void
    }

Note that this example uses value loaded to `%a` both as a sign
extended (`%b`) and as zero extended (`%a` passed as parameter).
Here is the difference in final assembly code:

    Before this revision          After this revision
    --------------------          -------------------
        r1 = *(u32 *)(r1 + 0)         r1 = *(u32 *)(r1 + 0)
        r1 <<= 32                     r1 <<= 32
        r1 s>>= 32                    r1 s>>= 32
        if r1 < 2 goto <LBB0_2>       if r1 < 2 goto <LBB0_2>
                                      r1 <<= 32
                                      r1 >>= 32
        call 62                       call 62
    <LBB0_2>:                     <LBB0_2>:
        exit                          exit

Before this commit `%a` is passed to call as a sign extended value,
after this commit `%a` is passed to call as a zero extended value,
both are correct as 32-bit sub-register is the same.

The difference comes from `DAGCombiner` operation on the initial DAG:

Initial selection DAG before this commit:

    t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
          t6: i64 = any_extend t5         <--------------------- (1)
        t8: ch = CopyToReg t0, Register:i64 %0, t6
            t9: i64 = sign_extend t5
          t12: i1 = setcc Constant:i64<1>, t9, setult:ch

Initial selection DAG after this commit:

    t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
          t6: i64 = zero_extend t5        <--------------------- (2)
        t8: ch = CopyToReg t0, Register:i64 %0, t6
            t9: i64 = sign_extend t5
          t12: i1 = setcc Constant:i64<1>, t9, setult:ch

The node `t9` is processed before node `t6` and `load` instruction is
combined to load with sign extension:

    Replacing.1 t9: i64 = sign_extend t5
    With: t30: i64,ch = load<(load (s32) from %ir.p), sext from i32> t0, t2, undef:i64
     and 0 other values
    Replacing.1 t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
    With: t31: i32 = truncate t30
     and 1 other values

This is done by `DAGCombiner.cpp:tryToFoldExtOfLoad` called from
`DAGCombiner::visitSIGN_EXTEND`. Note that `t5` is used by `t6` which
is `any_extend` in (1) and `zero_extend` in (2).
`tryToFoldExtOfLoad()` rewrites such uses of `t5` differently:
- `any_extend` is simply removed
- `zero_extend` is replaced by `and t30, 0xffffffff`, which is later
  converted to a pair of shifts. This pair of shifts survives till the
  end of translation.

Differential Revision: https://reviews.llvm.org/D157870
  • Loading branch information
eddyz87 committed Aug 21, 2023
1 parent 48e0a6f commit 651e644
Show file tree
Hide file tree
Showing 6 changed files with 95 additions and 182 deletions.
4 changes: 1 addition & 3 deletions llvm/lib/Target/BPF/BPF.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,12 @@ ModulePass *createBPFCheckAndAdjustIR();
FunctionPass *createBPFISelDag(BPFTargetMachine &TM);
FunctionPass *createBPFMISimplifyPatchablePass();
FunctionPass *createBPFMIPeepholePass();
FunctionPass *createBPFMIPeepholeTruncElimPass();
FunctionPass *createBPFMIPreEmitPeepholePass();
FunctionPass *createBPFMIPreEmitCheckingPass();

void initializeBPFCheckAndAdjustIRPass(PassRegistry&);
void initializeBPFDAGToDAGISelPass(PassRegistry &);
void initializeBPFMIPeepholePass(PassRegistry&);
void initializeBPFMIPeepholeTruncElimPass(PassRegistry &);
void initializeBPFMIPeepholePass(PassRegistry &);
void initializeBPFMIPreEmitCheckingPass(PassRegistry&);
void initializeBPFMIPreEmitPeepholePass(PassRegistry &);
void initializeBPFMISimplifyPatchablePass(PassRegistry &);
Expand Down
12 changes: 12 additions & 0 deletions llvm/lib/Target/BPF/BPFISelLowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,18 @@ bool BPFTargetLowering::isZExtFree(EVT VT1, EVT VT2) const {
return NumBits1 == 32 && NumBits2 == 64;
}

bool BPFTargetLowering::isZExtFree(SDValue Val, EVT VT2) const {
EVT VT1 = Val.getValueType();
if (Val.getOpcode() == ISD::LOAD && VT1.isSimple() && VT2.isSimple()) {
MVT MT1 = VT1.getSimpleVT().SimpleTy;
MVT MT2 = VT2.getSimpleVT().SimpleTy;
if ((MT1 == MVT::i8 || MT1 == MVT::i16 || MT1 == MVT::i32) &&
(MT2 == MVT::i32 || MT2 == MVT::i64))
return true;
}
return TargetLoweringBase::isZExtFree(Val, VT2);
}

BPFTargetLowering::ConstraintType
BPFTargetLowering::getConstraintType(StringRef Constraint) const {
if (Constraint.size() == 1) {
Expand Down
1 change: 1 addition & 0 deletions llvm/lib/Target/BPF/BPFISelLowering.h
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,7 @@ class BPFTargetLowering : public TargetLowering {
// For 32bit ALU result zext to 64bit is free.
bool isZExtFree(Type *Ty1, Type *Ty2) const override;
bool isZExtFree(EVT VT1, EVT VT2) const override;
bool isZExtFree(SDValue Val, EVT VT2) const override;

unsigned EmitSubregExt(MachineInstr &MI, MachineBasicBlock *BB, unsigned Reg,
bool isSigned) const;
Expand Down
177 changes: 0 additions & 177 deletions llvm/lib/Target/BPF/BPFMIPeephole.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -606,180 +606,3 @@ FunctionPass* llvm::createBPFMIPreEmitPeepholePass()
{
return new BPFMIPreEmitPeephole();
}

STATISTIC(TruncElemNum, "Number of truncation eliminated");

namespace {

struct BPFMIPeepholeTruncElim : public MachineFunctionPass {

static char ID;
const BPFInstrInfo *TII;
MachineFunction *MF;
MachineRegisterInfo *MRI;

BPFMIPeepholeTruncElim() : MachineFunctionPass(ID) {
initializeBPFMIPeepholeTruncElimPass(*PassRegistry::getPassRegistry());
}

private:
// Initialize class variables.
void initialize(MachineFunction &MFParm);

bool eliminateTruncSeq();

public:

// Main entry point for this pass.
bool runOnMachineFunction(MachineFunction &MF) override {
if (skipFunction(MF.getFunction()))
return false;

initialize(MF);

return eliminateTruncSeq();
}
};

static bool TruncSizeCompatible(int TruncSize, unsigned opcode)
{
if (TruncSize == 1)
return opcode == BPF::LDB || opcode == BPF::LDB32;

if (TruncSize == 2)
return opcode == BPF::LDH || opcode == BPF::LDH32;

if (TruncSize == 4)
return opcode == BPF::LDW || opcode == BPF::LDW32;

return false;
}

// Initialize class variables.
void BPFMIPeepholeTruncElim::initialize(MachineFunction &MFParm) {
MF = &MFParm;
MRI = &MF->getRegInfo();
TII = MF->getSubtarget<BPFSubtarget>().getInstrInfo();
LLVM_DEBUG(dbgs() << "*** BPF MachineSSA TRUNC Elim peephole pass ***\n\n");
}

// Reg truncating is often the result of 8/16/32bit->64bit or
// 8/16bit->32bit conversion. If the reg value is loaded with
// masked byte width, the AND operation can be removed since
// BPF LOAD already has zero extension.
//
// This also solved a correctness issue.
// In BPF socket-related program, e.g., __sk_buff->{data, data_end}
// are 32-bit registers, but later on, kernel verifier will rewrite
// it with 64-bit value. Therefore, truncating the value after the
// load will result in incorrect code.
bool BPFMIPeepholeTruncElim::eliminateTruncSeq() {
MachineInstr* ToErase = nullptr;
bool Eliminated = false;

for (MachineBasicBlock &MBB : *MF) {
for (MachineInstr &MI : MBB) {
// The second insn to remove if the eliminate candidate is a pair.
MachineInstr *MI2 = nullptr;
Register DstReg, SrcReg;
MachineInstr *DefMI;
int TruncSize = -1;

// If the previous instruction was marked for elimination, remove it now.
if (ToErase) {
ToErase->eraseFromParent();
ToErase = nullptr;
}

// AND A, 0xFFFFFFFF will be turned into SLL/SRL pair due to immediate
// for BPF ANDI is i32, and this case only happens on ALU64.
if (MI.getOpcode() == BPF::SRL_ri &&
MI.getOperand(2).getImm() == 32) {
SrcReg = MI.getOperand(1).getReg();
if (!MRI->hasOneNonDBGUse(SrcReg))
continue;

MI2 = MRI->getVRegDef(SrcReg);
DstReg = MI.getOperand(0).getReg();

if (!MI2 ||
MI2->getOpcode() != BPF::SLL_ri ||
MI2->getOperand(2).getImm() != 32)
continue;

// Update SrcReg.
SrcReg = MI2->getOperand(1).getReg();
DefMI = MRI->getVRegDef(SrcReg);
if (DefMI)
TruncSize = 4;
} else if (MI.getOpcode() == BPF::AND_ri ||
MI.getOpcode() == BPF::AND_ri_32) {
SrcReg = MI.getOperand(1).getReg();
DstReg = MI.getOperand(0).getReg();
DefMI = MRI->getVRegDef(SrcReg);

if (!DefMI)
continue;

int64_t imm = MI.getOperand(2).getImm();
if (imm == 0xff)
TruncSize = 1;
else if (imm == 0xffff)
TruncSize = 2;
}

if (TruncSize == -1)
continue;

// The definition is PHI node, check all inputs.
if (DefMI->isPHI()) {
bool CheckFail = false;

for (unsigned i = 1, e = DefMI->getNumOperands(); i < e; i += 2) {
MachineOperand &opnd = DefMI->getOperand(i);
if (!opnd.isReg()) {
CheckFail = true;
break;
}

MachineInstr *PhiDef = MRI->getVRegDef(opnd.getReg());
if (!PhiDef || PhiDef->isPHI() ||
!TruncSizeCompatible(TruncSize, PhiDef->getOpcode())) {
CheckFail = true;
break;
}
}

if (CheckFail)
continue;
} else if (!TruncSizeCompatible(TruncSize, DefMI->getOpcode())) {
continue;
}

BuildMI(MBB, MI, MI.getDebugLoc(), TII->get(BPF::MOV_rr), DstReg)
.addReg(SrcReg);

if (MI2)
MI2->eraseFromParent();

// Mark it to ToErase, and erase in the next iteration.
ToErase = &MI;
TruncElemNum++;
Eliminated = true;
}
}

return Eliminated;
}

} // end default namespace

INITIALIZE_PASS(BPFMIPeepholeTruncElim, "bpf-mi-trunc-elim",
"BPF MachineSSA Peephole Optimization For TRUNC Eliminate",
false, false)

char BPFMIPeepholeTruncElim::ID = 0;
FunctionPass* llvm::createBPFMIPeepholeTruncElimPass()
{
return new BPFMIPeepholeTruncElim();
}
2 changes: 0 additions & 2 deletions llvm/lib/Target/BPF/BPFTargetMachine.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,6 @@ extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeBPFTarget() {
PassRegistry &PR = *PassRegistry::getPassRegistry();
initializeBPFCheckAndAdjustIRPass(PR);
initializeBPFMIPeepholePass(PR);
initializeBPFMIPeepholeTruncElimPass(PR);
initializeBPFDAGToDAGISelPass(PR);
}

Expand Down Expand Up @@ -155,7 +154,6 @@ void BPFPassConfig::addMachineSSAOptimization() {
if (!DisableMIPeephole) {
if (Subtarget->getHasAlu32())
addPass(createBPFMIPeepholePass());
addPass(createBPFMIPeepholeTruncElimPass());
}
}

Expand Down
81 changes: 81 additions & 0 deletions llvm/test/CodeGen/BPF/remove_truncate_9.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
; RUN: llc -mcpu=v2 -march=bpf < %s | FileCheck %s
; RUN: llc -mcpu=v4 -march=bpf < %s | FileCheck %s

; Zero extension instructions should be eliminated at instruction
; selection phase for all test cases below.

; In BPF zero extension is implemented as &= or a pair of <<=/>>=
; instructions, hence simply check that &= and >>= do not exist in
; generated code (<<= remains because %c is used by both call and
; lshr in a few test cases).

; CHECK-NOT: &=
; CHECK-NOT: >>=

define void @shl_lshr_same_bb(ptr %p) {
entry:
%a = load i8, ptr %p, align 1
%b = zext i8 %a to i64
%c = shl i64 %b, 56
%d = lshr i64 %c, 56
%e = icmp eq i64 %d, 0
; hasOneUse() is a common requirement for many CombineDAG
; transofmations, make sure that it does not matter in this case.
call void @sink1(i8 %a, i64 %b, i64 %c, i64 %d, i1 %e)
ret void
}

define void @shl_lshr_diff_bb(ptr %p) {
entry:
%a = load i16, ptr %p, align 2
%b = zext i16 %a to i64
%c = shl i64 %b, 48
%d = lshr i64 %c, 48
br label %next

; Jump to the new basic block creates a COPY instruction for %d, which
; might be materialized as noop or as AND_ri (zero extension) at the
; start of the basic block. The decision depends on TLI.isZExtFree()
; results, see RegsForValue::getCopyToRegs(). Check below verifies
; that COPY is materialized as noop.
next:
%e = icmp eq i64 %d, 0
call void @sink2(i16 %a, i64 %b, i64 %c, i64 %d, i1 %e)
ret void
}

define void @load_zext_same_bb(ptr %p) {
entry:
%a = load i8, ptr %p, align 1
; zext is implicit in this context
%b = icmp eq i8 %a, 0
call void @sink3(i8 %a, i1 %b)
ret void
}

define void @load_zext_diff_bb(ptr %p) {
entry:
%a = load i8, ptr %p, align 1
br label %next

next:
%b = icmp eq i8 %a, 0
call void @sink3(i8 %a, i1 %b)
ret void
}

define void @load_zext_diff_bb_2(ptr %p) {
entry:
%a = load i32, ptr %p, align 4
br label %next

next:
%b = icmp eq i32 %a, 0
call void @sink4(i32 %a, i1 %b)
ret void
}

declare void @sink1(i8, i64, i64, i64, i1);
declare void @sink2(i16, i64, i64, i64, i1);
declare void @sink3(i8, i1);
declare void @sink4(i32, i1);

0 comments on commit 651e644

Please sign in to comment.