Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CodeGen][Spill2Reg] Initial patch #118832

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

[CodeGen][Spill2Reg] Initial patch #118832

wants to merge 8 commits into from

Conversation

vporpo
Copy link
Contributor

@vporpo vporpo commented Dec 5, 2024

This is the first commit for the Spill2Reg optimization pass. The goal of this pass is to selectively replace spills to the stack with spills to vector registers. This can help remove back-end stalls in x86.

Old code review: https://reviews.llvm.org/D118298

RFC:
https://lists.llvm.org/pipermail/llvm-dev/2022-January/154782.html https://discourse.llvm.org/t/rfc-spill2reg-selectively-replace-spills-to-stack-with-spills-to-vector-registers/59630

@llvmbot
Copy link
Member

llvmbot commented Dec 12, 2024

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-llvm-regalloc

Author: vporpo (vporpo)

Changes

This is the first commit for the Spill2Reg optimization pass. The goal of this pass is to selectively replace spills to the stack with spills to vector registers. This can help remove back-end stalls in x86.

Old code review: https://reviews.llvm.org/D118298

RFC:
https://lists.llvm.org/pipermail/llvm-dev/2022-January/154782.html https://discourse.llvm.org/t/rfc-spill2reg-selectively-replace-spills-to-stack-with-spills-to-vector-registers/59630


Full diff: https://github.com/llvm/llvm-project/pull/118832.diff

6 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/Passes.h (+3)
  • (modified) llvm/include/llvm/InitializePasses.h (+1)
  • (modified) llvm/lib/CodeGen/CMakeLists.txt (+1)
  • (modified) llvm/lib/CodeGen/CodeGen.cpp (+1)
  • (added) llvm/lib/CodeGen/Spill2Reg.cpp (+56)
  • (modified) llvm/lib/CodeGen/TargetPassConfig.cpp (+9)
diff --git a/llvm/include/llvm/CodeGen/Passes.h b/llvm/include/llvm/CodeGen/Passes.h
index d1fac4a304cffe..77d305aa7d0a9c 100644
--- a/llvm/include/llvm/CodeGen/Passes.h
+++ b/llvm/include/llvm/CodeGen/Passes.h
@@ -608,6 +608,9 @@ namespace llvm {
 
   /// Lowers KCFI operand bundles for indirect calls.
   FunctionPass *createKCFIPass();
+
+  /// This pass replaces spills to stack with spills to registers.
+  extern char &Spill2RegID;
 } // End llvm namespace
 
 #endif
diff --git a/llvm/include/llvm/InitializePasses.h b/llvm/include/llvm/InitializePasses.h
index 7b81c9a8e143a3..7467844ec34038 100644
--- a/llvm/include/llvm/InitializePasses.h
+++ b/llvm/include/llvm/InitializePasses.h
@@ -321,6 +321,7 @@ void initializeWasmEHPreparePass(PassRegistry &);
 void initializeWinEHPreparePass(PassRegistry &);
 void initializeWriteBitcodePassPass(PassRegistry &);
 void initializeXRayInstrumentationPass(PassRegistry &);
+void initializeSpill2RegPass(PassRegistry &);
 
 } // end namespace llvm
 
diff --git a/llvm/lib/CodeGen/CMakeLists.txt b/llvm/lib/CodeGen/CMakeLists.txt
index 7b47c0e6f75dbe..8cbd5650fdd10c 100644
--- a/llvm/lib/CodeGen/CMakeLists.txt
+++ b/llvm/lib/CodeGen/CMakeLists.txt
@@ -219,6 +219,7 @@ add_llvm_component_library(LLVMCodeGen
   SjLjEHPrepare.cpp
   SlotIndexes.cpp
   SpillPlacement.cpp
+  Spill2Reg.cpp
   SplitKit.cpp
   StackColoring.cpp
   StackFrameLayoutAnalysisPass.cpp
diff --git a/llvm/lib/CodeGen/CodeGen.cpp b/llvm/lib/CodeGen/CodeGen.cpp
index 59428818c1ee7c..2e599451a4b4a2 100644
--- a/llvm/lib/CodeGen/CodeGen.cpp
+++ b/llvm/lib/CodeGen/CodeGen.cpp
@@ -143,4 +143,5 @@ void llvm::initializeCodeGen(PassRegistry &Registry) {
   initializeWasmEHPreparePass(Registry);
   initializeWinEHPreparePass(Registry);
   initializeXRayInstrumentationPass(Registry);
+  initializeSpill2RegPass(Registry);
 }
diff --git a/llvm/lib/CodeGen/Spill2Reg.cpp b/llvm/lib/CodeGen/Spill2Reg.cpp
new file mode 100644
index 00000000000000..09ffa71b891cb5
--- /dev/null
+++ b/llvm/lib/CodeGen/Spill2Reg.cpp
@@ -0,0 +1,56 @@
+//===- Spill2Reg.cpp - Spill To Register Optimization ---------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+//
+/// \file This file implements Spill2Reg, an optimization which selectively
+/// replaces spills/reloads to/from the stack with register copies to/from the
+/// vector register file. This works even on targets where load/stores have
+/// similar latency to register copies because it can free up memory units which
+/// helps avoid back-end stalls.
+///
+//===----------------------------------------------------------------------===//
+
+#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/Passes.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Support/CommandLine.h"
+
+using namespace llvm;
+
+namespace {
+
+class Spill2Reg : public MachineFunctionPass {
+public:
+  static char ID;
+  Spill2Reg() : MachineFunctionPass(ID) {
+    initializeSpill2RegPass(*PassRegistry::getPassRegistry());
+  }
+  void getAnalysisUsage(AnalysisUsage &AU) const override;
+  void releaseMemory() override;
+  bool runOnMachineFunction(MachineFunction &) override;
+};
+
+} // namespace
+
+void Spill2Reg::getAnalysisUsage(AnalysisUsage &AU) const {
+  AU.setPreservesCFG();
+  MachineFunctionPass::getAnalysisUsage(AU);
+}
+
+void Spill2Reg::releaseMemory() {}
+
+bool Spill2Reg::runOnMachineFunction(MachineFunction &MFn) {
+  llvm_unreachable("Unimplemented");
+}
+
+char Spill2Reg::ID = 0;
+
+char &llvm::Spill2RegID = Spill2Reg::ID;
+
+INITIALIZE_PASS_BEGIN(Spill2Reg, "spill2reg", "Spill2Reg", false, false)
+INITIALIZE_PASS_END(Spill2Reg, "spill2reg", "Spill2Reg", false, false)
diff --git a/llvm/lib/CodeGen/TargetPassConfig.cpp b/llvm/lib/CodeGen/TargetPassConfig.cpp
index d407e9f0871d4c..87ee076db7a9f3 100644
--- a/llvm/lib/CodeGen/TargetPassConfig.cpp
+++ b/llvm/lib/CodeGen/TargetPassConfig.cpp
@@ -214,6 +214,11 @@ static cl::opt<bool> DisableReplaceWithVecLib(
     "disable-replace-with-vec-lib", cl::Hidden,
     cl::desc("Disable replace with vector math call pass"));
 
+// Enable the Spill2Reg pass.
+static cl::opt<bool> EnableSpill2Reg("enable-spill2reg", cl::Hidden,
+                                     cl::init(false),
+                                     cl::desc("Enable Spill2Reg pass"));
+
 /// Option names for limiting the codegen pipeline.
 /// Those are used in error reporting and we didn't want
 /// to duplicate their names all over the place.
@@ -1415,6 +1420,10 @@ bool TargetPassConfig::addRegAssignAndRewriteOptimized() {
   // Finally rewrite virtual registers.
   addPass(&VirtRegRewriterID);
 
+  // Replace spills to stack with spills to registers.
+  if (EnableSpill2Reg)
+    addPass(&Spill2RegID);
+
   // Regalloc scoring for ML-driven eviction - noop except when learning a new
   // eviction policy.
   addPass(createRegAllocScoringPass());

@vporpo
Copy link
Contributor Author

vporpo commented Dec 20, 2024

Per @nvjle 's request I uploaded the rest of the patches for reference.

@williamweixiao williamweixiao self-requested a review December 22, 2024 00:21
@williamweixiao
Copy link
Contributor

There are some DFS on the CFG and do you have data about the impact to compilation speed?

MachineBasicBlock *MBB = Reload->getParent();
bool IsSpillBlock = SpillMBBs.count(MBB);
// Add all MBB's live-outs.
LRU.addLiveOuts(*MBB);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need below similar "stepBackward" code here as in "GetReloadLRU"?

    // Start at the bottom of the BB and walk up until we find `Reload`.
    for (MachineInstr &MI : llvm::reverse(*MBB)) {
      if (&MI == Reload)
        break;
      ReloadLRU.stepBackward(MI);
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I think LRU.accumulate() is the correct one because LRU.stepBackward() seems to be removing the defined regs from the set. But what we need is to collect all the registers that are used at any point to avoid using them as the target vector register. I added a TODO to check this later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that LRU.accumulate() is correct but conservative.
consider below example:

spill rax
...
reload rax
xmm1 = ...
xmm2 = ...

we can actually reuse xmm1/xmm2 for the spill.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the spill and reload are in the same block then we walk from the reload to the spill (see line 458 bool FoundSpill = AccumulateLRUUntilSpillFn(Reload, &LRU: ReloadLRU); so the xmm registers won't be in the set.

I think accumulate() is the correct function to use because of cases like:

spill rax
xmm1 = ...
... = xmm1
reload rax

In this case if we stepBackward() from reload rax to spill rax then I think the register set at spill rax won't contain xmm1. But it's not safe to use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use my above example for discussion, the xmm1/xmm2 registers will be in the set by line 392 (i.e. "LRU.addLiveOuts(*MBB);").
The loop from line 394 to 400 is ok and we should use "accumulate()" there.
But we can remove the xmm1/xmm2 registers from the set before the loop (from line 394 to 400) with below code:

    // Start at the bottom of the BB and walk up until we find `Reload`.
    for (MachineInstr &MI : llvm::reverse(*MBB)) {
      if (&MI == Reload)
        break;
      ReloadLRU.stepBackward(MI);
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right:

AccumulateLRUUntilSpillFn() is called in two cases:

  1. when we calculate live regs from reload until a spill (after the call to GetReloadLRU()), in which case we have already calculated the live-outs at the point of the reload using GetReloadLRU(), starting from the live-outs and stepping backwards until the reload.
  2. when we are looking for the spills, in which case we start from the bottom of the BB and walk up until the spill (called inside AccumulateLRUFn()). In this case we need to initialize the set with the live-outs at MBB.

So I think the issue is that we call LRU.addLiveOuts() inside AccumulateLRUUntilSpillFn(), instead of calling it only in AccumulateLURFn() just before we call AccumulateLRUUntilSpillFn().

I have added some comments to make the code more readable.

SmallVector<MIDataWithLiveIn, 1> Reloads;

/// \Returns the physical register being spilled.
Register getSpilledReg() const { return Spills.front().MO->getReg(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name is somewhat confusing. One stackslot can receive spills from multiple registers. What we really need is TargetRegisterClass right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right. We are only using this to get the register class. I replaced this with getSpilledRegClass().

if (const MachineOperand *MO = TII->isStoreToStackSlotMO(MI, StackSlot)) {
MachineInstr *Spill = &MI;
auto &Entry = StackSlotData[StackSlot];
if (SkipEntry(StackSlot, MO->getReg())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check "Entry.Disable" first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I changed this to: if (Entry.Disable || SkipEntry(StackSlot, Reg: MO->getReg())).

} else {
// This should capture uses of the stack in instructions that access
// memory (e.g., folded spills/reloads) and non-memory instructions,
// like x86 LEA.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My hunch is that most cases may come from memory folding instructions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this should be the common case at least in x86.


bool Spill2Reg::run() {
// Walk over each instruction in the code keeping track of the processor's
// port pressure and look for memory unit hot-spots.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess "port pressure and look for memory unit hot-spots." is "TODO" work, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is not currently modeled properly. Ideally we should feed the instructions into a pipeline model and check for bottlenecks.

if (X86::VK16RegClass.contains(Reg))
return false;

switch (unsigned Bits = TRI->getRegSizeInBits(Reg, *MRI)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are "double" and "float" legal here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, float or double is not legal because they are already in a vector register. I think Bits is 128 or more for those, depending on the target.

if (!MBB->isLiveIn(VectorReg))
MBB->addLiveIn(VectorReg);
}
for (MachineBasicBlock *PredMBB : Reload->getParent()->predecessors())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to do "DFS" if "ReloadData.IsLiveIn" is false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed.

Copy link
Contributor Author

@vporpo vporpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some DFS on the CFG and do you have data about the impact to compilation speed?

It's been a while since I tested this pass, but as far as I remember it wasn't too bad. The traversals done in the code generation phase of the pass are not too frequent as they only happen when we need to spill to vector registers, which shouldn't be too often.

if (X86::VK16RegClass.contains(Reg))
return false;

switch (unsigned Bits = TRI->getRegSizeInBits(Reg, *MRI)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, float or double is not legal because they are already in a vector register. I think Bits is 128 or more for those, depending on the target.


bool Spill2Reg::run() {
// Walk over each instruction in the code keeping track of the processor's
// port pressure and look for memory unit hot-spots.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is not currently modeled properly. Ideally we should feed the instructions into a pipeline model and check for bottlenecks.

SmallVector<MIDataWithLiveIn, 1> Reloads;

/// \Returns the physical register being spilled.
Register getSpilledReg() const { return Spills.front().MO->getReg(); }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right. We are only using this to get the register class. I replaced this with getSpilledRegClass().

if (const MachineOperand *MO = TII->isStoreToStackSlotMO(MI, StackSlot)) {
MachineInstr *Spill = &MI;
auto &Entry = StackSlotData[StackSlot];
if (SkipEntry(StackSlot, MO->getReg())) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I changed this to: if (Entry.Disable || SkipEntry(StackSlot, Reg: MO->getReg())).

} else {
// This should capture uses of the stack in instructions that access
// memory (e.g., folded spills/reloads) and non-memory instructions,
// like x86 LEA.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this should be the common case at least in x86.

if (!MBB->isLiveIn(VectorReg))
MBB->addLiveIn(VectorReg);
}
for (MachineBasicBlock *PredMBB : Reload->getParent()->predecessors())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed.

MachineBasicBlock *MBB = Reload->getParent();
bool IsSpillBlock = SpillMBBs.count(MBB);
// Add all MBB's live-outs.
LRU.addLiveOuts(*MBB);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I think LRU.accumulate() is the correct one because LRU.stepBackward() seems to be removing the defined regs from the set. But what we need is to collect all the registers that are used at any point to avoid using them as the target vector register. I added a TODO to check this later.


/// \Returns the register class of the register being spilled.
const TargetRegisterClass *
getSpilledRegClass(const TargetInstrInfo *TII,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this must have been the result of a bad rebase, I removed it.

LiveRegUnits LRU(*TRI);
calculateLiveRegs(Entry, LRU);

// Look for a physical register that in LRU.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is not in LRU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

const unsigned MinVecBits =
TRI->getRegSizeInBits(*TRI->getRegClass(X86::VR128RegClassID));
if (MF->getFrameInfo().getObjectSize(MO.getIndex()) >= MinVecBits)
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please give me an instruction example that can return "true" here (i.e., vector-size stack access without any vector register operand)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have such an instruction in mind and I can't recall if I did back when I wrote this. It's probably a conservative check. I added a TODO to check if this is needed.

MachineBasicBlock *MBB = Reload->getParent();
bool IsSpillBlock = SpillMBBs.count(MBB);
// Add all MBB's live-outs.
LRU.addLiveOuts(*MBB);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that LRU.accumulate() is correct but conservative.
consider below example:

spill rax
...
reload rax
xmm1 = ...
xmm2 = ...

we can actually reuse xmm1/xmm2 for the spill.

// compilation time.
for (auto &MID : Entry.Reloads)
if (MID.MI->getParent() == &MBB)
MID.IsLiveIn = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need "live-in" for below case?

...
reload stack.0
...
spill  stack.0
...
reload stack.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this looks wrong. I removed this code.

if (X86::VK16RegClass.contains(Reg))
return false;

switch (unsigned Bits = TRI->getRegSizeInBits(Reg, *MRI)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Bits" is unused variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@RKSimon RKSimon requested review from RKSimon and topperc January 2, 2025 11:56
Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to ensure this patch doesn't get too focussed on just working for gpr->vector spills. In my experience those profitable cases are pretty rare. What has been more useful has been cases such as storing scalar f32/f64 in the upper elements of xmm registers, or even using ymm upper halfs to store xmm vector data, and to a lesser extent storing a i32 in the upper 32-bits of a i64 gpr.

@@ -294,6 +294,11 @@ class TargetInstrInfo : public MCInstrInfo {
return isLoadFromStackSlot(MI, FrameIndex);
}

virtual const MachineOperand *isLoadFromStackSlotMO(const MachineInstr &MI,
int &FrameIndex) const {
llvm_unreachable("target did not implement");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just return nullptr by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning is that if I was implementing this for a new target I would prefer getting a crash telling me that I should override this function, rather than getting it to silently skip spill2reg because some functions are not overridden. Wdyt?

}

virtual const TargetRegisterClass *
getVectorRegisterClassForSpill2Reg(const TargetRegisterInfo *TRI,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer we don't use Vector in the spill2reg naming convention as I'd like to see this work for more cases than spilling gpr to vector registers - maybe getCandidateRegisterClassForSpill2Reg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, done.

There are a couple more of these, like spill2RegExtractFromVectorReg(). I am thinking of using the term Host instead of Vector to describe the register used by spill2reg, what do you think?

const MachineRegisterInfo *MRI) const {
llvm_unreachable(
"Target didn't implement TargetInstrInfo::isLegalToSpill2Reg!");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we putting all of these in here and not TargetRegisterInfo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of them have to do with Instructions rather than registers, with the exception of isLegalToSpill2Reg(), and getVectorRegisterClassForSpill2Reg()`. I will move these to TargetRegisterInfo.

  • isStoreToStackSlotMO() inspects an instruction
  • targetSupportsSpill2Reg() could be placed in either file as it does not check an instruction or register
  • isSpill2RegProfitable() checks the instruction sequence
  • spill2RegInsertToVectoReg() emits instructions
  • spill2RegExtractFromVectorReg() emits instructions

@williamweixiao
Copy link
Contributor

I'd like to ensure this patch doesn't get too focussed on just working for gpr->vector spills. In my experience those profitable cases are pretty rare. What has been more useful has been cases such as storing scalar f32/f64 in the upper elements of xmm registers, or even using ymm upper halfs to store xmm vector data, and to a lesser extent storing a i32 in the upper 32-bits of a i64 gpr.

Yes, we also observed some cases in which spilling float value into GPR can help performance (suppose GPR register pressure is low meanwhile).

@vporpo
Copy link
Contributor Author

vporpo commented Jan 9, 2025

I'd like to ensure this patch doesn't get too focussed on just working for gpr->vector spills. In my experience those profitable cases are pretty rare. What has been more useful has been cases such as storing scalar f32/f64 in the upper elements of xmm registers, or even using ymm upper halfs to store xmm vector data, and to a lesser extent storing a i32 in the upper 32-bits of a i64 gpr.

Yes, we also observed some cases in which spilling float value into GPR can help performance (suppose GPR register pressure is low meanwhile).

I think that the structure of the pass is already fairly agnostic to the variant of spill2reg (like GPR->lower vector, GPR->upper vector, GPR->upper GPR, F32/64->upper vector). The candidates are filtered by TRI callbacks, like isLegalToSpillToReg() and code generation is done with TII callbacks: spill2RegInsertToVectorReg() and spill2RegExtractFromVectoReg(). All of these can be update to work with different spill2reg variants.

Once we add support for more than one spill2reg variant, then during the collection phase we would need to determine the variant and set it in StackSlotDataEntry using an enum. This is going to be used later in the pass (in generateCode()) to generate the corresponding code.

@vporpo
Copy link
Contributor Author

vporpo commented Jan 9, 2025

Changed TII function names spill2RegInsertToVectorReg() to spill2RegInsertToS2RReg() and spill2RegExtractFromVectorReg() to spill2RegExtractFromS2RReg(). Rebased.

Copy link

github-actions bot commented Jan 9, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@vporpo vporpo force-pushed the Spill2Reg branch 3 times, most recently from 8b212b6 to d2147f1 Compare January 14, 2025 19:15
@vporpo
Copy link
Contributor Author

vporpo commented Jan 14, 2025

Shall we start focusing on one patch at a time for the code reviews? @williamweixiao @RKSimon any comments on the first patch?

MachineBasicBlock *MBB = Reload->getParent();
bool IsSpillBlock = SpillMBBs.count(MBB);
// Add all MBB's live-outs.
LRU.addLiveOuts(*MBB);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use my above example for discussion, the xmm1/xmm2 registers will be in the set by line 392 (i.e. "LRU.addLiveOuts(*MBB);").
The loop from line 394 to 400 is ok and we should use "accumulate()" there.
But we can remove the xmm1/xmm2 registers from the set before the loop (from line 394 to 400) with below code:

    // Start at the bottom of the BB and walk up until we find `Reload`.
    for (MachineInstr &MI : llvm::reverse(*MBB)) {
      if (&MI == Reload)
        break;
      ReloadLRU.stepBackward(MI);
    }

/// already walking through the code there. Otherwise we would need to
/// walk throught the code again in `updateLiveIns()` just to check for
/// other spills in the block, which would waste compilation time.
bool IsLiveIn = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that "IsLiveIn" is always "true". Do we really need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better solution is to set the IsLiveIn flag whenever we visit a reload in collectSpillsAndReloads(). I have updated the code.

This is the first commit for the Spill2Reg optimization pass.
The goal of this pass is to selectively replace spills to the stack with
spills to other registers. This can help remove back-end stalls in x86.

Old code review: https://reviews.llvm.org/D118298

RFC:
https://lists.llvm.org/pipermail/llvm-dev/2022-January/154782.html
https://discourse.llvm.org/t/rfc-spill2reg-selectively-replace-spills-to-stack-with-spills-to-vector-registers/59630
Walk through the code looking for spills and reloads and group them
per stack slot.

Original review: https://reviews.llvm.org/D118299
This patch adds the main structure of the code generation phase of Spill2Reg.
Iterate through the spills/reloads collected earlier and generate the
new instructions.

Original review: https://reviews.llvm.org/D118300
Spill2Reg can now emit spill and reload instructions.
This will not generate correct code, as it does not keep track of live regs.

Original review: https://reviews.llvm.org/D118302
This patch implements tracking of live registers. This is used to look for free
vector registers. It works by walking up the CFG from the reloads all the way
to the spills, accumulating the register units being used.
This implementation caches the live register units used by each MBB for faster
compilation time.

Note: Live register tracking relies on MBB liveins/outs being maintained
correctly, which is implemented in a follow-up patch. So this patch will still
not generate correct code for all but some simple cases.

Original review: https://reviews.llvm.org/D118303
This patch implements updates for the MBB live-ins due to the newly introduced
instructions emitted by spill2reg.
This is required for correct tracking of live register usage.

Original review: https://reviews.llvm.org/D118304
This patch adds support for 8/16 bit values in x86.

Original review: https://reviews.llvm.org/D118305
This patch updates the vector spill/reload instructions to use the AVX
opcodes by default if the targets supports it. This can be turned off
with the -spill2reg-no-avx flag.

Original review: https://reviews.llvm.org/D118951
@vporpo
Copy link
Contributor Author

vporpo commented Jan 17, 2025

Thank you for the comments @williamweixiao . Do you think we should start focusing on the individual patches of this PR one by one? There are 8 patches in the chain at this point which makes it hard to review and to maintain.

@RKSimon
Copy link
Collaborator

RKSimon commented Jan 21, 2025

Have you tried graphite to maintain the patch series?

@vporpo
Copy link
Contributor Author

vporpo commented Jan 21, 2025

Have you tried graphite to maintain the patch series?

Thanks for the suggestion, I will give it a try. My main concern is testing: When the patch series is off-tree then we can only do so much testing on each change we are making, which can lead to regressions that get unnoticed. Given that there is enough interest in this project, my suggestion is to start working on this in-tree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants