[InstrProf] Evaluate function order using test traces #92451

ellishg · 2024-05-16T20:46:46Z

The llvm-profdata order command is used to compute a function order using traces from the input profile. Add the --num-test-traces flag to keep aside N traces to evalute this order. These test traces are assumed to be the actual function execution order in some experiment. The output is a number that represents how many page faults we got. Lower is better.

I tested on a large profile I already had.

llvm-profdata order default.profdata --num-test-traces=30
# Ordered 149103 functions
# Total area under the page fault curve: 2.271827e+09
...

I also improved TemporalProfTraceTy::createBPFunctionNodes() in a few ways:

Simplified how UNs are computed
Change how the initial Node order is computed
Filter out rare and common UNs
Output vector is an aliased argument instead of a return

These changes slightly improved the evaluation in my test.

llvm-profdata order default.profdata --num-test-traces=30
# Ordered 149103 functions
# Total area under the page fault curve: 2.268586e+09
...

The `llvm-profdata order` command is used to compute a function order using traces from the input profile. Add the `--num-test-traces` flag to keep aside N traces to evalute this order. These test traces are assumed to be the actual function execution order in some experiment. The output is a number that represents how many page faults we got. Lower is better. I tested on a large profile I already had. ``` llvm-profdata order default.profdata --num-test-traces=30 \# Ordered 149103 functions \# Total area under the page fault curve: 2.271827e+09 ... ``` I also improved `TemporalProfTraceTy::createBPFunctionNodes()` in a few ways: * Simplified how `UN`s are computed * Output vector is not an aliased argument instead of a return These changes improved the evaluation in my test. ``` llvm-profdata order default.profdata --num-test-traces=30 \# Ordered 149103 functions \# Total area under the page fault curve: 2.268586e+09 ... ```

llvmbot · 2024-05-16T20:47:18Z

@llvm/pr-subscribers-pgo

Author: Ellis Hoag (ellishg)

Changes

The llvm-profdata order command is used to compute a function order using traces from the input profile. Add the --num-test-traces flag to keep aside N traces to evalute this order. These test traces are assumed to be the actual function execution order in some experiment. The output is a number that represents how many page faults we got. Lower is better.

I tested on a large profile I already had.

llvm-profdata order default.profdata --num-test-traces=30
# Ordered 149103 functions
# Total area under the page fault curve: 2.271827e+09
...

I also improved TemporalProfTraceTy::createBPFunctionNodes() in a few ways:

Simplified how UNs are computed
Change how the initial Node order is computed
Filter out rare and common UNs
Output vector is an aliased argument instead of a return

These changes slightly improved the evaluation in my test.

llvm-profdata order default.profdata --num-test-traces=30
# Ordered 149103 functions
# Total area under the page fault curve: 2.268586e+09
...

Full diff: https://github.com/llvm/llvm-project/pull/92451.diff

6 Files Affected:

(modified) llvm/include/llvm/ProfileData/InstrProf.h (+2-2)
(modified) llvm/lib/ProfileData/InstrProf.cpp (+42-31)
(added) llvm/test/tools/llvm-profdata/show-order-error.proftext (+27)
(modified) llvm/test/tools/llvm-profdata/show-order.proftext (+8-3)
(modified) llvm/tools/llvm-profdata/llvm-profdata.cpp (+39-3)
(modified) llvm/unittests/ProfileData/BPFunctionNodeTest.cpp (+14-17)

diff --git a/llvm/include/llvm/ProfileData/InstrProf.h b/llvm/include/llvm/ProfileData/InstrProf.h
index 88c7fe425b5a5..b06ec5283a78b 100644
--- a/llvm/include/llvm/ProfileData/InstrProf.h
+++ b/llvm/include/llvm/ProfileData/InstrProf.h
@@ -385,8 +385,8 @@ struct TemporalProfTraceTy {
   /// Use a set of temporal profile traces to create a list of balanced
   /// partitioning function nodes used by BalancedPartitioning to generate a
   /// function order that reduces page faults during startup
-  static std::vector<BPFunctionNode>
-  createBPFunctionNodes(ArrayRef<TemporalProfTraceTy> Traces);
+  static void createBPFunctionNodes(ArrayRef<TemporalProfTraceTy> Traces,
+                                    std::vector<BPFunctionNode> &Nodes);
 };
 
 inline std::error_code make_error_code(instrprof_error E) {
diff --git a/llvm/lib/ProfileData/InstrProf.cpp b/llvm/lib/ProfileData/InstrProf.cpp
index 806d01de1ada5..220d7e01f7ccd 100644
--- a/llvm/lib/ProfileData/InstrProf.cpp
+++ b/llvm/lib/ProfileData/InstrProf.cpp
@@ -1002,46 +1002,57 @@ void InstrProfRecord::addValueData(uint32_t ValueKind, uint32_t Site,
     ValueSites.emplace_back(VData, VData + N);
 }
 
-std::vector<BPFunctionNode> TemporalProfTraceTy::createBPFunctionNodes(
-    ArrayRef<TemporalProfTraceTy> Traces) {
+void TemporalProfTraceTy::createBPFunctionNodes(
+    ArrayRef<TemporalProfTraceTy> Traces, std::vector<BPFunctionNode> &Nodes) {
   using IDT = BPFunctionNode::IDT;
   using UtilityNodeT = BPFunctionNode::UtilityNodeT;
-  // Collect all function IDs ordered by their smallest timestamp. This will be
-  // used as the initial FunctionNode order.
-  SetVector<IDT> FunctionIds;
-  size_t LargestTraceSize = 0;
-  for (auto &Trace : Traces)
-    LargestTraceSize =
-        std::max(LargestTraceSize, Trace.FunctionNameRefs.size());
-  for (size_t Timestamp = 0; Timestamp < LargestTraceSize; Timestamp++)
-    for (auto &Trace : Traces)
-      if (Timestamp < Trace.FunctionNameRefs.size())
-        FunctionIds.insert(Trace.FunctionNameRefs[Timestamp]);
-
-  const int N = Log2_64(LargestTraceSize) + 1;
-
+  UtilityNodeT MaxUN = 0;
+  DenseMap<IDT, size_t> IdToFirstTimestamp;
+  DenseMap<IDT, UtilityNodeT> IdToFirstUN;
+  DenseMap<IDT, SmallVector<UtilityNodeT>> IdToUNs;
   // TODO: We need to use the Trace.Weight field to give more weight to more
   // important utilities
-  DenseMap<IDT, SmallVector<UtilityNodeT, 4>> FuncGroups;
-  for (size_t TraceIdx = 0; TraceIdx < Traces.size(); TraceIdx++) {
-    auto &Trace = Traces[TraceIdx].FunctionNameRefs;
-    for (size_t Timestamp = 0; Timestamp < Trace.size(); Timestamp++) {
-      for (int I = Log2_64(Timestamp + 1); I < N; I++) {
-        auto FunctionId = Trace[Timestamp];
-        UtilityNodeT GroupId = TraceIdx * N + I;
-        FuncGroups[FunctionId].push_back(GroupId);
+  for (auto &Trace : Traces) {
+    size_t CutoffTimestamp = 1;
+    for (size_t Timestamp = 0; Timestamp < Trace.FunctionNameRefs.size();
+         Timestamp++) {
+      IDT Id = Trace.FunctionNameRefs[Timestamp];
+      auto [It, WasInserted] = IdToFirstTimestamp.try_emplace(Id, Timestamp);
+      if (!WasInserted)
+        It->getSecond() = std::min<size_t>(It->getSecond(), Timestamp);
+      if (Timestamp >= CutoffTimestamp) {
+        ++MaxUN;
+        CutoffTimestamp = 2 * Timestamp;
       }
+      IdToFirstUN.try_emplace(Id, MaxUN);
     }
+    for (auto &[Id, FirstUN] : IdToFirstUN)
+      for (auto UN = FirstUN; UN <= MaxUN; ++UN)
+        IdToUNs[Id].push_back(UN);
+    ++MaxUN;
+    IdToFirstUN.clear();
   }
 
-  std::vector<BPFunctionNode> Nodes;
-  for (auto Id : FunctionIds) {
-    auto &UNs = FuncGroups[Id];
-    llvm::sort(UNs);
-    UNs.erase(std::unique(UNs.begin(), UNs.end()), UNs.end());
+  DenseMap<UtilityNodeT, unsigned> UNFrequency;
+  for (auto &[Id, UNs] : IdToUNs)
+    for (auto &UN : UNs)
+      ++UNFrequency[UN];
+  // Filter out rare and common utility nodes to make BalancedPartitioning more
+  // effective.
+  for (auto &[Id, UNs] : IdToUNs)
+    llvm::erase_if(UNs, [&](auto &UN) {
+      return UNFrequency[UN] <= 1 || 2 * UNFrequency[UN] > IdToUNs.size();
+    });
+
+  for (auto &[Id, UNs] : IdToUNs)
     Nodes.emplace_back(Id, UNs);
-  }
-  return Nodes;
+
+  // Since BalancedPartitioning is sensitive to the initial order, we explicitly
+  // order nodes by their earliest timestamp.
+  llvm::sort(Nodes, [&](auto &L, auto &R) {
+    return std::make_pair(IdToFirstTimestamp[L.Id], L.Id) <
+           std::make_pair(IdToFirstTimestamp[R.Id], R.Id);
+  });
 }
 
 #define INSTR_PROF_COMMON_API_IMPL
diff --git a/llvm/test/tools/llvm-profdata/show-order-error.proftext b/llvm/test/tools/llvm-profdata/show-order-error.proftext
new file mode 100644
index 0000000000000..633f1a9949b6f
--- /dev/null
+++ b/llvm/test/tools/llvm-profdata/show-order-error.proftext
@@ -0,0 +1,27 @@
+# RUN: not llvm-profdata order %s --num-test-traces=10 2>&1 | FileCheck %s
+
+# CHECK: --num-test-traces must be smaller than the total number of traces
+
+# Header
+:ir
+:temporal_prof_traces
+# Num Traces
+1
+# Trace Stream Size:
+1
+# Weight
+1
+a, b
+
+a
+# Func Hash:
+0x1234
+# Num Counters:
+1
+# Counter Values:
+101
+
+b
+0x5678
+1
+202
diff --git a/llvm/test/tools/llvm-profdata/show-order.proftext b/llvm/test/tools/llvm-profdata/show-order.proftext
index 8ef26847ad77e..28eb1b9b42af7 100644
--- a/llvm/test/tools/llvm-profdata/show-order.proftext
+++ b/llvm/test/tools/llvm-profdata/show-order.proftext
@@ -1,4 +1,6 @@
-# RUN: llvm-profdata order %s | FileCheck %s
+# RUN: llvm-profdata order %s --num-test-traces=1 | FileCheck %s
+
+# CHECK: # Total area under the page fault curve: 4.000000e+00
 
 # CHECK: a
 # CHECK: b
@@ -9,9 +11,9 @@
 :ir
 :temporal_prof_traces
 # Num Traces
-3
+4
 # Trace Stream Size:
-3
+4
 # Weight
 1
 a, main.c:b, c
@@ -21,6 +23,9 @@ a, x, main.c:b, c
 # Weight
 1
 a, main.c:b, c
+# Weight
+1
+a, main.c:b, c, x
 
 a
 # Func Hash:
diff --git a/llvm/tools/llvm-profdata/llvm-profdata.cpp b/llvm/tools/llvm-profdata/llvm-profdata.cpp
index 4126b55576ddc..70294c10d698c 100644
--- a/llvm/tools/llvm-profdata/llvm-profdata.cpp
+++ b/llvm/tools/llvm-profdata/llvm-profdata.cpp
@@ -340,7 +340,7 @@ cl::opt<unsigned long long> OverlapValueCutoff(
         "profile with max count value greater then the parameter value"),
     cl::sub(OverlapSubcommand));
 
-// Options unique to show subcommand.
+// Options specific to show subcommand.
 cl::opt<bool> ShowCounts("counts", cl::init(false),
                          cl::desc("Show counter values for shown functions"),
                          cl::sub(ShowSubcommand));
@@ -439,6 +439,13 @@ cl::opt<bool> ShowProfileVersion("profile-version", cl::init(false),
                                  cl::desc("Show profile version. "),
                                  cl::sub(ShowSubcommand));
 
+// Options specific to order subcommand.
+cl::opt<unsigned> NumTestTraces(
+    "num-test-traces", cl::init(0),
+    cl::desc("Keep aside <num-test-traces> traces when computing the function "
+             "order and instead use them to evaluate that order"),
+    cl::sub(OrderSubcommand));
+
 // We use this string to indicate that there are
 // multiple static functions map to the same name.
 const std::string DuplicateNameStr = "----";
@@ -3278,13 +3285,42 @@ static int order_main() {
     // Read all entries
     (void)I;
   }
-  auto &Traces = Reader->getTemporalProfTraces();
-  auto Nodes = TemporalProfTraceTy::createBPFunctionNodes(Traces);
+  ArrayRef Traces = Reader->getTemporalProfTraces();
+  if (NumTestTraces && NumTestTraces >= Traces.size())
+    exitWithError(
+        "--" + NumTestTraces.ArgStr +
+        " must be smaller than the total number of traces: expected: < " +
+        Twine(Traces.size()) + ", actual: " + Twine(NumTestTraces));
+  ArrayRef TestTraces = Traces.take_back(NumTestTraces);
+  Traces = Traces.drop_back(NumTestTraces);
+
+  std::vector<BPFunctionNode> Nodes;
+  TemporalProfTraceTy::createBPFunctionNodes(Traces, Nodes);
   BalancedPartitioningConfig Config;
   BalancedPartitioning BP(Config);
   BP.run(Nodes);
 
   OS << "# Ordered " << Nodes.size() << " functions\n";
+  if (!TestTraces.empty()) {
+    // Since we don't know the symbol sizes, we assume 32 functions per page.
+    DenseMap<BPFunctionNode::IDT, unsigned> IdToPageNumber;
+    for (auto &Node : Nodes)
+      IdToPageNumber[Node.Id] = IdToPageNumber.size() / 32;
+
+    SmallSet<unsigned, 0> TouchedPages;
+    unsigned Area = 0;
+    for (auto &Trace : TestTraces) {
+      for (auto Id : Trace.FunctionNameRefs) {
+        auto It = IdToPageNumber.find(Id);
+        if (It == IdToPageNumber.end())
+          continue;
+        TouchedPages.insert(It->getSecond());
+        Area += TouchedPages.size();
+      }
+      TouchedPages.clear();
+    }
+    OS << "# Total area under the page fault curve: " << (float)Area << "\n";
+  }
   OS << "# Warning: Mach-O may prefix symbols with \"_\" depending on the "
         "linkage and this output does not take that into account. Some "
         "post-processing may be required before passing to the linker via "
diff --git a/llvm/unittests/ProfileData/BPFunctionNodeTest.cpp b/llvm/unittests/ProfileData/BPFunctionNodeTest.cpp
index 6af6f1bcdc40a..af0104da3341f 100644
--- a/llvm/unittests/ProfileData/BPFunctionNodeTest.cpp
+++ b/llvm/unittests/ProfileData/BPFunctionNodeTest.cpp
@@ -8,7 +8,6 @@
 
 #include "llvm/ProfileData/InstrProf.h"
 #include "llvm/Support/BalancedPartitioning.h"
-#include "llvm/Testing/Support/SupportHelpers.h"
 #include "gmock/gmock.h"
 #include "gtest/gtest.h"
 
@@ -31,22 +30,20 @@ TEST(BPFunctionNodeTest, Basic) {
                        UnorderedElementsAreArray(UNs)));
   };
 
-  auto Nodes = TemporalProfTraceTy::createBPFunctionNodes({
-      TemporalProfTraceTy({0, 1, 2, 3}),
-  });
-  EXPECT_THAT(Nodes,
-              UnorderedElementsAre(NodeIs(0, {0, 1, 2}), NodeIs(1, {1, 2}),
-                                   NodeIs(2, {1, 2}), NodeIs(3, {2})));
-
-  Nodes = TemporalProfTraceTy::createBPFunctionNodes({
-      TemporalProfTraceTy({0, 1, 2, 3, 4}),
-      TemporalProfTraceTy({4, 2}),
-  });
-
-  EXPECT_THAT(Nodes,
-              UnorderedElementsAre(NodeIs(0, {0, 1, 2}), NodeIs(1, {1, 2}),
-                                   NodeIs(2, {1, 2, 4, 5}), NodeIs(3, {2}),
-                                   NodeIs(4, {2, 3, 4, 5})));
+  std::vector<BPFunctionNode> Nodes;
+  TemporalProfTraceTy::createBPFunctionNodes(
+      {TemporalProfTraceTy({0, 1, 2, 3})}, Nodes);
+  EXPECT_THAT(Nodes, UnorderedElementsAre(NodeIs(0, {1}), NodeIs(1, {1}),
+                                          NodeIs(2, {}), NodeIs(3, {})));
+
+  Nodes.clear();
+  TemporalProfTraceTy::createBPFunctionNodes(
+      {TemporalProfTraceTy({0, 1, 2, 3, 4}), TemporalProfTraceTy({4, 2})},
+      Nodes);
+
+  EXPECT_THAT(Nodes, UnorderedElementsAre(NodeIs(0, {1}), NodeIs(1, {1}),
+                                          NodeIs(2, {5}), NodeIs(3, {}),
+                                          NodeIs(4, {5})));
 }
 
 } // end namespace llvm

kyulee-com · 2024-05-17T18:54:30Z

llvm/lib/ProfileData/InstrProf.cpp

+  for (auto &[Id, UNs] : IdToUNs)
+    for (auto &UN : UNs)
+      ++UNFrequency[UN];
+  // Filter out rare and common utility nodes to make BalancedPartitioning more


Initially I felt these wording rare and common seem contradictory.
Looking at the code below, you appear to avoid the case where UM appears only 1 function or more than half of total functions. Perhaps suggest the wording something like too infrequent or too prevalent?

kyulee-com · 2024-05-17T18:56:58Z

llvm/tools/llvm-profdata/llvm-profdata.cpp

+// Options specific to order subcommand.
+cl::opt<unsigned> NumTestTraces(
+    "num-test-traces", cl::init(0),
+    cl::desc("Keep aside <num-test-traces> traces when computing the function "


Looks like you keep the last N traces as the test traces. Not sure we want to given an option to randomize it or using separate test traces. In either case, I'd like to get the description more accurate above.

kyulee-com · 2024-05-17T18:58:07Z

llvm/unittests/ProfileData/BPFunctionNodeTest.cpp

+      {TemporalProfTraceTy({0, 1, 2, 3, 4}), TemporalProfTraceTy({4, 2})},
+      Nodes);
+
+  EXPECT_THAT(Nodes, UnorderedElementsAre(NodeIs(0, {1}), NodeIs(1, {1}),


Can you add the comment how com these initial BP nodes with utility indices are computed?

I added the RemoveOutlierUNs parameter so we can keep the original tests

kyulee-com · 2024-05-17T19:00:47Z

llvm/lib/ProfileData/InstrProf.cpp

+      return UNFrequency[UN] <= 1 || 2 * UNFrequency[UN] > IdToUNs.size();
+    });
+
+  for (auto &[Id, UNs] : IdToUNs)
    Nodes.emplace_back(Id, UNs);


UNs seems being copied when constructing BPFunctionNode. Not sure how big UNs are.
If it's not worth, I'm okay as is for the better readability/simplicity.

I tried using std::move() here and in the constructor, and I didn't see too much of a difference in runtime. I suspect the bottleneck is more in running balanced partitioning. I could also create the nodes earlier instead of having IdToUNs, but that would require making BPFunctionNode::UtilityNodes public.

llvmbot added the PGO Profile Guided Optimizations label May 16, 2024

ellishg requested review from MaskRay and kyulee-com May 16, 2024 20:47

kyulee-com reviewed May 17, 2024

View reviewed changes

Address review

3a60647

kyulee-com approved these changes May 23, 2024

View reviewed changes

ellishg merged commit 73eb9b3 into llvm:main May 23, 2024
3 of 4 checks passed

ellishg deleted the test-traces branch May 23, 2024 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[InstrProf] Evaluate function order using test traces #92451

[InstrProf] Evaluate function order using test traces #92451

ellishg commented May 16, 2024

llvmbot commented May 16, 2024

kyulee-com May 17, 2024

kyulee-com May 17, 2024

kyulee-com May 17, 2024

ellishg May 20, 2024

kyulee-com May 17, 2024

ellishg May 20, 2024

[InstrProf] Evaluate function order using test traces #92451

[InstrProf] Evaluate function order using test traces #92451

Conversation

ellishg commented May 16, 2024

llvmbot commented May 16, 2024

kyulee-com May 17, 2024

Choose a reason for hiding this comment

kyulee-com May 17, 2024

Choose a reason for hiding this comment

kyulee-com May 17, 2024

Choose a reason for hiding this comment

ellishg May 20, 2024

Choose a reason for hiding this comment

kyulee-com May 17, 2024

Choose a reason for hiding this comment

ellishg May 20, 2024

Choose a reason for hiding this comment