Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PGO][Offload] Allow PGO flags to be used on GPU targets #94268

Merged
merged 5 commits into from
Mar 20, 2025

Conversation

EthanLuisMcDonough
Copy link
Member

@EthanLuisMcDonough EthanLuisMcDonough commented Jun 3, 2024

This pull request is the third part of an ongoing effort to extends PGO instrumentation to GPU device code and depends on #93365. This PR makes the following changes:

  • Allows PGO flags to be supplied to GPU targets
  • Pulls version global from device
  • Modifies __llvm_write_custom_profile and lprofWriteDataImpl to allow the PGO version to be overridden

Copy link

github-actions bot commented Jun 3, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be lots of accidental clang-format changes. Why do we need new flags for this instead of just using the old ones and changing behavior when the target is a known GPU? I.e. SPIR-V, CUDA, or HSA.

@EthanLuisMcDonough
Copy link
Member Author

@jhuber6 The clang format errors are mostly due to my local version of clang-format disagreeing with the buildbot's version. Its a bit annoying, but it shouldn't be too much of a problem given I plan on squashing and merging once this gets approved.

I added new flags for GPU PGO specifically because I didn't want to modify the PGO flags' existing behavior. PGO has a significant runtime cost, so I figured it would be best for the end user experience to only enable PGO on the GPU when it was specifically requested.

@jhuber6
Copy link
Contributor

jhuber6 commented Jun 24, 2024

@jhuber6 The clang format errors are mostly due to my local version of clang-format disagreeing with the buildbot's version. Its a bit annoying, but it shouldn't be too much of a problem given I plan on squashing and merging once this gets approved.

I added new flags for GPU PGO specifically because I didn't want to modify the PGO flags' existing behavior. PGO has a significant runtime cost, so I figured it would be best for the end user experience to only enable PGO on the GPU when it was specifically requested.

Is this something that specifically requires its own flag? Or could we just do -Xarch_device -fprofile-generate.

@EthanLuisMcDonough
Copy link
Member Author

Is this something that specifically requires its own flag? Or could we just do -Xarch_device -fprofile-generate.

Right now the -fprofile-generate-gpu and -fprofile-instr-generate-gpu flags make sure that the compiler-rt profiling library is included even if host profiling isn't enabled, but your suggestion seems quite nice and compact. I'm open to looking into this. Thoughts @jdoerfert?

@@ -0,0 +1,82 @@
// RUN: %libomptarget-compile-generic -fprofile-generate-gpu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When targeting a V100, this command fails for me in both pgo1.c and pgo2.c. In the LTO case:

LLVM ERROR: Circular dependency found in global variable set 

In the non-LTO case:

fatal error: error in backend: NVPTX aliasee must be a non-kernel function definition

I do not see this problem in PR #93365's pgo1.c.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a limitation of the PTX target, globals cannot reference themselves. Most likely whatever NVIDIA engineer wrote the PTX parser found it annoying to reference something that wasn't fully parsed yet so he just decided to make it an error and here we are. See https://godbolt.org/z/53PP5c5ve.

@jdenny-ornl
Copy link
Collaborator

jdenny-ornl commented Oct 3, 2024

For some codes, I get the following error for a gfx906:

LLVM ERROR: Relocation for CG Profile could not be created: unknown relocation name 

I see it for OpenMC, but the following is a simpler example:

$ cat test.c
#include <stdio.h>
#include <stdlib.h>
__attribute__((noinline))
double test(double x, int n) {
  double res = 1;
  for (int i = 0; i < n; ++i)
    res *= x;
  return res;
}
int main(int argc, char *argv[]) {
  double x = atof(argv[1]);
  unsigned n = atoi(argv[2]);
  #pragma omp target map(tofrom:x)
  x = test(x, n);
  printf("%f\n", x);
  return 0;
}

$ clang -O2 -g -fopenmp --offload-arch=native test.c -o test \
      -fprofile-generate -fprofile-generate-gpu

$ LLVM_PROFILE_FILE=test.profraw ./test 2 4
16.000000

$ llvm-profdata merge -output=test.profdata *.profraw

$ clang -O2 -g -fopenmp --offload-arch=native test.c -foffload-lto \
      -fprofile-use-gpu=test.profdata

I can prevent the error by lowering the last -O2 to -O1 or by removing the __attribute__((noinline)). Am I doing something wrong?

@EthanLuisMcDonough
Copy link
Member Author

For some codes, I get the following error for a gfx906:

LLVM ERROR: Relocation for CG Profile could not be created: unknown relocation name 

I see it for OpenMC, but the following is a simpler example:

$ cat test.c
#include <stdio.h>
#include <stdlib.h>
__attribute__((noinline))
double test(double x, int n) {
  double res = 1;
  for (int i = 0; i < n; ++i)
    res *= x;
  return res;
}
int main(int argc, char *argv[]) {
  double x = atof(argv[1]);
  unsigned n = atoi(argv[2]);
  #pragma omp target map(tofrom:x)
  x = test(x, n);
  printf("%f\n", x);
  return 0;
}

$ clang -O2 -g -fopenmp --offload-arch=native test.c -o test \
      -fprofile-generate -fprofile-generate-gpu

$ LLVM_PROFILE_FILE=test.profraw ./test 2 4
16.000000

$ llvm-profdata merge -output=test.profdata *.profraw

$ clang -O2 -g -fopenmp --offload-arch=native test.c -foffload-lto \
      -fprofile-use-gpu=test.profdata

I can prevent the error by lowering the last -O2 to -O1 or by removing the __attribute__((noinline)). Am I doing something wrong?

Not sure if this is still relevant, but I think #114617 should fix this issue. The ASM builder was trying to create a reloc directive with the kind BFD_RELOC_NONE, which isn't defined in getFixupKind for AMDGPU.

@jdenny-ornl
Copy link
Collaborator

For some codes, I get the following error for a gfx906:

LLVM ERROR: Relocation for CG Profile could not be created: unknown relocation name 

Not sure if this is still relevant, but I think #114617 should fix this issue. The ASM builder was trying to create a reloc directive with the kind BFD_RELOC_NONE, which isn't defined in getFixupKind for AMDGPU.

Thanks, it does seem to fix it.

If possible, would you please keep this PR in sync with PR #93365? That should bring in the above fix and generally help me to review against my use cases.

@EthanLuisMcDonough
Copy link
Member Author

For some codes, I get the following error for a gfx906:

LLVM ERROR: Relocation for CG Profile could not be created: unknown relocation name 

Not sure if this is still relevant, but I think #114617 should fix this issue. The ASM builder was trying to create a reloc directive with the kind BFD_RELOC_NONE, which isn't defined in getFixupKind for AMDGPU.

Thanks, it does seem to fix it.

If possible, would you please keep this PR in sync with PR #93365? That should bring in the above fix and generally help me to review against my use cases.

Sure, I'll sync this branch with my changes in the other PR.

Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is that we do not introduce new flags, but use the ones we have already and use -Xarch as the canonical way to do it.

-fsanitize_instr_generate // Both host and all devices
-Xarch_host -fsanitize_instr_generate // Only on the host
-Xarch_device -fsanitize_instr_generate // Only on the devices
-Xarch_nvptx64 -fsanitize_instr_generate  // Only on NVPTX targets
-Xarch_gfx90a -fsanitize_instr_generate  // Only on gfx90a

All of this handling should be done for you by the getArgsForToolchain() interface that's called before we generate the arguments from the tools. So, all this should require is tests + documentation for this usage. You should just be able to replace existing checks on these new flags to the standard ones.

@@ -273,7 +273,7 @@ void GPUProfGlobals::dump() const {
}

Error GPUProfGlobals::write() const {
if (!__llvm_write_custom_profile)
if (__llvm_write_custom_profile == NULL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!ptr is canonical, but if you want to be explicit use `nullptr.

Group<f_Group>,
Visibility<[ClangOption, CLOption]>,
MetaVarName<"<pathname>">,
HelpText<"Use instrumentation data for profile-guided optimization targeting GPU">;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing non-gpu text reads different, on purpose?

Multilib::flags_list &Result) {
static void getARMMultilibFlags(const Driver &D, const llvm::Triple &Triple,
const llvm::opt::ArgList &Args,
Multilib::flags_list &Result) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated, more below. Don't format the file, use git-clang-format to format only your commit.

@@ -575,6 +575,76 @@ static void addDashXForInput(const ArgList &Args, const InputInfo &Input,
}
}

static void addPGOFlagsGPU(const ToolChain &TC, const ArgList &Args,
ArgStringList &CmdArgs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this copied from somewhere? Can we merge it, e.g., in a template that takes the options enums (CPU vs GPU)?

@@ -152,7 +152,7 @@ INSTR_PROF_VALUE_NODE(PtrToNodeT, llvm::PointerType::getUnqual(Ctx), Next, \
#define INSTR_PROF_DATA_DEFINED
#endif
INSTR_PROF_RAW_HEADER(uint64_t, Magic, __llvm_profile_get_magic())
INSTR_PROF_RAW_HEADER(uint64_t, Version, __llvm_profile_get_version())
INSTR_PROF_RAW_HEADER(uint64_t, Version, Version)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures that a GPU profile is written with the version pulled from the GPU. Previously, if the host used LLVM-level instrumentation and the device has clang-level instrumentation, the GPU profile would use the host's format. This change ensures that is not the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is a part of public API and is used by a number of users that have their own runtime implementation (for example baremetal users, OS kernels) and this change is going to break all of them. Is there another way we could handle this case that wouldn't require this change and avoid all that churn?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best solution would be to forbid the user from using different instrumentation levels (LLVM IR vs clang) on the host and device. I can replace the version replacement with a check that ensures the device version matches the host version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the version detection and made sure to set the Version property the custom value instead of changing the default value.

@EthanLuisMcDonough EthanLuisMcDonough changed the title [PGO][Offload] Add GPU profiling flags to driver [PGO][Offload] Allow PGO flags to be used on GPU targets Mar 18, 2025
@llvmbot llvmbot added clang Clang issues not falling into any other category compiler-rt clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' PGO Profile Guided Optimizations llvm:transforms offload labels Mar 18, 2025
@llvmbot
Copy link
Member

llvmbot commented Mar 18, 2025

@llvm/pr-subscribers-clang-codegen
@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-offload
@llvm/pr-subscribers-pgo
@llvm/pr-subscribers-clang-driver

@llvm/pr-subscribers-clang

Author: Ethan Luis McDonough (EthanLuisMcDonough)

Changes

This pull request is the third part of an ongoing effort to extends PGO instrumentation to GPU device code and depends on #93365. This PR makes the following changes:

  • Allows PGO flags to be supplied to GPU targets
  • Pulls version global from device
  • Modifies __llvm_write_custom_profile and lprofWriteDataImpl to allow the PGO version to be overridden

Patch is 26.26 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/94268.diff

19 Files Affected:

  • (modified) clang/lib/Driver/ToolChains/Clang.cpp (+1-5)
  • (removed) clang/test/Driver/cuda-no-pgo-or-coverage.cu (-33)
  • (modified) compiler-rt/include/profile/InstrProfData.inc (+1-1)
  • (modified) compiler-rt/lib/profile/InstrProfiling.h (+2-1)
  • (modified) compiler-rt/lib/profile/InstrProfilingBuffer.c (+2-1)
  • (modified) compiler-rt/lib/profile/InstrProfilingFile.c (+15-7)
  • (modified) compiler-rt/lib/profile/InstrProfilingInternal.h (+2-1)
  • (modified) compiler-rt/lib/profile/InstrProfilingWriter.c (+10-10)
  • (modified) llvm/include/llvm/ProfileData/InstrProfData.inc (+1-1)
  • (modified) llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp (+4-1)
  • (modified) llvm/test/tools/llvm-profdata/binary-ids-padding.test (+1-1)
  • (modified) llvm/test/tools/llvm-profdata/malformed-not-space-for-another-header.test (+1-1)
  • (modified) llvm/test/tools/llvm-profdata/malformed-num-counters-zero.test (+1-1)
  • (modified) llvm/test/tools/llvm-profdata/malformed-ptr-to-counter-array.test (+1-1)
  • (modified) offload/plugins-nextgen/common/include/GlobalHandler.h (+9-4)
  • (modified) offload/plugins-nextgen/common/src/GlobalHandler.cpp (+14-4)
  • (added) offload/test/offloading/gpupgo/pgo1.c (+84)
  • (added) offload/test/offloading/gpupgo/pgo2.c (+76)
  • (removed) offload/test/offloading/pgo1.c (-66)
diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp
index 1012128085c7a..e0f1206496486 100644
--- a/clang/lib/Driver/ToolChains/Clang.cpp
+++ b/clang/lib/Driver/ToolChains/Clang.cpp
@@ -6387,11 +6387,7 @@ void Clang::ConstructJob(Compilation &C, const JobAction &JA,
   Args.AddLastArg(CmdArgs, options::OPT_fconvergent_functions,
                   options::OPT_fno_convergent_functions);
 
-  // NVPTX/AMDGCN doesn't support PGO or coverage. There's no runtime support
-  // for sampling, overhead of call arc collection is way too high and there's
-  // no way to collect the output.
-  if (!Triple.isNVPTX() && !Triple.isAMDGCN())
-    addPGOAndCoverageFlags(TC, C, JA, Output, Args, SanitizeArgs, CmdArgs);
+  addPGOAndCoverageFlags(TC, C, JA, Output, Args, SanitizeArgs, CmdArgs);
 
   Args.AddLastArg(CmdArgs, options::OPT_fclang_abi_compat_EQ);
 
diff --git a/clang/test/Driver/cuda-no-pgo-or-coverage.cu b/clang/test/Driver/cuda-no-pgo-or-coverage.cu
deleted file mode 100644
index b84587e1e182b..0000000000000
--- a/clang/test/Driver/cuda-no-pgo-or-coverage.cu
+++ /dev/null
@@ -1,33 +0,0 @@
-// Check that profiling/coverage arguments doen't get passed down to device-side
-// compilation.
-//
-//
-// XRUN: not %clang -### --target=x86_64-linux-gnu -c --cuda-gpu-arch=sm_20 \
-// XRUN:   -fprofile-generate %s 2>&1 | \
-// XRUN:   FileCheck --check-prefixes=CHECK,PROF %s
-//
-// RUN: not %clang -### --target=x86_64-linux-gnu -c --cuda-gpu-arch=sm_20 \
-// RUN:   -fprofile-instr-generate %s 2>&1 | \
-// RUN:   FileCheck --check-prefixes=CHECK,PROF %s
-//
-// RUN: not %clang -### --target=x86_64-linux-gnu -c --cuda-gpu-arch=sm_20 \
-// RUN:   -coverage %s 2>&1 | \
-// RUN:   FileCheck --check-prefixes=CHECK,GCOV %s
-//
-// RUN: not %clang -### --target=x86_64-linux-gnu -c --cuda-gpu-arch=sm_20 \
-// RUN:   -ftest-coverage %s 2>&1 | \
-// RUN:   FileCheck --check-prefixes=CHECK,GCOV %s
-//
-// RUN: not %clang -### --target=x86_64-linux-gnu -c --cuda-gpu-arch=sm_20   \
-// RUN:   -fprofile-instr-generate -fcoverage-mapping %s 2>&1 | \
-// RUN:   FileCheck --check-prefixes=CHECK,PROF %s
-//
-//
-// CHECK-NOT: error: unsupported option '-fprofile
-// CHECK-NOT: error: invalid argument
-// CHECK-DAG: "-fcuda-is-device"
-// CHECK-NOT: "-f{{[^"/]*coverage.*}}"
-// CHECK-NOT: "-fprofile{{[^"]*}}"
-// CHECK: "-triple" "x86_64-unknown-linux-gnu"
-// PROF:      "-fprofile{{.*}}"
-// GCOV:      "-coverage-notes-file=
diff --git a/compiler-rt/include/profile/InstrProfData.inc b/compiler-rt/include/profile/InstrProfData.inc
index 2cdfea9a579a4..d51b58386f168 100644
--- a/compiler-rt/include/profile/InstrProfData.inc
+++ b/compiler-rt/include/profile/InstrProfData.inc
@@ -152,7 +152,7 @@ INSTR_PROF_VALUE_NODE(PtrToNodeT, llvm::PointerType::getUnqual(Ctx), Next, \
 #define INSTR_PROF_DATA_DEFINED
 #endif
 INSTR_PROF_RAW_HEADER(uint64_t, Magic, __llvm_profile_get_magic())
-INSTR_PROF_RAW_HEADER(uint64_t, Version, __llvm_profile_get_version())
+INSTR_PROF_RAW_HEADER(uint64_t, Version, Version)
 INSTR_PROF_RAW_HEADER(uint64_t, BinaryIdsSize, __llvm_write_binary_ids(NULL))
 INSTR_PROF_RAW_HEADER(uint64_t, NumData, NumData)
 INSTR_PROF_RAW_HEADER(uint64_t, PaddingBytesBeforeCounters, PaddingBytesBeforeCounters)
diff --git a/compiler-rt/lib/profile/InstrProfiling.h b/compiler-rt/lib/profile/InstrProfiling.h
index 77c8d6c79322d..a90558fdcfbbf 100644
--- a/compiler-rt/lib/profile/InstrProfiling.h
+++ b/compiler-rt/lib/profile/InstrProfiling.h
@@ -310,7 +310,8 @@ int __llvm_write_custom_profile(const char *Target,
                                 const __llvm_profile_data *DataEnd,
                                 const char *CountersBegin,
                                 const char *CountersEnd, const char *NamesBegin,
-                                const char *NamesEnd);
+                                const char *NamesEnd,
+                                const uint64_t *VersionOverride);
 
 /*!
  * This variable is defined in InstrProfilingRuntime.cpp as a hidden
diff --git a/compiler-rt/lib/profile/InstrProfilingBuffer.c b/compiler-rt/lib/profile/InstrProfilingBuffer.c
index 1c451d7ec7563..b406e8db74f3f 100644
--- a/compiler-rt/lib/profile/InstrProfilingBuffer.c
+++ b/compiler-rt/lib/profile/InstrProfilingBuffer.c
@@ -252,5 +252,6 @@ COMPILER_RT_VISIBILITY int __llvm_profile_write_buffer_internal(
       &BufferWriter, DataBegin, DataEnd, CountersBegin, CountersEnd,
       BitmapBegin, BitmapEnd, /*VPDataReader=*/0, NamesBegin, NamesEnd,
       /*VTableBegin=*/NULL, /*VTableEnd=*/NULL, /*VNamesBegin=*/NULL,
-      /*VNamesEnd=*/NULL, /*SkipNameDataWrite=*/0);
+      /*VNamesEnd=*/NULL, /*SkipNameDataWrite=*/0,
+      __llvm_profile_get_version());
 }
diff --git a/compiler-rt/lib/profile/InstrProfilingFile.c b/compiler-rt/lib/profile/InstrProfilingFile.c
index 4667c02892505..19467429cf4c3 100644
--- a/compiler-rt/lib/profile/InstrProfilingFile.c
+++ b/compiler-rt/lib/profile/InstrProfilingFile.c
@@ -1273,10 +1273,13 @@ COMPILER_RT_VISIBILITY int __llvm_profile_set_file_object(FILE *File,
   return 0;
 }
 
-COMPILER_RT_USED int __llvm_write_custom_profile(
-    const char *Target, const __llvm_profile_data *DataBegin,
-    const __llvm_profile_data *DataEnd, const char *CountersBegin,
-    const char *CountersEnd, const char *NamesBegin, const char *NamesEnd) {
+int __llvm_write_custom_profile(const char *Target,
+                                const __llvm_profile_data *DataBegin,
+                                const __llvm_profile_data *DataEnd,
+                                const char *CountersBegin,
+                                const char *CountersEnd, const char *NamesBegin,
+                                const char *NamesEnd,
+                                const uint64_t *VersionOverride) {
   int ReturnValue = 0, FilenameLength, TargetLength;
   char *FilenameBuf, *TargetFilename;
   const char *Filename;
@@ -1358,10 +1361,15 @@ COMPILER_RT_USED int __llvm_write_custom_profile(
   ProfDataWriter fileWriter;
   initFileWriter(&fileWriter, OutputFile);
 
+  uint64_t Version = __llvm_profile_get_version();
+  if (VersionOverride)
+    Version = *VersionOverride;
+
   /* Write custom data to the file */
-  ReturnValue = lprofWriteDataImpl(
-      &fileWriter, DataBegin, DataEnd, CountersBegin, CountersEnd, NULL, NULL,
-      lprofGetVPDataReader(), NULL, NULL, NULL, NULL, NamesBegin, NamesEnd, 0);
+  ReturnValue =
+      lprofWriteDataImpl(&fileWriter, DataBegin, DataEnd, CountersBegin,
+                         CountersEnd, NULL, NULL, lprofGetVPDataReader(), NULL,
+                         NULL, NULL, NULL, NamesBegin, NamesEnd, 0, Version);
   closeFileObject(OutputFile);
 
   // Restore SIGKILL.
diff --git a/compiler-rt/lib/profile/InstrProfilingInternal.h b/compiler-rt/lib/profile/InstrProfilingInternal.h
index b100343ca04f9..03df71828b91d 100644
--- a/compiler-rt/lib/profile/InstrProfilingInternal.h
+++ b/compiler-rt/lib/profile/InstrProfilingInternal.h
@@ -160,7 +160,8 @@ int lprofWriteDataImpl(ProfDataWriter *Writer,
                        VPDataReaderType *VPDataReader, const char *NamesBegin,
                        const char *NamesEnd, const VTableProfData *VTableBegin,
                        const VTableProfData *VTableEnd, const char *VNamesBegin,
-                       const char *VNamesEnd, int SkipNameDataWrite);
+                       const char *VNamesEnd, int SkipNameDataWrite,
+                       uint64_t Version);
 
 /* Merge value profile data pointed to by SrcValueProfData into
  * in-memory profile counters pointed by to DstData.  */
diff --git a/compiler-rt/lib/profile/InstrProfilingWriter.c b/compiler-rt/lib/profile/InstrProfilingWriter.c
index 8816a71155511..bcd88b30d050d 100644
--- a/compiler-rt/lib/profile/InstrProfilingWriter.c
+++ b/compiler-rt/lib/profile/InstrProfilingWriter.c
@@ -254,21 +254,21 @@ COMPILER_RT_VISIBILITY int lprofWriteData(ProfDataWriter *Writer,
   const VTableProfData *VTableEnd = __llvm_profile_end_vtables();
   const char *VNamesBegin = __llvm_profile_begin_vtabnames();
   const char *VNamesEnd = __llvm_profile_end_vtabnames();
+  uint64_t Version = __llvm_profile_get_version();
   return lprofWriteDataImpl(Writer, DataBegin, DataEnd, CountersBegin,
                             CountersEnd, BitmapBegin, BitmapEnd, VPDataReader,
                             NamesBegin, NamesEnd, VTableBegin, VTableEnd,
-                            VNamesBegin, VNamesEnd, SkipNameDataWrite);
+                            VNamesBegin, VNamesEnd, SkipNameDataWrite, Version);
 }
 
-COMPILER_RT_VISIBILITY int
-lprofWriteDataImpl(ProfDataWriter *Writer, const __llvm_profile_data *DataBegin,
-                   const __llvm_profile_data *DataEnd,
-                   const char *CountersBegin, const char *CountersEnd,
-                   const char *BitmapBegin, const char *BitmapEnd,
-                   VPDataReaderType *VPDataReader, const char *NamesBegin,
-                   const char *NamesEnd, const VTableProfData *VTableBegin,
-                   const VTableProfData *VTableEnd, const char *VNamesBegin,
-                   const char *VNamesEnd, int SkipNameDataWrite) {
+COMPILER_RT_VISIBILITY int lprofWriteDataImpl(
+    ProfDataWriter *Writer, const __llvm_profile_data *DataBegin,
+    const __llvm_profile_data *DataEnd, const char *CountersBegin,
+    const char *CountersEnd, const char *BitmapBegin, const char *BitmapEnd,
+    VPDataReaderType *VPDataReader, const char *NamesBegin,
+    const char *NamesEnd, const VTableProfData *VTableBegin,
+    const VTableProfData *VTableEnd, const char *VNamesBegin,
+    const char *VNamesEnd, int SkipNameDataWrite, uint64_t Version) {
   /* Calculate size of sections. */
   const uint64_t DataSectionSize =
       __llvm_profile_get_data_size(DataBegin, DataEnd);
diff --git a/llvm/include/llvm/ProfileData/InstrProfData.inc b/llvm/include/llvm/ProfileData/InstrProfData.inc
index 2cdfea9a579a4..d51b58386f168 100644
--- a/llvm/include/llvm/ProfileData/InstrProfData.inc
+++ b/llvm/include/llvm/ProfileData/InstrProfData.inc
@@ -152,7 +152,7 @@ INSTR_PROF_VALUE_NODE(PtrToNodeT, llvm::PointerType::getUnqual(Ctx), Next, \
 #define INSTR_PROF_DATA_DEFINED
 #endif
 INSTR_PROF_RAW_HEADER(uint64_t, Magic, __llvm_profile_get_magic())
-INSTR_PROF_RAW_HEADER(uint64_t, Version, __llvm_profile_get_version())
+INSTR_PROF_RAW_HEADER(uint64_t, Version, Version)
 INSTR_PROF_RAW_HEADER(uint64_t, BinaryIdsSize, __llvm_write_binary_ids(NULL))
 INSTR_PROF_RAW_HEADER(uint64_t, NumData, NumData)
 INSTR_PROF_RAW_HEADER(uint64_t, PaddingBytesBeforeCounters, PaddingBytesBeforeCounters)
diff --git a/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp b/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
index a8055979acaa2..ea4be07d0c8c8 100644
--- a/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
+++ b/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
@@ -462,7 +462,10 @@ createIRLevelProfileFlagVar(Module &M,
   auto IRLevelVersionVariable = new GlobalVariable(
       M, IntTy64, true, GlobalValue::WeakAnyLinkage,
       Constant::getIntegerValue(IntTy64, APInt(64, ProfileVersion)), VarName);
-  IRLevelVersionVariable->setVisibility(GlobalValue::HiddenVisibility);
+  if (isGPUProfTarget(M))
+    IRLevelVersionVariable->setVisibility(GlobalValue::ProtectedVisibility);
+  else
+    IRLevelVersionVariable->setVisibility(GlobalValue::HiddenVisibility);
   Triple TT(M.getTargetTriple());
   if (TT.supportsCOMDAT()) {
     IRLevelVersionVariable->setLinkage(GlobalValue::ExternalLinkage);
diff --git a/llvm/test/tools/llvm-profdata/binary-ids-padding.test b/llvm/test/tools/llvm-profdata/binary-ids-padding.test
index 292c582b45c52..f31aa15bfe6c9 100644
--- a/llvm/test/tools/llvm-profdata/binary-ids-padding.test
+++ b/llvm/test/tools/llvm-profdata/binary-ids-padding.test
@@ -1,7 +1,7 @@
 // Header
 //
 // INSTR_PROF_RAW_HEADER(uint64_t, Magic, __llvm_profile_get_magic())
-// INSTR_PROF_RAW_HEADER(uint64_t, Version, __llvm_profile_get_version())
+// INSTR_PROF_RAW_HEADER(uint64_t, Version, Version)
 // INSTR_PROF_RAW_HEADER(uint64_t, BinaryIdsSize, __llvm_write_binary_ids(NULL))
 // INSTR_PROF_RAW_HEADER(uint64_t, DataSize, DataSize)
 // INSTR_PROF_RAW_HEADER(uint64_t, CountersSize, CountersSize)
diff --git a/llvm/test/tools/llvm-profdata/malformed-not-space-for-another-header.test b/llvm/test/tools/llvm-profdata/malformed-not-space-for-another-header.test
index 705e5efaf5875..44be2980bb2f2 100644
--- a/llvm/test/tools/llvm-profdata/malformed-not-space-for-another-header.test
+++ b/llvm/test/tools/llvm-profdata/malformed-not-space-for-another-header.test
@@ -1,7 +1,7 @@
 // Header
 //
 // INSTR_PROF_RAW_HEADER(uint64_t, Magic, __llvm_profile_get_magic())
-// INSTR_PROF_RAW_HEADER(uint64_t, Version, __llvm_profile_get_version())
+// INSTR_PROF_RAW_HEADER(uint64_t, Version, Version)
 // INSTR_PROF_RAW_HEADER(uint64_t, BinaryIdsSize, __llvm_write_binary_ids(NULL))
 // INSTR_PROF_RAW_HEADER(uint64_t, DataSize, DataSize)
 // INSTR_PROF_RAW_HEADER(uint64_t, CountersSize, CountersSize)
diff --git a/llvm/test/tools/llvm-profdata/malformed-num-counters-zero.test b/llvm/test/tools/llvm-profdata/malformed-num-counters-zero.test
index 157c13b926a7e..9af9d65a6bdba 100644
--- a/llvm/test/tools/llvm-profdata/malformed-num-counters-zero.test
+++ b/llvm/test/tools/llvm-profdata/malformed-num-counters-zero.test
@@ -1,7 +1,7 @@
 // Header
 //
 // INSTR_PROF_RAW_HEADER(uint64_t, Magic, __llvm_profile_get_magic())
-// INSTR_PROF_RAW_HEADER(uint64_t, Version, __llvm_profile_get_version())
+// INSTR_PROF_RAW_HEADER(uint64_t, Version, Version)
 // INSTR_PROF_RAW_HEADER(uint64_t, BinaryIdsSize, __llvm_write_binary_ids(NULL))
 // INSTR_PROF_RAW_HEADER(uint64_t, DataSize, DataSize)
 // INSTR_PROF_RAW_HEADER(uint64_t, CountersSize, CountersSize)
diff --git a/llvm/test/tools/llvm-profdata/malformed-ptr-to-counter-array.test b/llvm/test/tools/llvm-profdata/malformed-ptr-to-counter-array.test
index 83cf76f68fb63..49c5ae9b0931d 100644
--- a/llvm/test/tools/llvm-profdata/malformed-ptr-to-counter-array.test
+++ b/llvm/test/tools/llvm-profdata/malformed-ptr-to-counter-array.test
@@ -1,7 +1,7 @@
 // Header
 //
 // INSTR_PROF_RAW_HEADER(uint64_t, Magic, __llvm_profile_get_magic())
-// INSTR_PROF_RAW_HEADER(uint64_t, Version, __llvm_profile_get_version())
+// INSTR_PROF_RAW_HEADER(uint64_t, Version, Version)
 // INSTR_PROF_RAW_HEADER(uint64_t, BinaryIdsSize, __llvm_write_binary_ids(NULL))
 // INSTR_PROF_RAW_HEADER(uint64_t, DataSize, DataSize)
 // INSTR_PROF_RAW_HEADER(uint64_t, CountersSize, CountersSize)
diff --git a/offload/plugins-nextgen/common/include/GlobalHandler.h b/offload/plugins-nextgen/common/include/GlobalHandler.h
index e030ab9e6b61f..562bf5ec223bc 100644
--- a/offload/plugins-nextgen/common/include/GlobalHandler.h
+++ b/offload/plugins-nextgen/common/include/GlobalHandler.h
@@ -13,6 +13,7 @@
 #ifndef LLVM_OPENMP_LIBOMPTARGET_PLUGINS_NEXTGEN_COMMON_GLOBALHANDLER_H
 #define LLVM_OPENMP_LIBOMPTARGET_PLUGINS_NEXTGEN_COMMON_GLOBALHANDLER_H
 
+#include <optional>
 #include <type_traits>
 
 #include "llvm/ADT/DenseMap.h"
@@ -64,10 +65,13 @@ struct __llvm_profile_data {
 };
 
 extern "C" {
-extern int __attribute__((weak)) __llvm_write_custom_profile(
-    const char *Target, const __llvm_profile_data *DataBegin,
-    const __llvm_profile_data *DataEnd, const char *CountersBegin,
-    const char *CountersEnd, const char *NamesBegin, const char *NamesEnd);
+extern int __attribute__((weak))
+__llvm_write_custom_profile(const char *Target,
+                            const __llvm_profile_data *DataBegin,
+                            const __llvm_profile_data *DataEnd,
+                            const char *CountersBegin, const char *CountersEnd,
+                            const char *NamesBegin, const char *NamesEnd,
+                            const uint64_t *VersionOverride);
 }
 
 /// PGO profiling data extracted from a GPU device
@@ -76,6 +80,7 @@ struct GPUProfGlobals {
   SmallVector<__llvm_profile_data> Data;
   SmallVector<uint8_t> NamesData;
   Triple TargetTriple;
+  std::optional<uint64_t> Version;
 
   void dump() const;
   Error write() const;
diff --git a/offload/plugins-nextgen/common/src/GlobalHandler.cpp b/offload/plugins-nextgen/common/src/GlobalHandler.cpp
index 8783490831e25..9b9233c95e567 100644
--- a/offload/plugins-nextgen/common/src/GlobalHandler.cpp
+++ b/offload/plugins-nextgen/common/src/GlobalHandler.cpp
@@ -16,6 +16,7 @@
 
 #include "Shared/Utils.h"
 
+#include "llvm/ProfileData/InstrProfData.inc"
 #include "llvm/Support/Error.h"
 
 #include <cstring>
@@ -214,6 +215,13 @@ GenericGlobalHandlerTy::readProfilingGlobals(GenericDeviceTy &Device,
       if (auto Err = readGlobalFromDevice(Device, Image, DataGlobal))
         return Err;
       DeviceProfileData.Data.push_back(std::move(Data));
+    } else if (*NameOrErr == INSTR_PROF_QUOTE(INSTR_PROF_RAW_VERSION_VAR)) {
+      uint64_t RawVersionData;
+      GlobalTy RawVersionGlobal(NameOrErr->str(), Sym.getSize(),
+                                &RawVersionData);
+      if (auto Err = readGlobalFromDevice(Device, Image, RawVersionGlobal))
+        return Err;
+      DeviceProfileData.Version = RawVersionData;
     }
   }
   return DeviceProfileData;
@@ -265,7 +273,7 @@ void GPUProfGlobals::dump() const {
 }
 
 Error GPUProfGlobals::write() const {
-  if (!__llvm_write_custom_profile)
+  if (__llvm_write_custom_profile == nullptr)
     return Plugin::error("Could not find symbol __llvm_write_custom_profile. "
                          "The compiler-rt profiling library must be linked for "
                          "GPU PGO to work.");
@@ -274,6 +282,8 @@ Error GPUProfGlobals::write() const {
          CountsSize = Counts.size() * sizeof(int64_t);
   __llvm_profile_data *DataBegin, *DataEnd;
   char *CountersBegin, *CountersEnd, *NamesBegin, *NamesEnd;
+  const uint64_t *VersionOverride =
+      Version.has_value() ? &Version.value() : nullptr;
 
   // Initialize array of contiguous data. We need to make sure each section is
   // contiguous so that the PGO library can compute deltas properly
@@ -295,9 +305,9 @@ Error GPUProfGlobals::write() const {
   memcpy(NamesBegin, NamesData.data(), NamesData.size());
 
   // Invoke compiler-rt entrypoint
-  int result = __llvm_write_custom_profile(TargetTriple.str().c_str(),
-                                           DataBegin, DataEnd, CountersBegin,
-                                           CountersEnd, NamesBegin, NamesEnd);
+  int result = __llvm_write_custom_profile(
+      TargetTriple.str().c_str(), DataBegin, DataEnd, CountersBegin,
+      CountersEnd, NamesBegin, NamesEnd, VersionOverride);
   if (result != 0)
     return Plugin::error("Error writing GPU PGO data to file");
 
diff --git a/offload/test/offloading/gpupgo/pgo1.c b/offload/test/offloading/gpupgo/pgo1.c
new file mode 100644
index 0000000000000..c8011cbae83c0
--- /dev/null
+++ b/offload/test/offloading/gpupgo/pgo1.c
@@ -0,0 +1,84 @@
+// RUN: %libomptarget-compile-generic -fcreate-profile \
+// RUN:     -Xarch_device -fprofile-generate
+// RUN: env LLVM_PROFILE_FILE=%basename_t.llvm.profraw \
+// RUN:     %libomptarget-run-generic 2>&1
+// RUN: llvm-profdata show --all-functions --counts \
+// RUN:     %target_triple.%basename_t.llvm.profraw | \
+// RUN:     %fcheck-generic --check-prefix="LLVM-PGO"
+
+// RUN: %libomptarget-compile-generic -fcreate-profile \
+// RUN:     -Xarch_device -fprofile-instr-generate
+// RUN: env LLVM_PROFILE_FILE=%basename_t.clang.profraw \
+// RUN:     %libomptarget-run-generic 2>&1
+// RUN: llvm-profdata show --all-functions --counts \
+// RUN:     %target_triple.%basename_t.clang.profraw | \
+// RUN:     %fcheck-generic --check-prefix="CLANG-PGO"
+
+// REQUIRES: gpu
+// REQUIRES: pgo
+
+int test1(int a) { return a / 2; }
+int test2(int a) { return a * 2; }
+
+int main() {
+  int m = 2;
+#pragma omp target
+  for (int i = 0; i < 10; i++) {
+    m = test1(m);
+    for (int j = 0; j < 2; j++) {
+      m = test2(m);
+    }
+  }
+}
+
+// LLVM-PGO-LABEL: __omp_offloading_{{[_0-9a-zA-Z]*}}_main_{{[_0-9a-zA-Z]*}}:
+// LLVM-PGO: Hash: {{0[xX][0-9a-fA-F]+}}
+// LLVM-PGO: Counters: 4
+// LLVM-PGO: Block counts: [20, 10, 2, 1]
+
+// LLVM-PGO-LABEL: test1:
+// LLVM-PGO: Hash: {{0[xX][0-9a-fA-F]+}}
+// LLVM-PGO: Counters: 1
+// LLVM-PGO: Block counts: [10]
+
+// LLVM-PGO-LABEL: test2:
+// LLVM-PGO: Hash...
[truncated]

Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG with a nit

Comment on lines 1088 to 1089
getArch() != llvm::Triple::armeb &&
getArch() != llvm::Triple::thumbeb;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of unrelated formatting changes, please undo those and just use git clang-format HEAD~1.

@@ -265,7 +273,7 @@ void GPUProfGlobals::dump() const {
}

Error GPUProfGlobals::write() const {
if (!__llvm_write_custom_profile)
if (__llvm_write_custom_profile == nullptr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (__llvm_write_custom_profile == nullptr)
if (!__llvm_write_custom_profile)

nit. this is idiomatic and shouldn't have been changed.

@llvmbot llvmbot added the clang:codegen IR generation bugs: mangling, exceptions, etc. label Mar 19, 2025
@EthanLuisMcDonough EthanLuisMcDonough merged commit c50d39f into llvm:main Mar 20, 2025
11 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Mar 20, 2025

LLVM Buildbot has detected a new failure on builder arc-builder running on arc-worker while building clang,compiler-rt,llvm,offload at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/3/builds/13339

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: 1200 seconds without output running [b'ninja', b'check-all'], attempting to kill
1.227 [59/18/1] Linking CXX executable tools/clang/unittests/Basic/BasicTests
1.355 [58/18/2] Linking CXX executable tools/clang/unittests/Format/FormatTests
5.573 [57/18/3] Linking CXX executable tools/clang/unittests/Lex/LexTests
6.054 [56/18/4] Linking CXX executable tools/clang/unittests/libclang/libclangTests
command timed out: 1200 seconds without output running [b'ninja', b'check-all'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1299.445143

@Artem-B
Copy link
Member

Artem-B commented Mar 28, 2025

This is breaking CUDA/NVPTX.
Enabling PGO results in compiler generating PGO-related data which references itself, and NVPTX can't compile those.

E.g. we see data like this which includes a reference to itself:

@__profd__ZN12cuda_helpers13memcmp_kernelEPjS0_mPb = protected global { i64, i64, i64, i64, ptr, ptr, i32, [3 x i16], i32 } { i64 322482019142718985, i64 287486624882456055, i64 sub (i64 ptrtoint (ptr @__profc__ZN12cuda_helpers13memcmp_kernelEPjS0_mPb to i64), i64 ptr
toint (ptr @__profd__ZN12cuda_helpers13memcmp_kernelEPjS0_mPb to i64)), i64 0, ptr @_ZN12cuda_helpers13memcmp_kernelEPjS0_mPb, ptr null, i32 5, [3 x i16] zeroinitializer, i32 0 }, section "__llvm_prf_data", comdat($__profc__ZN12cuda_helpers13memcmp_kernelEPjS0_mPb), a
lign 8

When LLVM sees such self-reference, it crashes. https://godbolt.org/z/o6PTqr1ca

LLVM ERROR: Circular dependency found in global variable set
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.	Program arguments: /opt/compiler-explorer/clang-trunk/bin/llc -o /app/output.s -x86-asm-syntax=intel -march=nvptx64 -mcpu=sm_20 <source>
 #0 0x0000000003ab1a88 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/opt/compiler-explorer/clang-trunk/bin/llc+0x3ab1a88)
 #1 0x0000000003aaf644 SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
 #2 0x00007a9fa6c42520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #3 0x00007a9fa6c969fc pthread_kill (/lib/x86_64-linux-gnu/libc.so.6+0x969fc)
 #4 0x00007a9fa6c42476 gsignal (/lib/x86_64-linux-gnu/libc.so.6+0x42476)
 #5 0x00007a9fa6c287f3 abort (/lib/x86_64-linux-gnu/libc.so.6+0x287f3)
 #6 0x000000000077d6a2 llvm::UniqueStringSaver::save(llvm::Twine const&) (.cold) StringSaver.cpp:0:0
 #7 0x0000000003a079d8 (/opt/compiler-explorer/clang-trunk/bin/llc+0x3a079d8)
 #8 0x00000000017dc57e (/opt/compiler-explorer/clang-trunk/bin/llc+0x17dc57e)
 #9 0x00000000017dc3dd VisitGlobalVariableForEmission(llvm::GlobalVariable const*, llvm::SmallVectorImpl<llvm::GlobalVariable const*>&, llvm::DenseSet<llvm::GlobalVariable const*, llvm::DenseMapInfo<llvm::GlobalVariable const*, void>>&, llvm::DenseSet<llvm::GlobalVariable const*, llvm::DenseMapInfo<llvm::GlobalVariable const*, void>>&) NVPTXAsmPrinter.cpp:0:0
#10 0x00000000017dc6dc llvm::NVPTXAsmPrinter::emitGlobals(llvm::Module const&) (/opt/compiler-explorer/clang-trunk/bin/llc+0x17dc6dc)
#11 0x00000000017dcf7d llvm::NVPTXAsmPrinter::doFinalization(llvm::Module&) (/opt/compiler-explorer/clang-trunk/bin/llc+0x17dcf7d)
#12 0x00000000030a916d llvm::FPPassManager::doFinalization(llvm::Module&) (/opt/compiler-explorer/clang-trunk/bin/llc+0x30a916d)
#13 0x00000000030b48f0 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/opt/compiler-explorer/clang-trunk/bin/llc+0x30b48f0)
#14 0x00000000008b4a08 compileModule(char**, llvm::LLVMContext&) llc.cpp:0:0
#15 0x000000000079237e main (/opt/compiler-explorer/clang-trunk/bin/llc+0x79237e)
#16 0x00007a9fa6c29d90 (/lib/x86_64-linux-gnu/libc.so.6+0x29d90)
#17 0x00007a9fa6c29e40 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e40)
#18 0x00000000008ac9c5 _start (/opt/compiler-explorer/clang-trunk/bin/llc+0x8ac9c5)
Program terminated with signal: SIGSEGV
Compiler returned: 139

@Artem-B
Copy link
Member

Artem-B commented Mar 28, 2025

@jhuber6 @jdoerfert I propose reverting the change, unless it can be quickly fixed forward so it does not affect CUDA/NVPTX.

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 28, 2025

I really wish someone from NVIDIA would just fix this stupid limitation already. @EthanLuisMcDonough This type of stuff shows up for things like this. Any way we can modify the source code to work around this? It's fine to put it in a separate global.

int arr[] = {1, arr[0]};

@EthanLuisMcDonough
Copy link
Member Author

Thank you all for making me aware of this. I'll look into fixing this as soon as possible.

@Artem-B
Copy link
Member

Artem-B commented Mar 28, 2025

The crash is blocking our compiler updates. If nothing depends on this change yet, it would be great to revert the patch and re-land it once it's fixed.

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 28, 2025

The crash is blocking our compiler updates. If nothing depends on this change yet, it would be great to revert the patch and re-land it once it's fixed.

Go ahead and revert it, but I think there were some follow-up patches. Also the pgo1.c and pgo2.c tests were flaky so I'm it's definitely the right move.

@EthanLuisMcDonough
Copy link
Member Author

This patch should disable PGO on nvidia targets: #133522. I'll merge it after I'm done running some tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:codegen IR generation bugs: mangling, exceptions, etc. clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang Clang issues not falling into any other category compiler-rt llvm:transforms offload PGO Profile Guided Optimizations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants