Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ld.lld: error: drivers/gpu/drm/amd/amdgpu/amdgpu.lto.o: SHT_SYMTAB_SHNDX has 79046 entries, but the symbol table associated has 79048 #1388

Closed
nathanchance opened this issue Jun 1, 2021 · 16 comments
Labels
[ARCH] x86_64 This bug impacts ARCH=x86_64 [BUG] linux-next This is an issue only seen in linux-next [BUG] llvm (main) A bug in an unreleased version of LLVM (this label is appropriate for regressions) [FEATURE] LTO Related to building the kernel with LLVM Link Time Optimization [FEATURE] PGO Related to building the kernel with LLVM Profile Guided Optimization [FIXED][LINUX] 5.13 This bug was fixed in Linux 5.13

Comments

@nathanchance
Copy link
Member

As pointed out by CI: https://github.com/ClangBuiltLinux/continuous-integration2/runs/2718382554?check_suite_focus=true

$ echo 'CONFIG_GCOV_KERNEL=n
CONFIG_KASAN=n
CONFIG_LTO_CLANG_THIN=y' >allmod.config

$ make -skj"$(nproc)" KCONFIG_ALLCONFIG=1 LLVM=1 LLVM_IAS=1 distclean allmodconfig all
...
ld.lld: error: drivers/gpu/drm/amd/amdgpu/amdgpu.lto.o: SHT_SYMTAB_SHNDX has 79582 entries, but the symbol table associated has 79583
...

This only happens when CONFIG_PGO_CLANG is enabled in addition to CONFIG_LTO_CLANG_THIN=y:

$ echo 'CONFIG_GCOV_KERNEL=n
CONFIG_KASAN=n
CONFIG_PGO_CLANG=n
CONFIG_LTO_CLANG_THIN=y' >allmod.config

$ make -skj"$(nproc)" KCONFIG_ALLCONFIG=1 LLVM=1 LLVM_IAS=1 distclean allmodconfig all
...

I bisected it down to llvm/llvm-project@b426b45 in LLVM.

$ git bisect log
# bad: [a3b3f7e631981bd861d5fe5e20f33b11a0dac978] [HIP] Adjust check in hip-include-path.hip test case
# good: [78eaff2ef8a984859a04f944522280360ee825aa] [llvm-exegesis] Loop unrolling for loop snippet repetitor mode
git bisect start 'a3b3f7e63198' '78eaff2ef8a9'
# good: [8e30b55c82cc245f8b59ef3b29d95c9797584b63] [scudo] Fix CHECK implementation
git bisect good 8e30b55c82cc245f8b59ef3b29d95c9797584b63
# good: [a051bbb53f6de8c2db8adf934ef7a9f5951ed0b8] [libcxxabi] Use ASan interface header for declaration. NFC
git bisect good a051bbb53f6de8c2db8adf934ef7a9f5951ed0b8
# bad: [20c9a44ac0164a657329020c0b3deabab3625688] [benchmark] Silence 'suggest override' and 'missing override' warnings
git bisect bad 20c9a44ac0164a657329020c0b3deabab3625688
# bad: [8cc437a8a16e6d2dd403a9a3a74594574e3371d4] [ARM] Extra predicated tests for VMULH. NFC
git bisect bad 8cc437a8a16e6d2dd403a9a3a74594574e3371d4
# good: [68e45962531a25a0fab63eab163a6c9f84c81f1e] [libcxx] Fix the function name in exceptions from create_directories
git bisect good 68e45962531a25a0fab63eab163a6c9f84c81f1e
# good: [832c99f727723bab2baf5a477bda3d91fed56f5d] Revert "[LoopDeletion] Break backedge if we can prove that the loop is exited on 1st iteration"
git bisect good 832c99f727723bab2baf5a477bda3d91fed56f5d
# bad: [b426b45d101740a21610205ec80610c6d0969966] [Internalize] Rename instead of removal if a to-be-internalized comdat has more than one member
git bisect bad b426b45d101740a21610205ec80610c6d0969966
# first bad commit: [b426b45d101740a21610205ec80610c6d0969966] [Internalize] Rename instead of removal if a to-be-internalized comdat has more than one member

cc @MaskRay

@nathanchance nathanchance added [FEATURE] LTO Related to building the kernel with LLVM Link Time Optimization [BUG] llvm (main) A bug in an unreleased version of LLVM (this label is appropriate for regressions) [FEATURE] PGO Related to building the kernel with LLVM Profile Guided Optimization labels Jun 1, 2021
@MaskRay
Copy link
Member

MaskRay commented Jun 1, 2021

This may be an issue in ELF object writer.

I cannot reproduce:

ninja -C /tmp/RelA clang lld llvm-objcopy llvm-ar llvm-objdump llvm-nm llvm-strings llvm-readelf llvm-strip llvm-ar

% git describe HEAD
v5.13-rc4-11-gc2131f7e73c9
% echo 'CONFIG_GCOV_KERNEL=n
CONFIG_KASAN=n
CONFIG_LTO_CLANG_THIN=y' >allmod.config
% PATH=/tmp/RelA/bin:$PATH make -k -j 50 KCONFIG_ALLCONFIG=1 LLVM=1 LLVM_IAS=1 O=/tmp/out/x86_64 clean allmodconfig all

Last few lines of output:

  LD [M]  sound/usb/snd-usbmidi-lib.ko                                         
  LD [M]  sound/usb/usx2y/snd-usb-us122l.ko 
  LD [M]  sound/usb/usx2y/snd-usb-usx2y.ko
  LD [M]  sound/virtio/virtio_snd.ko                                           
  LD [M]  sound/x86/snd-hdmi-lpe-audio.ko  
  LD [M]  sound/xen/snd_xen_front.ko                                           
  LD [M]  virt/lib/irqbypass.ko                                                
make[1]: Leaving directory '/tmp/out/x86_64'

@nathanchance
Copy link
Member Author

@MaskRay sorry, I should have specified that linux-next is needed to see this failure.

@nickdesaulniers nickdesaulniers added the [BUG] linux-next This is an issue only seen in linux-next label Jun 1, 2021
@MaskRay
Copy link
Member

MaskRay commented Jun 1, 2021

I think [Internalize] Rename instead of removal if a to-be-internalized comdat has more than one member is correct. Without it PGO GC can be wrong. The old amdgpu.lto.o has 44662 section headers while the new amdgpu.lto.o has 81884 (there are more because there are more .group), more than SHN_LORESERVE (0xff00), so .symtax_shndx is now needed.

I can reproduce SHT_SYMTAB_SHNDX has 79582 entries, but the symbol table associated has 79583

The problem is that # of entries in .symtab (0x1d24e8/0x18 = 79583) != # of entries in .symtab_shndx (0x04db78/0x04 = 79582).
Is amdgpu.lto.o mangled by objtool for orc? It is possible that LLVM MC creates a valid object file but objtool invalidates it after adding/deleting symbols. I am unfamiliar with the kernel build process.

% readelf -WS drivers/gpu/drm/amd/amdgpu/amdgpu.lto.o | grep symtab
  [81880] .symtab           SYMTAB          0000000000000000 2819ab8 1d24e8 18     81883 74403  8
  [81881] .symtab_shndx     SYMTAB SECTION INDICES 0000000000000000 29ebfa0 04db78 04     81880   0  4

@nickdesaulniers nickdesaulniers added the [ARCH] x86_64 This bug impacts ARCH=x86_64 label Jun 2, 2021
@nickdesaulniers
Copy link
Member

@samitolvanen doesn't the LTO patch set modify Kbuild to run objtool on the final vmlinux image? I guess if the AMDGPU driver is being built as a module, then objtool has to run on it at some point? Perhaps we could dump amdgpu.lto.o before and after objtool is run on it?

@samitolvanen
Copy link
Member

Yes, we run objtool for the .lto.o files in scripts/Makefile.modfinal when LTO is enabled. Should be pretty easy to edit that and save a copy of the pre-objtool file.

@nathanchance
Copy link
Member Author

$ git diff HEAD
diff --git a/scripts/Makefile.modfinal b/scripts/Makefile.modfinal
index 5e9b8057fb24..b10239c8a50b 100644
--- a/scripts/Makefile.modfinal
+++ b/scripts/Makefile.modfinal
@@ -40,6 +40,7 @@ prelink-ext := .lto

 ifdef CONFIG_STACK_VALIDATION
 cmd_ld_ko_o +=                                                         \
+       cp $(@:.ko=$(prelink-ext).o) $(@:.ko=$(prelink-ext).o).tmp;     \
        $(objtree)/tools/objtool/objtool $(objtool_args)                \
                $(@:.ko=$(prelink-ext).o);
$ llvm-readelf -WS build/x86_64/drivers/gpu/drm/amd/amdgpu/amdgpu.lto.o.tmp &| grep -E "\[Nr\]|symtab"
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [81880] .symtab        SYMTAB          0000000000000000 27b3860 1d24d0 18     81883 74403  8
  [81881] .symtab_shndx  SYMTAB SECTION INDICES 0000000000000000 2985d30 04db78 04     81880   0  4

$ llvm-readelf -WS build/x86_64/drivers/gpu/drm/amd/amdgpu/amdgpu.lto.o &| grep -E "\[Nr\]|symtab"
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [81880] .symtab        SYMTAB          0000000000000000 281c230 1d24e8 18     81883 74403  8
  [81881] .symtab_shndx  SYMTAB SECTION INDICES 0000000000000000 29ee718 04db78 04     81880   0  4

@nickdesaulniers
Copy link
Member

thanks for the number @nathanchance . Using @MaskRay 's math on entity size, we have

amdgpu.lto.o.tmp:
0x1d24d0 / 0x18 == 79582
0x04db78 / 0x04 == 79582

amdgpu.lto.o
0x1d24e8 / 0x18 == 79583
0x04db78 / 0x04 == 79582

so indeed, it looks like objtool is adding 1 entity to .symtab, but not .symtab_shndx. Can we find what symbol that is? @nathanchance if you still have those two object files, I think something like:

$ comm -1 -3 <$(llvm-readelf -s amdgpu.lto.o.tmp | rev | cut -d ' ' -f 1 | rev) <$(llvm-readelf -s amdgpu.lto.o | rev | cut -d ' ' -f 1 | rev)

I think should print the one difference. Then we can see what's special about that symbol.

@nathanchance
Copy link
Member Author

That does not seem to work... here is the direct output of that command.

Here is the full output of llvm-readelf -Ws:

pre-objtool

post-objtool

@nickdesaulniers
Copy link
Member

post-objtool contains a bunch of warnings from llvm-readelf in the form:

llvm-readelf: warning: 'amdgpu.lto.o': found an extended symbol index (59051), but unable to locate the extended symbol index table
 59051: 0000000000000000     0 SECTION LOCAL  DEFAULT   RSV[0xffff] <?>

@nickdesaulniers
Copy link
Member

__x86_indirect_alt_call_r11 is the newly added symbol after objtool runs.

@nathanchance
Copy link
Member Author

Related to #1384 (comment)?

@nickdesaulniers
Copy link
Member

nickdesaulniers commented Jun 4, 2021

upstream discussion: https://lore.kernel.org/lkml/20210604205018.2238778-1-ndesaulniers@google.com/, looks like there's another public report on that patch as well.

@nickdesaulniers
Copy link
Member

reverting f66c05d and 9bc0bb5 does resolve the issue for me. Peter has a diff https://lore.kernel.org/lkml/YL3q1qFO9QIRL%2FBA@hirez.programming.kicks-ass.net/ that I'm going to test next.

@nickdesaulniers
Copy link
Member

The diff resolves the issue for me.

@nickdesaulniers nickdesaulniers added the [PATCH] Exists There is a patch that fixes this issue label Jun 7, 2021
@nickdesaulniers nickdesaulniers added [PATCH] Submitted A patch has been submitted for review [PATCH] Accepted A submitted patch has been accepted upstream and removed [PATCH] Exists There is a patch that fixes this issue [PATCH] Submitted A patch has been submitted for review labels Jun 10, 2021
@nathanchance
Copy link
Member Author

@nathanchance nathanchance added [FIXED][LINUX] 5.13 This bug was fixed in Linux 5.13 and removed [PATCH] Accepted A submitted patch has been accepted upstream labels Jun 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[ARCH] x86_64 This bug impacts ARCH=x86_64 [BUG] linux-next This is an issue only seen in linux-next [BUG] llvm (main) A bug in an unreleased version of LLVM (this label is appropriate for regressions) [FEATURE] LTO Related to building the kernel with LLVM Link Time Optimization [FEATURE] PGO Related to building the kernel with LLVM Profile Guided Optimization [FIXED][LINUX] 5.13 This bug was fixed in Linux 5.13
Projects
None yet
Development

No branches or pull requests

4 participants