Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LTO on ARM 32bit ARM system fails #1627

Open
nakos937 opened this issue Apr 26, 2022 · 19 comments
Open

LTO on ARM 32bit ARM system fails #1627

nakos937 opened this issue Apr 26, 2022 · 19 comments
Labels
[ARCH] arm32 This bug impacts ARCH=arm enhancement New feature or request [FEATURE] LTO Related to building the kernel with LLVM Link Time Optimization feature-request Not a bug per-se [PATCH] Exists There is a patch that fixes this issue

Comments

@nakos937
Copy link

nakos937 commented Apr 26, 2022

Hello,
My company has a relatively old AT91-based embedded product, and it only has 4 megabytes of flash to boot the kernel from, so I started experimenting with LTO for size reduction and possible performance improvement.
I have ran into an issue - see the attached screenshot.
I suspect the linker incorrectly identifies these symbols in the kernel de-compressor routine as unused, and strips them out.
These are declared at:
https://elixir.bootlin.com/linux/v5.18-rc4/source/arch/arm/boot/compressed/misc.c

What I have tried already:
Setting KBUILD flags in corresponding Makefiles to disable LTO.
Setting the "used" compiler attribute on these specific variables/symbols.

I am honestly at a loss, please help me out on this one.
I am facing this issue on all kernels with existing LTO support.
Making a non-compressed target like "Image" instead of the default zImage works, although it is obviously way bigger.
Steps to reproduce it:
make at91_dt_defconfig - (I use a similar config for this custom board)
make -j $(nproc)

Screenshot from 2022-04-25 20-29-18
.

@nakos937 nakos937 changed the title LTO on ARM 32bit ARM system fials LTO on ARM 32bit ARM system fails Apr 26, 2022
@nathanchance
Copy link
Member

I think the orphan section warnings are the root cause of the issue? Does this help?

diff --git a/arch/arm/boot/compressed/vmlinux.lds.S b/arch/arm/boot/compressed/vmlinux.lds.S
index 1bcb68ac4b01..5538f205449c 100644
--- a/arch/arm/boot/compressed/vmlinux.lds.S
+++ b/arch/arm/boot/compressed/vmlinux.lds.S
@@ -123,7 +123,7 @@ SECTIONS

   . = BSS_START;
   __bss_start = .;
-  .bss                 : { *(.bss) }
+  .bss                 : { *(.bss) *(.bss.*) }
   _end = .;

   . = ALIGN(8);                /* the stack must be 64-bit aligned */

Alternatively, disabling LTO for the compressor might work?

diff --git a/arch/arm/boot/compressed/Makefile b/arch/arm/boot/compressed/Makefile
index 41bcbb460fac..78e2bf5f5fb8 100644
--- a/arch/arm/boot/compressed/Makefile
+++ b/arch/arm/boot/compressed/Makefile
@@ -97,6 +97,7 @@ OBJS  += lib1funcs.o ashldi3.o bswapsdi2.o
 targets       := vmlinux vmlinux.lds piggy_data piggy.o \
                 head.o $(OBJS)

+KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_LTO), $(KBUILD_CFLAGS))
 KBUILD_CFLAGS += -DDISABLE_BRANCH_PROFILING

 ccflags-y := -fpic $(call cc-option,-mno-single-pic-base,) -fno-builtin \

@nakos937
Copy link
Author

Hello, Thanks for your swift response. I'll try both and get back to you with results.

@nathanchance nathanchance added [ARCH] arm32 This bug impacts ARCH=arm [FEATURE] LTO Related to building the kernel with LLVM Link Time Optimization labels Apr 26, 2022
@nakos937
Copy link
Author

First suggestion did not work, it only skipped those linker warnings, bu produced the same error message after.
Disabling LTO with the flags you specified works, and I dont think LTO-ing the kernel decompressor routine would improve performance anywhere.
Thanks for your help.
Regards,
Akos

@nickdesaulniers
Copy link
Member

So...the resulting image boots? That's news if so, since we haven't supported (or tried) LTO on arm32.

@nakos937
Copy link
Author

nakos937 commented Apr 26, 2022

It does. Not sure if it is stable on the long-term. Will be deployed to 0-24 production after meticulous testing, I'll get back with my experience if you are interested.

@nickdesaulniers
Copy link
Member

We should upstream Nathan's patch if it works.

@nickdesaulniers nickdesaulniers added enhancement New feature or request feature-request Not a bug per-se [PATCH] Exists There is a patch that fixes this issue labels Apr 26, 2022
@nakos937
Copy link
Author

Keep in mind that the alpha architecture has exactly the same code for the de-compressor routine, I would eat my hat if it wouldn't behave the same. (LTO-ig the whole kernel just fine then dying at the compression part)
I'll test that out too if that helps, although that is a long-dead arch and is not even used in embedded systems.
But for the sake of clarity / OCD, if we want to upstream a patch, might as well be filling all potholes.

@nathanchance
Copy link
Member

We should upstream Nathan's patch if it works.

It does for me (with multi_v7_defconfig at least). Although we should probably understand why the symbol type changes with LTO...

$ llvm-nm vmlinux &| rg "(free_mem_end_ptr|free_mem_ptr|malloc_count|malloc_ptr|output_data)"
00a38208 B free_mem_end_ptr
00a38204 B free_mem_ptr
00a38214 B malloc_count
00a38210 B malloc_ptr
00a38200 B output_data

$ llvm-nm vmlinux.lto &| rg "(free_mem_end_ptr|free_mem_ptr|malloc_count|malloc_ptr|output_data)"
00a43208 b free_mem_end_ptr
00a43204 b free_mem_ptr
00a43210 b malloc_count
00a4320c b malloc_ptr
00a43200 b output_data

Keep in mind that the alpha architecture has exactly the same code for the de-compressor routine, I would eat my hat if it wouldn't behave the same.

LLVM does not have an Alpha backend as far as I can tell, so it is not really relevant since the only other option for LTO would be GCC, which is not upstream.

@nickdesaulniers
Copy link
Member

Although we should probably understand why the symbol type changes with LTO...

via man 1 nm:

...
           depending on the object file format.  If lowercase, the symbol is usually local; if
           uppercase, the symbol is global (external).  There are however a few lowercase
...
          "B"
           "b" The symbol is in the BSS data section.  This section typically contains zero-
               initialized or uninitialized data, although the exact behavior is system
               dependent.

(That's not an explanation as to why though).

Keep in mind that the alpha architecture has exactly the same code for the de-compressor routine, I would eat my hat if it wouldn't behave the same.

LLVM does not have an Alpha backend as far as I can tell, so it is not really relevant since the only other option for LTO would be GCC, which is not upstream.

I interpret this point more so as "let's figure out why the decompressor goes wrong under LTO BEFORE sending arm32 LTO support upstream, so that we can drop the proposed diff to arch/arm/boot/compressed/Makefile."

@nakos937
Copy link
Author

Also, I forgot to mention that I had enabled 32-bit ARM support by doing an ugly hack on the arch/Kconfig file and removing all the needed dependency factors for "HAS_LTO", so if we send this patch upstream, a proper patch for enabling ARM32 LTO should be included in the pull request, too.
I'd like to report that it has been working for two days or so now without a hiccup, but more in-depth testing is required.

@nakos937
Copy link
Author

nakos937 commented May 5, 2022

Will be deployed to 500+ of these devices soon, as it passed preliminary testing.
This is unrelated to LTO, but I'd like to ask because there may be an easy solution;
I have encountered a different problem: new kernels after the 5.4 series call the first mmc card mmcblk1 instead of mmcblk0, so a boot with the existing u-boot configuration fails, because mmcblk0p1 is hard-coded as the kernel partition. I hope there is a better option for making this work than modifying the u-boot parameters of thousands of devices manually. Patching the kernel side and deploying the kernel is done more conveniently because I can use SSH and Ansible.

@nickdesaulniers
Copy link
Member

Will be deployed to 500+ of these devices soon, as it passed preliminary testing.

Exciting! Any measurements you can share? That would help us when sending a patch upstream.

I hope there is a better option...

Perhaps that's a better question for #armlinux on Libera chat IRC.

@nickdesaulniers
Copy link
Member

Although we should probably understand why the symbol type changes with LTO...

@MaskRay any idea if it's intentional that LTO promotes bss symbols from STB_LOCAL to global STB_GLOBAL?

@MaskRay
Copy link
Member

MaskRay commented May 11, 2022

Although we should probably understand why the symbol type changes with LTO...

@MaskRay any idea if it's intentional that LTO promotes bss symbols from STB_LOCAL to global STB_GLOBAL?

I haven't looked into the case in the PR, but in general: ThinLTO may change symbols from STB_LOCAL to STB_GLOBAL.

If a function accessing a local variable is imported to another compile unit, the only way for the two compile units to see the same copy is to promote the variable to a global variable, then let the linker resolves references to the same copy.

See 4.1 Promotion of Symbols with Local Linkage in ThinLTO: Scalable and Incremental LTO.

T. Johnson, M. Amini and X. David Li, "ThinLTO: Scalable and incremental LTO," 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2017, pp. 111-121, doi: 10.1109/CGO.2017.7863733.

A simplest example:

static int var;
int foo(void) { return ++var; }

If foo is imported, var may be promoted to STB_GLOBAL.

@nickdesaulniers
Copy link
Member

nickdesaulniers commented May 11, 2022

Thanks!

Right, I'm guessing it promotes+renames to avoid any kind of collision at the global level. Edit, yep the paper says exactly that.

Although we should probably understand why the symbol type changes with LTO...

I think that's satisfactory to include in a commit message.

@nathanchance
Copy link
Member

For what it's worth, that issue is also reproducible with full LTO but I will still try to draft up a series for it, once I can verify there are no problems on real hardware and run it through a few other configurations.

@nathanchance
Copy link
Member

I also see a spew of warnings:

ld.lld: warning: arch/arm/built-in.a(perf_event_v7.o at 4900) <inline asm>:1:2: since v7, cp10 and cp11 are reserved for advanced SIMD or floating point instructions
        mcr p10, 7, r0, c11, c0, 0
        ^


ld.lld: warning: arch/arm/built-in.a(perf_event_v7.o at 4900) <inline asm>:1:2: since v7, cp10 and cp11 are reserved for advanced SIMD or floating point instructions
        mcr p10, 7, r2, c11, c0, 0
        ^


ld.lld: warning: arch/arm/built-in.a(perf_event_v7.o at 4900) <inline asm>:1:2: since v7, cp10 and cp11 are reserved for advanced SIMD or floating point instructions
        mcr p10, 7, r5, c11, c0, 0
        ^


ld.lld: warning: arch/arm/built-in.a(perf_event_v7.o at 4900) <inline asm>:1:2: since v7, cp10 and cp11 are reserved for advanced SIMD or floating point instructions
        mcr p10, 7, r0, c11, c0, 0
        ^


ld.lld: warning: arch/arm/built-in.a(perf_event_v7.o at 4900) <inline asm>:1:2: since v7, cp10 and cp11 are reserved for advanced SIMD or floating point instructions
        mcr p10, 7, r2, c11, c0, 0
        ^


ld.lld: warning: arch/arm/built-in.a(perf_event_v7.o at 4900) <inline asm>:1:2: since v7, cp10 and cp11 are reserved for advanced SIMD or floating point instructions
        mcr p10, 7, r5, c11, c0, 0
        ^


ld.lld: warning: arch/arm/mach-at91/built-in.a(pm.o at 226) <inline asm>:3:6: deprecated since v7, use 'dsb'
        1:  mcr    p15, 0, r1, c7, c10, 4
            ^


ld.lld: warning: arch/arm/built-in.a(flush.o at 5800) <inline asm>:2:2: deprecated since v7, use 'dsb'
        mcr     p15, 0, r7, c7, c10, 4
        ^


ld.lld: warning: arch/arm/plat-versatile/built-in.a(hotplug.o at 222) <inline asm>:2:2: deprecated since v7, use 'dsb'
        mcr     p15, 0, r2, c7, c10, 4
        ^

@nathanchance
Copy link
Member

Also, the issue is that global symbols are being changed to local symbols, not the other way around, if I am understanding the check and the commit message of both 5de813b and 8d7e4cc. Is that still expected with LTO? I guess that is expected if the global symbol is not referred to outside of its one translation unit?

@nickdesaulniers
Copy link
Member

nickdesaulniers commented May 12, 2022

re arch/arm/mach-at91, arch/arm/configs/at91_dt_defconfig has:

 12 CONFIG_ARCH_MULTI_V4T=y                                                                                                                                                                                                                
 13 CONFIG_ARCH_MULTI_V5=y                                                                                                                                                                                                                 
 14 # CONFIG_ARCH_MULTI_V7 is not set

I wonder if the arch version isn't being set correctly for LTO, or if explicit .arch directives would help. i.e.
https://lore.kernel.org/linux-arm-kernel/20210928154143.2106903-14-arnd@kernel.org/

Probably a similar issue for plat-versatile.

Though perf_event_v7.o sounds like it's definitely supposed to be for v7; I wonder what's going wrong there.

8d7e4cc makes me wonder why they're using nm to check for symbol binding rather than dumping the relocations ($OBJDUMP -dr) and grepping for the actual relocation type they're trying to avoid?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[ARCH] arm32 This bug impacts ARCH=arm enhancement New feature or request [FEATURE] LTO Related to building the kernel with LLVM Link Time Optimization feature-request Not a bug per-se [PATCH] Exists There is a patch that fixes this issue
Projects
None yet
Development

No branches or pull requests

4 participants