Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[objtool warnings] pairs of undefined stack state followed by stack state mismatch #319

Closed
nickdesaulniers opened this issue Jan 17, 2019 · 37 comments
Assignees
Labels
[ARCH] x86_64 This bug impacts ARCH=x86_64 asm goto related to the implementation of asm goto [BUG] llvm A bug that should be fixed in upstream LLVM [FIXED][LLVM] 9 This bug was fixed in LLVM 9.0 [TOOL] objtool warning is produced by the kernel's objtool

Comments

@nickdesaulniers
Copy link
Member

Forked from #6

lots and lots of objtool warning in the form:

drivers/scsi/scsi_debug.o: warning: objtool: resp_requests()+0x9: undefined stack state
drivers/scsi/scsi_debug.o: warning: objtool: resp_requests()+0x0: stack state mismatch: cfa1=6+16 cfa2=7+8

Note that they come in pairs. For an x86_64 allyesconfig, I see 2215 of these warnings (bad). I assume this is asm goto related.

@nickdesaulniers nickdesaulniers added [BUG] Untriaged Something isn't working [ARCH] x86_64 This bug impacts ARCH=x86_64 labels Jan 17, 2019
@nickdesaulniers nickdesaulniers added the [TOOL] objtool warning is produced by the kernel's objtool label Jan 17, 2019
@nathanchance
Copy link
Member

I expect this to be fully asm goto related, given that with https://github.com/nathanchance/linux/commits/no-asm-goto and a build of Clang @ llvm/llvm-project@e50d9cb, I do not see these warnings.

@nickdesaulniers
Copy link
Member Author

defconfig still has these warnings (different pairing):

arch/x86/mm/fault.o: warning: objtool: __do_page_fault()+0x34: sibling call from callable instruction with modified stack frame
arch/x86/mm/fault.o: warning: objtool: __do_page_fault()+0x0: stack state mismatch: cfa1=7+96 cfa2=7+8

and

net/core/dev.o: warning: objtool: trace_net_dev_queue()+0x81: return with modified stack frame
net/core/dev.o: warning: objtool: trace_net_dev_queue()+0x0: stack state mismatch: cfa1=7+40 cfa2=7+8

and

arch/x86/lib/usercopy_64.o: warning: objtool: copy_user_handle_tail()+0x3d: unreachable instruction

@nickdesaulniers
Copy link
Member Author

Simpler reproducer:

$ make CC=clang defconfig
$ make CC=clang -j46 arch/x86/mm/fault.o
$ ./tools/objtool/objtool check arch/x86/mm/fault.o --no-fp

@nickdesaulniers
Copy link
Member Author

nickdesaulniers commented Jan 19, 2019

creduce coughed this up:
final.i
reproducer
input

when grepping for stack state mismatch, but I don't think it's correct (it's complaining about t() which is an empty function.

@nickdesaulniers
Copy link
Member Author

(tangent; it's weird that clang accepts void n(*o, p, u) {};)

@bwendling
Copy link

It's actually perfectly valid C code. :-) (Albeit non-ANSI C.)

@nickdesaulniers
Copy link
Member Author

nickdesaulniers commented Jan 22, 2019

So what I think is happening here is that the clobber list is somehow messed up (for function n, not t as objtool points out. I think that the number of pushes does not match the number of pops.

Attached is the disassembly from:

$ clang -O2 -no-integrated-as -c final.i.c
$ objdump -d final.i.o > final.txt
$ ~/linux/tools/objtool/objtool check final.i.o --no-fp
final.i.o: warning: objtool: t()+0x0: return with modified stack frame
final.i.o: warning: objtool: t()+0x0: stack state mismatch: cfa1=7+64 cfa2=7+8

final.i.o.txt
final.txt

Specifically, it seems that %rax is pushed but then never popped, leaving the stack pointer in the wrong location upon function return.

@nickdesaulniers nickdesaulniers added [BUG] llvm A bug that should be fixed in upstream LLVM and removed [BUG] Untriaged Something isn't working labels Jan 22, 2019
@nickdesaulniers
Copy link
Member Author

ctopper points out that the add $0x8,%rsp should restore the stack to the correct position.

@nickdesaulniers nickdesaulniers added [BUG] Untriaged Something isn't working and removed [BUG] llvm A bug that should be fixed in upstream LLVM labels Jan 22, 2019
@nickdesaulniers
Copy link
Member Author

For further objtool warnings:

$ grep objtool: log.txt | cut -d ' ' -f 5- | sort | uniq
can't find switch jump table
return with modified stack frame
special: can't find new instruction
stack state mismatch: cfa1=6+16 cfa2=7+8
stack state mismatch: cfa1=7+64 cfa2=7+8
undefined stack state

@nickdesaulniers nickdesaulniers added the asm goto related to the implementation of asm goto label Jan 23, 2019
@dileks
Copy link

dileks commented Jan 24, 2019

Is CONFIG_STACK_VALIDATION=n an option to workaround these objtool warnings?

@nathanchance
Copy link
Member

I guess you could say that since CONFIG_STACK_VALIDATION=n just outright disables objtool.

@dileks
Copy link

dileks commented Jan 25, 2019

Which means disabling CONFIG_ORC_WINDER=n and use CONFIG_UNWINDER_FRAME_POINTER=y instead.

@dileks
Copy link

dileks commented Jan 25, 2019

My objtool warnings:

$ grep objtool: build-log_5.0.0-rc3-2-amd64-cbl.txt | cut -d ' ' -f 5- | sort | uniq
return with modified stack frame
special: can't find new instruction
stack state mismatch: cfa1=6+16 cfa2=7+8
stack state mismatch: cfa1=7+64 cfa2=7+8
stack state mismatch: cfa1=7+96 cfa2=7+8
undefined stack state
unreachable instruction
unsupported stack pointer realignment

@nickdesaulniers
Copy link
Member Author

Cool, so Josh (author of objtool) got back to me. Looks like the problem is seen in the relocations:

$ readelf -a final.i.clang.o
...
Relocation section '.rela__jump_table' at offset 0x420 contains 3 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000000  000200000002 R_X86_64_PC32     0000000000000000 .text + 5e
000000000004  000200000002 R_X86_64_PC32     0000000000000000 .text + 77
000000000008  000200000018 R_X86_64_PC64     0000000000000000 .text + 77
...

That creates a jump table entry, which when patched, makes a JMP from
0x5e to 0x77:

  5e:   0f 1f 44 00 8b          nopl   -0x75(%rax,%rax,1)
  63:   44 24 04                rex.R and $0x4,%al
  66:   48 83 c4 08             add    $0x8,%rsp
  6a:   5b                      pop    %rbx
  6b:   41 5c                   pop    %r12
  6d:   41 5d                   pop    %r13
  6f:   41 5e                   pop    %r14
  71:   41 5f                   pop    %r15
  73:   5d                      pop    %rbp
  74:   c3                      retq
  75:   0f 0b                   ud2
  77:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  7e:   00 00

0000000000000080 <t>:
  80:   c3                      retq


and 0x77 is just a nop which falls through to the next function.  It
never gets to the epilogue so the pushed registers aren't popped before
returning.

@nickdesaulniers
Copy link
Member Author

@topperc points out on IRC that the reproducer bug is visible before we even assemble to an object file. Replacing -c with -S:

...
# %bb.7:                                # %if.then14
	#APP
	1:.byte 15,31,68,0x00,0x00
	.pushsection __jump_table,  "aw" 
	.long 1b - ., .Ltmp0 - . 
	.quad 0 + .Ltmp0 - .
	.popsection 
	
	#NO_APP
	movl	4(%rsp), %eax           # 4-byte Reload
.LBB0_8:                                # %cleanup
	addq	$8, %rsp
	.cfi_def_cfa_offset 56
	popq	%rbx
	.cfi_def_cfa_offset 48
	popq	%r12
	.cfi_def_cfa_offset 40
	popq	%r13
	.cfi_def_cfa_offset 32
	popq	%r14
	.cfi_def_cfa_offset 24
	popq	%r15
	.cfi_def_cfa_offset 16
	popq	%rbp
	.cfi_def_cfa_offset 8
	retq
.LBB0_9:                                # %if.else
	.cfi_def_cfa_offset 64
	ud2
.Ltmp0:                                 # Address of block that was removed by CodeGen
.Lfunc_end0:
...

So the inline asm goto jumps past the epilog (pops and return instructions).

@nickdesaulniers
Copy link
Member Author

nickdesaulniers commented Jan 25, 2019

@topperc points out that Control Flow Optimizer is removing the INLINEASM's successor:

$ clang -mllvm -debug -mllvm -print-after-all -O2 -no-integrated-as 2>&1 final.edit.c
...
# *** IR Dump After Prologue/Epilogue Insertion & Frame Finalization ***:
bb.7.if.then14:
; predecessors: %bb.6
  successors: %bb.10(0x40000000), %bb.8(0x40000000); %bb.10(50.00%), %bb.8(50.00%)
  liveins: $r15
  INLINEASM ...
...
# *** IR Dump After Control Flow Optimizer ***:
bb.7.if.then14:
; predecessors: %bb.6
  successors: %bb.8(0x80000000); %bb.8(100.00%)
  liveins: $r15
  INLINEASM
...

@nickdesaulniers nickdesaulniers added [BUG] llvm A bug that should be fixed in upstream LLVM and removed [BUG] Untriaged Something isn't working labels Jan 25, 2019
@nickdesaulniers
Copy link
Member Author

nickdesaulniers commented Jan 26, 2019

Diff 183623 of https://reviews.llvm.org/D53765 has cut down the vast majority (4593/4607 == 99.7%) of the objtool warnings. (The one we debugged above came from arch/x86/mm/fault.o which is now fixed).

➜  kernel-all git:(master) ✗ grep objtool: log.txt | cut -d ' ' -f 5- | sort | uniq 
can't find switch jump table
return with modified stack frame
special: can't find new instruction
stack state mismatch: cfa1=6+16 cfa2=7+8
stack state mismatch: cfa1=7+64 cfa2=7+8
undefined stack state
➜  kernel-all git:(master) ✗ grep objtool: log.txt | cut -d ' ' -f 5- | wc -l      
4607
➜  kernel-all git:(master) ✗ grep objtool: log2.txt | cut -d ' ' -f 5- | sort | uniq
return with modified stack frame
stack state mismatch: cfa1=6+16 cfa2=7+8
stack state mismatch: cfa1=7+64 cfa2=7+8
undefined stack state
➜  kernel-all git:(master) ✗ grep objtool: log2.txt | cut -d ' ' -f 5- | wc -l
14

Let's see if we can pull out another reproducer and whether these are asm goto related or not.

EDIT: the warning from arch/x86/mm/fault.o is not gone. It's just that we get different sets of objtool warnings for defconfig (4 + 1KI) vs allyesconfig (14).

@nickdesaulniers nickdesaulniers self-assigned this Jan 26, 2019
@topperc
Copy link

topperc commented Jan 26, 2019

If you can get even a non-creduced reproducer i might be able to make some progress with it. i'm not convinced I fixed all the possible issues with BranchFolding.cpp

@bwendling
Copy link

Shouldn't these exist for the second asm goto as well?

__label__ l_yes;
__label__ t;

@nickdesaulniers
Copy link
Member Author

@gwelymernans I don't think so...labels in C are function scoped, unless using GNU C extension "local labels." So the first pair of l_yes/t labels are "local labels" scoped to the GNU C extension "statement expression." The second pair are NOT local labels but regular function scoped labels (which are "shadowed" in the local label scope from the first pair). But I could be wrong.

@nickdesaulniers
Copy link
Member Author

avec D53765 diff 183714:
defconfig: 1 objtool warning, which is a separate known issue (#199 )
allyesconfig: build seems to be hanging (in 30 files)

@topperc
Copy link

topperc commented Jan 26, 2019

Is the hanging a new problem?

@nickdesaulniers
Copy link
Member Author

Is the hanging a new problem?

yes (we can track in a new issue, then close this one out once that is resolved)

@topperc
Copy link

topperc commented Jan 26, 2019

Can you get me a reproducer? Even just one of the hanging sources preprocessed with a command line.

@nickdesaulniers
Copy link
Member Author

nickdesaulniers commented Jan 27, 2019

Can you get me a reproducer?

Moved to #330 .

w/ D53765 diff 183738:
defconfig: 1 objtool warning, which is a separate known issue (#199 )
allyesconfig: 10 warnings remaining (maybe a long tail of control flow patterns).

attaching a non-creduced reproducer. Please let me know if you'd like it creduced.
export3.zip

I don't see the typical Address of block that was removed by CodeGen smoking gun in the generated assembly. The warning produced by the objtool object validator is:

drivers/spi/spi-rockchip.o: warning: objtool: rockchip_spi_max_transfer_size()+0x13: undefined stack state
drivers/spi/spi-rockchip.o: warning: objtool: rockchip_spi_max_transfer_size()+0x0: stack state mismatch: cfa1=6+16 cfa2=7+8

so rockchip_spi_max_transfer_size() appears problematic. Please let me know if creducing this would help.

spi-rockchip2.s.txt

It's not clear to me what's wrong in this case:

objdump -d -r spi-rockchip2.o:

...
00000000000022e0 <rockchip_spi_max_transfer_size>:
    22e0:       e8 00 00 00 00          callq  22e5 <rockchip_spi_max_transfer_size+0x5>
                        22e1: R_X86_64_PC32     __fentry__-0x4
    22e5:       55                      push   %rbp
    22e6:       48 89 e5                mov    %rsp,%rbp
    22e9:       e8 00 00 00 00          callq  22ee <rockchip_spi_max_transfer_size+0xe>
                        22ea: R_X86_64_PC32     __sanitizer_cov_trace_pc-0x4
    22ee:       b8 ff ff 00 00          mov    $0xffff,%eax
    22f3:       5d                      pop    %rbp
    22f4:       c3                      retq   
    22f5:       90                      nop
    22f6:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
    22fd:       00 00 00
...

And I don't see the jump table issues Josh previously pointed out. (There is no .rela__jump_table section).

@nickdesaulniers
Copy link
Member Author

readelf -a spi-rockchip2.o:

Relocation section '.rela.orc_unwind_ip' at offset 0x8ce0 contains 110 entries:
...
0000000000d4  000200000002 R_X86_64_PC32     0000000000000000 .text + 22e9
...

Comparing 22e9 and the disassembly above, I wonder if this is another special case but for the orc unwinder?

@topperc
Copy link

topperc commented Jan 27, 2019

It doesn't look like there are any asm gotos in this file right?

When you say "another" special case? Is that referencing some previous issue that wasn't asm goto related?

@nickdesaulniers
Copy link
Member Author

Is that referencing some previous issue that wasn't asm goto related?

Correct. Don't worry about it; I'll stop posting about it (it will be implied).

@bwendling
Copy link

The second asm goto is also in an expression statement, so I'm not sure I understand ...

@nickdesaulniers
Copy link
Member Author

The second asm goto is also in an expression statement

The labels the second asm goto jump to are just normal labels and thus have function level scope; scoped to function r. The labels the first asm goto jump to are special; GNU local labels, are scoped to the surrounding GNU statement expression, and additionally happen to shadow the function scoped labels the second asm goto jumps to.

@nickdesaulniers
Copy link
Member Author

It doesn't look like there are any asm gotos in this file right?

huh? see export3.zip (has many).

@topperc
Copy link

topperc commented Jan 27, 2019

When I compiled to .ll there were no callbr instructions in the final IR. It looks like a lot of them are in functions in header files. Those functions must not be used by the .c file and so they got dead code eliminated?

@nickdesaulniers
Copy link
Member Author

nickdesaulniers commented Jan 27, 2019

oh, indeed, no callbr's there. Ok, sorry for the false negative. Did some comparisons between w/ and w/o asm goto support: https://gist.github.com/nickdesaulniers/d2181780669db43affa05f60a4b059dd

TL;DR looks like there's just 1 warning unique to asm goto from objtool left (and 5 pre-existing unrelated objtool warnings (4 that we weren't aware of; filed #331 #332 #333 #334 )).

drivers/usb/gadget/function/rndis.o: warning: objtool: gen_ndis_set_resp()+0x9: undefined stack state drivers/usb/gadget/function/rndis.o: warning: objtool: gen_ndis_set_resp()+0x0: stack state mismatch: cfa1=6+16 cfa2=7+8

What's bizarre is that I can only reproduce when building the entire kernel, and not the standalone object file.

$ make CC=clang allyesconfig
$ make CC=clang -j46
...
  CC      drivers/usb/gadget/function/rndis.o
...
drivers/usb/gadget/function/rndis.o: warning: objtool: gen_ndis_set_resp()+0x9: undefined stack state
drivers/usb/gadget/function/rndis.o: warning: objtool: gen_ndis_set_resp()+0x0: stack state mismatch: cfa1=6+16 cfa2=7+8
...
$ tools/objtool/objtool check drivers/usb/gadget/function/rndis.o --no-fp
drivers/usb/gadget/function/rndis.o: warning: objtool: gen_ndis_set_resp()+0x9: undefined stack state
drivers/usb/gadget/function/rndis.o: warning: objtool: gen_ndis_set_resp()+0x0: stack state mismatch: cfa1=6+16 cfa2=7+8
$ rm drivers/usb/gadget/function/rndis.o
$ make CC=clang drivers/usb/gadget/function/rndis.o
  DESCEND  objtool
  CALL    scripts/checksyscalls.sh
  CC      drivers/usb/gadget/function/rndis.o
$ make clean
...
$ make CC=clang drivers/usb/gadget/function/rndis.o
...
  CC      drivers/usb/gadget/function/rndis.o
$ tools/objtool/objtool check drivers/usb/gadget/function/rndis.o 
$ tools/objtool/objtool check drivers/usb/gadget/function/rndis.o --no-fp
$

shakes laptop; why are you haunted?!?

Oh, seems like at least everything under drivers/usb/gadget/ has to be the make target to repro...hmm... -DDEBUG -DVERBOSE_DEBUG is not being set when running the full path, seems like subdir-ccflags- may only apply to one level subdir? Will start an LKML thread with @masahir0y )

Anyways, looks like the .ll does have callbr's. Also, the warning is coming from the kernel's orc unwinder instrumenting the object file after the build (vs the previous issues where it was checking the object files).

From the assembly, I see the smoking gun .Ltmp0: # Address of block that was removed by CodeGen.

Here's the reproducer. Replacing s/-c -o rndis.o/-S/, you can see the smoking gun in the assembly. From the assembly, I also can't tell how you'd get to .LBB1_1.

@topperc I'm not sure if this reproducer is enough to point clearly to a problem or not. I also suspect the lack of inlining of static local functions, even if they contain meaningless asm goto's like the reproducer.

@topperc
Copy link

topperc commented Jan 27, 2019

New diff 183780 is up in phab. .LBB1_1 being unreachable was important to finding the issue.

@nickdesaulniers
Copy link
Member Author

nickdesaulniers commented Jan 28, 2019

Trying now with localmodconfig (a kernel config based on the currently running kernel, a safe configuration if you want to boot on metal); seeing 4 different objtool warnings and some compiler hangs (4 jobs were still running after an hour). Will spend more time bisecting and comparing before+after asm goto patches are applied/checking for the presence of callbr IR instructions, in order to provide smaller reproducers.

Edit: filed #335 and #336 , but not reassigning until I've done more due diligence on them.

@nickdesaulniers
Copy link
Member Author

the issues I was observing with allyesconfig look fixed now as of Diff 183921. Closing this bug, but let's follow up in #336 .

@tpimh tpimh added the [FIXED][LLVM] 9 This bug was fixed in LLVM 9.0 label Jun 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[ARCH] x86_64 This bug impacts ARCH=x86_64 asm goto related to the implementation of asm goto [BUG] llvm A bug that should be fixed in upstream LLVM [FIXED][LLVM] 9 This bug was fixed in LLVM 9.0 [TOOL] objtool warning is produced by the kernel's objtool
Projects
None yet
Development

No branches or pull requests

6 participants