Skip to content

mutex between async samples#43

Merged
richardstartin merged 2 commits into
mainfrom
rgs/agct-hardening
Nov 14, 2023
Merged

mutex between async samples#43
richardstartin merged 2 commits into
mainfrom
rgs/agct-hardening

Conversation

@richardstartin
Copy link
Copy Markdown
Contributor

@richardstartin richardstartin commented Nov 10, 2023

What does this PR do?:
This makes two changes:

  1. Add a mutex between signal handlers for different async sample types, which prevents reentrance into AsyncGetCallTrace
  2. Move AsyncGetCallTrace retries into methods which cannot be inlined, so if we see an AGCT crash and only get the backtrace, we can associate it with the first attempt or the retry

Motivation:

We have seen infrequent segfaults in high load services internally with the following stacktrace:

Stack: [0x00007fbaadd00000,0x00007fbaade00000],  sp=0x00007fbaaddf9140,  free space=996k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x6cc0b5]  frame::sender_for_entry_frame(RegisterMap*) const+0x115
V  [libjvm.so+0x6cccc0]  frame::sender(RegisterMap*) const+0x1d0
V  [libjvm.so+0x6c7b84]  vframeStreamForte::forte_next()+0x84
V  [libjvm.so+0x6c8abd]  forte_fill_call_trace_given_top(JavaThread*, ASGCT_CallTrace*, int, frame) [clone .isra.22]+0x7bd
V  [libjvm.so+0x6c92c1]  AsyncGetCallTrace+0x221
C  [libjavaProfiler1081509410225895832.so+0x4284a]  Profiler::getJavaTraceAsync(void*, ASGCT_CallFrame*, int, StackContext*, bool*)+0x1ca
C  [libjavaProfiler1081509410225895832.so+0x43a96]  Profiler::recordSample(void*, unsigned long long, int, int, Event*)+0x236
C  [libjavaProfiler1081509410225895832.so+0x3cb3a]  PerfEvents::signalHandler(int, siginfo_t*, void*)+0x1ba
C  [libc.so.6+0x42520]
V  [libjvm.so+0xf12e35]  ZBarrierSetRuntime::weak_load_barrier_on_weak_oop_field_preloaded(oopDesc*, oopDesc**)+0xc5


siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 0x0000000000000000

This is caused by dereferencing a null pointer (si_addr: 0x0000000000000000).

I inspected the libjvm build using the docker images Azul publish (currently docker azul/zulu-openjdk:17-latest can be used to obtain a 17.0.9+8 build). After installing binutils,objdump -d /usr/lib/jvm/zulu17-ca-amd64/lib/server/libjvm.so can be used to find disassembly for the function where the segfault takes place:

00000000006cbfa0 <_ZNK5frame22sender_for_entry_frameEP11RegisterMap>:
  6cbfa0:	55                   	push   %rbp
  6cbfa1:	48 89 e5             	mov    %rsp,%rbp
  6cbfa4:	41 54                	push   %r12
  6cbfa6:	53                   	push   %rbx
  6cbfa7:	48 8b 46 20          	mov    0x20(%rsi),%rax
  6cbfab:	48 89 fb             	mov    %rdi,%rbx
  6cbfae:	4c 8b 60 d0          	mov    -0x30(%rax),%r12
  6cbfb2:	49 8b 44 24 20       	mov    0x20(%r12),%rax
  6cbfb7:	48 85 c0             	test   %rax,%rax
  6cbfba:	74 1c                	je     6cbfd8 <_ZNK5frame22sender_for_entry_frameEP11RegisterMap+0x38>
  6cbfbc:	49 8b 44 24 20       	mov    0x20(%r12),%rax
  6cbfc1:	48 85 c0             	test   %rax,%rax
  6cbfc4:	0f 84 e6 00 00 00    	je     6cc0b0 <_ZNK5frame22sender_for_entry_frameEP11RegisterMap+0x110>
  6cbfca:	49 8b 44 24 28       	mov    0x28(%r12),%rax
  6cbfcf:	48 85 c0             	test   %rax,%rax
  6cbfd2:	0f 84 d8 00 00 00    	je     6cc0b0 <_ZNK5frame22sender_for_entry_frameEP11RegisterMap+0x110>
  6cbfd8:	48 89 d7             	mov    %rdx,%rdi
  6cbfdb:	e8 00 d4 ff ff       	call   6c93e0 <_ZN11RegisterMap5clearEv>
  6cbfe0:	49 8b 7c 24 28       	mov    0x28(%r12),%rdi
  6cbfe5:	49 8b 54 24 30       	mov    0x30(%r12),%rdx
  6cbfea:	49 8b 44 24 20       	mov    0x20(%r12),%rax
  6cbfef:	48 89 7b 08          	mov    %rdi,0x8(%rbx)
  6cbff3:	48 89 03             	mov    %rax,(%rbx)
  6cbff6:	48 89 43 28          	mov    %rax,0x28(%rbx)
  6cbffa:	48 89 53 20          	mov    %rdx,0x20(%rbx)
  6cbffe:	e8 3d f2 ec ff       	call   59b240 <_ZN9CodeCache9find_blobEPv>
  6cc003:	49 89 c4             	mov    %rax,%r12
  6cc006:	48 89 43 10          	mov    %rax,0x10(%rbx)
  6cc00a:	48 85 c0             	test   %rax,%rax
  6cc00d:	74 30                	je     6cc03f <_ZNK5frame22sender_for_entry_frameEP11RegisterMap+0x9f>
  6cc00f:	48 8b 00             	mov    (%rax),%rax
  6cc012:	4c 89 e7             	mov    %r12,%rdi
  6cc015:	ff 50 58             	call   *0x58(%rax)
  6cc018:	84 c0                	test   %al,%al
  6cc01a:	74 23                	je     6cc03f <_ZNK5frame22sender_for_entry_frameEP11RegisterMap+0x9f>
  6cc01c:	48 8b 43 08          	mov    0x8(%rbx),%rax
  6cc020:	49 8b 94 24 80 00 00 	mov    0x80(%r12),%rdx
  6cc027:	00 
  6cc028:	48 39 d0             	cmp    %rdx,%rax
  6cc02b:	74 43                	je     6cc070 <_ZNK5frame22sender_for_entry_frameEP11RegisterMap+0xd0>
  6cc02d:	41 83 7c 24 08 03    	cmpl   $0x3,0x8(%r12)
  6cc033:	74 1b                	je     6cc050 <_ZNK5frame22sender_for_entry_frameEP11RegisterMap+0xb0>
  6cc035:	49 3b 84 24 88 00 00 	cmp    0x88(%r12),%rax
  6cc03c:	00 
  6cc03d:	74 31                	je     6cc070 <_ZNK5frame22sender_for_entry_frameEP11RegisterMap+0xd0>
  6cc03f:	31 c0                	xor    %eax,%eax
  6cc041:	89 43 18             	mov    %eax,0x18(%rbx)
  6cc044:	48 89 d8             	mov    %rbx,%rax
  6cc047:	5b                   	pop    %rbx
  6cc048:	41 5c                	pop    %r12
  6cc04a:	5d                   	pop    %rbp
  6cc04b:	c3                   	ret    
  6cc04c:	0f 1f 40 00          	nopl   0x0(%rax)
  6cc050:	48 83 c2 05          	add    $0x5,%rdx
  6cc054:	48 39 d0             	cmp    %rdx,%rax
  6cc057:	74 17                	je     6cc070 <_ZNK5frame22sender_for_entry_frameEP11RegisterMap+0xd0>
  6cc059:	49 8b 94 24 88 00 00 	mov    0x88(%r12),%rdx
  6cc060:	00 
  6cc061:	48 39 d0             	cmp    %rdx,%rax
  6cc064:	74 0a                	je     6cc070 <_ZNK5frame22sender_for_entry_frameEP11RegisterMap+0xd0>
  6cc066:	48 83 c2 05          	add    $0x5,%rdx
  6cc06a:	48 39 d0             	cmp    %rdx,%rax
  6cc06d:	75 d0                	jne    6cc03f <_ZNK5frame22sender_for_entry_frameEP11RegisterMap+0x9f>
  6cc06f:	90                   	nop
  6cc070:	49 8b 04 24          	mov    (%r12),%rax
  6cc074:	48 8d 15 35 14 00 00 	lea    0x1435(%rip),%rdx        # 6cd4b0 <_ZN7nmethod15get_original_pcEPK5frame>
  6cc07b:	48 89 de             	mov    %rbx,%rsi
  6cc07e:	4c 89 e7             	mov    %r12,%rdi
  6cc081:	48 8b 80 b8 01 00 00 	mov    0x1b8(%rax),%rax
  6cc088:	48 39 d0             	cmp    %rdx,%rax
  6cc08b:	75 3b                	jne    6cc0c8 <_ZNK5frame22sender_for_entry_frameEP11RegisterMap+0x128>
  6cc08d:	e8 8e d2 4c 00       	call   b99320 <_ZN7nmethod12orig_pc_addrEPK5frame>
  6cc092:	48 8b 00             	mov    (%rax),%rax
  6cc095:	48 85 c0             	test   %rax,%rax
  6cc098:	74 a5                	je     6cc03f <_ZNK5frame22sender_for_entry_frameEP11RegisterMap+0x9f>
  6cc09a:	48 89 43 08          	mov    %rax,0x8(%rbx)
  6cc09e:	b8 01 00 00 00       	mov    $0x1,%eax
  6cc0a3:	89 43 18             	mov    %eax,0x18(%rbx)
  6cc0a6:	48 89 d8             	mov    %rbx,%rax
  6cc0a9:	5b                   	pop    %rbx
  6cc0aa:	41 5c                	pop    %r12
  6cc0ac:	5d                   	pop    %rbp
  6cc0ad:	c3                   	ret    
  6cc0ae:	66 90                	xchg   %ax,%ax
  6cc0b0:	49 8b 44 24 20       	mov    0x20(%r12),%rax
  6cc0b5:	48 8b 40 f8          	mov    -0x8(%rax),%rax    <—— + 0x115 (crash happens here)
  6cc0b9:	49 89 44 24 28       	mov    %rax,0x28(%r12)
  6cc0be:	e9 15 ff ff ff       	jmp    6cbfd8 <_ZNK5frame22sender_for_entry_frameEP11RegisterMap+0x38>
  6cc0c3:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  6cc0c8:	ff d0                	call   *%rax
  6cc0ca:	eb c9                	jmp    6cc095 <_ZNK5frame22sender_for_entry_frameEP11RegisterMap+0xf5>
  6cc0cc:	0f 1f 40 00          	nopl   0x0(%rax)

(note that 0x6cbfa0 + 0x115 = 0x6cc0b5). The line above the faulting instruction (6cc0b0) is jumped to from 6cbfc4 and 6cbfd2 before jumping back to 6cbfd8. These checks appear to be checks inlined from the function JavaFrameAnchor::make_walkable() shown below:

void JavaFrameAnchor::make_walkable() {
  // last frame set?
  if (last_Java_sp() == nullptr) return;
  // already walkable?
  if (walkable()) return;
  vmassert(last_Java_pc() == nullptr, "already walkable");
  _last_Java_pc = (address)_last_Java_sp[-1];   <-- this looks like where the crash happens
  vmassert(walkable(), "something went wrong");
}

_last_Java_sp is an volatile intptr_t* and _last_Java_sp[-1] is the address 8 bytes before it, which is consistent with the x86 mov -0x8(%rax),%rax.

The only problem is that last_Java_sp() was already actually null-checked a few lines above! So _last_Java_pc was somehow made null during the execution of this function. There are two possibilities:

  1. We interrupted a thread already within AsyncGetCallTrace (either the perf events CPU profiler or the wallclock profiler) from the other one while the thread was in JavaFrameAnchor::make_walkable(). The TLS based mutex would prevent this, and we will know if this is effective if the crashes we see stop.
  2. another thread set the last_Java_sp to null during the invocation of the function. We will consider this possibility if the crashes recur.

The second commit is to allow better diagnostics in case this crash happens exclusively in a retry call and could have been avoided without the workaround.

Additional Notes:

How to test the change?:

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

Unsure? Have a question? Request a review!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 10, 2023

🔧 Report generated by pr-comment-cppcheck

CppCheck Report

Warnings (2)

Style Violations (156)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 10, 2023

🔧 Report generated by pr-comment-scanbuild

Scan-Build Report

User:runner@fv-az573-407
Working Directory:/home/runner/work/java-profiler/java-profiler/ddprof-lib/src/test/make
Command Line:make -j4 clean all
Clang Version:Ubuntu clang version 14.0.0-1ubuntu1.1
Date:Fri Nov 10 11:37:03 2023

Bug Summary

Bug TypeQuantityDisplay?
All Bugs6
Logic error
Assigned value is garbage or undefined1
Dereference of null pointer2
Result of operation is garbage or undefined1
Unused code
Dead initialization1
Dead nested assignment1

Reports

Bug Group Bug Type ▾ File Function/Method Line Path Length
Logic errorAssigned value is garbage or undefineddwarf.cppparseInstructions23120
Unused codeDead initializationlivenessTracker.cppcleanup_table451
Unused codeDead nested assignmentvmStructs.cppcheckNativeBinding7081
Logic errorDereference of null pointerflightRecorder.cppflush14008
Logic errorDereference of null pointersafeAccess.hload3518
Logic errorResult of operation is garbage or undefineddwarf.hgetSLeb12325

@richardstartin richardstartin marked this pull request as ready for review November 10, 2023 11:38
@richardstartin richardstartin merged commit 5787ac4 into main Nov 14, 2023
@richardstartin richardstartin mentioned this pull request Nov 15, 2023
2 tasks
@richardstartin richardstartin deleted the rgs/agct-hardening branch January 26, 2024 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants