Skip to content

disable the profiler if -XX:+UseAdaptiveGCBoundary is set#9

Merged
richardstartin merged 1 commit into
mainfrom
rgs/detect-use-adaptive-gc-boundary-flag
Aug 23, 2023
Merged

disable the profiler if -XX:+UseAdaptiveGCBoundary is set#9
richardstartin merged 1 commit into
mainfrom
rgs/detect-use-adaptive-gc-boundary-flag

Conversation

@richardstartin
Copy link
Copy Markdown
Contributor

This flag was made obsolete in JDK15 (so that is has no effect) and was removed entirely in JDK16 (so that it causes an error when set), but has made its way into JVM tuning folklore so shows up from time to time. We have seen it crash the profiler with the following backtrace, which reproduces reliably with -XX:+UseAdaptiveGCBoundary on older JDK versions:

Current thread (0x00007f22c9cce800):  JavaThread "dd-profiler-recording-scheduler" daemon [_thread_in_vm, id=1356, stack(0x00007f226232c000,0x00007f226242d000)]

Stack: [0x00007f226232c000,0x00007f226242d000],  sp=0x00007f226242af30,  free space=1019k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x999dc6]  oopDesc::is_a(Klass*) const+0x26
V  [libjvm.so+0xa0b933]  jvmti_GetClassSignature+0x1e3
C  [libjavaProfiler3325426718231816309.so+0x41444]  Recording::writeCpool(Buffer*)+0x2314
C  [libjavaProfiler3325426718231816309.so+0x428a5]  Recording::finishChunk(bool)+0x845
C  [libjavaProfiler3325426718231816309.so+0x35dc8]  FlightRecorder::dump(char const*, int)+0x88
C  [libjavaProfiler3325426718231816309.so+0x28f2f]  Profiler::dump(char const*, int)+0xef
C  [libjavaProfiler3325426718231816309.so+0x12157]  Java_com_datadoghq_profiler_JavaProfiler_dump0+0x47
j  com.datadoghq.profiler.JavaProfiler.dump0(Ljava/lang/String;)V+0
j  com.datadoghq.profiler.JavaProfiler.dump(Ljava/nio/file/Path;)V+11
j  com.datadog.profiling.ddprof.DatadogProfiler.dump(Ljava/nio/file/Path;)V+5
j  com.datadog.profiling.ddprof.DatadogProfilerRecording.snapshot(Ljava/time/Instant;Ldatadog/trace/api/profiling/ProfilingSnapshot$Kind;)Lcom/datadog/profiling/controller/RecordingData;+17
j  com.datadog.profiling.controller.openjdk.OpenJdkOngoingRecording.snapshot(Ljava/time/Instant;Ldatadog/trace/api/profiling/ProfilingSnapshot$Kind;)Lcom/datadog/profiling/controller/RecordingData;+94
j  com.datadog.profiling.controller.ProfilingSystem$SnapshotRecording.snapshot(Z)V+38
j  com.datadog.profiling.controller.ProfilingSystem$SnapshotRecording.snapshot()V+2
j  com.datadog.profiling.controller.ProfilingSystem$$Lambda$555.run(Ljava/lang/Object;)V+4
J 22328 c2 datadog.trace.util.AgentTaskScheduler$PeriodicTask.run()V (25 bytes) @ 0x00007f22bfdf74b8 [0x00007f22bfdf7440+0x0000000000000078]
j  datadog.trace.util.AgentTaskScheduler$Worker.run()V+27
j  java.lang.Thread.run()V+11 java.base@11.0.17
v  ~StubRoutines::call_stub
V  [libjvm.so+0x8dc0ab]  JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*)+0x39b
V  [libjvm.so+0x8da06d]  JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, Thread*)+0x1ed
V  [libjvm.so+0x98912c]  thread_entry(JavaThread*, Thread*)+0x6c
V  [libjvm.so+0xedb02a]  JavaThread::thread_main_inner()+0x1ba
V  [libjvm.so+0xed7c2f]  Thread::call_run()+0x14f
V  [libjvm.so+0xc776a6]  thread_native_entry(Thread*)+0xe6

@github-actions
Copy link
Copy Markdown
Contributor

🔧 Report generated by pr-comment-cppcheck

CppCheck Report

Style Violations (151)

@github-actions
Copy link
Copy Markdown
Contributor

🔧 Report generated by pr-comment-scanbuild

Scan-Build Report

User:runner@fv-az887-500
Working Directory:/home/runner/work/java-profiler/java-profiler/ddprof-lib/src/test/make
Command Line:make -j4 clean all
Clang Version:Ubuntu clang version 14.0.0-1ubuntu1.1
Date:Wed Aug 23 12:55:43 2023

Bug Summary

Bug TypeQuantityDisplay?
All Bugs4
Logic error
Assigned value is garbage or undefined1
Dereference of null pointer1
Result of operation is garbage or undefined1
Unused code
Dead nested assignment1

Reports

Bug Group Bug Type ▾ File Function/Method Line Path Length
Logic errorAssigned value is garbage or undefineddwarf.cppparseInstructions23120
Unused codeDead nested assignmentvmStructs.cppcheckNativeBinding6351
Logic errorDereference of null pointersymbols_linux.cppElfParser15435
Logic errorResult of operation is garbage or undefineddwarf.hgetSLeb12325

@richardstartin richardstartin merged commit 8d2a226 into main Aug 23, 2023
@richardstartin richardstartin deleted the rgs/detect-use-adaptive-gc-boundary-flag branch January 26, 2024 13:55
jbachorik added a commit that referenced this pull request May 25, 2026
The always-on _in_signal_handler_depth TLS variable, accessed first from
inside our SIGPROF/SIGVTALRM handlers via SIGNAL_HANDLER_GUARD(), was
declared with the default global-dynamic TLS model.  On first access in
a given thread glibc lazily allocates the dtv slot via malloc() and
takes the heap lock — both async-signal-unsafe.

Reproduced deterministically on Graal aarch64 (glibc 17-graal debug)
running ClassGCTest: SIGPROF arrived on the VM Thread while Graal's
JVMCI compiler held the malloc heap lock through
c2v_notifyCompilerPhaseEvent.  Stack:

  #2  __libc_malloc                    -- waiting on heap lock
  #4  allocate_dtv_entry
  #7  _dl_tlsdesc_dynamic
  #8  TLS wrapper for _in_signal_handler_depth
  #9  SignalHandlerScope::SignalHandlerScope
  #10 CTimer::signalHandler

The heap holder is itself blocked at a safepoint waiting for VM Thread
to check in, and VM Thread is stuck in malloc -> full process deadlock.

Switch the variable to the initial-exec TLS model so the loader
allocates its slot from the static TLS surplus at libjavaProfiler.so
load time.  Every existing thread is fixed up at dlopen and every new
thread receives the slot at pthread_create.  Access is then a
register-relative load — async-signal-safe, lock-free, malloc-free.

Also narrow the type to uint8_t (realistic max depth ~3) to make the
intent explicit; alignment-wise this is the same slot.

Refresher tick reduced from 5 s to 500 ms so a library lazily loaded
from signal context becomes resolvable by the stack walker within
half a second.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jbachorik added a commit that referenced this pull request May 25, 2026
Code review surfaced 7 findings — 4 CONFIRMED, 3 PLAUSIBLE.  This
commit addresses all 7:

  #8 (CONFIRMED) — split isInSignalContext into a strict variant.
     dlopen_hook was treating null ProfiledThread as "in signal" and
     deferring refresh on every dlopen from uninstrumented JVM threads
     (VM Thread, JIT, GC), delaying wasmtime sigaction patching by up
     to 500 ms.  Add isInTrackedSignalContext() that returns false on
     null (only true when one of our SignalHandlerScopes is positively
     on the stack); dlopen_hook now uses it so JVM-internal threads
     get synchronous refresh again.  isInSignalContext() retains its
     conservative semantics for any future caller that wants
     AS-safe-by-default.

  #7 (CONFIRMED) — switchLibraryTrap was called before startRefresher
     despite the invariant comment.  Reorder so the refresher is
     running before the trap can fire.

  #9 (CONFIRMED) — DEBUG_ASSERT_NOT_IN_SIGNAL was on the 1-arg and
     2-arg Dictionary::lookup overloads but missed the 4-arg form
     that actually mallocs.  bounded_lookup's runtime-decided
     for_insert path was uncovered.  Move the assertion into the
     4-arg lookup, gated on for_insert (read-only lookups are
     AS-safe).

  #4 (CONFIRMED) — Comments referenced "REFRESH_INTERVAL_NS (5s)"
     but the actual constant is 500 ms.  Fix both stale mentions.

  #13 (PLAUSIBLE) — SIGNAL_HANDLER_GUARD_RELEASE before chaining
     leaves depth == 0 inside a chained handler that returns
     normally; DEBUG_ASSERT_NOT_IN_SIGNAL inside such a handler
     would not fire.  Document the trade-off in segvHandler — the
     longjmp safety property is more important than the sanitizer
     coverage gap, which is bounded to third-party signal handler
     code we don't own.

  #2 (PLAUSIBLE) — refresherLoop used OS::sleep without an EINTR
     loop; any unmasked signal (SIGCHLD, SIGURG, RT signals) would
     cause premature ticks.  Wrap the sleep in an explicit
     elapsed-time loop using OS::nanotime so the refresher ticks at
     500 ms regardless of stray signals.

  #14 (PLAUSIBLE) — refresherLoop published _refresher_tid before
     blocking SIGPROF/SIGVTALRM; a stale per-thread timer from a
     previous lifecycle could fire on the refresher during the
     window.  Block signals first, then publish the TID.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants