Skip to content

fix(crashtracker): move preload logger marking after recursive guard#2023

Open
gyuheon0h wants to merge 1 commit into
mainfrom
gyuheon0h/PROF-14734-preload-logger-fail-inf-recursion
Open

fix(crashtracker): move preload logger marking after recursive guard#2023
gyuheon0h wants to merge 1 commit into
mainfrom
gyuheon0h/PROF-14734-preload-logger-fail-inf-recursion

Conversation

@gyuheon0h
Copy link
Copy Markdown
Contributor

@gyuheon0h gyuheon0h commented May 21, 2026

What does this PR do?

We saw a crash as such, where the stacktrace shows failures within the crashtracker itself, looking like
handle_posix_sigaction -> dlysm -> ... ->handle_posix_sigaction -> dlysm -> ... -> ...

This is because mark_preload_logger_collector() in handle_posix_sigaction is called before the one-time guard NUM_TIMES_CALLED. If dlsym in mark_preload_logger_collector() fails, then

  1. SIGBUS fires -> handle_posix_sigaction -> handle_posix_signal_impl
  2. mark_preload_logger_collector() calls dlsym -> SIGBUS again (same faulty mapping)
  3. Since SA_NODEFER, the new SIGBUS is delivered immediately, re-entering handle_posix_sigaction -> handle_posix_signal_impl
  4. ...infinite recursion

Note that the production failure might not have been dd_preload_logger_mark_collector itself, but dlsym failure while iterating through dyn loaded libs. The important part is that dlsym can fail and this will retrigger crashtracker

Motivation

What inspired you to submit this pull request?

Additional Notes

AI generated the script to repro

How to test the change?

#!/usr/bin/env bash

set -euo pipefail

REPO="$(cd "$(dirname "$0")" && pwd)"
BIN_TEST="$REPO/target/debug/crashtracker_bin_test"
RECEIVER="$REPO/target/debug/crashtracker-receiver"
PRELOAD_C="/tmp/repro_preload.c"
PRELOAD_SO="/tmp/repro_preload.so"
REPORT_OUT="$REPO/repro_crash_report.json"
RECURSE_DEPTH=30 # this is some N where dlysm keeps failing before succeeding / returning without SIGBUS

for bin in "$BIN_TEST" "$RECEIVER"; do
    [[ -x "$bin" ]] || { echo "Missing: $bin"; exit 1; }
done
command -v gcc >/dev/null || { echo "gcc not found"; exit 1; }

cat > "$PRELOAD_C" <<EOF
#include <signal.h>
#include <stdatomic.h>

static atomic_int call_count = 0;

__attribute__((visibility("default")))
void dd_preload_logger_mark_collector(void) {
    int n = atomic_fetch_add_explicit(&call_count, 1, memory_order_seq_cst);
    if (n < $RECURSE_DEPTH) {
        raise(SIGBUS);
    }
}
EOF

gcc -shared -fPIC -o "$PRELOAD_SO" "$PRELOAD_C"

OUT_DIR="$(mktemp -d)"
trap 'rm -rf "$OUT_DIR"' EXIT
rm -f "$REPORT_OUT"

set +e
LD_PRELOAD="$PRELOAD_SO" "$BIN_TEST" \
    "file://$REPORT_OUT" \
    "$RECEIVER" \
    "$OUT_DIR" \
    "donothing" \
    "kill_sigbus"
EXIT_CODE=$?
set -e

echo "Exit code: $EXIT_CODE"

if [[ -f "$REPORT_OUT" ]]; then
    FRAMES=$(grep -c "handle_posix_sigaction" "$REPORT_OUT" || true)
    echo "Crash report: $REPORT_OUT"
    echo "handle_posix_sigaction frames: $FRAMES"
    [[ $FRAMES -gt 1 ]] && echo "BUG REPRODUCED." || echo "FIX CONFIRMED: no recursion."
else
    echo "No crash report"
fi

Reproduced by running a script that loaded in dd_preload_logger_mark_collector symbol that failed after N times, which the previous times, triggers a SIGBUS. Basically, the gist is, we should not do anything after we enter the crash handler, before we check the one time guard

Copy link
Copy Markdown
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@gyuheon0h gyuheon0h changed the title Move preload logger marking after recursive guard fix(crashtracker): move preload logger marking after recursive guard May 21, 2026
@datadog-datadog-prod-us1-2
Copy link
Copy Markdown

datadog-datadog-prod-us1-2 Bot commented May 21, 2026

Tests

🎉 All green!

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 0.00%
Overall Coverage: 72.86% (-0.02%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 3a72305 | Docs | Datadog PR Page | Give us feedback!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 21, 2026

📚 Documentation Check Results

⚠️ 1079 documentation warning(s) found

📦 libdd-crashtracker - 1079 warning(s)


Updated: 2026-05-21 18:37:09 UTC | Commit: 45f3a32 | missing-docs job results

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 21, 2026

Clippy Allow Annotation Report

Comparing clippy allow annotations between branches:

  • Base Branch: origin/main
  • PR Branch: origin/gyuheon0h/PROF-14734-preload-logger-fail-inf-recursion

Summary by Rule

Rule Base Branch PR Branch Change

Annotation Counts by File

File Base Branch PR Branch Change

Annotation Stats by Crate

Crate Base Branch PR Branch Change
clippy-annotation-reporter 5 5 No change (0%)
datadog-ffe-ffi 1 1 No change (0%)
datadog-ipc 21 21 No change (0%)
datadog-live-debugger 6 6 No change (0%)
datadog-live-debugger-ffi 10 10 No change (0%)
datadog-profiling-replayer 4 4 No change (0%)
datadog-remote-config 3 3 No change (0%)
datadog-sidecar 57 57 No change (0%)
libdd-common 13 13 No change (0%)
libdd-common-ffi 12 12 No change (0%)
libdd-data-pipeline 5 5 No change (0%)
libdd-ddsketch 2 2 No change (0%)
libdd-dogstatsd-client 1 1 No change (0%)
libdd-profiling 13 13 No change (0%)
libdd-telemetry 20 20 No change (0%)
libdd-tinybytes 4 4 No change (0%)
libdd-trace-normalization 2 2 No change (0%)
libdd-trace-obfuscation 3 3 No change (0%)
libdd-trace-stats 1 1 No change (0%)
libdd-trace-utils 15 15 No change (0%)
Total 198 198 No change (0%)

About This Report

This report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 21, 2026

🔒 Cargo Deny Results

⚠️ 4 issue(s) found, showing only errors (advisories, bans, sources)

📦 libdd-crashtracker - 4 error(s)

Show output
error[unsound]: Rand is unsound with a custom logger using `rand::rng()`
    ┌─ /home/runner/work/libdatadog/libdatadog/Cargo.lock:200:1
    │
200 │ rand 0.8.5 registry+https://github.com/rust-lang/crates.io-index
    │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ unsound advisory detected
    │
    ├ ID: RUSTSEC-2026-0097
    ├ Advisory: https://rustsec.org/advisories/RUSTSEC-2026-0097
    ├ It has been reported (by @lopopolo) that the `rand` library is [unsound](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#soundness-of-code--of-a-library) (i.e. that safe code using the public API can cause Undefined Behaviour) when all the following conditions are met:
      
      - The `log` and `thread_rng` features are enabled
      - A [custom logger](https://docs.rs/log/latest/log/#implementing-a-logger) is defined
      - The custom logger accesses `rand::rng()` (previously `rand::thread_rng()`) and calls any `TryRng` (previously `RngCore`) methods on `ThreadRng`
      - The `ThreadRng` (attempts to) reseed while called from the custom logger (this happens every 64 kB of generated data)
      - Trace-level logging is enabled or warn-level logging is enabled and the random source (the `getrandom` crate) is unable to provide a new seed
      
      `TryRng` (previously `RngCore`) methods for `ThreadRng` use `unsafe` code to cast `*mut BlockRng<ReseedingCore>` to `&mut BlockRng<ReseedingCore>`. When all the above conditions are met this results in an aliased mutable reference, violating the Stacked Borrows rules. Miri is able to detect this violation in sample code. Since construction of [aliased mutable references is Undefined Behaviour](https://doc.rust-lang.org/stable/nomicon/references.html), the behaviour of optimized builds is hard to predict.
    ├ Announcement: https://github.com/rust-random/rand/pull/1763
    ├ Solution: Upgrade to >=0.10.1 OR <0.10.0, >=0.9.3 OR <0.9.0, >=0.8.6 (try `cargo update -p rand`)
    ├ rand v0.8.5
      ├── libdd-common v4.1.0
      │   ├── libdd-capabilities-impl v2.0.0
      │   │   └── libdd-shared-runtime v1.0.0
      │   │       └── libdd-telemetry v5.0.0
      │   │           └── libdd-crashtracker v1.0.0
      │   ├── (build) libdd-crashtracker v1.0.0 (*)
      │   ├── libdd-shared-runtime v1.0.0 (*)
      │   └── libdd-telemetry v5.0.0 (*)
      └── libdd-crashtracker v1.0.0 (*)

error[vulnerability]: Name constraints for URI names were incorrectly accepted
    ┌─ /home/runner/work/libdatadog/libdatadog/Cargo.lock:215:1
    │
215 │ rustls-webpki 0.103.10 registry+https://github.com/rust-lang/crates.io-index
    │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ security vulnerability detected
    │
    ├ ID: RUSTSEC-2026-0098
    ├ Advisory: https://rustsec.org/advisories/RUSTSEC-2026-0098
    ├ Name constraints for URI names were ignored and therefore accepted.
      
      Note this library does not provide an API for asserting URI names, and URI name constraints are otherwise not implemented.  URI name constraints are now rejected unconditionally.
      
      Since name constraints are restrictions on otherwise properly-issued certificates, this bug is reachable only after signature verification and requires misissuance to exploit.
      
      This vulnerability is identified as [GHSA-965h-392x-2mh5](https://github.com/rustls/webpki/security/advisories/GHSA-965h-392x-2mh5). Thank you to @1seal for the report.
    ├ Solution: Upgrade to >=0.103.12, <0.104.0-alpha.1 OR >=0.104.0-alpha.6 (try `cargo update -p rustls-webpki`)
    ├ rustls-webpki v0.103.10
      └── rustls v0.23.37
          ├── hyper-rustls v0.27.7
          │   └── libdd-common v4.1.0
          │       ├── libdd-capabilities-impl v2.0.0
          │       │   └── libdd-shared-runtime v1.0.0
          │       │       └── libdd-telemetry v5.0.0
          │       │           └── libdd-crashtracker v1.0.0
          │       ├── (build) libdd-crashtracker v1.0.0 (*)
          │       ├── libdd-shared-runtime v1.0.0 (*)
          │       └── libdd-telemetry v5.0.0 (*)
          ├── libdd-common v4.1.0 (*)
          └── tokio-rustls v0.26.0
              ├── hyper-rustls v0.27.7 (*)
              └── libdd-common v4.1.0 (*)

error[vulnerability]: Name constraints were accepted for certificates asserting a wildcard name
    ┌─ /home/runner/work/libdatadog/libdatadog/Cargo.lock:215:1
    │
215 │ rustls-webpki 0.103.10 registry+https://github.com/rust-lang/crates.io-index
    │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ security vulnerability detected
    │
    ├ ID: RUSTSEC-2026-0099
    ├ Advisory: https://rustsec.org/advisories/RUSTSEC-2026-0099
    ├ Permitted subtree name constraints for DNS names were accepted for certificates asserting a wildcard name.
      
      This was incorrect because, given a name constraint of `accept.example.com`, `*.example.com` could feasibly allow a name of `reject.example.com` which is outside the constraint.
      This is very similar to [CVE-2025-61727](https://go.dev/issue/76442).
      
      Since name constraints are restrictions on otherwise properly-issued certificates, this bug is reachable only after signature verification and requires misissuance to exploit.
      
      This vulnerability is identified as [GHSA-xgp8-3hg3-c2mh](https://github.com/rustls/webpki/security/advisories/GHSA-xgp8-3hg3-c2mh). Thank you to @1seal for the report.
    ├ Solution: Upgrade to >=0.103.12, <0.104.0-alpha.1 OR >=0.104.0-alpha.6 (try `cargo update -p rustls-webpki`)
    ├ rustls-webpki v0.103.10
      └── rustls v0.23.37
          ├── hyper-rustls v0.27.7
          │   └── libdd-common v4.1.0
          │       ├── libdd-capabilities-impl v2.0.0
          │       │   └── libdd-shared-runtime v1.0.0
          │       │       └── libdd-telemetry v5.0.0
          │       │           └── libdd-crashtracker v1.0.0
          │       ├── (build) libdd-crashtracker v1.0.0 (*)
          │       ├── libdd-shared-runtime v1.0.0 (*)
          │       └── libdd-telemetry v5.0.0 (*)
          ├── libdd-common v4.1.0 (*)
          └── tokio-rustls v0.26.0
              ├── hyper-rustls v0.27.7 (*)
              └── libdd-common v4.1.0 (*)

error[vulnerability]: Reachable panic in certificate revocation list parsing
    ┌─ /home/runner/work/libdatadog/libdatadog/Cargo.lock:215:1
    │
215 │ rustls-webpki 0.103.10 registry+https://github.com/rust-lang/crates.io-index
    │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ security vulnerability detected
    │
    ├ ID: RUSTSEC-2026-0104
    ├ Advisory: https://rustsec.org/advisories/RUSTSEC-2026-0104
    ├ A panic was reachable when parsing certificate revocation lists via [`BorrowedCertRevocationList::from_der`]
      or [`OwnedCertRevocationList::from_der`].  This was the result of mishandling a syntactically valid empty
      `BIT STRING` appearing in the `onlySomeReasons` element of a `IssuingDistributionPoint` CRL extension.
      
      This panic is reachable prior to a CRL's signature being verified.
      
      Applications that do not use CRLs are not affected.
      
      Thank you to @tynus3 for the report.
    ├ Solution: Upgrade to >=0.103.13, <0.104.0-alpha.1 OR >=0.104.0-alpha.7 (try `cargo update -p rustls-webpki`)
    ├ rustls-webpki v0.103.10
      └── rustls v0.23.37
          ├── hyper-rustls v0.27.7
          │   └── libdd-common v4.1.0
          │       ├── libdd-capabilities-impl v2.0.0
          │       │   └── libdd-shared-runtime v1.0.0
          │       │       └── libdd-telemetry v5.0.0
          │       │           └── libdd-crashtracker v1.0.0
          │       ├── (build) libdd-crashtracker v1.0.0 (*)
          │       ├── libdd-shared-runtime v1.0.0 (*)
          │       └── libdd-telemetry v5.0.0 (*)
          ├── libdd-common v4.1.0 (*)
          └── tokio-rustls v0.26.0
              ├── hyper-rustls v0.27.7 (*)
              └── libdd-common v4.1.0 (*)

advisories FAILED, bans ok, sources ok

Updated: 2026-05-21 18:38:41 UTC | Commit: 45f3a32 | dependency-check job results

@gyuheon0h gyuheon0h force-pushed the gyuheon0h/PROF-14734-preload-logger-fail-inf-recursion branch 2 times, most recently from 118e192 to 40de8cc Compare May 21, 2026 14:06
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 21, 2026

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.85%. Comparing base (7647446) to head (3a72305).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2023      +/-   ##
==========================================
- Coverage   72.87%   72.85%   -0.03%     
==========================================
  Files         457      457              
  Lines       75769    75769              
==========================================
- Hits        55220    55204      -16     
- Misses      20549    20565      +16     
Components Coverage Δ
libdd-crashtracker 65.21% <0.00%> (-0.03%) ⬇️
libdd-crashtracker-ffi 36.82% <ø> (ø)
libdd-alloc 98.77% <ø> (ø)
libdd-data-pipeline 86.69% <ø> (ø)
libdd-data-pipeline-ffi 78.63% <ø> (ø)
libdd-common 79.81% <ø> (ø)
libdd-common-ffi 74.41% <ø> (ø)
libdd-telemetry 73.34% <ø> (-0.03%) ⬇️
libdd-telemetry-ffi 31.36% <ø> (ø)
libdd-dogstatsd-client 82.64% <ø> (ø)
datadog-ipc 76.22% <ø> (ø)
libdd-profiling 81.70% <ø> (+0.01%) ⬆️
libdd-profiling-ffi 64.79% <ø> (ø)
libdd-sampling 97.46% <ø> (ø)
datadog-sidecar 29.01% <ø> (ø)
datdog-sidecar-ffi 9.29% <ø> (ø)
spawn-worker 48.86% <ø> (ø)
libdd-tinybytes 93.16% <ø> (ø)
libdd-trace-normalization 81.71% <ø> (ø)
libdd-trace-obfuscation 87.30% <ø> (ø)
libdd-trace-protobuf 68.25% <ø> (ø)
libdd-trace-utils 88.86% <ø> (ø)
libdd-tracer-flare 86.88% <ø> (ø)
libdd-log 74.83% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@gyuheon0h gyuheon0h marked this pull request as ready for review May 21, 2026 18:18
@gyuheon0h gyuheon0h requested a review from a team as a code owner May 21, 2026 18:18
@gyuheon0h gyuheon0h force-pushed the gyuheon0h/PROF-14734-preload-logger-fail-inf-recursion branch from 40de8cc to 3a72305 Compare May 21, 2026 18:21
@gyuheon0h gyuheon0h added AI Generated PR largely written by AI tools identified-by:crashtracking labels May 21, 2026
@taegyunkim
Copy link
Copy Markdown
Contributor

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Keep them coming!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Contributor

@taegyunkim taegyunkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with this part of libdatadog, but the high level description makes sense to me. And thanks for adding the reproducer script!

Copy link
Copy Markdown
Contributor

@gleocadie gleocadie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gyuheon0h gyuheon0h changed the base branch from main to julio/use-builder-on-windows May 21, 2026 19:05
@gyuheon0h gyuheon0h requested review from a team as code owners May 21, 2026 19:05
@gyuheon0h gyuheon0h requested review from dd-oleksii, leoromanovsky and vpellan and removed request for a team, dd-oleksii, leoromanovsky and vpellan May 21, 2026 19:05
@gyuheon0h gyuheon0h changed the base branch from julio/use-builder-on-windows to main May 21, 2026 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants