test lambda branch by rithikanarayan · Pull Request #17872 · DataDog/dd-trace-py

rithikanarayan · 2026-05-04T20:19:01Z

Description

Testing

Risks

Additional Notes

cit-pr-commenter-54b7da · 2026-05-04T20:26:32Z

Codeowners resolved as

.gitlab-ci.yml                                                          @DataDog/python-guild @DataDog/apm-core-python

datadog-datadog-prod-us1 · 2026-05-04T20:28:32Z

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 6248b2c | Docs | Datadog PR Page | Give us feedback!}

pr-commenter · 2026-05-04T20:45:45Z

Benchmarks

Benchmark execution time: 2026-05-08 14:02:49

Comparing candidate commit 6248b2c in PR branch rithika.narayan/test-trigger with baseline commit 7834861 in branch main.

Found 0 performance improvements and 3 performance regressions! Performance is the same for 368 metrics, 3 unstable metrics.

scenario:span-start

🟥 execution_time [+1.336ms; +1.474ms] or [+8.662%; +9.563%]

scenario:telemetryaddmetric-1-count-metric-1-times

🟥 execution_time [+172.227ns; +202.052ns] or [+8.187%; +9.605%]

scenario:tracer-small

🟥 execution_time [+24.453µs; +27.063µs] or [+7.017%; +7.766%]

…lse (#17929) ## Description Reverts the default of `_DD_PROFILING_STACK_FAST_COPY` from `true` (set in #17757) back to `false` while we investigate the profiler-shutdown segfault tracked in [PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568) (Slack `#incident-53849`, channel `C0B1H299S4R`). Reproduces 5/8 natively on x86_64 Linux Python 3.11.13 with default `safe_memcpy`. With `pytest -p no:faulthandler` it drops to **0/8**, which combined with the post-mortem core analysis pinpoints the actual mechanism: `safe_memcpy`'s `sigsetjmp/siglongjmp` recovery is incompatible with pytest's built-in faulthandler plugin owning the SIGSEGV `sigaction`. Switching the default to `process_vm_readv` (a kernel syscall that returns `-1/EFAULT` cleanly on bad source) sidesteps the SIGSEGV path entirely and is the fastest unblock for the serverless team's release. A proper fix for the underlying profiler-shutdown race (sampler walking frames during `scheduler.flush()`-driven imports) is tracked separately in PROF-14568. Concrete changes: - `ddtrace/internal/settings/profiling.py` — `fast_copy` default `True` → `False`. Users opting in via `_DD_PROFILING_STACK_FAST_COPY=1` are unaffected. - `riotfile.py` — flipped the four dedicated profile-uwsgi venvs from `_DD_PROFILING_STACK_FAST_COPY=0` to `=1` (and updated the comment) so the non-default path is still exercised in riot. - `releasenotes/notes/profiling-phase-out-process-vm-readv-97af2e74953bb9e9.yaml` — dropped (the release note from #17757 announced a default flip that we are no longer landing). ## Findings since the original PLAN.md write-up The crash decomposes into three layers, only the first of which this PR fixes: 1. **dd-trace-py: `safe_memcpy`'s recovery is fragile when another component owns SIGSEGV.** pytest's built-in faulthandler plugin runs `faulthandler.enable()` in `pytest_configure`, installing `faulthandler_fatal_error` as the SIGSEGV `sigaction`. Our `init_segv_catcher` is wrapped in `call_once` (`sampler.cpp:172`) and never re-installs after another component overwrites it. Post-mortem core confirms: `t_handler_armed = 1` on the crashing thread (we *did* arm), `g_old_segv.sa_handler = faulthandler_fatal_error` (we saved pytest's handler when ours got installed), and the call stack shows the kernel called `faulthandler_fatal_error`, not our `segv_handler`. `pytest -p no:faulthandler` eliminates the crash 0/8. 2. **dd-trace-py: profiler shutdown invokes Python imports while sampler is live.** `Profiler._stop_service` keeps collectors alive across `scheduler.flush()` "for snapshot"; flush triggers `code_provenance.get_code_provenance_file()` → `_package_for_root_module_mapping()` → cold imports of `setuptools`/`_distutils_hack`/`packaging.*`. Sampler races, reads stale `PyCodeObject*`, faults. With layer 1 fixed (via this PR using `process_vm_readv`) this becomes a dropped sample instead of a crash, but should be properly fixed in a PROF-14568 follow-up. 3. **datadog-lambda-python: `tests/test_api.py:175` calls `os.environ.clear()` without restoring env.** Strips `DD_PROFILING_STACK_FAST_COPY` for the rest of the pytest session, so any subsequent `Profiler.start()` re-reads `config.stack.fast_copy` which defaults to `True` and re-flips `safe_copy` back to `safe_memcpy_wrapper`. Confirmed via gdb breakpoint on `set_fast_copy_enabled` showing `arg=0` (env present) then `arg=1` (env cleared) when test_api runs before test_wrapper. To be filed as a separate `datadog-lambda-python` issue. This PR sidesteps layer 1 universally by defaulting to `process_vm_readv`. With layer 1 gone, layer 3 becomes harmless to the profiler regardless of test pollution. ## Testing - `hatch run lint:fmt` and `hatch run lint:riot` pass. - The C++ runtime path is unchanged; only the Python config default and the riot env values move. The C++ static-init opt-out at `ddtrace/internal/datadog/profiling/stack/src/echion/vm.cc:6-16` already treated unset/truthy as fast-copy-enabled, so behavior with the env var unset is now: Python config = `False` → `set_fast_copy_enabled(false)` → `safe_copy = process_vm_readv` (Linux) or `mach_vm_read_overwrite` (Darwin). - Existing `tests/profiling/collector/test_copy_memory_stats.py` covers both branches explicitly via env var. - Reproduction validation on workspace-tg: pre-fix `safe_memcpy` default = 5/8 crashes; with this PR's default flipped to `process_vm_readv` (or equivalently `pytest -p no:faulthandler` on the old default) = 0/8. ## Risks - **Hardened-kernel cohort.** With `VmReader` removed in #17755, the only non-fast-copy reader is `process_vm_readv`. On Linux systems where `process_vm_readv` is unavailable (e.g. `kernel.yama.ptrace_scope=3`, certain seccomp/sandboxed containers), `set_fast_copy_enabled(false)` will set `failed_safe_copy = true` and disable stack profiling, even though `safe_memcpy` would have worked. This reproduces the pre-#17757 behavior exactly, so it does not regress any user who was already running on a release without #17757. Users who upgraded to a release containing #17757 and were silently relying on `safe_memcpy` in such an environment will lose stack profiling until they set `_DD_PROFILING_STACK_FAST_COPY=1` explicitly. Accepted as a trade-off vs. the shutdown-crash risk. - **No public API change.** `_DD_PROFILING_STACK_FAST_COPY` is a private (`_DD_*`) env var. ## Additional Notes - Tracking: [PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568) - Original culprit PR: #17757 - Related upstream test PR: #17872 - `changelog/no-changelog` label applied because the env var is private and we are deleting the user-facing release note from #17757 (which announced a default flip that is no longer landing). [PROF-14568]: https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [PROF-14568]: https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>

test lambda branch

a639fab

rithikanarayan added 2 commits May 4, 2026 16:59

small change

925005b

little change

ae47af6

purple4reina force-pushed the rithika.narayan/test-trigger branch from 5ee8a72 to ae47af6 Compare May 5, 2026 21:40

rithikanarayan added 2 commits May 6, 2026 10:14

print

2472713

again

4982f62

taegyunkim mentioned this pull request May 6, 2026

revert(profiling): revert _DD_PROFILING_STACK_FAST_COPY default to false #17929

Merged

rithikanarayan added 2 commits May 6, 2026 16:51

another one

2f38875

dj khaled

8d1aa87

rithikanarayan added 6 commits May 7, 2026 13:38

Merge branch 'main' into rithika.narayan/test-trigger

d6ef9b6

we the best music

0f0d931

delete something lambda needs

79ea60c

use main

a6e8ba3

branch

88f2741

back to main

6248b2c

rithikanarayan closed this May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test lambda branch#17872

test lambda branch#17872
rithikanarayan wants to merge 13 commits into
mainfrom
rithika.narayan/test-trigger

rithikanarayan commented May 4, 2026

Uh oh!

cit-pr-commenter-54b7da Bot commented May 4, 2026 •

edited

Loading

Uh oh!

datadog-datadog-prod-us1 Bot commented May 4, 2026 •

edited by datadog-prod-us1-5 Bot

Loading

Uh oh!

pr-commenter Bot commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rithikanarayan commented May 4, 2026

Description

Testing

Risks

Additional Notes

Uh oh!

cit-pr-commenter-54b7da Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codeowners resolved as

Uh oh!

datadog-datadog-prod-us1 Bot commented May 4, 2026 • edited by datadog-prod-us1-5 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pr-commenter Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

scenario:span-start

scenario:telemetryaddmetric-1-count-metric-1-times

scenario:tracer-small

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cit-pr-commenter-54b7da Bot commented May 4, 2026 •

edited

Loading

datadog-datadog-prod-us1 Bot commented May 4, 2026 •

edited by datadog-prod-us1-5 Bot

Loading

pr-commenter Bot commented May 4, 2026 •

edited

Loading