test lambda branch#17872
Closed
rithikanarayan wants to merge 13 commits into
Closed
Conversation
Codeowners resolved as |
Contributor
🎉 All green!❄️ No new flaky tests detected 🔗 Commit SHA: 6248b2c | Docs | Datadog PR Page | Give us feedback! |
BenchmarksBenchmark execution time: 2026-05-08 14:02:49 Comparing candidate commit 6248b2c in PR branch Found 0 performance improvements and 3 performance regressions! Performance is the same for 368 metrics, 3 unstable metrics. scenario:span-start
scenario:telemetryaddmetric-1-count-metric-1-times
scenario:tracer-small
|
5ee8a72 to
ae47af6
Compare
brettlangdon
added a commit
that referenced
this pull request
May 7, 2026
…lse (#17929) ## Description Reverts the default of `_DD_PROFILING_STACK_FAST_COPY` from `true` (set in #17757) back to `false` while we investigate the profiler-shutdown segfault tracked in [PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568) (Slack `#incident-53849`, channel `C0B1H299S4R`). Reproduces 5/8 natively on x86_64 Linux Python 3.11.13 with default `safe_memcpy`. With `pytest -p no:faulthandler` it drops to **0/8**, which combined with the post-mortem core analysis pinpoints the actual mechanism: `safe_memcpy`'s `sigsetjmp/siglongjmp` recovery is incompatible with pytest's built-in faulthandler plugin owning the SIGSEGV `sigaction`. Switching the default to `process_vm_readv` (a kernel syscall that returns `-1/EFAULT` cleanly on bad source) sidesteps the SIGSEGV path entirely and is the fastest unblock for the serverless team's release. A proper fix for the underlying profiler-shutdown race (sampler walking frames during `scheduler.flush()`-driven imports) is tracked separately in PROF-14568. Concrete changes: - `ddtrace/internal/settings/profiling.py` — `fast_copy` default `True` → `False`. Users opting in via `_DD_PROFILING_STACK_FAST_COPY=1` are unaffected. - `riotfile.py` — flipped the four dedicated profile-uwsgi venvs from `_DD_PROFILING_STACK_FAST_COPY=0` to `=1` (and updated the comment) so the non-default path is still exercised in riot. - `releasenotes/notes/profiling-phase-out-process-vm-readv-97af2e74953bb9e9.yaml` — dropped (the release note from #17757 announced a default flip that we are no longer landing). ## Findings since the original PLAN.md write-up The crash decomposes into three layers, only the first of which this PR fixes: 1. **dd-trace-py: `safe_memcpy`'s recovery is fragile when another component owns SIGSEGV.** pytest's built-in faulthandler plugin runs `faulthandler.enable()` in `pytest_configure`, installing `faulthandler_fatal_error` as the SIGSEGV `sigaction`. Our `init_segv_catcher` is wrapped in `call_once` (`sampler.cpp:172`) and never re-installs after another component overwrites it. Post-mortem core confirms: `t_handler_armed = 1` on the crashing thread (we *did* arm), `g_old_segv.sa_handler = faulthandler_fatal_error` (we saved pytest's handler when ours got installed), and the call stack shows the kernel called `faulthandler_fatal_error`, not our `segv_handler`. `pytest -p no:faulthandler` eliminates the crash 0/8. 2. **dd-trace-py: profiler shutdown invokes Python imports while sampler is live.** `Profiler._stop_service` keeps collectors alive across `scheduler.flush()` "for snapshot"; flush triggers `code_provenance.get_code_provenance_file()` → `_package_for_root_module_mapping()` → cold imports of `setuptools`/`_distutils_hack`/`packaging.*`. Sampler races, reads stale `PyCodeObject*`, faults. With layer 1 fixed (via this PR using `process_vm_readv`) this becomes a dropped sample instead of a crash, but should be properly fixed in a PROF-14568 follow-up. 3. **datadog-lambda-python: `tests/test_api.py:175` calls `os.environ.clear()` without restoring env.** Strips `DD_PROFILING_STACK_FAST_COPY` for the rest of the pytest session, so any subsequent `Profiler.start()` re-reads `config.stack.fast_copy` which defaults to `True` and re-flips `safe_copy` back to `safe_memcpy_wrapper`. Confirmed via gdb breakpoint on `set_fast_copy_enabled` showing `arg=0` (env present) then `arg=1` (env cleared) when test_api runs before test_wrapper. To be filed as a separate `datadog-lambda-python` issue. This PR sidesteps layer 1 universally by defaulting to `process_vm_readv`. With layer 1 gone, layer 3 becomes harmless to the profiler regardless of test pollution. ## Testing - `hatch run lint:fmt` and `hatch run lint:riot` pass. - The C++ runtime path is unchanged; only the Python config default and the riot env values move. The C++ static-init opt-out at `ddtrace/internal/datadog/profiling/stack/src/echion/vm.cc:6-16` already treated unset/truthy as fast-copy-enabled, so behavior with the env var unset is now: Python config = `False` → `set_fast_copy_enabled(false)` → `safe_copy = process_vm_readv` (Linux) or `mach_vm_read_overwrite` (Darwin). - Existing `tests/profiling/collector/test_copy_memory_stats.py` covers both branches explicitly via env var. - Reproduction validation on workspace-tg: pre-fix `safe_memcpy` default = 5/8 crashes; with this PR's default flipped to `process_vm_readv` (or equivalently `pytest -p no:faulthandler` on the old default) = 0/8. ## Risks - **Hardened-kernel cohort.** With `VmReader` removed in #17755, the only non-fast-copy reader is `process_vm_readv`. On Linux systems where `process_vm_readv` is unavailable (e.g. `kernel.yama.ptrace_scope=3`, certain seccomp/sandboxed containers), `set_fast_copy_enabled(false)` will set `failed_safe_copy = true` and disable stack profiling, even though `safe_memcpy` would have worked. This reproduces the pre-#17757 behavior exactly, so it does not regress any user who was already running on a release without #17757. Users who upgraded to a release containing #17757 and were silently relying on `safe_memcpy` in such an environment will lose stack profiling until they set `_DD_PROFILING_STACK_FAST_COPY=1` explicitly. Accepted as a trade-off vs. the shutdown-crash risk. - **No public API change.** `_DD_PROFILING_STACK_FAST_COPY` is a private (`_DD_*`) env var. ## Additional Notes - Tracking: [PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568) - Original culprit PR: #17757 - Related upstream test PR: #17872 - `changelog/no-changelog` label applied because the env var is private and we are deleting the user-facing release note from #17757 (which announced a default flip that is no longer landing). [PROF-14568]: https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [PROF-14568]: https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>
dubloom
pushed a commit
that referenced
this pull request
May 11, 2026
…lse (#17929) ## Description Reverts the default of `_DD_PROFILING_STACK_FAST_COPY` from `true` (set in #17757) back to `false` while we investigate the profiler-shutdown segfault tracked in [PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568) (Slack `#incident-53849`, channel `C0B1H299S4R`). Reproduces 5/8 natively on x86_64 Linux Python 3.11.13 with default `safe_memcpy`. With `pytest -p no:faulthandler` it drops to **0/8**, which combined with the post-mortem core analysis pinpoints the actual mechanism: `safe_memcpy`'s `sigsetjmp/siglongjmp` recovery is incompatible with pytest's built-in faulthandler plugin owning the SIGSEGV `sigaction`. Switching the default to `process_vm_readv` (a kernel syscall that returns `-1/EFAULT` cleanly on bad source) sidesteps the SIGSEGV path entirely and is the fastest unblock for the serverless team's release. A proper fix for the underlying profiler-shutdown race (sampler walking frames during `scheduler.flush()`-driven imports) is tracked separately in PROF-14568. Concrete changes: - `ddtrace/internal/settings/profiling.py` — `fast_copy` default `True` → `False`. Users opting in via `_DD_PROFILING_STACK_FAST_COPY=1` are unaffected. - `riotfile.py` — flipped the four dedicated profile-uwsgi venvs from `_DD_PROFILING_STACK_FAST_COPY=0` to `=1` (and updated the comment) so the non-default path is still exercised in riot. - `releasenotes/notes/profiling-phase-out-process-vm-readv-97af2e74953bb9e9.yaml` — dropped (the release note from #17757 announced a default flip that we are no longer landing). ## Findings since the original PLAN.md write-up The crash decomposes into three layers, only the first of which this PR fixes: 1. **dd-trace-py: `safe_memcpy`'s recovery is fragile when another component owns SIGSEGV.** pytest's built-in faulthandler plugin runs `faulthandler.enable()` in `pytest_configure`, installing `faulthandler_fatal_error` as the SIGSEGV `sigaction`. Our `init_segv_catcher` is wrapped in `call_once` (`sampler.cpp:172`) and never re-installs after another component overwrites it. Post-mortem core confirms: `t_handler_armed = 1` on the crashing thread (we *did* arm), `g_old_segv.sa_handler = faulthandler_fatal_error` (we saved pytest's handler when ours got installed), and the call stack shows the kernel called `faulthandler_fatal_error`, not our `segv_handler`. `pytest -p no:faulthandler` eliminates the crash 0/8. 2. **dd-trace-py: profiler shutdown invokes Python imports while sampler is live.** `Profiler._stop_service` keeps collectors alive across `scheduler.flush()` "for snapshot"; flush triggers `code_provenance.get_code_provenance_file()` → `_package_for_root_module_mapping()` → cold imports of `setuptools`/`_distutils_hack`/`packaging.*`. Sampler races, reads stale `PyCodeObject*`, faults. With layer 1 fixed (via this PR using `process_vm_readv`) this becomes a dropped sample instead of a crash, but should be properly fixed in a PROF-14568 follow-up. 3. **datadog-lambda-python: `tests/test_api.py:175` calls `os.environ.clear()` without restoring env.** Strips `DD_PROFILING_STACK_FAST_COPY` for the rest of the pytest session, so any subsequent `Profiler.start()` re-reads `config.stack.fast_copy` which defaults to `True` and re-flips `safe_copy` back to `safe_memcpy_wrapper`. Confirmed via gdb breakpoint on `set_fast_copy_enabled` showing `arg=0` (env present) then `arg=1` (env cleared) when test_api runs before test_wrapper. To be filed as a separate `datadog-lambda-python` issue. This PR sidesteps layer 1 universally by defaulting to `process_vm_readv`. With layer 1 gone, layer 3 becomes harmless to the profiler regardless of test pollution. ## Testing - `hatch run lint:fmt` and `hatch run lint:riot` pass. - The C++ runtime path is unchanged; only the Python config default and the riot env values move. The C++ static-init opt-out at `ddtrace/internal/datadog/profiling/stack/src/echion/vm.cc:6-16` already treated unset/truthy as fast-copy-enabled, so behavior with the env var unset is now: Python config = `False` → `set_fast_copy_enabled(false)` → `safe_copy = process_vm_readv` (Linux) or `mach_vm_read_overwrite` (Darwin). - Existing `tests/profiling/collector/test_copy_memory_stats.py` covers both branches explicitly via env var. - Reproduction validation on workspace-tg: pre-fix `safe_memcpy` default = 5/8 crashes; with this PR's default flipped to `process_vm_readv` (or equivalently `pytest -p no:faulthandler` on the old default) = 0/8. ## Risks - **Hardened-kernel cohort.** With `VmReader` removed in #17755, the only non-fast-copy reader is `process_vm_readv`. On Linux systems where `process_vm_readv` is unavailable (e.g. `kernel.yama.ptrace_scope=3`, certain seccomp/sandboxed containers), `set_fast_copy_enabled(false)` will set `failed_safe_copy = true` and disable stack profiling, even though `safe_memcpy` would have worked. This reproduces the pre-#17757 behavior exactly, so it does not regress any user who was already running on a release without #17757. Users who upgraded to a release containing #17757 and were silently relying on `safe_memcpy` in such an environment will lose stack profiling until they set `_DD_PROFILING_STACK_FAST_COPY=1` explicitly. Accepted as a trade-off vs. the shutdown-crash risk. - **No public API change.** `_DD_PROFILING_STACK_FAST_COPY` is a private (`_DD_*`) env var. ## Additional Notes - Tracking: [PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568) - Original culprit PR: #17757 - Related upstream test PR: #17872 - `changelog/no-changelog` label applied because the env var is private and we are deleting the user-facing release note from #17757 (which announced a default flip that is no longer landing). [PROF-14568]: https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [PROF-14568]: https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>
P403n1x87
pushed a commit
that referenced
this pull request
May 14, 2026
…lse (#17929) ## Description Reverts the default of `_DD_PROFILING_STACK_FAST_COPY` from `true` (set in #17757) back to `false` while we investigate the profiler-shutdown segfault tracked in [PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568) (Slack `#incident-53849`, channel `C0B1H299S4R`). Reproduces 5/8 natively on x86_64 Linux Python 3.11.13 with default `safe_memcpy`. With `pytest -p no:faulthandler` it drops to **0/8**, which combined with the post-mortem core analysis pinpoints the actual mechanism: `safe_memcpy`'s `sigsetjmp/siglongjmp` recovery is incompatible with pytest's built-in faulthandler plugin owning the SIGSEGV `sigaction`. Switching the default to `process_vm_readv` (a kernel syscall that returns `-1/EFAULT` cleanly on bad source) sidesteps the SIGSEGV path entirely and is the fastest unblock for the serverless team's release. A proper fix for the underlying profiler-shutdown race (sampler walking frames during `scheduler.flush()`-driven imports) is tracked separately in PROF-14568. Concrete changes: - `ddtrace/internal/settings/profiling.py` — `fast_copy` default `True` → `False`. Users opting in via `_DD_PROFILING_STACK_FAST_COPY=1` are unaffected. - `riotfile.py` — flipped the four dedicated profile-uwsgi venvs from `_DD_PROFILING_STACK_FAST_COPY=0` to `=1` (and updated the comment) so the non-default path is still exercised in riot. - `releasenotes/notes/profiling-phase-out-process-vm-readv-97af2e74953bb9e9.yaml` — dropped (the release note from #17757 announced a default flip that we are no longer landing). ## Findings since the original PLAN.md write-up The crash decomposes into three layers, only the first of which this PR fixes: 1. **dd-trace-py: `safe_memcpy`'s recovery is fragile when another component owns SIGSEGV.** pytest's built-in faulthandler plugin runs `faulthandler.enable()` in `pytest_configure`, installing `faulthandler_fatal_error` as the SIGSEGV `sigaction`. Our `init_segv_catcher` is wrapped in `call_once` (`sampler.cpp:172`) and never re-installs after another component overwrites it. Post-mortem core confirms: `t_handler_armed = 1` on the crashing thread (we *did* arm), `g_old_segv.sa_handler = faulthandler_fatal_error` (we saved pytest's handler when ours got installed), and the call stack shows the kernel called `faulthandler_fatal_error`, not our `segv_handler`. `pytest -p no:faulthandler` eliminates the crash 0/8. 2. **dd-trace-py: profiler shutdown invokes Python imports while sampler is live.** `Profiler._stop_service` keeps collectors alive across `scheduler.flush()` "for snapshot"; flush triggers `code_provenance.get_code_provenance_file()` → `_package_for_root_module_mapping()` → cold imports of `setuptools`/`_distutils_hack`/`packaging.*`. Sampler races, reads stale `PyCodeObject*`, faults. With layer 1 fixed (via this PR using `process_vm_readv`) this becomes a dropped sample instead of a crash, but should be properly fixed in a PROF-14568 follow-up. 3. **datadog-lambda-python: `tests/test_api.py:175` calls `os.environ.clear()` without restoring env.** Strips `DD_PROFILING_STACK_FAST_COPY` for the rest of the pytest session, so any subsequent `Profiler.start()` re-reads `config.stack.fast_copy` which defaults to `True` and re-flips `safe_copy` back to `safe_memcpy_wrapper`. Confirmed via gdb breakpoint on `set_fast_copy_enabled` showing `arg=0` (env present) then `arg=1` (env cleared) when test_api runs before test_wrapper. To be filed as a separate `datadog-lambda-python` issue. This PR sidesteps layer 1 universally by defaulting to `process_vm_readv`. With layer 1 gone, layer 3 becomes harmless to the profiler regardless of test pollution. ## Testing - `hatch run lint:fmt` and `hatch run lint:riot` pass. - The C++ runtime path is unchanged; only the Python config default and the riot env values move. The C++ static-init opt-out at `ddtrace/internal/datadog/profiling/stack/src/echion/vm.cc:6-16` already treated unset/truthy as fast-copy-enabled, so behavior with the env var unset is now: Python config = `False` → `set_fast_copy_enabled(false)` → `safe_copy = process_vm_readv` (Linux) or `mach_vm_read_overwrite` (Darwin). - Existing `tests/profiling/collector/test_copy_memory_stats.py` covers both branches explicitly via env var. - Reproduction validation on workspace-tg: pre-fix `safe_memcpy` default = 5/8 crashes; with this PR's default flipped to `process_vm_readv` (or equivalently `pytest -p no:faulthandler` on the old default) = 0/8. ## Risks - **Hardened-kernel cohort.** With `VmReader` removed in #17755, the only non-fast-copy reader is `process_vm_readv`. On Linux systems where `process_vm_readv` is unavailable (e.g. `kernel.yama.ptrace_scope=3`, certain seccomp/sandboxed containers), `set_fast_copy_enabled(false)` will set `failed_safe_copy = true` and disable stack profiling, even though `safe_memcpy` would have worked. This reproduces the pre-#17757 behavior exactly, so it does not regress any user who was already running on a release without #17757. Users who upgraded to a release containing #17757 and were silently relying on `safe_memcpy` in such an environment will lose stack profiling until they set `_DD_PROFILING_STACK_FAST_COPY=1` explicitly. Accepted as a trade-off vs. the shutdown-crash risk. - **No public API change.** `_DD_PROFILING_STACK_FAST_COPY` is a private (`_DD_*`) env var. ## Additional Notes - Tracking: [PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568) - Original culprit PR: #17757 - Related upstream test PR: #17872 - `changelog/no-changelog` label applied because the env var is private and we are deleting the user-facing release note from #17757 (which announced a default flip that is no longer landing). [PROF-14568]: https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [PROF-14568]: https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Testing
Risks
Additional Notes