Skip to content

test lambda branch#17872

Closed
rithikanarayan wants to merge 13 commits into
mainfrom
rithika.narayan/test-trigger
Closed

test lambda branch#17872
rithikanarayan wants to merge 13 commits into
mainfrom
rithika.narayan/test-trigger

Conversation

@rithikanarayan
Copy link
Copy Markdown
Contributor

Description

Testing

Risks

Additional Notes

@cit-pr-commenter-54b7da
Copy link
Copy Markdown

cit-pr-commenter-54b7da Bot commented May 4, 2026

Codeowners resolved as

.gitlab-ci.yml                                                          @DataDog/python-guild @DataDog/apm-core-python

@datadog-datadog-prod-us1
Copy link
Copy Markdown
Contributor

datadog-datadog-prod-us1 Bot commented May 4, 2026

Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 6248b2c | Docs | Datadog PR Page | Give us feedback!

@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 4, 2026

Benchmarks

Benchmark execution time: 2026-05-08 14:02:49

Comparing candidate commit 6248b2c in PR branch rithika.narayan/test-trigger with baseline commit 7834861 in branch main.

Found 0 performance improvements and 3 performance regressions! Performance is the same for 368 metrics, 3 unstable metrics.

scenario:span-start

  • 🟥 execution_time [+1.336ms; +1.474ms] or [+8.662%; +9.563%]

scenario:telemetryaddmetric-1-count-metric-1-times

  • 🟥 execution_time [+172.227ns; +202.052ns] or [+8.187%; +9.605%]

scenario:tracer-small

  • 🟥 execution_time [+24.453µs; +27.063µs] or [+7.017%; +7.766%]

@purple4reina purple4reina force-pushed the rithika.narayan/test-trigger branch from 5ee8a72 to ae47af6 Compare May 5, 2026 21:40
brettlangdon added a commit that referenced this pull request May 7, 2026
…lse (#17929)

## Description

Reverts the default of `_DD_PROFILING_STACK_FAST_COPY` from `true` (set
in #17757) back to `false` while we investigate the profiler-shutdown
segfault tracked in
[PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568) (Slack
`#incident-53849`, channel `C0B1H299S4R`).

Reproduces 5/8 natively on x86_64 Linux Python 3.11.13 with default
`safe_memcpy`. With `pytest -p no:faulthandler` it drops to **0/8**,
which combined with the post-mortem core analysis pinpoints the actual
mechanism: `safe_memcpy`'s `sigsetjmp/siglongjmp` recovery is
incompatible with pytest's built-in faulthandler plugin owning the
SIGSEGV `sigaction`. Switching the default to `process_vm_readv` (a
kernel syscall that returns `-1/EFAULT` cleanly on bad source) sidesteps
the SIGSEGV path entirely and is the fastest unblock for the serverless
team's release. A proper fix for the underlying profiler-shutdown race
(sampler walking frames during `scheduler.flush()`-driven imports) is
tracked separately in PROF-14568.

Concrete changes:
- `ddtrace/internal/settings/profiling.py` — `fast_copy` default `True`
→ `False`. Users opting in via `_DD_PROFILING_STACK_FAST_COPY=1` are
unaffected.
- `riotfile.py` — flipped the four dedicated profile-uwsgi venvs from
`_DD_PROFILING_STACK_FAST_COPY=0` to `=1` (and updated the comment) so
the non-default path is still exercised in riot.
-
`releasenotes/notes/profiling-phase-out-process-vm-readv-97af2e74953bb9e9.yaml`
— dropped (the release note from #17757 announced a default flip that we
are no longer landing).

## Findings since the original PLAN.md write-up

The crash decomposes into three layers, only the first of which this PR
fixes:

1. **dd-trace-py: `safe_memcpy`'s recovery is fragile when another
component owns SIGSEGV.** pytest's built-in faulthandler plugin runs
`faulthandler.enable()` in `pytest_configure`, installing
`faulthandler_fatal_error` as the SIGSEGV `sigaction`. Our
`init_segv_catcher` is wrapped in `call_once` (`sampler.cpp:172`) and
never re-installs after another component overwrites it. Post-mortem
core confirms: `t_handler_armed = 1` on the crashing thread (we *did*
arm), `g_old_segv.sa_handler = faulthandler_fatal_error` (we saved
pytest's handler when ours got installed), and the call stack shows the
kernel called `faulthandler_fatal_error`, not our `segv_handler`.
`pytest -p no:faulthandler` eliminates the crash 0/8.
2. **dd-trace-py: profiler shutdown invokes Python imports while sampler
is live.** `Profiler._stop_service` keeps collectors alive across
`scheduler.flush()` "for snapshot"; flush triggers
`code_provenance.get_code_provenance_file()` →
`_package_for_root_module_mapping()` → cold imports of
`setuptools`/`_distutils_hack`/`packaging.*`. Sampler races, reads stale
`PyCodeObject*`, faults. With layer 1 fixed (via this PR using
`process_vm_readv`) this becomes a dropped sample instead of a crash,
but should be properly fixed in a PROF-14568 follow-up.
3. **datadog-lambda-python: `tests/test_api.py:175` calls
`os.environ.clear()` without restoring env.** Strips
`DD_PROFILING_STACK_FAST_COPY` for the rest of the pytest session, so
any subsequent `Profiler.start()` re-reads `config.stack.fast_copy`
which defaults to `True` and re-flips `safe_copy` back to
`safe_memcpy_wrapper`. Confirmed via gdb breakpoint on
`set_fast_copy_enabled` showing `arg=0` (env present) then `arg=1` (env
cleared) when test_api runs before test_wrapper. To be filed as a
separate `datadog-lambda-python` issue.

This PR sidesteps layer 1 universally by defaulting to
`process_vm_readv`. With layer 1 gone, layer 3 becomes harmless to the
profiler regardless of test pollution.

## Testing

- `hatch run lint:fmt` and `hatch run lint:riot` pass.
- The C++ runtime path is unchanged; only the Python config default and
the riot env values move. The C++ static-init opt-out at
`ddtrace/internal/datadog/profiling/stack/src/echion/vm.cc:6-16` already
treated unset/truthy as fast-copy-enabled, so behavior with the env var
unset is now: Python config = `False` → `set_fast_copy_enabled(false)` →
`safe_copy = process_vm_readv` (Linux) or `mach_vm_read_overwrite`
(Darwin).
- Existing `tests/profiling/collector/test_copy_memory_stats.py` covers
both branches explicitly via env var.
- Reproduction validation on workspace-tg: pre-fix `safe_memcpy` default
= 5/8 crashes; with this PR's default flipped to `process_vm_readv` (or
equivalently `pytest -p no:faulthandler` on the old default) = 0/8.

## Risks

- **Hardened-kernel cohort.** With `VmReader` removed in #17755, the
only non-fast-copy reader is `process_vm_readv`. On Linux systems where
`process_vm_readv` is unavailable (e.g. `kernel.yama.ptrace_scope=3`,
certain seccomp/sandboxed containers), `set_fast_copy_enabled(false)`
will set `failed_safe_copy = true` and disable stack profiling, even
though `safe_memcpy` would have worked. This reproduces the pre-#17757
behavior exactly, so it does not regress any user who was already
running on a release without #17757. Users who upgraded to a release
containing #17757 and were silently relying on `safe_memcpy` in such an
environment will lose stack profiling until they set
`_DD_PROFILING_STACK_FAST_COPY=1` explicitly. Accepted as a trade-off
vs. the shutdown-crash risk.
- **No public API change.** `_DD_PROFILING_STACK_FAST_COPY` is a private
(`_DD_*`) env var.

## Additional Notes

- Tracking:
[PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568)
- Original culprit PR: #17757
- Related upstream test PR: #17872
- `changelog/no-changelog` label applied because the env var is private
and we are deleting the user-facing release note from #17757 (which
announced a default flip that is no longer landing).


[PROF-14568]:
https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[PROF-14568]:
https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>
dubloom pushed a commit that referenced this pull request May 11, 2026
…lse (#17929)

## Description

Reverts the default of `_DD_PROFILING_STACK_FAST_COPY` from `true` (set
in #17757) back to `false` while we investigate the profiler-shutdown
segfault tracked in
[PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568) (Slack
`#incident-53849`, channel `C0B1H299S4R`).

Reproduces 5/8 natively on x86_64 Linux Python 3.11.13 with default
`safe_memcpy`. With `pytest -p no:faulthandler` it drops to **0/8**,
which combined with the post-mortem core analysis pinpoints the actual
mechanism: `safe_memcpy`'s `sigsetjmp/siglongjmp` recovery is
incompatible with pytest's built-in faulthandler plugin owning the
SIGSEGV `sigaction`. Switching the default to `process_vm_readv` (a
kernel syscall that returns `-1/EFAULT` cleanly on bad source) sidesteps
the SIGSEGV path entirely and is the fastest unblock for the serverless
team's release. A proper fix for the underlying profiler-shutdown race
(sampler walking frames during `scheduler.flush()`-driven imports) is
tracked separately in PROF-14568.

Concrete changes:
- `ddtrace/internal/settings/profiling.py` — `fast_copy` default `True`
→ `False`. Users opting in via `_DD_PROFILING_STACK_FAST_COPY=1` are
unaffected.
- `riotfile.py` — flipped the four dedicated profile-uwsgi venvs from
`_DD_PROFILING_STACK_FAST_COPY=0` to `=1` (and updated the comment) so
the non-default path is still exercised in riot.
-
`releasenotes/notes/profiling-phase-out-process-vm-readv-97af2e74953bb9e9.yaml`
— dropped (the release note from #17757 announced a default flip that we
are no longer landing).

## Findings since the original PLAN.md write-up

The crash decomposes into three layers, only the first of which this PR
fixes:

1. **dd-trace-py: `safe_memcpy`'s recovery is fragile when another
component owns SIGSEGV.** pytest's built-in faulthandler plugin runs
`faulthandler.enable()` in `pytest_configure`, installing
`faulthandler_fatal_error` as the SIGSEGV `sigaction`. Our
`init_segv_catcher` is wrapped in `call_once` (`sampler.cpp:172`) and
never re-installs after another component overwrites it. Post-mortem
core confirms: `t_handler_armed = 1` on the crashing thread (we *did*
arm), `g_old_segv.sa_handler = faulthandler_fatal_error` (we saved
pytest's handler when ours got installed), and the call stack shows the
kernel called `faulthandler_fatal_error`, not our `segv_handler`.
`pytest -p no:faulthandler` eliminates the crash 0/8.
2. **dd-trace-py: profiler shutdown invokes Python imports while sampler
is live.** `Profiler._stop_service` keeps collectors alive across
`scheduler.flush()` "for snapshot"; flush triggers
`code_provenance.get_code_provenance_file()` →
`_package_for_root_module_mapping()` → cold imports of
`setuptools`/`_distutils_hack`/`packaging.*`. Sampler races, reads stale
`PyCodeObject*`, faults. With layer 1 fixed (via this PR using
`process_vm_readv`) this becomes a dropped sample instead of a crash,
but should be properly fixed in a PROF-14568 follow-up.
3. **datadog-lambda-python: `tests/test_api.py:175` calls
`os.environ.clear()` without restoring env.** Strips
`DD_PROFILING_STACK_FAST_COPY` for the rest of the pytest session, so
any subsequent `Profiler.start()` re-reads `config.stack.fast_copy`
which defaults to `True` and re-flips `safe_copy` back to
`safe_memcpy_wrapper`. Confirmed via gdb breakpoint on
`set_fast_copy_enabled` showing `arg=0` (env present) then `arg=1` (env
cleared) when test_api runs before test_wrapper. To be filed as a
separate `datadog-lambda-python` issue.

This PR sidesteps layer 1 universally by defaulting to
`process_vm_readv`. With layer 1 gone, layer 3 becomes harmless to the
profiler regardless of test pollution.

## Testing

- `hatch run lint:fmt` and `hatch run lint:riot` pass.
- The C++ runtime path is unchanged; only the Python config default and
the riot env values move. The C++ static-init opt-out at
`ddtrace/internal/datadog/profiling/stack/src/echion/vm.cc:6-16` already
treated unset/truthy as fast-copy-enabled, so behavior with the env var
unset is now: Python config = `False` → `set_fast_copy_enabled(false)` →
`safe_copy = process_vm_readv` (Linux) or `mach_vm_read_overwrite`
(Darwin).
- Existing `tests/profiling/collector/test_copy_memory_stats.py` covers
both branches explicitly via env var.
- Reproduction validation on workspace-tg: pre-fix `safe_memcpy` default
= 5/8 crashes; with this PR's default flipped to `process_vm_readv` (or
equivalently `pytest -p no:faulthandler` on the old default) = 0/8.

## Risks

- **Hardened-kernel cohort.** With `VmReader` removed in #17755, the
only non-fast-copy reader is `process_vm_readv`. On Linux systems where
`process_vm_readv` is unavailable (e.g. `kernel.yama.ptrace_scope=3`,
certain seccomp/sandboxed containers), `set_fast_copy_enabled(false)`
will set `failed_safe_copy = true` and disable stack profiling, even
though `safe_memcpy` would have worked. This reproduces the pre-#17757
behavior exactly, so it does not regress any user who was already
running on a release without #17757. Users who upgraded to a release
containing #17757 and were silently relying on `safe_memcpy` in such an
environment will lose stack profiling until they set
`_DD_PROFILING_STACK_FAST_COPY=1` explicitly. Accepted as a trade-off
vs. the shutdown-crash risk.
- **No public API change.** `_DD_PROFILING_STACK_FAST_COPY` is a private
(`_DD_*`) env var.

## Additional Notes

- Tracking:
[PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568)
- Original culprit PR: #17757
- Related upstream test PR: #17872
- `changelog/no-changelog` label applied because the env var is private
and we are deleting the user-facing release note from #17757 (which
announced a default flip that is no longer landing).


[PROF-14568]:
https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[PROF-14568]:
https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>
P403n1x87 pushed a commit that referenced this pull request May 14, 2026
…lse (#17929)

## Description

Reverts the default of `_DD_PROFILING_STACK_FAST_COPY` from `true` (set
in #17757) back to `false` while we investigate the profiler-shutdown
segfault tracked in
[PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568) (Slack
`#incident-53849`, channel `C0B1H299S4R`).

Reproduces 5/8 natively on x86_64 Linux Python 3.11.13 with default
`safe_memcpy`. With `pytest -p no:faulthandler` it drops to **0/8**,
which combined with the post-mortem core analysis pinpoints the actual
mechanism: `safe_memcpy`'s `sigsetjmp/siglongjmp` recovery is
incompatible with pytest's built-in faulthandler plugin owning the
SIGSEGV `sigaction`. Switching the default to `process_vm_readv` (a
kernel syscall that returns `-1/EFAULT` cleanly on bad source) sidesteps
the SIGSEGV path entirely and is the fastest unblock for the serverless
team's release. A proper fix for the underlying profiler-shutdown race
(sampler walking frames during `scheduler.flush()`-driven imports) is
tracked separately in PROF-14568.

Concrete changes:
- `ddtrace/internal/settings/profiling.py` — `fast_copy` default `True`
→ `False`. Users opting in via `_DD_PROFILING_STACK_FAST_COPY=1` are
unaffected.
- `riotfile.py` — flipped the four dedicated profile-uwsgi venvs from
`_DD_PROFILING_STACK_FAST_COPY=0` to `=1` (and updated the comment) so
the non-default path is still exercised in riot.
-
`releasenotes/notes/profiling-phase-out-process-vm-readv-97af2e74953bb9e9.yaml`
— dropped (the release note from #17757 announced a default flip that we
are no longer landing).

## Findings since the original PLAN.md write-up

The crash decomposes into three layers, only the first of which this PR
fixes:

1. **dd-trace-py: `safe_memcpy`'s recovery is fragile when another
component owns SIGSEGV.** pytest's built-in faulthandler plugin runs
`faulthandler.enable()` in `pytest_configure`, installing
`faulthandler_fatal_error` as the SIGSEGV `sigaction`. Our
`init_segv_catcher` is wrapped in `call_once` (`sampler.cpp:172`) and
never re-installs after another component overwrites it. Post-mortem
core confirms: `t_handler_armed = 1` on the crashing thread (we *did*
arm), `g_old_segv.sa_handler = faulthandler_fatal_error` (we saved
pytest's handler when ours got installed), and the call stack shows the
kernel called `faulthandler_fatal_error`, not our `segv_handler`.
`pytest -p no:faulthandler` eliminates the crash 0/8.
2. **dd-trace-py: profiler shutdown invokes Python imports while sampler
is live.** `Profiler._stop_service` keeps collectors alive across
`scheduler.flush()` "for snapshot"; flush triggers
`code_provenance.get_code_provenance_file()` →
`_package_for_root_module_mapping()` → cold imports of
`setuptools`/`_distutils_hack`/`packaging.*`. Sampler races, reads stale
`PyCodeObject*`, faults. With layer 1 fixed (via this PR using
`process_vm_readv`) this becomes a dropped sample instead of a crash,
but should be properly fixed in a PROF-14568 follow-up.
3. **datadog-lambda-python: `tests/test_api.py:175` calls
`os.environ.clear()` without restoring env.** Strips
`DD_PROFILING_STACK_FAST_COPY` for the rest of the pytest session, so
any subsequent `Profiler.start()` re-reads `config.stack.fast_copy`
which defaults to `True` and re-flips `safe_copy` back to
`safe_memcpy_wrapper`. Confirmed via gdb breakpoint on
`set_fast_copy_enabled` showing `arg=0` (env present) then `arg=1` (env
cleared) when test_api runs before test_wrapper. To be filed as a
separate `datadog-lambda-python` issue.

This PR sidesteps layer 1 universally by defaulting to
`process_vm_readv`. With layer 1 gone, layer 3 becomes harmless to the
profiler regardless of test pollution.

## Testing

- `hatch run lint:fmt` and `hatch run lint:riot` pass.
- The C++ runtime path is unchanged; only the Python config default and
the riot env values move. The C++ static-init opt-out at
`ddtrace/internal/datadog/profiling/stack/src/echion/vm.cc:6-16` already
treated unset/truthy as fast-copy-enabled, so behavior with the env var
unset is now: Python config = `False` → `set_fast_copy_enabled(false)` →
`safe_copy = process_vm_readv` (Linux) or `mach_vm_read_overwrite`
(Darwin).
- Existing `tests/profiling/collector/test_copy_memory_stats.py` covers
both branches explicitly via env var.
- Reproduction validation on workspace-tg: pre-fix `safe_memcpy` default
= 5/8 crashes; with this PR's default flipped to `process_vm_readv` (or
equivalently `pytest -p no:faulthandler` on the old default) = 0/8.

## Risks

- **Hardened-kernel cohort.** With `VmReader` removed in #17755, the
only non-fast-copy reader is `process_vm_readv`. On Linux systems where
`process_vm_readv` is unavailable (e.g. `kernel.yama.ptrace_scope=3`,
certain seccomp/sandboxed containers), `set_fast_copy_enabled(false)`
will set `failed_safe_copy = true` and disable stack profiling, even
though `safe_memcpy` would have worked. This reproduces the pre-#17757
behavior exactly, so it does not regress any user who was already
running on a release without #17757. Users who upgraded to a release
containing #17757 and were silently relying on `safe_memcpy` in such an
environment will lose stack profiling until they set
`_DD_PROFILING_STACK_FAST_COPY=1` explicitly. Accepted as a trade-off
vs. the shutdown-crash risk.
- **No public API change.** `_DD_PROFILING_STACK_FAST_COPY` is a private
(`_DD_*`) env var.

## Additional Notes

- Tracking:
[PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568)
- Original culprit PR: #17757
- Related upstream test PR: #17872
- `changelog/no-changelog` label applied because the env var is private
and we are deleting the user-facing release note from #17757 (which
announced a default flip that is no longer landing).


[PROF-14568]:
https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[PROF-14568]:
https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant