fix(metrics): emit OOM metric on memory equality with per-request dedup by lym953 · Pull Request #1241 · DataDog/datadog-lambda-extension

lym953 · 2026-05-29T19:13:26Z

Summary

Closes the gap surfaced in #1237: a Node.js Lambda that hit its memory limit (Memory Size 192 MB / Max Memory Used 192 MB, Status: timeout) did not emit aws.lambda.enhanced.out_of_memory because none of the three existing detection paths matched.

Why the existing paths missed it. V8 spent its budget in GC rather than declaring JavaScript heap out of memory, so the runtime log-line match never fired. The runtime crashed on a wall-clock timeout, so PlatformRuntimeDone reported no error_type. And the max_memory_used_mb == memory_size_mb check in PlatformReport was gated on runtime.starts_with(\"provided.al\") to avoid double-counting against the log path, so Node was excluded.
What changes. Drop the provided.al* restriction so the equality check applies to every runtime. To avoid double-counting against the two pre-existing paths (some invocations satisfy both equality and Runtime.OutOfMemory simultaneously), add a per-Context oom_emitted flag. All three detection paths funnel through a new Processor::try_increment_oom_metric, which checks/sets the flag and is a no-op on subsequent calls for the same request_id.
Plumbing. Event::OutOfMemory now carries an Option<String> request_id. The log-path detector reads it from LambdaProcessor::invocation_context.request_id (set on PlatformStart, cleared on PlatformRuntimeDone/PlatformReport). None is only realistic in Managed Instance mode (extensions can't subscribe to INVOKE there); the helper falls back to a best-effort emit without dedup in that case.

Test plan

Unit tests added (cargo test — 540 passing, +3):

test_try_increment_oom_metric_dedupes_same_request_id — two OOM signals on the same request emit exactly once (sketch sum = 1.0)
test_try_increment_oom_metric_distinct_request_ids_emit_separately — two distinct requests each emit (sum = 2.0)
test_handle_ondemand_report_emits_oom_on_memory_equality — regression coverage for the dropped provided.al* check

Cross-runtime integration test added (integration-tests/tests/oom.test.ts, suite oom):

One Lambda per OOM shape across Node × 2, Python, Ruby, Java, .NET, Go — seven functions in total. Each intentionally OOMs; the test asserts aws.lambda.enhanced.out_of_memory count is exactly 1 per function. Python/Ruby/Go naturally fire more than one detection path per invocation, so they fail the assertion if the dedup flag regresses.
Adds Ruby + Go scaffolding to the integration-test framework (runtime/layer helpers, build-ruby.sh, build-go.sh, GitLab pipeline build stages).

CI hygiene:

cargo check clean
cargo clippy --all-targets -- -D warnings clean
cargo fmt --check clean

Manual test

Four reproducer Lambdas in us-east-2 (committed to a separate investigation folder outside this repo; details in #1237 thread). Each exercises a distinct OOM symptom on Node 22 (`nodejs:22.v78`).

Case	Trigger	Status	Error Type	Max/Size
A — V8 heap exhaustion	`--max-old-space-size=128` + string allocator	error	`Runtime.ExitError`	244/256
B — Off-heap SIGKILL	`Buffer.allocUnsafe(20MB)` loop	error	`Runtime.OutOfMemory`	192/192
C — Handler timeout, RSS near limit	100MB ballast + sleep past timeout	timeout	(none)	173/192
D — Suppressed init OOM (customer's verbatim shape)	Slow allocator at module top level	timeout	(none)	156/192

Cases C and D are the failure mode #1237 reports; both now emit OOM under this PR via the equality path. Cases A and B emit via the existing log-string and Runtime.OutOfMemory paths respectively, deduped by the new flag from the equality path's emission attempt.

Note: a behavior change since the Aug-2025 design note — Node22's kernel-SIGKILL case (B) now reports `Error Type: Runtime.OutOfMemory` (was `Runtime.ExitError`), so the existing path 2 already catches it. The log-string "Runtime exited with error: signal: killed" check in `logs/lambda/processor.rs` is effectively dead code for Node now (left in place for other runtimes).

Manual test — end-to-end with the custom extension layer

Built Datadog-Bottlecap-Beta-yiming-oom:1 (ARCHITECTURE=amd64 REGION=us-east-2 SUFFIX=yiming-oom FIPS=0 ./scripts/publish_bottlecap_sandbox.sh), instrumented Case A and Case B with datadog-ci lambda instrument ... -e 97 and swapped in the custom layer in place of the public Datadog-Extension:97. Invoked each once. Queried sum:aws.lambda.enhanced.out_of_memory{functionname:...} by {functionname}.as_count() in Datadog over the invocation window:

Function	OOM count	Detection path fired
`yiming-repro-1237-case-a-v8-heap`	1	log-string match (`JavaScript heap out of memory`)
`yiming-repro-1237-case-b-sigkill`	1	`PlatformRuntimeDone` `error_type=Runtime.OutOfMemory` (path 2)

For Case B, the new equality path also attempted to fire (192/192), but try_increment_oom_metric saw oom_emitted=true from path 2 and skipped — exactly as designed.

🤖 Generated with Claude Code

…th per-request dedup Customer report (#1237): a Node.js Lambda that hit its memory limit (Memory Size 192 MB / Max Memory Used 192 MB, Status: timeout) did not emit aws.lambda.enhanced.out_of_memory because none of the existing detection paths matched. The Node runtime did not log "JavaScript heap out of memory" (V8 spent its time in GC instead of declaring an OOM), and PlatformRuntimeDone reported no error_type — just a wall-clock timeout — so the log-string and Runtime.OutOfMemory paths both stayed silent. Drop the provided.al* restriction on the PlatformReport equality check so any runtime emits OOM when max_memory_used_mb == memory_size_mb. To avoid double-counting against the two pre-existing paths (some invocations satisfy both equality and Runtime.OutOfMemory at the same time), add a per-Context oom_emitted flag. All three detection paths now funnel through Processor::try_increment_oom_metric, which checks the flag, sets it on first emission, and is a no-op on subsequent calls for the same request_id. The flag lives with the per-invocation Context and is cleared automatically when on_platform_report removes the context. Plumbing: Event::OutOfMemory now carries an Option<String> request_id (the log-path detector reads it from the logs processor's invocation_context.request_id, set on PlatformStart and cleared on PlatformRuntimeDone). When request_id is None — only realistic in Managed Instance mode, where extensions cannot subscribe to INVOKE — the helper falls back to a best-effort emit without dedup. Tests cover three scenarios: same request_id emits exactly once, two distinct request_ids each emit, and the equality path still fires (regression coverage for the dropped provided.al* check). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

datadog-prod-us1-5 · 2026-05-29T19:15:39Z

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 3 Pipeline jobs failed

DataDog/datadog-lambda-extension | integration-suite: [oom]

🔧 Fix in code (Fix with Cursor).
3 failed tests in tests/oom.test.ts: expected out_of_memory metric to be 1 but received 0 for ruby, java and go runtimes.

DataDog/datadog-lambda-extension | integration-suite: [snapstart]

🔧 Fix in code (Fix with Cursor).
CloudFormation import failed due to existing resources without proper DeletionPolicy attribute in template.

DataDog/datadog-lambda-extension | e2e-test-status (amd64, fips)

See error
E2E job status failed after multiple retries due to failing tests.

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 5a833ac | Docs | Datadog PR Page | Give us feedback!}

Adds a new `oom` integration-test suite that exercises the OOM dedup change (Context::oom_emitted, #1241) end-to-end across every supported runtime. Each lambda intentionally allocates until it OOMs; the test asserts aws.lambda.enhanced.out_of_memory increments by exactly one data point per function over the invocation window — which fails if the dedup flag stops working and two detection paths emit for the same invocation. New lambda apps under integration-tests/lambda/: - oom-node-v8-heap : exercises log-line path (JavaScript heap OOM) - oom-node-sigkill : exercises PlatformRuntimeDone Runtime.OutOfMemory path - oom-python : MemoryError — log path AND PlatformRuntimeDone path both fire, so dedup is necessary for count==1 - oom-ruby : NoMemoryError — same dual-path coverage as Python - oom-java : OutOfMemoryError (log-line path) - oom-dotnet : OutOfMemoryException (log-line path) - oom-go : fatal: runtime: out of memory — log path AND PlatformReport memory-equality path both fire Framework additions: - Ruby and Go runtime/layer helpers in lib/util.ts (Ruby tracer layer; Go has no tracer layer — extension layer alone covers the test). - Oom CDK stack registered in bin/app.ts. - build-ruby.sh (zip-as-is for now; Gemfile build stubbed) and build-go.sh (Docker cross-compile to ARM64 Linux, bootstrap binary). - Pipeline template additions for the two new build stages and oom suite registration in test-suites.yaml. - getMetricCount() + OUT_OF_MEMORY_METRIC in tests/utils/datadog.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(metrics): emit OOM metric on memory equality with per-request dedup#1241

fix(metrics): emit OOM metric on memory equality with per-request dedup#1241
lym953 wants to merge 2 commits into
mainfrom
yiming.luo/fix-1237-node-oom-metric

lym953 commented May 29, 2026 •

edited

Loading

Uh oh!

datadog-prod-us1-5 Bot commented May 29, 2026 •

edited by datadog-official Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lym953 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Manual test

Manual test — end-to-end with the custom extension layer

Uh oh!

datadog-prod-us1-5 Bot commented May 29, 2026 • edited by datadog-official Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lym953 commented May 29, 2026 •

edited

Loading

datadog-prod-us1-5 Bot commented May 29, 2026 •

edited by datadog-official Bot

Loading