Skip to content

fix(metrics): emit OOM metric on memory equality with per-request dedup#1241

Draft
lym953 wants to merge 2 commits into
mainfrom
yiming.luo/fix-1237-node-oom-metric
Draft

fix(metrics): emit OOM metric on memory equality with per-request dedup#1241
lym953 wants to merge 2 commits into
mainfrom
yiming.luo/fix-1237-node-oom-metric

Conversation

@lym953
Copy link
Copy Markdown
Contributor

@lym953 lym953 commented May 29, 2026

Summary

Closes the gap surfaced in #1237: a Node.js Lambda that hit its memory limit (Memory Size 192 MB / Max Memory Used 192 MB, Status: timeout) did not emit aws.lambda.enhanced.out_of_memory because none of the three existing detection paths matched.

  • Why the existing paths missed it. V8 spent its budget in GC rather than declaring JavaScript heap out of memory, so the runtime log-line match never fired. The runtime crashed on a wall-clock timeout, so PlatformRuntimeDone reported no error_type. And the max_memory_used_mb == memory_size_mb check in PlatformReport was gated on runtime.starts_with(\"provided.al\") to avoid double-counting against the log path, so Node was excluded.

  • What changes. Drop the provided.al* restriction so the equality check applies to every runtime. To avoid double-counting against the two pre-existing paths (some invocations satisfy both equality and Runtime.OutOfMemory simultaneously), add a per-Context oom_emitted flag. All three detection paths funnel through a new Processor::try_increment_oom_metric, which checks/sets the flag and is a no-op on subsequent calls for the same request_id.

  • Plumbing. Event::OutOfMemory now carries an Option<String> request_id. The log-path detector reads it from LambdaProcessor::invocation_context.request_id (set on PlatformStart, cleared on PlatformRuntimeDone/PlatformReport). None is only realistic in Managed Instance mode (extensions can't subscribe to INVOKE there); the helper falls back to a best-effort emit without dedup in that case.

Test plan

Unit tests added (cargo test — 540 passing, +3):

  • test_try_increment_oom_metric_dedupes_same_request_id — two OOM signals on the same request emit exactly once (sketch sum = 1.0)
  • test_try_increment_oom_metric_distinct_request_ids_emit_separately — two distinct requests each emit (sum = 2.0)
  • test_handle_ondemand_report_emits_oom_on_memory_equality — regression coverage for the dropped provided.al* check

Cross-runtime integration test added (integration-tests/tests/oom.test.ts, suite oom):

  • One Lambda per OOM shape across Node × 2, Python, Ruby, Java, .NET, Go — seven functions in total. Each intentionally OOMs; the test asserts aws.lambda.enhanced.out_of_memory count is exactly 1 per function. Python/Ruby/Go naturally fire more than one detection path per invocation, so they fail the assertion if the dedup flag regresses.
  • Adds Ruby + Go scaffolding to the integration-test framework (runtime/layer helpers, build-ruby.sh, build-go.sh, GitLab pipeline build stages).

CI hygiene:

  • cargo check clean
  • cargo clippy --all-targets -- -D warnings clean
  • cargo fmt --check clean

Manual test

Four reproducer Lambdas in us-east-2 (committed to a separate investigation folder outside this repo; details in #1237 thread). Each exercises a distinct OOM symptom on Node 22 (`nodejs:22.v78`).

Case Trigger Status Error Type Max/Size
A — V8 heap exhaustion --max-old-space-size=128 + string allocator error Runtime.ExitError 244/256
B — Off-heap SIGKILL Buffer.allocUnsafe(20MB) loop error Runtime.OutOfMemory 192/192
C — Handler timeout, RSS near limit 100MB ballast + sleep past timeout timeout (none) 173/192
D — Suppressed init OOM (customer's verbatim shape) Slow allocator at module top level timeout (none) 156/192

Cases C and D are the failure mode #1237 reports; both now emit OOM under this PR via the equality path. Cases A and B emit via the existing log-string and Runtime.OutOfMemory paths respectively, deduped by the new flag from the equality path's emission attempt.

Note: a behavior change since the Aug-2025 design note — Node22's kernel-SIGKILL case (B) now reports `Error Type: Runtime.OutOfMemory` (was `Runtime.ExitError`), so the existing path 2 already catches it. The log-string "Runtime exited with error: signal: killed" check in `logs/lambda/processor.rs` is effectively dead code for Node now (left in place for other runtimes).

Manual test — end-to-end with the custom extension layer

Built Datadog-Bottlecap-Beta-yiming-oom:1 (ARCHITECTURE=amd64 REGION=us-east-2 SUFFIX=yiming-oom FIPS=0 ./scripts/publish_bottlecap_sandbox.sh), instrumented Case A and Case B with datadog-ci lambda instrument ... -e 97 and swapped in the custom layer in place of the public Datadog-Extension:97. Invoked each once. Queried sum:aws.lambda.enhanced.out_of_memory{functionname:...} by {functionname}.as_count() in Datadog over the invocation window:

Function OOM count Detection path fired
yiming-repro-1237-case-a-v8-heap 1 log-string match (JavaScript heap out of memory)
yiming-repro-1237-case-b-sigkill 1 PlatformRuntimeDone error_type=Runtime.OutOfMemory (path 2)

For Case B, the new equality path also attempted to fire (192/192), but try_increment_oom_metric saw oom_emitted=true from path 2 and skipped — exactly as designed.

🤖 Generated with Claude Code

…th per-request dedup

Customer report (#1237): a Node.js Lambda that hit its memory limit
(Memory Size 192 MB / Max Memory Used 192 MB, Status: timeout) did not
emit aws.lambda.enhanced.out_of_memory because none of the existing
detection paths matched. The Node runtime did not log
"JavaScript heap out of memory" (V8 spent its time in GC instead of
declaring an OOM), and PlatformRuntimeDone reported no error_type — just
a wall-clock timeout — so the log-string and Runtime.OutOfMemory paths
both stayed silent.

Drop the provided.al* restriction on the PlatformReport equality check
so any runtime emits OOM when max_memory_used_mb == memory_size_mb. To
avoid double-counting against the two pre-existing paths (some
invocations satisfy both equality and Runtime.OutOfMemory at the same
time), add a per-Context oom_emitted flag. All three detection paths now
funnel through Processor::try_increment_oom_metric, which checks the
flag, sets it on first emission, and is a no-op on subsequent calls for
the same request_id. The flag lives with the per-invocation Context and
is cleared automatically when on_platform_report removes the context.

Plumbing: Event::OutOfMemory now carries an Option<String> request_id
(the log-path detector reads it from the logs processor's
invocation_context.request_id, set on PlatformStart and cleared on
PlatformRuntimeDone). When request_id is None — only realistic in
Managed Instance mode, where extensions cannot subscribe to INVOKE — the
helper falls back to a best-effort emit without dedup.

Tests cover three scenarios: same request_id emits exactly once, two
distinct request_ids each emit, and the equality path still fires
(regression coverage for the dropped provided.al* check).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@datadog-prod-us1-5
Copy link
Copy Markdown

datadog-prod-us1-5 Bot commented May 29, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 3 Pipeline jobs failed

DataDog/datadog-lambda-extension | integration-suite: [oom]   View in Datadog   GitLab

🔧 Fix in code (Fix with Cursor). 3 failed tests in tests/oom.test.ts: expected out_of_memory metric to be 1 but received 0 for ruby, java and go runtimes.

DataDog/datadog-lambda-extension | integration-suite: [snapstart]   View in Datadog   GitLab

🔧 Fix in code (Fix with Cursor). CloudFormation import failed due to existing resources without proper DeletionPolicy attribute in template.

DataDog/datadog-lambda-extension | e2e-test-status (amd64, fips)   View in Datadog   GitLab

See error E2E job status failed after multiple retries due to failing tests.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 5a833ac | Docs | Datadog PR Page | Give us feedback!

Adds a new `oom` integration-test suite that exercises the OOM dedup
change (Context::oom_emitted, #1241) end-to-end across every supported
runtime. Each lambda intentionally allocates until it OOMs; the test
asserts aws.lambda.enhanced.out_of_memory increments by exactly one
data point per function over the invocation window — which fails if the
dedup flag stops working and two detection paths emit for the same
invocation.

New lambda apps under integration-tests/lambda/:
- oom-node-v8-heap   : exercises log-line path (JavaScript heap OOM)
- oom-node-sigkill   : exercises PlatformRuntimeDone Runtime.OutOfMemory path
- oom-python         : MemoryError — log path AND PlatformRuntimeDone path
                       both fire, so dedup is necessary for count==1
- oom-ruby           : NoMemoryError — same dual-path coverage as Python
- oom-java           : OutOfMemoryError (log-line path)
- oom-dotnet         : OutOfMemoryException (log-line path)
- oom-go             : fatal: runtime: out of memory — log path AND
                       PlatformReport memory-equality path both fire

Framework additions:
- Ruby and Go runtime/layer helpers in lib/util.ts (Ruby tracer layer;
  Go has no tracer layer — extension layer alone covers the test).
- Oom CDK stack registered in bin/app.ts.
- build-ruby.sh (zip-as-is for now; Gemfile build stubbed) and
  build-go.sh (Docker cross-compile to ARM64 Linux, bootstrap binary).
- Pipeline template additions for the two new build stages and
  oom suite registration in test-suites.yaml.
- getMetricCount() + OUT_OF_MEMORY_METRIC in tests/utils/datadog.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant