fix(metrics): emit OOM metric on memory equality with per-request dedup#1241
Draft
lym953 wants to merge 2 commits into
Draft
fix(metrics): emit OOM metric on memory equality with per-request dedup#1241lym953 wants to merge 2 commits into
lym953 wants to merge 2 commits into
Conversation
…th per-request dedup Customer report (#1237): a Node.js Lambda that hit its memory limit (Memory Size 192 MB / Max Memory Used 192 MB, Status: timeout) did not emit aws.lambda.enhanced.out_of_memory because none of the existing detection paths matched. The Node runtime did not log "JavaScript heap out of memory" (V8 spent its time in GC instead of declaring an OOM), and PlatformRuntimeDone reported no error_type — just a wall-clock timeout — so the log-string and Runtime.OutOfMemory paths both stayed silent. Drop the provided.al* restriction on the PlatformReport equality check so any runtime emits OOM when max_memory_used_mb == memory_size_mb. To avoid double-counting against the two pre-existing paths (some invocations satisfy both equality and Runtime.OutOfMemory at the same time), add a per-Context oom_emitted flag. All three detection paths now funnel through Processor::try_increment_oom_metric, which checks the flag, sets it on first emission, and is a no-op on subsequent calls for the same request_id. The flag lives with the per-invocation Context and is cleared automatically when on_platform_report removes the context. Plumbing: Event::OutOfMemory now carries an Option<String> request_id (the log-path detector reads it from the logs processor's invocation_context.request_id, set on PlatformStart and cleared on PlatformRuntimeDone). When request_id is None — only realistic in Managed Instance mode, where extensions cannot subscribe to INVOKE — the helper falls back to a best-effort emit without dedup. Tests cover three scenarios: same request_id emits exactly once, two distinct request_ids each emit, and the equality path still fires (regression coverage for the dropped provided.al* check). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Adds a new `oom` integration-test suite that exercises the OOM dedup change (Context::oom_emitted, #1241) end-to-end across every supported runtime. Each lambda intentionally allocates until it OOMs; the test asserts aws.lambda.enhanced.out_of_memory increments by exactly one data point per function over the invocation window — which fails if the dedup flag stops working and two detection paths emit for the same invocation. New lambda apps under integration-tests/lambda/: - oom-node-v8-heap : exercises log-line path (JavaScript heap OOM) - oom-node-sigkill : exercises PlatformRuntimeDone Runtime.OutOfMemory path - oom-python : MemoryError — log path AND PlatformRuntimeDone path both fire, so dedup is necessary for count==1 - oom-ruby : NoMemoryError — same dual-path coverage as Python - oom-java : OutOfMemoryError (log-line path) - oom-dotnet : OutOfMemoryException (log-line path) - oom-go : fatal: runtime: out of memory — log path AND PlatformReport memory-equality path both fire Framework additions: - Ruby and Go runtime/layer helpers in lib/util.ts (Ruby tracer layer; Go has no tracer layer — extension layer alone covers the test). - Oom CDK stack registered in bin/app.ts. - build-ruby.sh (zip-as-is for now; Gemfile build stubbed) and build-go.sh (Docker cross-compile to ARM64 Linux, bootstrap binary). - Pipeline template additions for the two new build stages and oom suite registration in test-suites.yaml. - getMetricCount() + OUT_OF_MEMORY_METRIC in tests/utils/datadog.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the gap surfaced in #1237: a Node.js Lambda that hit its memory limit (
Memory Size 192 MB / Max Memory Used 192 MB,Status: timeout) did not emitaws.lambda.enhanced.out_of_memorybecause none of the three existing detection paths matched.Why the existing paths missed it. V8 spent its budget in GC rather than declaring
JavaScript heap out of memory, so the runtime log-line match never fired. The runtime crashed on a wall-clock timeout, soPlatformRuntimeDonereported noerror_type. And themax_memory_used_mb == memory_size_mbcheck inPlatformReportwas gated onruntime.starts_with(\"provided.al\")to avoid double-counting against the log path, so Node was excluded.What changes. Drop the
provided.al*restriction so the equality check applies to every runtime. To avoid double-counting against the two pre-existing paths (some invocations satisfy both equality andRuntime.OutOfMemorysimultaneously), add a per-Contextoom_emittedflag. All three detection paths funnel through a newProcessor::try_increment_oom_metric, which checks/sets the flag and is a no-op on subsequent calls for the samerequest_id.Plumbing.
Event::OutOfMemorynow carries anOption<String> request_id. The log-path detector reads it fromLambdaProcessor::invocation_context.request_id(set onPlatformStart, cleared onPlatformRuntimeDone/PlatformReport).Noneis only realistic in Managed Instance mode (extensions can't subscribe to INVOKE there); the helper falls back to a best-effort emit without dedup in that case.Test plan
Unit tests added (
cargo test— 540 passing, +3):test_try_increment_oom_metric_dedupes_same_request_id— two OOM signals on the same request emit exactly once (sketch sum = 1.0)test_try_increment_oom_metric_distinct_request_ids_emit_separately— two distinct requests each emit (sum = 2.0)test_handle_ondemand_report_emits_oom_on_memory_equality— regression coverage for the droppedprovided.al*checkCross-runtime integration test added (
integration-tests/tests/oom.test.ts, suiteoom):aws.lambda.enhanced.out_of_memorycount is exactly 1 per function. Python/Ruby/Go naturally fire more than one detection path per invocation, so they fail the assertion if the dedup flag regresses.build-ruby.sh,build-go.sh, GitLab pipeline build stages).CI hygiene:
cargo checkcleancargo clippy --all-targets -- -D warningscleancargo fmt --checkcleanManual test
Four reproducer Lambdas in
us-east-2(committed to a separate investigation folder outside this repo; details in #1237 thread). Each exercises a distinct OOM symptom on Node 22 (`nodejs:22.v78`).--max-old-space-size=128+ string allocatorRuntime.ExitErrorBuffer.allocUnsafe(20MB)loopRuntime.OutOfMemoryCases C and D are the failure mode #1237 reports; both now emit OOM under this PR via the equality path. Cases A and B emit via the existing log-string and
Runtime.OutOfMemorypaths respectively, deduped by the new flag from the equality path's emission attempt.Note: a behavior change since the Aug-2025 design note — Node22's kernel-SIGKILL case (B) now reports `Error Type: Runtime.OutOfMemory` (was `Runtime.ExitError`), so the existing path 2 already catches it. The log-string "Runtime exited with error: signal: killed" check in `logs/lambda/processor.rs` is effectively dead code for Node now (left in place for other runtimes).
Manual test — end-to-end with the custom extension layer
Built
Datadog-Bottlecap-Beta-yiming-oom:1(ARCHITECTURE=amd64 REGION=us-east-2 SUFFIX=yiming-oom FIPS=0 ./scripts/publish_bottlecap_sandbox.sh), instrumented Case A and Case B withdatadog-ci lambda instrument ... -e 97and swapped in the custom layer in place of the publicDatadog-Extension:97. Invoked each once. Queriedsum:aws.lambda.enhanced.out_of_memory{functionname:...} by {functionname}.as_count()in Datadog over the invocation window:yiming-repro-1237-case-a-v8-heapJavaScript heap out of memory)yiming-repro-1237-case-b-sigkillPlatformRuntimeDoneerror_type=Runtime.OutOfMemory(path 2)For Case B, the new equality path also attempted to fire (
192/192), buttry_increment_oom_metricsawoom_emitted=truefrom path 2 and skipped — exactly as designed.🤖 Generated with Claude Code