Skip to content

[BUG]: AppSec In App WAF blocking causes uncaught exceptions with Next.js when headers already sent #5452

@remenoscodes

Description

@remenoscodes

Tracer Version(s)

5.42.0

Node.js Version(s)

20.12.1

Bug Report

Environment

Versions

  • Tracer Version: 5.42.0
  • Node.js Version: 20.12.1
  • Next.js Version: 14.2.7
  • Platform: AWS Fargate (Linux/ARM64)

Resource Allocation

  • CPU: 1024 units (1 vCPU)
  • Memory: 3072 MB
  • Datadog Agent: 256 CPU units

Features Enabled

DD_APPSEC_ENABLED=true      # AppSec In App WAF
DD_IAST_ENABLED=true        # Interactive Application Security Testing
DD_PROFILING_ENABLED=true   # Continuous Profiler
DD_LOGS_INJECTION=true      # Log Correlation
DD_RUNTIME_METRICS=true     # Runtime Metrics Collection
DD_DBM_PROPAGATION=full     # Database Monitoring

Bug Report

We've identified critical issues in the AppSec In App WAF blocking functionality that affect high-traffic Next.js applications:

  1. Race Condition and Crashes: When the WAF attempts to block requests after Next.js has already sent headers (common with non-existent routes like /admin.php), it throws unhandled exceptions (Headers have already been sent). With our error handling configuration (which uses process.exit for unhandled exceptions), this causes application crashes.

  2. Status Code Discrepancy: When WAF successfully blocks a request with HTTP 403, the Datadog traces incorrectly record it as HTTP 404. This creates inconsistency between what clients experience and what appears in our monitoring.

  3. Performance Impact: Our workaround solutions have introduced high CPU usage in the "DD AppSec In App WAF Context" span, creating a performance bottleneck.

Patch Evolution

We've tried two different patch approaches:

First Approach (Effective but Performance-Heavy)

try {
  // 1. Check headers first
  if (res.headersSent) {
    log.warn('[ASM] Cannot send blocking response when headers have already been sent')
    return false
  }

  // 2. Get blocking data and send response
  const { body, headers, statusCode } = getBlockingData(req, null, actionParameters)
  for (const headerName of res.getHeaderNames()) {
    res.removeHeader(headerName)
  }
  res.writeHead(statusCode, headers)
  res.constructor.prototype.end.call(res, body)

  // 3. Mark as blocked and cleanup
  responseBlockedSet.add(res)
  rootSpan.setTag('appsec.blocked', 'true')
  abortController?.abort()

  return true
} catch (err) {
  rootSpan?.setTag('_dd.appsec.block.failed', 1)
  log.error('[ASM] Blocking error', err)
  return false
}

Current Minimal Approach

// Current minimal patch - prevents crashes only
if (!res || res.headersSent || res.finished) {
  log.warn('[ASM] Cannot send blocking response when headers have already been sent')
  return false
}

Key Differences:

  • First approach successfully blocked requests but had performance overhead
  • Current approach prevents crashes but fails to block when race condition occurs
  • Both approaches show similar CPU usage patterns

Root Cause Investigation

We're investigating deeper issues with the middleware/instrumentation timing:

  1. Next.js Middleware Timing: Headers appear to be sent before AppSec evaluation completes
  2. Trace Status Capture: Status codes are recorded before AppSec modifications
  3. Monkey Patching: Potential issues with how response methods are patched

Performance Metrics

Load testing revealed consistent performance impact:

CPU Usage Mean Response Time p95 Response Time
<50% 77-84ms ~500ms
50-70% 250-450ms ~1600ms
>70% 600-900ms >2300ms

Performance zones identified:

  • Optimal: Up to 50-60 req/sec (Response time < 500ms at p95)
  • Degraded: 80-100 req/sec (Response time < 1600ms at p95)
  • Critical: >110 req/sec (Unpredictable performance)

Questions

  1. Response Handling:

    • How should AppSec In App WAF handle requests where headers are already sent?
    • What is the correct point in the request lifecycle to perform WAF evaluation?
  2. Status Code Capture:

    • How can we ensure trace spans capture the final response status, it may be a bug or configuration issue?
    • Is there a way to update span data after AppSec modifies the response?
  3. Performance & Timing:

    • Are there recommended approaches for response tracking that minimize overhead?
    • How should AppSec integrate with Next.js routing to avoid race conditions?

Current Investigation

We're examining several areas that may contribute to the timing issues:

  1. DD-Trace Instrumentation:

    • Response hooks
    • HTTP instrumentation
    • Next.js specific code
  2. Request Flow Analysis:

    Client → HTTP Server → Next.js Router → [Middleware] → AppSec WAF → Response
                     ↑
    Headers may be sent here, before WAF evaluation
    
  3. Status Code Capture Timing:

    sequenceDiagram
        Client->>DD-Trace: Request
        DD-Trace->>AppSec: Process Request
        AppSec-->>Client: Return 403 (Actual Response)
        DD-Trace-->>Datadog: Report 404 (Incorrect Trace)
        Note over DD-Trace,Datadog: Status Code Mismatch
    
    Loading

We believe the core issue may be related to the timing and order of HTTP method instrumentation rather than just the AppSec module itself.

Reproduction Code

No response

Error Logs

No response

Tracer Config

No response

Operating System

AWS Fargate (Linux/ARM64)

Bundling

Next.js

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions