Skip to content

fix(server): classify client-disconnect during streaming + live-preview e2e scenarios#337

Merged
SantiagoDePolonia merged 5 commits into
mainfrom
fix/stream-cancel-audit-and-live-preview-scenarios
May 17, 2026
Merged

fix(server): classify client-disconnect during streaming + live-preview e2e scenarios#337
SantiagoDePolonia merged 5 commits into
mainfrom
fix/stream-cancel-audit-and-live-preview-scenarios

Conversation

@SantiagoDePolonia
Copy link
Copy Markdown
Contributor

@SantiagoDePolonia SantiagoDePolonia commented May 17, 2026

Summary

  • fix(server) — streaming requests that failed before any chunks were flushed were audited as upstream provider errors (status=502, error_type=provider_error, stream=null) even when the cause was a client disconnect mid-handshake. Audit rows now reflect the streaming intent and tag the cause as client_disconnected when ctx.Err() is set or the error wraps context.Canceled / EPIPE / ECONNRESET. Routed StreamChatCompletion, StreamResponses, handleStreamingResponse, and the passthrough flushStream error site through a single handleStreamingDispatchError helper so classification stays consistent across paths.
  • test(e2e) — refreshed S22 to use xai/grok-4.3 (the previous grok-3-mini target was retired from xAI's catalog) and added a new §17 Dashboard live preview section to release-e2e-scenarios.md:
    • S91 idle subscriber receives reset + heartbeat
    • S92 chat completion produces matching audit.* + usage.* events
    • S93 types=usage filter excludes audit.* events
    • S94 invalid cursor → 400 invalid_request_error
    • S95 streaming client disconnect → stream=true, error_type=client_disconnected (regression test for the fix)

Test plan

  • go test ./internal/server ./internal/auditlog ./internal/streaming — all green
  • New unit tests: TestHandleStreamingResponse_ClientDisconnectBeforeUpstream, TestRecordStreamingError_ClassifiesClientDisconnect
  • Pre-commit hooks (lint, race tests, mod tidy, fmt, perf guard) pass on both commits
  • Live runner: tests/e2e/run-release-e2e.sh --scenario S22,S91,S92,S93,S94,S95 — all 6 OK
  • Manual end-to-end probe — cancelled streaming request now audits as status=200, stream=true, error_type=client_disconnected instead of status=502, stream=null, error_type=provider_error

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Streaming error handling now distinguishes client disconnects from upstream failures, preventing spurious error responses and correctly marking disconnected streams as client_disconnected.
    • Pre-flush (dispatch) and flush/write-phase failures are classified separately so client disconnects are swallowed while real upstream errors surface.
  • Tests

    • Added unit tests covering client-disconnect scenarios, wrapped syscall cases, and edge/race conditions for streaming error classification.
  • Tests / E2E

    • Expanded release E2E scenarios (S91–S95) for SSE heartbeat, event filtering, audit/usage emission, invalid cursor handling, and auditing of client-disconnected streams.

Review Change Stack

SantiagoDePolonia and others added 2 commits May 17, 2026 15:12
Streaming requests that failed before any chunks were flushed were
audited as upstream provider errors (status 502, error_type
provider_error, stream null) even when the cause was the client
closing the connection mid-handshake. The audit row no longer
reflected the request's streaming intent at all, hiding cancellations
from monitoring.

Mark the audit entry as a streaming request and tag the error type as
client_disconnected when ctx.Err is set or the error wraps
context.Canceled / EPIPE / ECONNRESET. Routed both the dispatch-time
failures (StreamChatCompletion, StreamResponses, handleStreamingResponse,
passthrough flushStream) through a single helper so the classification
stays consistent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the retired grok-3-mini target in S22 with xai/grok-4.3 so
the xAI smoke probe again hits a live model.

Add §17 Dashboard live preview covering /admin/live/logs:

  S91 idle subscriber receives reset + heartbeat events
  S92 chat completion produces matching audit and usage events
  S93 types=usage filter excludes audit events
  S94 invalid cursor is rejected with 400 invalid_request_error
  S95 streaming client disconnect is audited as client_disconnected

S95 acts as the regression test for the audit fix shipped in the
previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 17, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 932bc637-0ae0-4653-b387-560f7a71dc3f

📥 Commits

Reviewing files that changed from the base of the PR and between 30aec8e and 87ead6c.

📒 Files selected for processing (2)
  • internal/server/handlers_test.go
  • internal/server/translated_inference_service.go

📝 Walkthrough

Walkthrough

Classify streaming errors into client_disconnected vs stream_error, pass request context into recording, route pre-flush/init errors through a dispatch-aware handler that suppresses responses for client disconnects, enrich audit logs with error_type, and add unit + E2E coverage.

Changes

Streaming Error Handling and Client Disconnect Classification

Layer / File(s) Summary
Dispatch wiring and passthrough integration
internal/server/translated_inference_service.go, internal/server/passthrough_support.go
Streaming dispatch/init errors now use handleStreamingDispatchError; passthrough flushStream call sites pass the request context.Context into recordStreamingError.
Streaming error recording and disconnect detection
internal/server/translated_inference_service.go
recordStreamingError now accepts context.Context, computes error_type using isClientDisconnect/isClientDisconnectDuringDispatch (checks context.Canceled, syscall.EPIPE, syscall.ECONNRESET, and a canceled-context nil-err fallback), sets ErrorType/ErrorMessage, and logs error_type with stream identifiers.
Streaming dispatch error handler
internal/server/translated_inference_service.go
Added handleStreamingDispatchError which enriches streaming audit context and returns nil for dispatch-time client disconnects while delegating other failures to handleError.
Streaming error handling unit tests
internal/server/handlers_test.go
Added tests: TestHandleStreamingResponse_ClientDisconnectBeforeUpstream, TestHandleStreamingResponse_UpstreamResetIsNotClassifiedAsClientDisconnect, and TestRecordStreamingError_ClassifiesClientDisconnect covering plain/wrapped cancellations, syscall errors, race cases, and nil-err fallback.
E2E scenario matrix updates
tests/e2e/release-e2e-scenarios.md
Updated matrix size to 95 scenarios, changed xAI model to xai/grok-4.3, and added five Dashboard live preview SSE scenarios (S91–S95) covering heartbeat, audit+usage correlation, filtering, cursor validation, and client-disconnected audit classification.

Sequence Diagram

sequenceDiagram
  participant Client
  participant TranslatedInferenceService
  participant StreamFn
  participant Passthrough
  participant recordStreamingError
  participant AuditLog
  Client->>TranslatedInferenceService: start streaming request
  TranslatedInferenceService->>StreamFn: initialize streamFn (dispatch)
  StreamFn-->>TranslatedInferenceService: init error -> handleStreamingDispatchError
  TranslatedInferenceService->>Passthrough: flushStream / send chunk
  Passthrough->>recordStreamingError: recordStreamingError(ctx, err) on flush failure
  recordStreamingError->>AuditLog: log stream termination with error_type
  Client->>TranslatedInferenceService: disconnect (context canceled)
  TranslatedInferenceService->>recordStreamingError: recordStreamingError(ctx, ctx.Err())
  recordStreamingError->>AuditLog: log event with error_type: client_disconnected
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hop through logs where streams unwind,
A canceled breeze, a socket unkind.
EPIPE, reset — I name the sigh,
"client_disconnected" I softly write.
Rabbit audits the quiet sky.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main changes: fixing client-disconnect classification during streaming and adding live-preview e2e scenarios, which aligns with the PR's primary objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/stream-cancel-audit-and-live-preview-scenarios

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 17, 2026

Greptile Summary

This PR updates streaming disconnect auditing and expands release validation coverage. It changes:

  • Streaming dispatch errors now mark requests as streaming.
  • Client disconnect errors are classified as client_disconnected.
  • Stream flush error logging now receives the request context.
  • Unit tests cover disconnect classification cases.
  • Release e2e scenarios add live-preview and stream-cancel checks.

Confidence Score: 4/5

This is close, but the fast-path streaming route should be fixed before merging.

  • Most streaming paths now use the new disconnect classifier.

  • Fast-path passthrough streaming chat still uses the old error handler during setup failures.

  • That path can still write incorrect audit rows for client disconnects.

  • internal/server/translated_inference_service.go

Important Files Changed

Filename Overview
internal/server/translated_inference_service.go Adds the shared streaming error classifier, but one fast-path passthrough setup error branch still bypasses it.
internal/server/passthrough_support.go Passes request context into stream flush error recording for passthrough SSE responses.

Comments Outside Diff (1)

  1. internal/server/translated_inference_service.go, line 356-364 (link)

    P1 Classify passthrough disconnects

    This streaming chat path still sends passthrough setup failures through handleError. For fast-path eligible streams, a client disconnect during the upstream handshake makes Passthrough return an error that wraps context.Canceled, but this branch audits it as a provider error instead of marking the request as stream=true with client_disconnected. That leaves a common streaming route with the old incorrect audit result this PR is meant to fix.

Reviews (2): Last reviewed commit: "fix(server): guard recordStreamingError ..." | Re-trigger Greptile

Comment thread internal/server/translated_inference_service.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/server/handlers_test.go`:
- Around line 2688-2702: Add explicit coverage for syscall disconnect errors in
TestRecordStreamingError_ClassifiesClientDisconnect by invoking
recordStreamingError with errors that are syscall.EPIPE and syscall.ECONNRESET
(use syscall.EPIPE and syscall.ECONNRESET wrapped or passed so errors.Is can
detect them) and asserting each resulting auditlog.LogEntry.ErrorType equals
"client_disconnected"; keep the existing canceled-context and generic
stream_error checks, but add two new entries (e.g., entry3/entry4) and
corresponding t.Fatalf assertions to prevent regressions in
recordStreamingError's disconnect classification.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: efd71b75-d478-44b6-8797-6c39899f50a2

📥 Commits

Reviewing files that changed from the base of the PR and between 3a4cfe8 and 554ea8f.

📒 Files selected for processing (4)
  • internal/server/handlers_test.go
  • internal/server/passthrough_support.go
  • internal/server/translated_inference_service.go
  • tests/e2e/release-e2e-scenarios.md

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 17, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 97.14286% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
internal/server/passthrough_support.go 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

isClientDisconnect returned true for any error once ctx.Err was set, so
a real upstream stream failure that races with a client disconnect was
audited as client_disconnected. That hides upstream incidents from the
audit log.

Require the error itself to be context.Canceled / syscall.EPIPE /
syscall.ECONNRESET (or a chain that unwraps to one). The context-only
path now only fires when no concrete error was returned by the call.

Expand TestRecordStreamingError_ClassifiesClientDisconnect into a
table-driven test covering syscall.EPIPE, syscall.ECONNRESET, wrapped
context.Canceled, the race case that must stay stream_error, and the
clean-context generic error case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/server/translated_inference_service.go (1)

520-531: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard the nil-error disconnect path before writing audit data.

isClientDisconnect now explicitly supports err == nil && ctx.Err() == context.Canceled, but recordStreamingError still unconditionally calls err.Error(). On that path this will panic while recording the disconnect.

Suggested fix
 func recordStreamingError(streamEntry *auditlog.LogEntry, model, provider, path, requestID string, ctx context.Context, err error) {
 	errorType := "stream_error"
+	errorMessage := ""
+	logErr := err
 	if isClientDisconnect(ctx, err) {
 		errorType = "client_disconnected"
 	}
+	if err != nil {
+		errorMessage = err.Error()
+	} else if ctx != nil && ctx.Err() != nil {
+		errorMessage = ctx.Err().Error()
+		logErr = ctx.Err()
+	}
 
 	if streamEntry != nil {
 		streamEntry.ErrorType = errorType
 		if streamEntry.Data == nil {
 			streamEntry.Data = &auditlog.LogData{}
 		}
-		streamEntry.Data.ErrorMessage = err.Error()
+		streamEntry.Data.ErrorMessage = errorMessage
 	}
 
 	slog.Warn("stream terminated abnormally",
-		"error", err,
+		"error", logErr,
 		"error_type", errorType,
 		"model", model,
 		"provider", provider,
 		"path", path,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/server/translated_inference_service.go` around lines 520 - 531, The
recordStreamingError function writes err.Error() without checking for nil;
update recordStreamingError to handle the disconnect path when err == nil by
checking err != nil before calling err.Error() and set
streamEntry.Data.ErrorMessage to a safe value (e.g., ctx.Err().Error() or an
explicit "client disconnected") when err is nil; adjust the logic around
isClientDisconnect(ctx, err) and the block that assigns
streamEntry.Data.ErrorMessage inside recordStreamingError so it never calls
err.Error() on a nil error while still recording a meaningful message.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/server/handlers_test.go`:
- Around line 2690-2753: The test table in
TestRecordStreamingError_ClassifiesClientDisconnect misses the case where err ==
nil but the context has been canceled (ctx.Err() == context.Canceled); add a
test entry using the existing canceledCtx with err: nil and wantType:
"client_disconnected" so the new production branch (err == nil && ctx.Err() ==
context.Canceled) is exercised by recordStreamingError and guarded against
regressions.

---

Outside diff comments:
In `@internal/server/translated_inference_service.go`:
- Around line 520-531: The recordStreamingError function writes err.Error()
without checking for nil; update recordStreamingError to handle the disconnect
path when err == nil by checking err != nil before calling err.Error() and set
streamEntry.Data.ErrorMessage to a safe value (e.g., ctx.Err().Error() or an
explicit "client disconnected") when err is nil; adjust the logic around
isClientDisconnect(ctx, err) and the block that assigns
streamEntry.Data.ErrorMessage inside recordStreamingError so it never calls
err.Error() on a nil error while still recording a meaningful message.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c54bf5c0-3ebb-42ac-a600-5ab14ebb0a15

📥 Commits

Reviewing files that changed from the base of the PR and between 554ea8f and 10b766a.

📒 Files selected for processing (2)
  • internal/server/handlers_test.go
  • internal/server/translated_inference_service.go

Comment thread internal/server/handlers_test.go
isClientDisconnect classifies (ctx canceled, err == nil) as a client
disconnect, but recordStreamingError still called err.Error()
unconditionally. The nil-err branch was unreachable from today's two
callsites (both gate on err != nil) but it is documented behaviour
that a future caller could rely on, and a latent nil deref is not the
right place to leave a footgun.

Compute the audit message defensively: prefer err.Error() when
available, otherwise fall back to the context error. The slog "error"
field follows the same source so the log line never carries a stale
nil.

Extend the classifier test with a (canceledCtx, nil) row and assert
the recorded error_message on every row to lock in both the
classification and the message-source fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/server/translated_inference_service.go`:
- Around line 513-515: The current use of isClientDisconnect in
handleStreamingDispatchError classifies syscall.EPIPE/ECONNRESET as client
disconnects even on pre-flush/dispatch failures; change the logic so that on
early dispatch/init failure (before any response bytes are written) only
context.Canceled / ctx.Err() are treated as client disconnects, while
EPIPE/ECONNRESET checks are reserved for response-write/flush error paths;
update handleStreamingDispatchError and the isClientDisconnect call sites
(including the similar block around the other check at lines mentioned) to
branch based on whether the response has been written (or use a boolean/flag
indicating flush/write phase) and only perform syscall.EPIPE/ECONNRESET matching
in the write/flush branch so upstream socket resets aren’t swallowed as empty
200s.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c250ea96-4e1d-452e-8624-0a19cd797f61

📥 Commits

Reviewing files that changed from the base of the PR and between 10b766a and 30aec8e.

📒 Files selected for processing (2)
  • internal/server/handlers_test.go
  • internal/server/translated_inference_service.go

Comment thread internal/server/translated_inference_service.go Outdated
…atch

handleStreamingDispatchError ran the same isClientDisconnect classifier
as the write-phase recordStreamingError. That helper treats
syscall.EPIPE / syscall.ECONNRESET as client disconnects, which is
correct only after the gateway has begun writing the SSE response to
the client. Before the first chunk is flushed the only socket in play
is the upstream provider connection, so an EPIPE / ECONNRESET there
belongs to the provider and must surface as an upstream failure - not
be swallowed as client_disconnected and returned as an empty 200.

Introduce isClientDisconnectDuringDispatch covering only request-
context cancellation (errors.Is(context.Canceled) or the err==nil race
fallback) and wire handleStreamingDispatchError to it. Keep
isClientDisconnect unchanged for the write-phase callers in
recordStreamingError where EPIPE / ECONNRESET genuinely signal the
downstream client going away.

Pin the new boundary with a test that feeds bare and wrapped EPIPE /
ECONNRESET into handleStreamingResponse via streamFn and asserts the
audit entry is not classified as client_disconnected and the response
is not an empty 200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@SantiagoDePolonia SantiagoDePolonia merged commit 4490b73 into main May 17, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants