fix(server): classify client-disconnect during streaming + live-preview e2e scenarios#337
Conversation
Streaming requests that failed before any chunks were flushed were audited as upstream provider errors (status 502, error_type provider_error, stream null) even when the cause was the client closing the connection mid-handshake. The audit row no longer reflected the request's streaming intent at all, hiding cancellations from monitoring. Mark the audit entry as a streaming request and tag the error type as client_disconnected when ctx.Err is set or the error wraps context.Canceled / EPIPE / ECONNRESET. Routed both the dispatch-time failures (StreamChatCompletion, StreamResponses, handleStreamingResponse, passthrough flushStream) through a single helper so the classification stays consistent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the retired grok-3-mini target in S22 with xai/grok-4.3 so the xAI smoke probe again hits a live model. Add §17 Dashboard live preview covering /admin/live/logs: S91 idle subscriber receives reset + heartbeat events S92 chat completion produces matching audit and usage events S93 types=usage filter excludes audit events S94 invalid cursor is rejected with 400 invalid_request_error S95 streaming client disconnect is audited as client_disconnected S95 acts as the regression test for the audit fix shipped in the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughClassify streaming errors into client_disconnected vs stream_error, pass request context into recording, route pre-flush/init errors through a dispatch-aware handler that suppresses responses for client disconnects, enrich audit logs with error_type, and add unit + E2E coverage. ChangesStreaming Error Handling and Client Disconnect Classification
Sequence DiagramsequenceDiagram
participant Client
participant TranslatedInferenceService
participant StreamFn
participant Passthrough
participant recordStreamingError
participant AuditLog
Client->>TranslatedInferenceService: start streaming request
TranslatedInferenceService->>StreamFn: initialize streamFn (dispatch)
StreamFn-->>TranslatedInferenceService: init error -> handleStreamingDispatchError
TranslatedInferenceService->>Passthrough: flushStream / send chunk
Passthrough->>recordStreamingError: recordStreamingError(ctx, err) on flush failure
recordStreamingError->>AuditLog: log stream termination with error_type
Client->>TranslatedInferenceService: disconnect (context canceled)
TranslatedInferenceService->>recordStreamingError: recordStreamingError(ctx, ctx.Err())
recordStreamingError->>AuditLog: log event with error_type: client_disconnected
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR updates streaming disconnect auditing and expands release validation coverage. It changes:
Confidence Score: 4/5This is close, but the fast-path streaming route should be fixed before merging.
Important Files Changed
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/server/handlers_test.go`:
- Around line 2688-2702: Add explicit coverage for syscall disconnect errors in
TestRecordStreamingError_ClassifiesClientDisconnect by invoking
recordStreamingError with errors that are syscall.EPIPE and syscall.ECONNRESET
(use syscall.EPIPE and syscall.ECONNRESET wrapped or passed so errors.Is can
detect them) and asserting each resulting auditlog.LogEntry.ErrorType equals
"client_disconnected"; keep the existing canceled-context and generic
stream_error checks, but add two new entries (e.g., entry3/entry4) and
corresponding t.Fatalf assertions to prevent regressions in
recordStreamingError's disconnect classification.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: efd71b75-d478-44b6-8797-6c39899f50a2
📒 Files selected for processing (4)
internal/server/handlers_test.gointernal/server/passthrough_support.gointernal/server/translated_inference_service.gotests/e2e/release-e2e-scenarios.md
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
isClientDisconnect returned true for any error once ctx.Err was set, so a real upstream stream failure that races with a client disconnect was audited as client_disconnected. That hides upstream incidents from the audit log. Require the error itself to be context.Canceled / syscall.EPIPE / syscall.ECONNRESET (or a chain that unwraps to one). The context-only path now only fires when no concrete error was returned by the call. Expand TestRecordStreamingError_ClassifiesClientDisconnect into a table-driven test covering syscall.EPIPE, syscall.ECONNRESET, wrapped context.Canceled, the race case that must stay stream_error, and the clean-context generic error case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/server/translated_inference_service.go (1)
520-531:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard the nil-error disconnect path before writing audit data.
isClientDisconnectnow explicitly supportserr == nil && ctx.Err() == context.Canceled, butrecordStreamingErrorstill unconditionally callserr.Error(). On that path this will panic while recording the disconnect.Suggested fix
func recordStreamingError(streamEntry *auditlog.LogEntry, model, provider, path, requestID string, ctx context.Context, err error) { errorType := "stream_error" + errorMessage := "" + logErr := err if isClientDisconnect(ctx, err) { errorType = "client_disconnected" } + if err != nil { + errorMessage = err.Error() + } else if ctx != nil && ctx.Err() != nil { + errorMessage = ctx.Err().Error() + logErr = ctx.Err() + } if streamEntry != nil { streamEntry.ErrorType = errorType if streamEntry.Data == nil { streamEntry.Data = &auditlog.LogData{} } - streamEntry.Data.ErrorMessage = err.Error() + streamEntry.Data.ErrorMessage = errorMessage } slog.Warn("stream terminated abnormally", - "error", err, + "error", logErr, "error_type", errorType, "model", model, "provider", provider, "path", path,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/server/translated_inference_service.go` around lines 520 - 531, The recordStreamingError function writes err.Error() without checking for nil; update recordStreamingError to handle the disconnect path when err == nil by checking err != nil before calling err.Error() and set streamEntry.Data.ErrorMessage to a safe value (e.g., ctx.Err().Error() or an explicit "client disconnected") when err is nil; adjust the logic around isClientDisconnect(ctx, err) and the block that assigns streamEntry.Data.ErrorMessage inside recordStreamingError so it never calls err.Error() on a nil error while still recording a meaningful message.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/server/handlers_test.go`:
- Around line 2690-2753: The test table in
TestRecordStreamingError_ClassifiesClientDisconnect misses the case where err ==
nil but the context has been canceled (ctx.Err() == context.Canceled); add a
test entry using the existing canceledCtx with err: nil and wantType:
"client_disconnected" so the new production branch (err == nil && ctx.Err() ==
context.Canceled) is exercised by recordStreamingError and guarded against
regressions.
---
Outside diff comments:
In `@internal/server/translated_inference_service.go`:
- Around line 520-531: The recordStreamingError function writes err.Error()
without checking for nil; update recordStreamingError to handle the disconnect
path when err == nil by checking err != nil before calling err.Error() and set
streamEntry.Data.ErrorMessage to a safe value (e.g., ctx.Err().Error() or an
explicit "client disconnected") when err is nil; adjust the logic around
isClientDisconnect(ctx, err) and the block that assigns
streamEntry.Data.ErrorMessage inside recordStreamingError so it never calls
err.Error() on a nil error while still recording a meaningful message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c54bf5c0-3ebb-42ac-a600-5ab14ebb0a15
📒 Files selected for processing (2)
internal/server/handlers_test.gointernal/server/translated_inference_service.go
isClientDisconnect classifies (ctx canceled, err == nil) as a client disconnect, but recordStreamingError still called err.Error() unconditionally. The nil-err branch was unreachable from today's two callsites (both gate on err != nil) but it is documented behaviour that a future caller could rely on, and a latent nil deref is not the right place to leave a footgun. Compute the audit message defensively: prefer err.Error() when available, otherwise fall back to the context error. The slog "error" field follows the same source so the log line never carries a stale nil. Extend the classifier test with a (canceledCtx, nil) row and assert the recorded error_message on every row to lock in both the classification and the message-source fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/server/translated_inference_service.go`:
- Around line 513-515: The current use of isClientDisconnect in
handleStreamingDispatchError classifies syscall.EPIPE/ECONNRESET as client
disconnects even on pre-flush/dispatch failures; change the logic so that on
early dispatch/init failure (before any response bytes are written) only
context.Canceled / ctx.Err() are treated as client disconnects, while
EPIPE/ECONNRESET checks are reserved for response-write/flush error paths;
update handleStreamingDispatchError and the isClientDisconnect call sites
(including the similar block around the other check at lines mentioned) to
branch based on whether the response has been written (or use a boolean/flag
indicating flush/write phase) and only perform syscall.EPIPE/ECONNRESET matching
in the write/flush branch so upstream socket resets aren’t swallowed as empty
200s.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c250ea96-4e1d-452e-8624-0a19cd797f61
📒 Files selected for processing (2)
internal/server/handlers_test.gointernal/server/translated_inference_service.go
…atch handleStreamingDispatchError ran the same isClientDisconnect classifier as the write-phase recordStreamingError. That helper treats syscall.EPIPE / syscall.ECONNRESET as client disconnects, which is correct only after the gateway has begun writing the SSE response to the client. Before the first chunk is flushed the only socket in play is the upstream provider connection, so an EPIPE / ECONNRESET there belongs to the provider and must surface as an upstream failure - not be swallowed as client_disconnected and returned as an empty 200. Introduce isClientDisconnectDuringDispatch covering only request- context cancellation (errors.Is(context.Canceled) or the err==nil race fallback) and wire handleStreamingDispatchError to it. Keep isClientDisconnect unchanged for the write-phase callers in recordStreamingError where EPIPE / ECONNRESET genuinely signal the downstream client going away. Pin the new boundary with a test that feeds bare and wrapped EPIPE / ECONNRESET into handleStreamingResponse via streamFn and asserts the audit entry is not classified as client_disconnected and the response is not an empty 200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
status=502,error_type=provider_error,stream=null) even when the cause was a client disconnect mid-handshake. Audit rows now reflect the streaming intent and tag the cause asclient_disconnectedwhenctx.Err()is set or the error wrapscontext.Canceled/EPIPE/ECONNRESET. RoutedStreamChatCompletion,StreamResponses,handleStreamingResponse, and the passthroughflushStreamerror site through a singlehandleStreamingDispatchErrorhelper so classification stays consistent across paths.S22to usexai/grok-4.3(the previousgrok-3-minitarget was retired from xAI's catalog) and added a new §17 Dashboard live preview section torelease-e2e-scenarios.md:S91idle subscriber receivesreset+heartbeatS92chat completion produces matchingaudit.*+usage.*eventsS93types=usagefilter excludesaudit.*eventsS94invalid cursor →400 invalid_request_errorS95streaming client disconnect →stream=true, error_type=client_disconnected(regression test for the fix)Test plan
go test ./internal/server ./internal/auditlog ./internal/streaming— all greenTestHandleStreamingResponse_ClientDisconnectBeforeUpstream,TestRecordStreamingError_ClassifiesClientDisconnecttests/e2e/run-release-e2e.sh --scenario S22,S91,S92,S93,S94,S95— all 6 OKstatus=200, stream=true, error_type=client_disconnectedinstead ofstatus=502, stream=null, error_type=provider_error🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests
Tests / E2E