Skip to content

fix(audit): improve failover and timeout visibility#239

Merged
SantiagoDePolonia merged 1 commit intomainfrom
feat/increased-audit-log-visibility
Apr 17, 2026
Merged

fix(audit): improve failover and timeout visibility#239
SantiagoDePolonia merged 1 commit intomainfrom
feat/increased-audit-log-visibility

Conversation

@SantiagoDePolonia
Copy link
Copy Markdown
Contributor

@SantiagoDePolonia SantiagoDePolonia commented Apr 17, 2026

Summary

  • persist failover target details in audit logs and surface failover in workflow charts
  • record timeout/provider errors with the correct HTTP status code and make them searchable/filterable in audit logs
  • preserve the primary route in failover audit charts so cross-provider fallbacks do not render impossible provider/model combinations

Testing

  • go test ./internal/llmclient ./internal/auditlog ./internal/server ./internal/admin
  • node --test internal/admin/dashboard/static/js/modules/workflows.test.js internal/admin/dashboard/static/js/modules/workflows-layout.test.js internal/admin/dashboard/static/js/modules/dashboard-layout.test.js internal/admin/dashboard/static/js/modules/dashboard-display.test.js internal/admin/dashboard/static/js/modules/audit-list.test.js

Summary by CodeRabbit

  • New Features

    • Added failover routing support for workflows with configurable fallback targets
    • Extended audit log search functionality to include error messages
    • Added failover metadata display in audit logs and workflow visualization
  • Bug Fixes

    • Improved timeout error handling with proper gateway timeout status codes
  • Documentation

    • Added 504 (Gateway Timeout) status code option to audit filter

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

📝 Walkthrough

Walkthrough

This PR introduces failover metadata tracking throughout the system. It adds a new Failover field to audit logs to record fallback target models, implements error-to-status-code mapping for timeout classification, updates the admin dashboard UI to display failover information, and modifies request handlers to enrich audit entries with failover details during fallback execution.

Changes

Cohort / File(s) Summary
Admin Dashboard UI
internal/admin/dashboard/templates/index.html, internal/admin/dashboard/templates/workflow-chart.html
Added audit log status filter option for 504 and new conditional failover metadata badge in audit logs; introduced failover segment in workflow chart between AI and Response nodes with dynamic classes/labels.
Workflow Dashboard Tests
internal/admin/dashboard/static/js/modules/dashboard-layout.test.js, internal/admin/dashboard/static/js/modules/workflows-layout.test.js
Updated test assertions to expect 504 status option and failover UI elements (conditional render blocks with x-show, dynamic classes, and text bindings for failover connector/node/labels).
Workflow Logic & Helpers
internal/admin/dashboard/static/js/modules/workflows.js
Added failover parsing helpers (workflowEntryFailover, workflowFailoverTarget); extended runtime model to include failoverTarget and provider/model selection logic; added UI model fields and class/label helpers for conditional failover rendering.
Workflow Logic Tests
internal/admin/dashboard/static/js/modules/workflows.test.js
Extended test expectations to include failover-related fields (showFailover, failoverNodeClass, failoverConnClass, failoverStatusLabel, failoverTargetLabel) and added new test coverage for failover target extraction and cross-provider failover scenarios.
Audit Logging Core
internal/auditlog/auditlog.go
Added new Failover *FailoverSnapshot field to LogData struct and introduced exported FailoverSnapshot type with TargetModel field.
Audit Logging Enrichment
internal/auditlog/middleware.go
Added two new exported functions EnrichEntryWithFailover and EnrichLogEntryWithFailover to record failover target metadata on audit entries.
Audit Logging Tests
internal/auditlog/auditlog_test.go, internal/auditlog/reader_sqlite_boundary_test.go
Extended test fixtures with non-nil Failover field; added JSON round-trip and copy/propagation assertions; added new boundary test verifying error message search filtering.
Audit Log Readers
internal/auditlog/reader_postgresql.go, internal/auditlog/reader_sqlite.go
Expanded free-text search to match against error_message JSON field in addition to top-level columns using ILIKE/LIKE with escape clause.
Audit Streaming
internal/auditlog/stream_wrapper.go
Modified CreateStreamEntry to deep-copy Failover field from base entry data into streaming entry copy.
HTTP Client Error Handling
internal/llmclient/client.go, internal/llmclient/client_test.go
Added centralized error-to-status-code mapping via providerErrorStatusCode and isTimeoutError helpers; updated request/response error paths to classify timeouts as http.StatusGatewayTimeout (504) instead of always using 502; added two new tests for timeout scenarios.
Request Handler Audit Integration
internal/server/internal_chat_completion_executor.go, internal/server/translated_inference_service.go, internal/server/fallback_test.go
Modified handlers to track failoverModel throughout request execution and enrich audit entries with auditlog.EnrichEntryWithFailover(...) when fallback is used; updated fallback helper signatures to return resolved fallover model instead of boolean; updated tests to assert entry.Data.Failover.TargetModel matches fallback selection.
API Documentation
internal/admin/handler.go
Updated Swagger parameter description for search filter to indicate it searches error_type/error_message instead of only error_type.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Handler
    participant HTTPClient
    participant AuditLog
    participant Fallback

    Client->>Handler: ChatCompletion Request
    Handler->>HTTPClient: Send to Primary Provider
    HTTPClient-->>HTTPClient: Timeout Error
    HTTPClient->>HTTPClient: Map to 504 Status
    HTTPClient-->>Handler: Error Response
    Handler->>Fallback: Execute Fallback/Failover
    Fallback->>HTTPClient: Send to Fallback Provider
    HTTPClient-->>Fallback: Success Response
    Fallback-->>Handler: Return with failoverModel
    Handler->>AuditLog: EnrichEntryWithFailover(failoverModel)
    AuditLog-->>Handler: Audit Entry Updated
    Handler-->>Client: Return Response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

release:internal

Poem

🐰 When timeouts strike and errors rise,
A failover path saves the skies,
With 504s and fallbacks true,
The audit log remembers you—
Tracking every model's graceful flight!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main changes: improving visibility of failover events and timeout errors in audit logs, which aligns with the key objectives of persisting failover details, recording timeout status codes, and making errors searchable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/increased-audit-log-visibility

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/llmclient/client.go`:
- Around line 676-692: The string-match fallback in isTimeoutError currently
checks for "client.timeout exceeded" and "timeout awaiting response headers" but
misses the body-read phrasing; update isTimeoutError to also check the lowercase
error message for the pattern "client.timeout or context canceled" (or a
substring like "client.timeout or context canceled") so non-standard transports
that surface that wording are treated as timeouts; locate function
isTimeoutError in internal/llmclient/client.go and add the additional
strings.Contains check to the existing fallback logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 64459caf-3769-4584-91eb-0cfe8bafecfe

📥 Commits

Reviewing files that changed from the base of the PR and between 6b669da and cb4443a.

📒 Files selected for processing (19)
  • internal/admin/dashboard/static/js/modules/dashboard-layout.test.js
  • internal/admin/dashboard/static/js/modules/workflows-layout.test.js
  • internal/admin/dashboard/static/js/modules/workflows.js
  • internal/admin/dashboard/static/js/modules/workflows.test.js
  • internal/admin/dashboard/templates/index.html
  • internal/admin/dashboard/templates/workflow-chart.html
  • internal/admin/handler.go
  • internal/auditlog/auditlog.go
  • internal/auditlog/auditlog_test.go
  • internal/auditlog/middleware.go
  • internal/auditlog/reader_postgresql.go
  • internal/auditlog/reader_sqlite.go
  • internal/auditlog/reader_sqlite_boundary_test.go
  • internal/auditlog/stream_wrapper.go
  • internal/llmclient/client.go
  • internal/llmclient/client_test.go
  • internal/server/fallback_test.go
  • internal/server/internal_chat_completion_executor.go
  • internal/server/translated_inference_service.go

Comment on lines +676 to +692
func isTimeoutError(err error) bool {
if err == nil {
return false
}
if errors.Is(err, context.DeadlineExceeded) {
return true
}

var netErr net.Error
if errors.As(err, &netErr) && netErr.Timeout() {
return true
}

message := strings.ToLower(err.Error())
return strings.Contains(message, "client.timeout exceeded") ||
strings.Contains(message, "timeout awaiting response headers")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

❓ Verification inconclusive

What exact error message does Go's net/http return when http.Client.Timeout fires during response body read vs while awaiting headers?


When http.Client.Timeout fires while awaiting headers (before receiving the response), the error returned by client.Do/Get/etc. is a *url.Error wrapping the underlying error with the message "(Client.Timeout exceeded while awaiting headers)" appended, e.g., "Get "https://example.com\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)". When the timeout fires during reading the response body (after headers received, during io.ReadAll(resp.Body) or similar), the resp.Body.Read returns an error wrapped by *http.timeoutError with "(Client.Timeout or context canceled)" appended to the underlying read error, e.g., "read tcp ...: i/o timeout (Client.Timeout or context canceled)". This is due to the cancelTimerBody wrapper around the body, which detects the timeout and augments the error. Note that issue #58279 discusses this returning an unexposed *http.httpError type, but it implements net.Error.Timeout=true and can be detected that way.

Citations:


Minor: string-match fallback incompletely covers timeout scenarios.

The errors.Is / net.Error.Timeout() checks reliably catch http.Client.Timeout for both awaiting-headers and body-read phases because Go's internal timeout errors implement net.Error in both cases, so production code will work as-is. However, the string fallback provides a safety net for non-standard transports that might surface timeout errors without the net.Error interface. The fallback currently matches "client.timeout exceeded" (from awaiting-headers) and "timeout awaiting response headers", but not the body-read timeout phrasing "client.timeout or context canceled". Consider adding that pattern for completeness, or document that the fallback is intentionally limited.

♻️ Suggested tweak
 	message := strings.ToLower(err.Error())
 	return strings.Contains(message, "client.timeout exceeded") ||
+		strings.Contains(message, "client.timeout or context canceled") ||
 		strings.Contains(message, "timeout awaiting response headers")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/llmclient/client.go` around lines 676 - 692, The string-match
fallback in isTimeoutError currently checks for "client.timeout exceeded" and
"timeout awaiting response headers" but misses the body-read phrasing; update
isTimeoutError to also check the lowercase error message for the pattern
"client.timeout or context canceled" (or a substring like "client.timeout or
context canceled") so non-standard transports that surface that wording are
treated as timeouts; locate function isTimeoutError in
internal/llmclient/client.go and add the additional strings.Contains check to
the existing fallback logic.

@SantiagoDePolonia SantiagoDePolonia merged commit ac7a7f5 into main Apr 17, 2026
19 checks passed
@SantiagoDePolonia SantiagoDePolonia deleted the feat/increased-audit-log-visibility branch April 25, 2026 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant