Skip to content

feat: add manual failover management#444

Merged
SantiagoDePolonia merged 25 commits into
mainfrom
feature/failover-management
Jun 30, 2026
Merged

feat: add manual failover management#444
SantiagoDePolonia merged 25 commits into
mainfrom
feature/failover-management

Conversation

@SantiagoDePolonia

@SantiagoDePolonia SantiagoDePolonia commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add manual-only failover runtime rules with env/config/dashboard support
  • keep deprecated default_mode accepted but ignored at runtime
  • store failover rules separately from virtual models and expose admin APIs/dashboard UI
  • add failover reset and automatic suggestion generation for dashboard workflows

Verification

  • go test ./cmd/... gomodel/internal/admin
    gomodel/internal/admin/dashboard
    gomodel/internal/anthropicapi
    gomodel/internal/app
    gomodel/internal/auditlog
    gomodel/internal/authkeys
    gomodel/internal/batch
    gomodel/internal/batchrewrite
    gomodel/internal/budget
    gomodel/internal/cache
    gomodel/internal/cache/modelcache
    gomodel/internal/conversationstore
    gomodel/internal/core
    gomodel/internal/embedding
    gomodel/internal/failover
    gomodel/internal/fallback
    gomodel/internal/filestore
    gomodel/internal/gateway
    gomodel/internal/guardrails
    gomodel/internal/httpclient
    gomodel/internal/live
    gomodel/internal/llmclient
    gomodel/internal/modeldata
    gomodel/internal/modelselectors
    gomodel/internal/observability
    gomodel/internal/pricingoverrides
    gomodel/internal/providers
    gomodel/internal/providers/anthropic
    gomodel/internal/providers/azure
    gomodel/internal/providers/bailian
    gomodel/internal/providers/bedrock
    gomodel/internal/providers/deepseek
    gomodel/internal/providers/gemini
    gomodel/internal/providers/googlecommon
    gomodel/internal/providers/groq
    gomodel/internal/providers/minimax
    gomodel/internal/providers/ollama
    gomodel/internal/providers/openai
    gomodel/internal/providers/opencodego
    gomodel/internal/providers/openrouter
    gomodel/internal/providers/oracle
    gomodel/internal/providers/vertex
    gomodel/internal/providers/vllm
    gomodel/internal/providers/xai
    gomodel/internal/providers/xiaomi
    gomodel/internal/providers/zai
    gomodel/internal/realtime
    gomodel/internal/responsecache
    gomodel/internal/responsestore
    gomodel/internal/server
    gomodel/internal/storage
    gomodel/internal/storage/sqlutil
    gomodel/internal/streaming
    gomodel/internal/usage
    gomodel/internal/validation
    gomodel/internal/version
    gomodel/internal/virtualmodels
    gomodel/internal/workflows ./config/...
  • node --test internal/admin/dashboard/static/js/modules/*.test.cjs
  • pre-commit hooks: make test-race, go mod tidy, dashboard JS tests, performance guard, make lint, mint validate

Summary by CodeRabbit

  • New Features
    • Added admin failover management endpoints plus a dashboard failover editor with draft generation, saving, and reset, controlled by FAILOVER_ENABLED.
    • Introduced manual-only failover configuration (fallback.enabled, JSON-based rules, and disabled-model controls).
    • Expanded audit logs with per-attempt request/response tabbing and compact live preview attempt details.
    • Added token throughput usage endpoint and token breakdown fields in usage responses.
  • Bug Fixes
    • Audit failover display now preserves executed routing details and shows the correct executed target.
  • Documentation
    • Updated failover documentation and configuration examples for the new manual model and runtime flag behavior.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@SantiagoDePolonia, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 37 seconds

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2613521c-8b93-4ca1-b228-74dc78c9bb2d

📥 Commits

Reviewing files that changed from the base of the PR and between c5701b5 and 5619d2a.

📒 Files selected for processing (4)
  • config/config_test.go
  • internal/failover/service.go
  • internal/failover/service_test.go
  • internal/failover/store_postgresql.go
📝 Walkthrough

Walkthrough

Adds dashboard-managed failover rules with a new runtime enablement flag, and adds provider-attempt capture, persistence, hydration, and rendering across gateway, server, audit, and dashboard flows.

Changes

Failover Rules Management

Layer / File(s) Summary
Config and runtime flag
config/*, internal/fallback/*, internal/admin/*, internal/app/*, .env.template, docs/features/failover.mdx, internal/admin/dashboard/static/js/modules/workflows.*
FallbackConfig gains enabled/manual/disabled inputs, runtime gating moves to FAILOVER_ENABLED, and related docs, example config, resolver behavior, and tests are updated.
Failover service and stores
internal/failover/*
Defines failover rule/view types, store helpers, SQLite/PostgreSQL/MongoDB persistence, the in-memory service, and the factory lifecycle wrapper, with migration and refresh coverage.
Admin API and wiring
internal/admin/handler.go, handler_failover.go, routes.go, docs/openapi.json, tests
Wires failover into admin handling, registers the /admin/failover API surface, and updates the dashboard runtime response contract and related tests.
Dashboard failover editor
internal/admin/dashboard/static/js/modules/failover.js, dashboard.js, templates, CSS, tests
Adds the dashboard module, modals, row actions, settings actions, icon, styling, and JS/test wiring for viewing and editing failover mappings.

Provider Attempt Tracking and Audit

Layer / File(s) Summary
Attempt model and capture helpers
internal/auditlog/*, internal/core/errors.go, internal/llmclient/client.go
Adds AttemptSnapshot to audit data, capture/redaction helpers, bounded upstream error body/header fields, and upstream response header propagation.
Gateway attempt recording
internal/gateway/attempts.go, fallback.go, inference_execute.go, inference_prepare.go, tests/perf/hotpath_test.go
Records primary and fallback attempts with timing via context plumbing, adjusts fallback matching behavior, and updates perf guard ceilings.
Audit persistence and hydration
internal/auditlog/store_*.go, reader_*.go, middleware.go, stream_wrapper.go
Persists audit_log_attempts, hydrates attempts in readers, and carries normalized attempt snapshots through middleware and stream copies.
Server request handling
internal/server/audit_attempts.go, error_support.go, internal_chat_completion_executor.go, messages_handler.go, translated_inference_service.go, fallback_test.go
Wires attempt recording and audit enrichment into request handling and verifies the resulting attempts in fallback tests.
Live preview compact attempts
internal/live/broker.go, broker_test.go
Adds compact attempt summaries to live audit preview payloads without response bodies or headers.
Dashboard audit attempt UI
internal/admin/dashboard/static/js/modules/audit-list.js, conversation-drawer.js, templates, CSS, tests
Adds attempt-aware audit list helpers, tabbed request/response panes, conversation click handling, and the corresponding dashboard template, CSS, and tests.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~150 minutes

Possibly related PRs

  • ENTERPILOT/GoModel#186: Both PRs wire failover visibility through dashboard runtime config and execution-plan UI.
  • ENTERPILOT/GoModel#204: Both PRs modify internal/auditlog/stream_wrapper.go’s CreateStreamEntry cloning logic.
  • ENTERPILOT/GoModel#292: Both PRs touch the fallback configuration implementation in config/fallback.go.

Poem

🐇 I hopped through failovers, bright and new,
With audit tracks and attempts in view.
One hop, one log, one tabbed delight—
The rabbit’s trail is clean and tight! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is concise and accurately summarizes the main change: adding manual failover management.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/failover-management

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Comment thread internal/failover/store_mongodb.go Dismissed
Comment thread internal/failover/store_mongodb.go Dismissed
Comment thread internal/failover/store_mongodb.go Dismissed
@codecov-commenter

codecov-commenter commented Jun 29, 2026

Copy link
Copy Markdown

@mintlify

mintlify Bot commented Jun 29, 2026

Copy link
Copy Markdown

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
gomodel 🟢 Ready View Preview Jun 29, 2026, 9:48 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/fallback/resolver_test.go (1)

55-83: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Assert that manual targets never reappear in SuggestFallbacks().

This test now only checks the first two suggestions. If SuggestFallbacks() starts appending the already-configured manual target ("azure/gpt-4o") later in the slice, the test still passes even though internal/fallback/resolver.go:171-189 is explicitly deduping manual selectors before auto suggestions are generated. Add a negative assertion over the full result set. As per coding guidelines, **/*_test.go: Add or update tests for behavior changes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/fallback/resolver_test.go` around lines 55 - 83, The fallback
ranking test for Resolver currently only validates the first two suggestions, so
it can miss regressions where the manual target is appended later in the slice.
Update TestResolverSuggestFallbacksReturnsRankingCandidates to assert across the
full result from SuggestFallbacks that the manual selector from
Resolver.NewResolver config (the "azure/gpt-4o" entry) never appears, while
still preserving the existing ordering checks for the top ranked auto
suggestions.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@config/fallback.go`:
- Around line 252-263: The disabled-model JSON parsing currently accepts the
literal null because json.Unmarshal into the slice in the cfg.DisabledModelsJSON
branch succeeds with a nil result; update this parsing path to explicitly reject
null before or after unmarshaling. In the fallback config logic that builds
disabled, keep allowing only a JSON array or boolean object shape, and if the
expanded value is null return an error instead of treating it as valid input.

In `@docs/features/failover.mdx`:
- Around line 13-15: The overview text in the failover documentation has an
awkward, broken sentence that makes the runtime migration guidance unclear.
Update the affected prose in the failover overview to rephrase the runtime
failover guidance into a complete, unambiguous sentence, keeping the meaning
that runtime failover relies on manual rules only. Locate the text near the
introductory failover description and adjust the surrounding sentence flow for
clarity.

In `@docs/openapi.json`:
- Around line 6775-6782: The failover request schemas currently define source as
optional even though the delete and upsert handlers require it; update both
admin.deleteFailoverRuleRequest and admin.upsertFailoverRuleRequest in the
OpenAPI document to mark source as required. Make sure the required field list
matches the server contract so generated clients treat source as mandatory for
both request shapes.
- Around line 952-1003: The failover upsert response contract is missing the
successful 204 No Content path. Update the responses for the PUT /admin/failover
operation in docs/openapi.json to include a 204 entry alongside the existing
200/400/401/502/503 responses, matching the behavior in
internal/admin/handler_failover.go and its refreshed view handling.

In `@internal/admin/dashboard/static/js/modules/failover.js`:
- Around line 83-89: The failover edit flow in failover.js captures
failoverFormOriginalSource but never uses it when saving, so changing source
creates a new persisted rule instead of updating the existing one. Fix the save
path in the form handling logic by either making source immutable in edit mode
or by using failoverFormOriginalSource to delete/replace the old rule before
upserting the edited one; update the relevant edit/save methods around
failoverFormOriginalSource and failoverForm so the original key is handled
correctly.
- Around line 82-88: The failover UI is still using rule.managed to determine
whether a rule is config-managed, but this PR’s contract uses managed_source
instead. Update the failover state setup in the failover module to derive the
managed flag from managed_source, and make sure the template logic for the
edit/delete affordances also checks managed_source consistently so managed rules
are treated as non-editable dashboard entries.

In `@internal/admin/dashboard/templates/layout.html`:
- Around line 162-170: The failover target picker is using a datalist with a
textarea, but the `list` attribute is not supported on `<textarea>`, so the
autocomplete will never work. Update the `failover-target-options` /
`failover-targets` UI in `layout.html` to use a supported control such as an
`<input>` with `list`, or replace it with a custom autocomplete component while
keeping the `failoverForm.targets` binding and `failoverFormManaged` disabled
state intact.

In `@internal/admin/dashboard/templates/page-failover.html`:
- Around line 21-24: The failover filter input is bound to shared state but the
tables still render the unfiltered arrays, so typing has no effect and leaks
into the models page filter. Update the failover template to use a
failover-specific filtered collection in the table loops, and avoid reusing the
shared modelFilter binding by introducing a dedicated failover filter state and
derived list in the surrounding Alpine data/template logic. Ensure both table
sections that currently iterate the raw collections are switched to the new
filtered source.

In `@internal/admin/dashboard/templates/sidebar.html`:
- Around line 43-46: The sidebar Failover nav link is still shown and clickable
even when the feature is unavailable. Update the sidebar template’s Failover
entry to use the same availability check as the other failover entry points by
gating the anchor on failoverAvailable, so it is hidden when the API reports
failover is unavailable.

In `@internal/admin/handler_failover.go`:
- Around line 169-172: The `/admin/failover/generate` endpoint is documented as
returning `failover.View`, but it only fabricates draft data and should not use
the persisted-rule shape. Update the handler/Swagger annotations in
`admin/handler_failover.go` to return a dedicated draft DTO for this endpoint,
and adjust the generation path in the failover handler so it serializes only the
draft fields instead of `created_at`/`updated_at` or other stored-view-only
values.

In `@internal/failover/factory.go`:
- Around line 90-98: The failover factory currently reuses the model-cache
refresh ticker for propagation, which can leave peer nodes stale for too long.
Update the Result construction in the factory logic to use a dedicated, short
failover refresh interval or an explicit invalidation mechanism instead of
cfg.Cache.Model.RefreshInterval and the time.Hour fallback. Keep the change
localized around the service.StartBackgroundRefresh call so failover updates
propagate independently from the model cache.

In `@internal/failover/service_test.go`:
- Around line 60-89: The current test only covers the happy-path merge in
TestServiceConfigRulesOverrideDashboardRules; add regression tests for the new
write/error behavior too. Extend coverage around
Service/Refresh/UpsertRule/ResetDashboardRules to verify config-managed rules
cannot be overwritten by dashboard writes, config rules survive
ResetDashboardRules, and storage failures from refresh/upsert are surfaced as
errors instead of being swallowed. Use the existing symbols NewService, Refresh,
Rules, Get, and the managed-source constants to keep the tests aligned with the
failover contract.

In `@internal/failover/service.go`:
- Around line 183-185: The pre-read in `service.go` is swallowing unexpected
`Store.Get` errors and still allowing the write to continue, which can hide
storage failures and overwrite metadata. Update the `existing, err :=
s.store.Get(ctx, normalized.Source)` check in the rule write path to distinguish
`ErrNotFound` from other errors: keep the current `CreatedAt` preservation only
when a record exists, but immediately return any non-`ErrNotFound` error before
proceeding. Ensure the fix is applied in the same update flow that uses
`normalized.CreatedAt` and `s.store.Get`.

In `@internal/failover/store_mongodb.go`:
- Around line 19-32: NewMongoDBStore creates indexes with context.Background(),
so index creation ignores the caller’s cancellation and shutdown state. Update
the MongoDB initialization path to accept and use the caller context in
NewMongoDBStore, and pass that ctx through to coll.Indexes().CreateMany instead
of creating a new timeout context. Keep the existing error handling and index
definitions intact while aligning the MongoDB flow with the PostgreSQL init
path.

---

Outside diff comments:
In `@internal/fallback/resolver_test.go`:
- Around line 55-83: The fallback ranking test for Resolver currently only
validates the first two suggestions, so it can miss regressions where the manual
target is appended later in the slice. Update
TestResolverSuggestFallbacksReturnsRankingCandidates to assert across the full
result from SuggestFallbacks that the manual selector from Resolver.NewResolver
config (the "azure/gpt-4o" entry) never appears, while still preserving the
existing ordering checks for the top ranked auto suggestions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 849c231b-e921-4bda-ac96-a07c5211f4eb

📥 Commits

Reviewing files that changed from the base of the PR and between 92d4e88 and 73f3f20.

📒 Files selected for processing (36)
  • .env.template
  • config/config.example.yaml
  • config/config.go
  • config/config_test.go
  • config/fallback.go
  • docs/features/failover.mdx
  • docs/openapi.json
  • internal/admin/dashboard/static/js/dashboard.js
  • internal/admin/dashboard/static/js/modules/failover.js
  • internal/admin/dashboard/static/js/modules/workflows-layout.test.cjs
  • internal/admin/dashboard/static/js/modules/workflows.js
  • internal/admin/dashboard/static/js/modules/workflows.test.cjs
  • internal/admin/dashboard/templates/index.html
  • internal/admin/dashboard/templates/layout.html
  • internal/admin/dashboard/templates/model-table-body.html
  • internal/admin/dashboard/templates/page-failover.html
  • internal/admin/dashboard/templates/page-settings.html
  • internal/admin/dashboard/templates/sidebar.html
  • internal/admin/handler.go
  • internal/admin/handler_failover.go
  • internal/admin/handler_test.go
  • internal/admin/routes.go
  • internal/admin/routes_test.go
  • internal/app/app.go
  • internal/app/app_test.go
  • internal/failover/factory.go
  • internal/failover/service.go
  • internal/failover/service_test.go
  • internal/failover/store.go
  • internal/failover/store_mongodb.go
  • internal/failover/store_postgresql.go
  • internal/failover/store_sqlite.go
  • internal/failover/types.go
  • internal/fallback/resolver.go
  • internal/fallback/resolver_test.go
  • internal/server/http_test.go

Comment thread config/fallback.go
Comment thread docs/features/failover.mdx Outdated
Comment thread docs/openapi.json
Comment on lines +952 to +1003
"responses": {
"200": {
"description": "OK",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/failover.View"
}
}
}
},
"400": {
"description": "Bad Request",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/core.GatewayError"
}
}
}
},
"401": {
"description": "Unauthorized",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/core.GatewayError"
}
}
}
},
"502": {
"description": "Bad Gateway",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/core.GatewayError"
}
}
}
},
"503": {
"description": "Service Unavailable",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/core.GatewayError"
}
}
}
}
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Document the 204 No Content upsert path.

internal/admin/handler_failover.go can return 204 after a successful PUT /admin/failover when the refreshed view is unavailable, but this operation only advertises 200. That makes the published API contract narrower than the handler.

Suggested OpenAPI fix
         "responses": {
           "200": {
             "description": "OK",
             "content": {
               "application/json": {
                 "schema": {
                   "$ref": "`#/components/schemas/failover.View`"
                 }
               }
             }
           },
+          "204": {
+            "description": "No Content"
+          },
           "400": {
             "description": "Bad Request",
             "content": {
               "application/json": {
                 "schema": {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"responses": {
"200": {
"description": "OK",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/failover.View"
}
}
}
},
"400": {
"description": "Bad Request",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/core.GatewayError"
}
}
}
},
"401": {
"description": "Unauthorized",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/core.GatewayError"
}
}
}
},
"502": {
"description": "Bad Gateway",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/core.GatewayError"
}
}
}
},
"503": {
"description": "Service Unavailable",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/core.GatewayError"
}
}
}
}
},
"responses": {
"200": {
"description": "OK",
"content": {
"application/json": {
"schema": {
"$ref": "`#/components/schemas/failover.View`"
}
}
}
},
"204": {
"description": "No Content"
},
"400": {
"description": "Bad Request",
"content": {
"application/json": {
"schema": {
"$ref": "`#/components/schemas/core.GatewayError`"
}
}
}
},
"401": {
"description": "Unauthorized",
"content": {
"application/json": {
"schema": {
"$ref": "`#/components/schemas/core.GatewayError`"
}
}
}
},
"502": {
"description": "Bad Gateway",
"content": {
"application/json": {
"schema": {
"$ref": "`#/components/schemas/core.GatewayError`"
}
}
}
},
"503": {
"description": "Service Unavailable",
"content": {
"application/json": {
"schema": {
"$ref": "`#/components/schemas/core.GatewayError`"
}
}
}
}
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/openapi.json` around lines 952 - 1003, The failover upsert response
contract is missing the successful 204 No Content path. Update the responses for
the PUT /admin/failover operation in docs/openapi.json to include a 204 entry
alongside the existing 200/400/401/502/503 responses, matching the behavior in
internal/admin/handler_failover.go and its refreshed view handling.

Comment thread docs/openapi.json
Comment on lines +6775 to +6782
"admin.deleteFailoverRuleRequest": {
"type": "object",
"properties": {
"source": {
"type": "string"
}
}
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Mark source as required in both failover request schemas.

Both the delete and upsert handlers reject missing or blank source, but these schemas leave it optional. That weakens generated clients and misstates the server contract.

Suggested OpenAPI fix
       "admin.deleteFailoverRuleRequest": {
         "type": "object",
+        "required": [
+          "source"
+        ],
         "properties": {
           "source": {
             "type": "string"
           }
         }
       },
@@
       "admin.upsertFailoverRuleRequest": {
         "type": "object",
+        "required": [
+          "source"
+        ],
         "properties": {
           "description": {
             "type": "string"
           },

Also applies to: 6957-6976

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/openapi.json` around lines 6775 - 6782, The failover request schemas
currently define source as optional even though the delete and upsert handlers
require it; update both admin.deleteFailoverRuleRequest and
admin.upsertFailoverRuleRequest in the OpenAPI document to mark source as
required. Make sure the required field list matches the server contract so
generated clients treat source as mandatory for both request shapes.

Comment on lines +82 to +88
this.failoverFormManaged = Boolean(rule.managed);
this.failoverFormOriginalSource = rule.source || '';
this.failoverForm = {
source: rule.source || '',
targets: (Array.isArray(rule.targets) ? rule.targets : []).join('\n'),
description: rule.description || '',
enabled: rule.enabled !== false

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Use managed_source consistently for managed-rule state.

This module still keys off rule.managed, but the view contract in this PR uses managed_source. As written, config-managed rules will be labeled as normal dashboard rules here, and the same mismatch leaks into the template’s edit/delete affordances. Gate the UI off managed_source instead.

Also applies to: 297-300

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/failover.js` around lines 82 - 88,
The failover UI is still using rule.managed to determine whether a rule is
config-managed, but this PR’s contract uses managed_source instead. Update the
failover state setup in the failover module to derive the managed flag from
managed_source, and make sure the template logic for the edit/delete affordances
also checks managed_source consistently so managed rules are treated as
non-editable dashboard entries.

Comment on lines +169 to +172
// @Success 200 {array} failover.View
// @Failure 401 {object} core.GatewayError
// @Failure 503 {object} core.GatewayError
// @Router /admin/failover/generate [post]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Don't return draft suggestions as failover.View.

failover.View is the persisted-rule schema, but this endpoint fabricates only a subset of those fields. The generated payload will serialize zero created_at/updated_at values and a stored-view shape for objects that have not been saved yet. Please switch /admin/failover/generate to a dedicated typed draft DTO instead of reusing failover.View.

Suggested direction
+type generateFailoverRuleResponse struct {
+	Source      string   `json:"source"`
+	Targets     []string `json:"targets"`
+	Description string   `json:"description,omitempty"`
+	Enabled     bool     `json:"enabled"`
+}
+
 func (h *Handler) GenerateFailoverRules(c *echo.Context) error {
 	...
-	suggestions := make([]failover.View, 0)
+	suggestions := make([]generateFailoverRuleResponse, 0)
 	...
-		suggestions = append(suggestions, failover.View{
-			Source:        source,
-			Targets:       targets,
-			Enabled:       true,
-			ManagedSource: failover.ManagedSourceDashboard,
-		})
+		suggestions = append(suggestions, generateFailoverRuleResponse{
+			Source:  source,
+			Targets: targets,
+			Enabled: true,
+		})
 	}
 	return c.JSON(http.StatusOK, suggestions)
 }

Also applies to: 207-212

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/handler_failover.go` around lines 169 - 172, The
`/admin/failover/generate` endpoint is documented as returning `failover.View`,
but it only fabricates draft data and should not use the persisted-rule shape.
Update the handler/Swagger annotations in `admin/handler_failover.go` to return
a dedicated draft DTO for this endpoint, and adjust the generation path in the
failover handler so it serializes only the draft fields instead of
`created_at`/`updated_at` or other stored-view-only values.

Comment thread internal/failover/factory.go Outdated
Comment thread internal/failover/service_test.go
Comment thread internal/failover/service.go Outdated
Comment thread internal/failover/store_mongodb.go
@greptile-apps

greptile-apps Bot commented Jun 29, 2026

Copy link
Copy Markdown

Confidence Score: 4/5

The core failover routing and audit-attempt recording logic is structurally sound; the main concerns are in the config migration path and the dashboard rule merge order, both of which could silently alter routing behaviour for operators who upgrade from a FEATURE_FALLBACK_MODE=off or populated manual_rules_path configuration.

Three previously-flagged defects remain unresolved: default_mode off no longer disables failover on upgrade, dynamic dashboard rules silently overwrite static config rules in the resolver, and a transient DB error during Upsert silently resets CreatedAt. New findings here are minor (cleanup early-return pattern, duplicated AttemptKind constants).

config/fallback.go (migration of deprecated default_mode), internal/fallback/resolver.go (effectiveManualRules merge order), internal/failover/service.go (Upsert error handling), and internal/auditlog/store_postgresql.go and store_sqlite.go (cleanup early-return).

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Client
    participant Gateway as Gateway (server)
    participant Orchestrator as InferenceOrchestrator
    participant FallbackResolver as FallbackResolver
    participant FailoverService as failover.Service
    participant Provider as AI Provider
    participant AuditLog as AuditLog Store

    Client->>Gateway: POST /v1/chat/completions
    Gateway->>Gateway: prepareTranslated (WithAttemptRecorder)
    Gateway->>Orchestrator: ExecuteChatCompletion(ctx, workflow, req)
    Orchestrator->>Provider: "Primary attempt (model=gpt-4o)"
    Provider-->>Orchestrator: 404 model deprecated
    Orchestrator->>Orchestrator: recordProviderAttempt(primary, failed)
    Orchestrator->>Orchestrator: ShouldAttemptFallback? true
    Orchestrator->>FallbackResolver: ResolveFallbacks(resolution)
    FallbackResolver->>FailoverService: Rules()
    FallbackResolver-->>Orchestrator: [azure/gpt-4o]
    Orchestrator->>Provider: "Failover attempt (model=azure/gpt-4o)"
    Provider-->>Orchestrator: 200 OK
    Orchestrator->>Orchestrator: recordProviderAttempt(failover, success)
    Orchestrator-->>Gateway: Result UsedFallback true
    Gateway->>AuditLog: enrichAuditEntryWithProviderAttempts
    Gateway->>AuditLog: EnrichEntryWithFailover(azure/gpt-4o)
    Gateway-->>Client: 200 OK
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Client
    participant Gateway as Gateway (server)
    participant Orchestrator as InferenceOrchestrator
    participant FallbackResolver as FallbackResolver
    participant FailoverService as failover.Service
    participant Provider as AI Provider
    participant AuditLog as AuditLog Store

    Client->>Gateway: POST /v1/chat/completions
    Gateway->>Gateway: prepareTranslated (WithAttemptRecorder)
    Gateway->>Orchestrator: ExecuteChatCompletion(ctx, workflow, req)
    Orchestrator->>Provider: "Primary attempt (model=gpt-4o)"
    Provider-->>Orchestrator: 404 model deprecated
    Orchestrator->>Orchestrator: recordProviderAttempt(primary, failed)
    Orchestrator->>Orchestrator: ShouldAttemptFallback? true
    Orchestrator->>FallbackResolver: ResolveFallbacks(resolution)
    FallbackResolver->>FailoverService: Rules()
    FallbackResolver-->>Orchestrator: [azure/gpt-4o]
    Orchestrator->>Provider: "Failover attempt (model=azure/gpt-4o)"
    Provider-->>Orchestrator: 200 OK
    Orchestrator->>Orchestrator: recordProviderAttempt(failover, success)
    Orchestrator-->>Gateway: Result UsedFallback true
    Gateway->>AuditLog: enrichAuditEntryWithProviderAttempts
    Gateway->>AuditLog: EnrichEntryWithFailover(azure/gpt-4o)
    Gateway-->>Client: 200 OK
Loading

Reviews (4): Last reviewed commit: "perf(failover): cache merged rule maps i..." | Re-trigger Greptile

Comment thread config/fallback.go Outdated
Comment on lines 90 to 94
}

cfg.DefaultMode = ResolveFallbackDefaultMode(cfg.DefaultMode)
if !cfg.DefaultMode.Valid() {
return fmt.Errorf("fallback.default_mode must be one of: auto, manual, off")
}

if len(cfg.Overrides) > 0 {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 default_mode: off no longer disables failover — silent behavior change on upgrade

FEATURE_FALLBACK_MODE=off / default_mode: off was the documented way to globally disable fallback. The old resolver gate was if mode == FallbackModeOff && len(cfg.Manual) == 0 && len(cfg.Overrides) == 0 { return nil }, so with manual rules present, the resolver existed but modeFor() returned FallbackModeOff for every model that lacked an explicit mode: manual override — those models got no fallbacks.

After this PR, Enabled defaults to true and DefaultMode is silently ignored at runtime. Any operator who had default_mode: off (or FEATURE_FALLBACK_MODE=off) together with a populated manual_rules_path (or FALLBACK_MANUAL_RULES_PATH) will silently receive fallbacks for all those models on the next deploy.

loadFallbackConfig processes the deprecated DefaultMode field but never maps FallbackModeOff → Enabled=false. Either add that migration or update the deprecation notice and migration docs to call out this specific case explicitly.

Comment thread internal/fallback/resolver.go
Comment thread internal/failover/service.go
Comment thread internal/admin/handler_failover.go Outdated
Comment on lines +206 to +240
}
suggestions = append(suggestions, failover.View{
Source: source,
Targets: targets,
Enabled: true,
ManagedSource: failover.ManagedSourceDashboard,
})
}
return c.JSON(http.StatusOK, suggestions)
}

func modelSupportsCategory(meta *core.ModelMetadata, category core.ModelCategory) bool {
if meta == nil || len(meta.Categories) == 0 {
return true
}
for _, candidate := range meta.Categories {
if candidate == category {
return true
}
}
return false
}

func (h *Handler) findFailoverView(source string) (failover.View, bool) {
for _, view := range h.failoverRules.ListViews() {
if view.Source == source {
return view, true
}
}
return failover.View{}, false
}

func failoverWriteError(err error) error {
if errors.Is(err, failover.ErrManaged) {
return core.NewInvalidRequestError("failover rule is managed by configuration and cannot be changed in the dashboard", err)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 GenerateFailoverRules is O(N²) with no rate-limit protection

The handler calls resolver.SuggestFallbacks() per model, and SuggestFallbacks internally iterates the full registry for each call. For a large registry this can be a sustained CPU spike with no timeout or rate-limit guard on the endpoint.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
internal/admin/dashboard/static/js/modules/failover.js (1)

34-37: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Clear generated drafts when the failover feature becomes unavailable.

On the 503 and fetch-error paths, this only clears failoverRules. page-failover.html still renders failoverGeneratedRules independently, so stale generated drafts remain visible and reopenable after the backend reports the feature as unavailable. Reset failoverGeneratedRules in these early-return paths too.

Also applies to: 49-53

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/failover.js` around lines 34 - 37,
The failover unavailable paths in failover.js only clear failoverRules, leaving
stale drafts in failoverGeneratedRules visible in page-failover.html. Update the
early-return branches in the fetch response handling (including the 503 and
fetch-error paths) to also reset failoverGeneratedRules alongside
failoverAvailable and failoverRules. Use the failoverGeneratedRules state in the
failover.js module as the target to clear so the UI no longer shows reopenable
generated drafts after the feature becomes unavailable.
internal/admin/handler_failover.go (1)

69-83: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Normalize fallback_models before storing them.

primary_model is trimmed and validated, but each fallback selector is persisted verbatim. A payload like [" azure/gpt-4o ", ""] will survive this PUT and store selectors that will never match later. Trim each entry and reject blank targets before building the rule.

Suggested fix
 	source := strings.TrimSpace(req.PrimaryModel)
 	if source == "" {
 		return handleError(c, core.NewInvalidRequestError("primary_model is required", nil))
 	}
+	targets := make([]string, 0, len(req.FallbackModels))
+	for _, raw := range req.FallbackModels {
+		target := strings.TrimSpace(raw)
+		if target == "" {
+			return handleError(c, core.NewInvalidRequestError("fallback_models must not contain blank entries", nil))
+		}
+		targets = append(targets, target)
+	}
 	enabled := true
 	if existing, ok := h.failoverRules.Get(source); ok && existing != nil {
 		enabled = existing.Enabled
 	}
@@
 	rule := failover.Rule{
 		Source:  source,
-		Targets: req.FallbackModels,
+		Targets: targets,
 		Enabled: enabled,
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/handler_failover.go` around lines 69 - 83, The failover rule
creation path in handler_failover.go trims and validates primary_model, but
stores req.FallbackModels unchanged in failover.Rule.Targets. Normalize each
fallback model before constructing the rule by trimming whitespace in the same
handler logic, and reject any blank entries with an invalid request error so
only valid selectors are persisted.
docs/features/failover.mdx (1)

42-52: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Document the disable-list controls too.

This page explains manual rule sources, but it never mentions fallback.disabled_models / FAILOVER_DISABLED_MODELS_JSON, even though both are now part of the exposed config surface in config/config.example.yaml and .env.template. Without that, operators have no docs path for opting specific selectors out of failover.

Suggested addition
 For env-only deployments, use inline JSON:
@@
 `fallback.default_mode` and `FEATURE_FALLBACK_MODE` are deprecated
 compatibility inputs. They are accepted but ignored by runtime failover.
+
+To disable failover for specific primary model selectors, use
+`fallback.disabled_models` in YAML or `FAILOVER_DISABLED_MODELS_JSON` in env-only
+deployments. Example:
+
+```bash
+FAILOVER_DISABLED_MODELS_JSON='["claude-sonnet-4"]'
+```
 
 The Dashboard can generate suggested mappings from model metadata, but
 suggestions must be saved as manual mappings before they affect traffic.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/features/failover.mdx` around lines 42 - 52, The failover docs in the
failover page are missing the disable-list configuration, so add coverage for
fallback.disabled_models and FAILOVER_DISABLED_MODELS_JSON alongside the
existing rule source examples. Update the documentation near the inline JSON /
environment variable examples to show how operators can opt specific selectors
out of failover, and mention that this is part of the supported config surface
just like the manual mappings described in the page.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/failover/store_postgresql.go`:
- Around line 26-29: The PostgreSQL failover store DDL only covers fresh
installs, so upgraded databases with the old source/targets columns will still
break when FailoverStore reads or upserts failover_rules. Update the store
initialization/migration path in FailoverStore or its schema setup to detect the
legacy layout and run a real ALTER TABLE migration that renames source to
primary_model and targets to fallback_models before any queries use the new
names.

In `@internal/failover/store_sqlite.go`:
- Around line 21-24: The failover_rules initializer in store_sqlite.go only
creates the new table shape for fresh databases, so existing SQLite installs
will keep the old source/targets schema and break later queries. Update the
initialization/migration path in the failover store setup to detect and migrate
the existing failover_rules table before any SELECT or upsert logic runs,
backfilling/renaming source and targets into primary_model and fallback_models
so the new column names are always available.

In `@internal/failover/types.go`:
- Line 17: The BSON tag change on Targets in the failover types breaks decoding
of legacy failover_rules documents that still persist targets, so
MongoDBStore.List/Get will return empty Targets and upserts may rewrite them
under fallback_models. Add backward-compatible read handling in the failover
decoding path (or a migration in MongoDBStore) so both targets and
fallback_models are accepted, and only switch the persisted tag after existing
documents can be read reliably.

---

Outside diff comments:
In `@docs/features/failover.mdx`:
- Around line 42-52: The failover docs in the failover page are missing the
disable-list configuration, so add coverage for fallback.disabled_models and
FAILOVER_DISABLED_MODELS_JSON alongside the existing rule source examples.
Update the documentation near the inline JSON / environment variable examples to
show how operators can opt specific selectors out of failover, and mention that
this is part of the supported config surface just like the manual mappings
described in the page.

In `@internal/admin/dashboard/static/js/modules/failover.js`:
- Around line 34-37: The failover unavailable paths in failover.js only clear
failoverRules, leaving stale drafts in failoverGeneratedRules visible in
page-failover.html. Update the early-return branches in the fetch response
handling (including the 503 and fetch-error paths) to also reset
failoverGeneratedRules alongside failoverAvailable and failoverRules. Use the
failoverGeneratedRules state in the failover.js module as the target to clear so
the UI no longer shows reopenable generated drafts after the feature becomes
unavailable.

In `@internal/admin/handler_failover.go`:
- Around line 69-83: The failover rule creation path in handler_failover.go
trims and validates primary_model, but stores req.FallbackModels unchanged in
failover.Rule.Targets. Normalize each fallback model before constructing the
rule by trimming whitespace in the same handler logic, and reject any blank
entries with an invalid request error so only valid selectors are persisted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 111eae64-2273-418b-ba43-1ec136d2ce31

📥 Commits

Reviewing files that changed from the base of the PR and between 73f3f20 and b1bf8d4.

📒 Files selected for processing (14)
  • .env.template
  • config/config.example.yaml
  • docs/features/failover.mdx
  • docs/openapi.json
  • internal/admin/dashboard/static/js/modules/failover.js
  • internal/admin/dashboard/templates/layout.html
  • internal/admin/dashboard/templates/page-failover.html
  • internal/admin/handler_failover.go
  • internal/failover/service.go
  • internal/failover/store.go
  • internal/failover/store_mongodb.go
  • internal/failover/store_postgresql.go
  • internal/failover/store_sqlite.go
  • internal/failover/types.go

Comment thread internal/failover/store_postgresql.go
Comment thread internal/failover/store_sqlite.go
// Rule is one manual failover mapping for a primary model selector.
type Rule struct {
Source string `json:"primary_model" bson:"_id"`
Targets []string `json:"fallback_models" bson:"fallback_models"`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Keep backward-compatible BSON decoding for existing failover documents.

Changing Targets to bson:"fallback_models" means legacy failover_rules documents that still store targets will decode with empty Targets in MongoDBStore.List/Get, and the next upsert will rewrite them under the new field name. Add a migration or dual-read compatibility path before switching the persisted tag.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/failover/types.go` at line 17, The BSON tag change on Targets in the
failover types breaks decoding of legacy failover_rules documents that still
persist targets, so MongoDBStore.List/Get will return empty Targets and upserts
may rewrite them under fallback_models. Add backward-compatible read handling in
the failover decoding path (or a migration in MongoDBStore) so both targets and
fallback_models are accepted, and only switch the persisted tag after existing
documents can be read reliably.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/failover/store_postgresql.go`:
- Around line 40-63: Scope the migration checks in the failover PostgreSQL
migration so they only inspect the active schema. In the migration logic that
renames failover_rules columns in store_postgresql.go, update each
information_schema.columns existence check to include table_schema =
current_schema() alongside table_name and column_name, so the ALTER TABLE
failover_rules statements only depend on columns in the schema being migrated.

In `@internal/failover/store_sqlite_test.go`:
- Around line 51-64: The test in store_sqlite_test.go currently verifies only
Source and Targets, so add assertions for the migrated metadata fields that the
rule maps as well. Update the existing List() result checks in the relevant test
case to validate enabled, managed_source, created_at, and updated_at on the
returned row from store.List, using the same row variable already referenced in
the test so regressions in metadata preservation fail here too.

In `@internal/failover/store_sqlite.go`:
- Around line 104-113: The SQLite migration in the failover store is selecting
the legacy primary key value without normalizing it, so rows can be inserted
with leading/trailing spaces even though lookups trim inputs. Update the
migration query in the failover store logic (the code building the INSERT/SELECT
for failover_rules in store_sqlite.go) so the primary key expression is TRIMmed
before being stored, matching the normalization used by Get and Delete and the
TRIM filter already applied in the source query.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0dd8c12f-4b04-4ba5-a223-ce262e5afac1

📥 Commits

Reviewing files that changed from the base of the PR and between b1bf8d4 and a9e37d2.

📒 Files selected for processing (6)
  • internal/admin/dashboard/templates/model-table-body.html
  • internal/admin/dashboard/templates/shuffle-icon.html
  • internal/failover/store_mongodb.go
  • internal/failover/store_postgresql.go
  • internal/failover/store_sqlite.go
  • internal/failover/store_sqlite_test.go

Comment thread internal/failover/store_postgresql.go
Comment thread internal/failover/store_sqlite_test.go
Comment thread internal/failover/store_sqlite.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
internal/admin/dashboard/static/js/modules/failover.js (2)

77-82: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Use managed_source for managed-rule UI state.

The API/view contract here exposes managed_source, not managed, so config-managed mappings are still treated as editable dashboard rules and shown as On instead of Config. Normalize the managed flag from managed_source before openFailoverEdit() and failoverRuleStatus() consume it.

Also applies to: 363-375

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/failover.js` around lines 77 - 82,
The failover rule UI is reading the wrong field for managed state, so
config-managed mappings are treated like editable dashboard rules. Update the
failover flow to normalize and use managed_source instead of managed before
openFailoverEdit() sets form state and before failoverRuleStatus() renders the
status. Make sure the rule object consistently derives failoverFormManaged and
the displayed label from managed_source so managed rules show as Config rather
than On.

77-85: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Renaming the primary model in edit mode creates a second rule.

failoverFormOriginalSource is captured but never used on save. The admin API upserts by primary_model, so editing source leaves the old mapping active and persists a new one. Make source immutable while editing, or delete/replace the original key before the PUT.

Also applies to: 154-192

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/failover.js` around lines 77 - 85,
The edit flow in openFailoverEdit and the save path in failover.js are allowing
primary model renames to create a second rule because failoverFormOriginalSource
is captured but never applied. Update the edit/save logic so the original
primary_model key is preserved or explicitly deleted/replaced before the PUT,
and make the source field immutable while failoverFormMode is edit. Ensure the
code paths that build the payload and submit the form use
failoverFormOriginalSource consistently when editing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@internal/admin/dashboard/static/js/modules/failover.js`:
- Around line 77-82: The failover rule UI is reading the wrong field for managed
state, so config-managed mappings are treated like editable dashboard rules.
Update the failover flow to normalize and use managed_source instead of managed
before openFailoverEdit() sets form state and before failoverRuleStatus()
renders the status. Make sure the rule object consistently derives
failoverFormManaged and the displayed label from managed_source so managed rules
show as Config rather than On.
- Around line 77-85: The edit flow in openFailoverEdit and the save path in
failover.js are allowing primary model renames to create a second rule because
failoverFormOriginalSource is captured but never applied. Update the edit/save
logic so the original primary_model key is preserved or explicitly
deleted/replaced before the PUT, and make the source field immutable while
failoverFormMode is edit. Ensure the code paths that build the payload and
submit the form use failoverFormOriginalSource consistently when editing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 998194ca-8d28-4db5-8420-b3e20e6a23fe

📥 Commits

Reviewing files that changed from the base of the PR and between a9e37d2 and 17dfa09.

📒 Files selected for processing (5)
  • internal/admin/dashboard/static/css/dashboard.css
  • internal/admin/dashboard/static/js/modules/failover.js
  • internal/admin/dashboard/static/js/modules/failover.test.cjs
  • internal/admin/dashboard/templates/failover-icon.html
  • internal/admin/dashboard/templates/model-table-body.html

SantiagoDePolonia and others added 12 commits June 30, 2026 13:07
…etail as tabs

Capture every external provider call made for a logical request (primary +
failover) as a structured attempt, including failed attempts' raw upstream
response body and headers, and surface them in the dashboard.

Backend:
- Record a ProviderAttempt per provider call; attach the raw upstream error
  body (capped 64 KB) and response headers to GatewayError, then to the attempt.
- Persist per-attempt response_body/response_headers, gated by LOG_BODIES /
  LOG_HEADERS (headers redacted), in audit_log_attempts (SQLite + Postgres,
  additive ADD COLUMN migrations) and embedded for Mongo; hydrate in readers.
- Include a compact attempt summary in the live-log preview so the failover
  indicator appears without waiting for the persisted detail.

Dashboard:
- Request / Response(s) are now tabs (default = last valid response), replacing
  the collapsible panes; each failed attempt gets its own tab with a type pill
  and status badge. Shared audit-pane is headless; Headers 1/3 / Body 2/3 split.
- Collapsed audit row shows a per-attempt pip track; request/response direction
  icons; the error message opens the conversation preview; tab-switch animation.
- Rename the Models page heading "Registered Models" -> "Models".

Bump the gateway chat hot-path byte ceiling (14080 -> 14784) for the larger
per-request attempt structs (allocation count unchanged).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s they happen

Provider attempts were only published to the live audit preview once the whole
request finished, so a failed primary was invisible in real time while failover
was still in flight; the failed attempt only appeared after the request was
persisted.

Add a context-scoped AttemptObserver that the gateway invokes immediately after
recording a failed attempt, which re-publishes the audit entry's attempts live.
The observer is installed only when failover targets exist (FallbackSelectors
non-empty) and fires only on failures, so non-failover/success requests — the
hot path — take on no extra per-request work (allocation count unchanged).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…he dashboard

A request that failed over (primary failed, failover succeeded) stored the
primary as resolved_model/provider instead of the model actually called: the
audit middleware re-applies the workflow's planned (primary) resolution after
the handler set the real route, clobbering it. Costs were unaffected (priced
off the failover model via usagePricingModel), but the audit route was wrong.

- middleware: don't overwrite resolved_model/provider/provider_name when a
  runtime failover was already recorded (entry.Data.Failover set).
- dashboard: the audit row shows "requested ⮕ failover-target" using
  data.failover.target_model (correct even for already-clobbered rows), in
  addition to the existing alias case.
- audit detail: split responses into per-attempt tabs whenever any attempt
  failed (not only with >1 attempt), so a failed primary shows its own tab with
  the error while failover is still in flight; a lone response tab drops the
  seq/type/status chips.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- The request/response panel hangs flush off the tab strip (rounded bottom
  corners, no top border; achieved by dropping the tablist margin rather than a
  negative panel margin).
- Tabs get visible outlines with rounded top corners, a transparent bottom edge,
  and spacing between them; the active tab opens into the panel.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/admin/dashboard/static/js/modules/conversation-drawer.js`:
- Around line 47-61: The error preview trigger in handleErrorConversationClick
is mouse-only, so make it keyboard accessible by using a focusable interactive
element or adding the needed accessibility behavior to the existing preview
target. Update the conversation drawer’s error block trigger so it can receive
focus and open on Enter/Space as well as click, while preserving the existing
drag/selection checks and the openConversation call flow.

In `@internal/admin/dashboard/templates/audit-pane.html`:
- Around line 5-9: The error preview in the audit pane is clickable via
handleErrorConversationClick but is not keyboard-accessible. Update the <pre> in
audit-pane.html so the interactive preview can receive focus and be activated
from the keyboard, using canShowConversation(pane.entry) to gate the behavior;
preserve the existing click behavior while adding the appropriate focus/keyboard
handling around pane.errorMessage and handleErrorConversationClick.

In `@internal/admin/dashboard/templates/page-audit-logs.html`:
- Around line 98-126: The audit log tab UI in the tablist/template block is
missing the full accessibility contract, so update the tab buttons and panels
together. In the Alpine loop that renders `auditPanes(entry)`, add explicit
`aria-controls` on each tab button and matching `aria-labelledby` on each
`audit-pane-tabpanel`, using stable IDs derived from `p.id` and the panel/tab
elements. Also add arrow-key handling on the tab buttons so left/right (and
optionally up/down) move focus and activate the next/previous tab by updating
`active` in the same `auditEffectiveTab` flow.

In `@internal/admin/handler_failover_test.go`:
- Around line 1-191: The failover handler tests are missing coverage for the
success/error branches in `UpsertFailoverRule`, `DeleteFailoverRule`, and
`failoverGenerateSource`, so add focused cases to exercise those paths. In
`internal/admin/handler_failover_test.go`, extend the existing
`NewHandler`/`RegisterRoutes` setup with tests that verify upsert returns 200 vs
204 depending on the request/result, delete maps a missing source to the
not-found response, and generate handles invalid JSON bodies. Reuse the existing
`failoverHandlerTestStore`, `newFailoverHandlerTestService`, and route-based
request setup to keep the tests aligned with the handler symbols.

In `@internal/auditlog/auditlog.go`:
- Around line 261-266: The truncation in truncateAttemptErrorMessage is
byte-based and can cut a UTF-8 rune in half, producing invalid text for JSON
storage. Update this helper to truncate on rune boundaries instead of slicing
message directly, keeping the same maxAttemptErrorMessageLength behavior while
ensuring the returned string remains valid UTF-8.

In `@internal/auditlog/middleware.go`:
- Around line 612-615: The direct attempt enrichment path in
EnrichLogEntryWithAttempts currently forwards raw AttemptSnapshot data without
any gating, so update it to either accept a logging Config and apply the same
GateAttemptCapture behavior as the Echo path or otherwise default to stripping
response bodies and headers before storing attempts. Keep the existing
enrichEntryWithAttempts flow, but make sure the wrapper is the only entry point
that can attach attempts to a LogEntry and that opt-in-only fields are removed
unless explicitly enabled.
- Around line 219-237: The audit log enrichment in EnrichEntryWithFailover is
incorrectly using Data.Failover as a proxy for whether the executed route was
recorded, which can suppress ResolvedModel, Provider, and ProviderName. Update
the guard logic in internal/auditlog/middleware.go so each field is only skipped
when its actual executed-route field is already populated, or first hydrate the
missing executed route data from Failover.TargetModel before applying the
workflow-derived values. Keep the existing trimmed workflow fallback behavior,
but base the decision on the concrete route fields rather than
executedRouteRecorded alone.

In `@internal/auditlog/reader_postgresql_test.go`:
- Around line 71-73: The Postgres reader tests are only stubbing the new
audit_log_attempts query as empty, so hydrated attempt results and failure paths
are not covered. Update the test setup around the fakePostgreSQLRows and the
reader behavior in audit_log_attempts so at least one case returns a non-empty
attempt row and another exercises rows.Err() or a query error. Add/adjust
assertions in reader_postgresql_test.go to verify attempt hydration and error
handling through the relevant reader methods.

In `@internal/auditlog/reader_postgresql.go`:
- Around line 252-323: The current loadAttempts helper in PostgreSQLReader
performs one audit_log_attempts query per LogEntry, causing N+1 queries during
GetLogs. Refactor loadAttempts to fetch all attempts for the batch at once using
a single WHERE audit_log_id = ANY(...) query, then group the resulting
AttemptSnapshot rows by audit_log_id and attach them back to each entry via
ensureLogData and normalizeAttemptSnapshots.
- Around line 191-193: In the audit log lookup flow in reader_postgresql.go, the
empty-result path currently treats any false from rows.Next() as “not found,”
which can mask query/cursor errors. Update the logic around rows.Next() in the
relevant reader method to check rows.Err() before returning nil,nil, and only
report a missing audit log entry when there is no rows error; if rows.Err() is
set, propagate that error instead.

In `@internal/auditlog/reader_sqlite.go`:
- Around line 335-417: The loadAttempts method is issuing one audit_log_attempts
query per LogEntry, creating an N+1 pattern when GetLogs hydrates many entries.
Change SQLiteReader.loadAttempts to fetch attempts for all entry IDs in a single
query using an IN (...) filter, then group the scanned AttemptSnapshot values by
audit_log_id and attach them back to each entry via ensureLogData. Keep the
existing scan/normalize behavior, but replace the per-entry loop query in
loadAttempts with batched lookup logic.
- Around line 189-190: The no-row branch in the SQLite reader is too permissive:
in the `rows.Next()` check inside `reader_sqlite.go`, a false result can mean
either no results or an iteration error. Update the read path to inspect
`rows.Err()` before returning `(nil, nil)` so the `read`/query logic
distinguishes an actual “not found” case from a failed iteration, using the
existing `rows` handling in this function.

In `@internal/core/errors.go`:
- Around line 54-56: captureGatewayErrorBody truncates oversized error bodies
without leaving any signal that the data is incomplete. Update the logic in
captureGatewayErrorBody so that when body exceeds maxGatewayErrorBodyBytes, it
also records truncation state or appends a clear truncation marker that
downstream consumers can detect. Keep the existing body limit behavior, but make
sure the returned/recorded error payload explicitly indicates truncation
alongside the body content.
- Around line 39-59: Raw upstream error bodies are being retained without
redaction, which can leak sensitive provider data. Update the error-body capture
path in `captureGatewayErrorBody` and the `CaptureAttemptResponseBody` flow to
scrub or redact sensitive content before persisting it for audit, similar to how
`RedactAttemptResponseHeaders` handles headers. Keep the size cap in place, but
ensure the stored `ResponseBody` is sanitized before it is attached to the
gateway error.

In `@internal/gateway/fallback_test.go`:
- Around line 10-24: The two ShouldAttemptFallback tests duplicate the same
setup/assert pattern, so refactor them into a single table-driven
TestShouldAttemptFallback using subtests. Keep the cases for the
model-availability 404 and the generic endpoint 404, and structure the table so
adding more fallback edge cases from fallback.go is straightforward. Use
core.NewProviderError and ShouldAttemptFallback in the shared loop, and verify
each case with the expected boolean.

In `@internal/gateway/fallback.go`:
- Around line 209-243: The 404 fallback logic in `isFallbackError` is too broad
and can classify unrelated NotFound responses as model fallback. Narrow the
`status == http.StatusNotFound` branch in `internal/gateway/fallback.go` so it
only matches provider-specific model unavailability phrasing, or add explicit
exclusions for generic 404s that merely contain words like “disabled”,
“deprecated”, or “unavailable”. Update the `isFallbackError` matcher and its
tests to cover both the intended `endpoint not found` case and the
false-positive 404 cases.

In `@internal/server/audit_attempts.go`:
- Around line 23-48: Add unit test coverage for the pure mapping in
auditAttemptsFromGateway to prevent drift between gateway.ProviderAttempt and
auditlog.AttemptSnapshot. Create a small table-driven test that covers both the
empty gateway.AttemptsFromContext case and a populated attempt, asserting
field-by-field mapping plus the ResponseBody and ResponseHeaders handling via
auditlog.CaptureAttemptResponseBody and auditlog.RedactAttemptResponseHeaders.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a1ad0e27-90b8-471d-971c-7e3e925e9b90

📥 Commits

Reviewing files that changed from the base of the PR and between 17dfa09 and 1b3151f.

📒 Files selected for processing (49)
  • internal/admin/dashboard/static/css/dashboard.css
  • internal/admin/dashboard/static/js/dashboard.js
  • internal/admin/dashboard/static/js/modules/audit-list.js
  • internal/admin/dashboard/static/js/modules/conversation-drawer.js
  • internal/admin/dashboard/static/js/modules/dashboard-display.test.cjs
  • internal/admin/dashboard/static/js/modules/dashboard-layout.test.cjs
  • internal/admin/dashboard/static/js/modules/failover.js
  • internal/admin/dashboard/static/js/modules/failover.test.cjs
  • internal/admin/dashboard/templates/audit-pane.html
  • internal/admin/dashboard/templates/failover-icon.html
  • internal/admin/dashboard/templates/layout.html
  • internal/admin/dashboard/templates/page-audit-logs.html
  • internal/admin/dashboard/templates/page-models.html
  • internal/admin/dashboard/templates/page-settings.html
  • internal/admin/handler_failover.go
  • internal/admin/handler_failover_test.go
  • internal/auditlog/attempt_capture.go
  • internal/auditlog/attempt_capture_test.go
  • internal/auditlog/auditlog.go
  • internal/auditlog/middleware.go
  • internal/auditlog/middleware_test.go
  • internal/auditlog/reader_mongodb.go
  • internal/auditlog/reader_postgresql.go
  • internal/auditlog/reader_postgresql_test.go
  • internal/auditlog/reader_sqlite.go
  • internal/auditlog/store_mongodb.go
  • internal/auditlog/store_postgresql.go
  • internal/auditlog/store_sqlite.go
  • internal/auditlog/store_sqlite_test.go
  • internal/auditlog/stream_wrapper.go
  • internal/core/errors.go
  • internal/failover/factory.go
  • internal/failover/factory_test.go
  • internal/gateway/attempts.go
  • internal/gateway/attempts_test.go
  • internal/gateway/fallback.go
  • internal/gateway/fallback_test.go
  • internal/gateway/inference_execute.go
  • internal/gateway/inference_prepare.go
  • internal/live/broker.go
  • internal/live/broker_test.go
  • internal/llmclient/client.go
  • internal/server/audit_attempts.go
  • internal/server/error_support.go
  • internal/server/fallback_test.go
  • internal/server/internal_chat_completion_executor.go
  • internal/server/messages_handler.go
  • internal/server/translated_inference_service.go
  • tests/perf/hotpath_test.go

Comment thread internal/admin/dashboard/static/js/modules/conversation-drawer.js
Comment thread internal/admin/dashboard/templates/audit-pane.html
Comment thread internal/admin/dashboard/templates/page-audit-logs.html
Comment thread internal/admin/handler_failover_test.go
Comment thread internal/auditlog/auditlog.go
Comment thread internal/core/errors.go
Comment on lines +39 to +59
// ResponseBody and ResponseHeaders carry the raw upstream error response so
// failed provider attempts can be audited. Never serialized to API clients.
ResponseBody []byte `json:"-"`
ResponseHeaders http.Header `json:"-"`
}

// maxGatewayErrorBodyBytes caps the raw upstream error body retained for audit.
const maxGatewayErrorBodyBytes = 64 * 1024

// captureGatewayErrorBody returns a bounded copy of an upstream error body so
// the original buffer is not retained and large bodies cannot bloat memory.
func captureGatewayErrorBody(body []byte) []byte {
if len(body) == 0 {
return nil
}
if len(body) > maxGatewayErrorBodyBytes {
body = body[:maxGatewayErrorBodyBytes]
}
out := make([]byte, len(body))
copy(out, body)
return out

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
ast-grep run --pattern 'func CaptureAttemptResponseBody($$$) {
  $$$
}' --lang go

rg -n -A 15 'func CaptureAttemptResponseBody' --type=go
rg -n -A 15 'func RedactAttemptResponseHeaders' --type=go

Repository: ENTERPILOT/GoModel

Length of output: 156


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== files ==\n'
git ls-files internal/core/errors.go internal/server/error_support.go

printf '\n== outline internal/core/errors.go ==\n'
ast-grep outline internal/core/errors.go --lang go || true

printf '\n== outline internal/server/error_support.go ==\n'
ast-grep outline internal/server/error_support.go --lang go || true

printf '\n== search for GatewayError / AttemptSnapshot / CaptureAttemptResponseBody ==\n'
rg -n -A 20 -B 10 'GatewayError|AttemptSnapshot|CaptureAttemptResponseBody|RedactAttemptResponseHeaders|ResponseBody|ResponseHeaders' internal/core/errors.go internal/server/error_support.go

Repository: ENTERPILOT/GoModel

Length of output: 24881


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '1,220p' internal/core/errors.go | cat -n
printf '\n---\n'
sed -n '1,220p' internal/server/error_support.go | cat -n

Repository: ENTERPILOT/GoModel

Length of output: 12478


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== search auditlog body/header helpers ==\n'
rg -n -A 20 -B 5 'auditlog\.(Capture|Redact).*Response|CaptureAttempt|RedactAttempt|ResponseHeaders|ResponseBody' .

Repository: ENTERPILOT/GoModel

Length of output: 50375


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== auditlog attempt helpers ==\n'
rg -n -A 20 -B 5 'func (CaptureAttemptResponseBody|RedactAttemptResponseHeaders|CaptureAttemptResponseHeaders|CaptureAttemptResponseBody)|CaptureAttemptResponseBody\(|RedactAttemptResponseHeaders\(' internal/auditlog internal/server

printf '\n== auditlog entry body handling ==\n'
rg -n -A 15 -B 5 'ResponseBody|ResponseHeaders|Redact|Capture' internal/auditlog

Repository: ENTERPILOT/GoModel

Length of output: 50375


Raw provider error bodies are stored without content redaction. CaptureAttemptResponseBody keeps the upstream body verbatim as JSON/string, and captureGatewayErrorBody only truncates size; unlike RedactAttemptResponseHeaders, there’s no body-level scrubbing before audit persistence. That can leak prompt fragments or other sensitive provider data into logs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/core/errors.go` around lines 39 - 59, Raw upstream error bodies are
being retained without redaction, which can leak sensitive provider data. Update
the error-body capture path in `captureGatewayErrorBody` and the
`CaptureAttemptResponseBody` flow to scrub or redact sensitive content before
persisting it for audit, similar to how `RedactAttemptResponseHeaders` handles
headers. Keep the size cap in place, but ensure the stored `ResponseBody` is
sanitized before it is attached to the gateway error.

Comment thread internal/core/errors.go
Comment on lines +54 to +56
if len(body) > maxGatewayErrorBodyBytes {
body = body[:maxGatewayErrorBodyBytes]
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🔵 Trivial | ⚡ Quick win

Truncation happens silently.

captureGatewayErrorBody truncates bodies over 64KiB with no indication left for downstream consumers (e.g., a truncated flag or marker), so anyone debugging from the audit trail can't tell a captured body is incomplete.

🛡️ Suggested tweak
 func captureGatewayErrorBody(body []byte) []byte {
 	if len(body) == 0 {
 		return nil
 	}
+	truncated := len(body) > maxGatewayErrorBodyBytes
 	if len(body) > maxGatewayErrorBodyBytes {
 		body = body[:maxGatewayErrorBodyBytes]
 	}
 	out := make([]byte, len(body))
 	copy(out, body)
+	if truncated {
+		// surface truncation to callers, e.g. via a sentinel suffix or a
+		// separate bool return, so audit consumers know data was cut.
+	}
 	return out
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/core/errors.go` around lines 54 - 56, captureGatewayErrorBody
truncates oversized error bodies without leaving any signal that the data is
incomplete. Update the logic in captureGatewayErrorBody so that when body
exceeds maxGatewayErrorBodyBytes, it also records truncation state or appends a
clear truncation marker that downstream consumers can detect. Keep the existing
body limit behavior, but make sure the returned/recorded error payload
explicitly indicates truncation alongside the body content.

Comment thread internal/gateway/fallback_test.go Outdated
Comment on lines 209 to 243
message := strings.ToLower(strings.TrimSpace(gatewayErr.Message))
if !strings.Contains(message, "model") {
return false
if strings.Contains(message, "model") {
for _, fragment := range []string{
"not found",
"does not exist",
"unsupported",
"unavailable",
"not available",
"deprecated",
"retired",
"disabled",
} {
if strings.Contains(message, fragment) {
return true
}
}
}

for _, fragment := range []string{
"not found",
"does not exist",
"unsupported",
"unavailable",
"not available",
"deprecated",
"retired",
"disabled",
} {
if strings.Contains(message, fragment) {
return true
if status == http.StatusNotFound {
for _, fragment := range []string{
"unsupported",
"unavailable",
"not available",
"deprecated",
"retired",
"disabled",
} {
if strings.Contains(message, fragment) {
return true
}
}
}

return false
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n -B2 -A2 'StatusNotFound' internal/gateway/fallback.go internal/gateway/fallback_test.go

Repository: ENTERPILOT/GoModel

Length of output: 1237


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '--- internal/gateway/fallback.go ---\n'
sed -n '200,255p' internal/gateway/fallback.go

printf '\n--- internal/gateway/fallback_test.go ---\n'
sed -n '1,120p' internal/gateway/fallback_test.go

Repository: ENTERPILOT/GoModel

Length of output: 247


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '%s\n' '--- internal/gateway/fallback.go ---'
sed -n '200,255p' internal/gateway/fallback.go

printf '%s\n' ''
printf '%s\n' '--- internal/gateway/fallback_test.go ---'
sed -n '1,140p' internal/gateway/fallback_test.go

Repository: ENTERPILOT/GoModel

Length of output: 2057


Tighten the 404 fallback matcher. The new status == http.StatusNotFound branch catches provider-specific “not available” phrasing, but it also falls back on unrelated 404s that contain words like disabled, deprecated, or unavailable. The existing test only covers endpoint not found, so add a narrower gate for the 404 path or cover the false-positive cases explicitly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/gateway/fallback.go` around lines 209 - 243, The 404 fallback logic
in `isFallbackError` is too broad and can classify unrelated NotFound responses
as model fallback. Narrow the `status == http.StatusNotFound` branch in
`internal/gateway/fallback.go` so it only matches provider-specific model
unavailability phrasing, or add explicit exclusions for generic 404s that merely
contain words like “disabled”, “deprecated”, or “unavailable”. Update the
`isFallbackError` matcher and its tests to cover both the intended `endpoint not
found` case and the false-positive 404 cases.

Comment on lines +23 to +48
func auditAttemptsFromGateway(ctx context.Context) []auditlog.AttemptSnapshot {
gatewayAttempts := gateway.AttemptsFromContext(ctx)
if len(gatewayAttempts) == 0 {
return nil
}
attempts := make([]auditlog.AttemptSnapshot, 0, len(gatewayAttempts))
for _, attempt := range gatewayAttempts {
attempts = append(attempts, auditlog.AttemptSnapshot{
Seq: attempt.Seq,
Kind: attempt.Kind,
ProviderType: attempt.ProviderType,
ProviderName: attempt.ProviderName,
Model: attempt.Model,
StatusCode: attempt.StatusCode,
Success: attempt.Success,
ErrorType: attempt.ErrorType,
ErrorCode: attempt.ErrorCode,
ErrorMessage: attempt.ErrorMessage,
StartedAt: attempt.StartedAt,
DurationNs: attempt.DurationNs,
ResponseBody: auditlog.CaptureAttemptResponseBody(attempt.ResponseBody),
ResponseHeaders: auditlog.RedactAttemptResponseHeaders(attempt.ResponseHeaders),
})
}
return attempts
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add unit test coverage for the gateway → audit attempt mapping.

auditAttemptsFromGateway is a pure mapping function (easy to test in isolation) but no corresponding test file is included in this cohort, and the PR comments note broadly low patch coverage across the failover/attempt-tracking changes. A small table-driven test asserting field-by-field mapping (including body/header capture and the empty-attempts short-circuit) would be low-cost and catch future field-mapping drift between gateway.ProviderAttempt and auditlog.AttemptSnapshot.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/server/audit_attempts.go` around lines 23 - 48, Add unit test
coverage for the pure mapping in auditAttemptsFromGateway to prevent drift
between gateway.ProviderAttempt and auditlog.AttemptSnapshot. Create a small
table-driven test that covers both the empty gateway.AttemptsFromContext case
and a populated attempt, asserting field-by-field mapping plus the ResponseBody
and ResponseHeaders handling via auditlog.CaptureAttemptResponseBody and
auditlog.RedactAttemptResponseHeaders.

SantiagoDePolonia and others added 3 commits June 30, 2026 22:19
…config

Resolve the still-valid reviewer comments on the manual-failover feature:

- service: Upsert now distinguishes ErrNotFound from other Store.Get errors and
  propagates the latter, so a transient read fault no longer silently resets
  created_at via stampUpsert (+ test).
- sqlite migration: copy the legacy primary key TRIM()'d so a padded key
  ' gpt-4o ' migrates to 'gpt-4o' and stays reachable by Get/Delete (which trim
  input). Extend the migration test to also assert enabled/managed_source/
  created_at/updated_at survive and the trimmed key is found.
- postgres migration: scope the information_schema.columns probes with
  table_schema = current_schema() so a same-named table in another schema can't
  skip the source->primary_model / targets->fallback_models rename.
- config: reject FAILOVER_DISABLED_MODELS_JSON=null instead of silently
  accepting a nil slice (parity with the manual-rules path).
- docs: fix the broken failover overview sentence.
- dashboard: remove the now-dead failoverFormOriginalSource state (the form
  source is read-only in edit, so it was never used).

Left intentionally unaddressed (rationale): the Mongo context.Background() index
init is a pattern shared across ~10 stores (a one-off would be inconsistent);
default_mode:off disabling is an intentional deprecation; the /generate
failover.View draft shape and the OpenAPI 204/required gaps need the generated
docs pipeline; GenerateFailoverRules O(N^2) is an admin-only manual endpoint;
the resolver precedence is already neutralized by Service.mergeConfig. The 3
CodeQL "DB query from user-controlled source" flags on store_mongodb.go are
false positives (parameterized bson values at a constant _id field).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verified the latest CodeRabbit/Greptile comments on PR #444 and applied the
ones that hold up against current code:

- reader_sqlite/reader_postgresql: check rows.Err() on the empty-result path so
  a query/iteration failure is no longer reported to callers as "log not found".
- middleware enrichEntryWithWorkflow: gate each executed-route field on its own
  concrete value instead of on the failover snapshot alone. A failover snapshot
  only proves Failover.TargetModel was set, so the coarse guard could leave
  resolved_model/provider/provider_name blank when the executed route populated
  some but not all of them; the workflow's planned value now fills any gap.
  Behavior is unchanged on the common (non-failover) path.
- truncateAttemptErrorMessage: back the cut off to a rune boundary so a
  multi-byte rune in non-ASCII provider errors is not split into invalid UTF-8.

Added table-driven coverage for the truncation helper and a middleware test for
the partial-route failover case.

Verified and intentionally skipped:
- Greptile's two "second look" areas (effectiveManualRules merge order,
  broadened 404 fallback trigger): both correct. Service.mergeConfig appends
  config rows unconditionally with an atomic swap, so dynamic rules already carry
  config-wins values; the 404 path only fires on availability-keyword messages.
- EnrichLogEntryWithAttempts re-gating: its sole caller already gates with the
  real logger Config; re-gating with Config{} would strip opt-in bodies/headers.
- Attempt error-body redaction: bodies are opt-in (LOGGING_LOG_BODIES) and stored
  verbatim system-wide; only credential-bearing headers are redacted.
- N+1 attempt hydration, dashboard a11y, extra coverage, table-driven fallback
  test, truncation marker: valid but deferred as separate follow-ups.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ew nits

Worked through the rest of the latest CodeRabbit review batch on PR #444,
fixing the items that hold up against current code:

- reader_sqlite/reader_postgresql: batch provider-attempt hydration into a
  single IN (...) / = ANY($1) query keyed by audit_log_id instead of one query
  per returned entry. A 100-row dashboard page went from 101 reads to 2.
- gateway: convert the ShouldAttemptFallback tests to a table and document the
  404 matcher explicitly — availability/deprecated 404s fall back, while route
  and endpoint-not-found 404s do not (covers the false-positive cases).
- dashboard a11y: make the clickable error preview keyboard-operable
  (role/tabindex + Enter/Space) and complete the request/response tab ARIA
  contract (id/aria-controls/aria-labelledby, roving tabindex, and arrow +
  Home/End navigation via a new auditTabKeydown helper).
- tests: add coverage for auditTabKeydown, the tab ARIA markup, and the admin
  failover handler's upsert(200)/delete(204)/delete-missing(404)/generate
  invalid-body(400) branches.

Verified and intentionally skipped:
- captureGatewayErrorBody truncation marker: a suffix marker would corrupt the
  JSON bodies the audit pane parses; a structured flag isn't worth the plumbing.
- Raw error-body content redaction: bodies are opt-in (LOGGING_LOG_BODIES) and
  stored verbatim system-wide; only credential-bearing headers are redacted.
- EnrichLogEntryWithAttempts re-gating: its sole caller already gates with the
  real logger Config; re-gating with Config{} would strip opt-in bodies/headers.
- auditAttemptsFromGateway unit test: the gateway attempt recorder has no
  exported writer, so a focused test would require test-only public API; the
  mapping is exercised end-to-end. Non-empty Postgres attempt hydration is
  covered on the SQLite path against a real in-memory DB.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread internal/admin/dashboard/static/js/modules/audit-list.js Fixed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/failover/store_postgresql.go`:
- Around line 47-56: Normalize legacy PostgreSQL failover rule data during
migration by trimming whitespace from the renamed primary model values after the
existing source-to-primary_model rename in the failover_rules migration. Update
the migration logic in store_postgresql.go alongside the schema rename steps so
upgraded rows match the trimming behavior already used by Get and stay reachable
after upgrade.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5b110a03-47da-48e4-b601-1a0b0c60c503

📥 Commits

Reviewing files that changed from the base of the PR and between 1b3151f and a40b223.

📒 Files selected for processing (21)
  • config/fallback.go
  • docs/features/failover.mdx
  • internal/admin/dashboard/static/js/modules/audit-list.js
  • internal/admin/dashboard/static/js/modules/audit-list.test.cjs
  • internal/admin/dashboard/static/js/modules/dashboard-layout.test.cjs
  • internal/admin/dashboard/static/js/modules/failover.js
  • internal/admin/dashboard/templates/audit-pane.html
  • internal/admin/dashboard/templates/page-audit-logs.html
  • internal/admin/handler_failover_test.go
  • internal/auditlog/auditlog.go
  • internal/auditlog/auditlog_test.go
  • internal/auditlog/middleware.go
  • internal/auditlog/middleware_test.go
  • internal/auditlog/reader_postgresql.go
  • internal/auditlog/reader_sqlite.go
  • internal/failover/service.go
  • internal/failover/service_test.go
  • internal/failover/store_postgresql.go
  • internal/failover/store_sqlite.go
  • internal/failover/store_sqlite_test.go
  • internal/gateway/fallback_test.go
💤 Files with no reviewable changes (1)
  • internal/admin/dashboard/static/js/modules/failover.js

Comment thread internal/failover/store_postgresql.go
Two CI failures on PR #444:

- lint: CI pinned golangci-lint v2.10, whose bundled staticcheck mis-analyzes
  `if x == nil { t.Fatal() }` guards under the repo's Go 1.26.4 and emits
  false-positive SA5011 warnings (surfaced once this PR invalidated the cached
  analysis of internal/core). Bump the pin to v2.12, whose staticcheck handles
  Go 1.26; the full module lints clean locally on v2.12.

- Integration Tests: the new audit_log_attempts table has a FK to audit_logs
  (ON DELETE CASCADE), so the PostgreSQL teardown's `DROP TABLE audit_logs`
  failed with "other objects depend on it". Drop the child table first and add
  CASCADE to cover future dependents.

The 3 CodeQL go/sql-injection alerts on the new Mongo failover store were
dismissed as false positives: the _id values are Go strings marshaled as BSON
string values, which cannot become Mongo query operators.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/integration/setup_test.go (1)

180-198: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Redundant ordering comment given CASCADE.

Once CASCADE is added to the DROP TABLE statement, Postgres will automatically drop audit_log_attempts when audit_logs is dropped regardless of list order, making the explicit "listed first" comment slightly inaccurate framing (it's defensive but not load-bearing). Not a functional problem — just a minor clarity nit.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/setup_test.go` around lines 180 - 198, The reset-table
ordering comment in the integration test setup is misleading now that DROP TABLE
uses CASCADE, since the explicit “listed first” framing is no longer necessary.
Update the comment near the tables list in setup_test.go to reflect that CASCADE
handles dependent tables automatically, or remove the ordering rationale
entirely, while leaving the table reset loop and table names unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/integration/setup_test.go`:
- Around line 180-198: The reset-table ordering comment in the integration test
setup is misleading now that DROP TABLE uses CASCADE, since the explicit “listed
first” framing is no longer necessary. Update the comment near the tables list
in setup_test.go to reflect that CASCADE handles dependent tables automatically,
or remove the ordering rationale entirely, while leaving the table reset loop
and table names unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f3b635f1-7c7c-4135-b2de-e0432a40180c

📥 Commits

Reviewing files that changed from the base of the PR and between a40b223 and e89e92a.

📒 Files selected for processing (2)
  • .github/workflows/test.yml
  • tests/integration/setup_test.go

@greptile-apps

greptile-apps Bot commented Jun 30, 2026

Copy link
Copy Markdown

T-Rex pricing update — T-Rex was free through June 2026. Effective July 1, 2026, T-Rex adds 2 credits on top of the standard 1-credit review (3 total). T-Rex settings

…gnment

Two follow-up review comments:

- store_postgresql migration: after renaming source -> primary_model, trim
  padded values with btrim so legacy padded keys stay reachable by Get/Delete
  (which trim input), matching the SQLite migration contract.
- audit-list.js auditTabKeydown: drop the unused `let next = idx` initializer
  (flagged by CodeQL/code-quality); the switch assigns next on every handled
  key and unhandled keys return early.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/failover/store_postgresql.go`:
- Around line 48-50: The trim migration in the PostgreSQL failover setup is tied
to the source rename flow, so it can be skipped for databases that already have
legacy padded primary_model values. In store_postgresql.go, move the UPDATE ...
btrim(...) logic into its own separate IF EXISTS primary_model block after the
rename handling in the failover migration path, so the trim runs independently
and is idempotent. Use the existing migration logic around the failover_rules
update and the primary_model/source rename checks to keep the behavior
consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 184abf1b-ffb7-4495-a921-52818b8c0b01

📥 Commits

Reviewing files that changed from the base of the PR and between e89e92a and 9089fa2.

📒 Files selected for processing (2)
  • internal/admin/dashboard/static/js/modules/audit-list.js
  • internal/failover/store_postgresql.go

Comment thread internal/failover/store_postgresql.go
`fallback.overrides` was a deprecated, undocumented per-model map whose only
live effect (mode: off) duplicated `disabled_models`; manual/auto were already
no-ops. Removed the Overrides field, the FallbackModelOverride type, its
load-time normalization/validation, and the disable consumer.

Compatibility: config.yaml is parsed with yaml.v3's lenient Unmarshal, so an
existing `fallback.overrides` block still loads without error — it is simply
ignored. The only behavior change is for configs that used
`overrides: {m: {mode: off}}` to disable failover for a model: that disable is
no longer applied. Migrate those to `disabled_models: [m]` (identical effect).
The field was YAML-only (no env var), so env-based deployments are unaffected.

Tests updated to assert the legacy block now loads and has no effect; the
obsolete overrides duplicate-key test is dropped (the manual-rules duplicate
test still covers that validation path).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@config/config_test.go`:
- Around line 793-798: The test function in config_test.go defines a local
string named yaml that shadows the imported yaml package used elsewhere in the
file. Rename the local variable in this test to a more specific, idiomatic name
such as yamlConfig or yamlContent, and update its use in the surrounding test
setup so the yaml package identifier remains available if needed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d7325c4a-0d30-4519-9a03-a63b0604ebaf

📥 Commits

Reviewing files that changed from the base of the PR and between 9089fa2 and c5701b5.

📒 Files selected for processing (2)
  • config/config_test.go
  • config/fallback.go
💤 Files with no reviewable changes (1)
  • config/fallback.go

Comment thread config/config_test.go Outdated
SantiagoDePolonia and others added 2 commits July 1, 2026 00:01
Address two follow-up review comments:

- store_postgresql migration: move the primary_model btrim into its own
  `IF EXISTS primary_model` block that runs independently of the source->
  primary_model rename. A database already migrated by an earlier, non-trimming
  version keeps padded keys otherwise; now they are normalized on next startup
  (and it is a no-op once trimmed).
- config_test: rename the local `yaml` string to `yamlData` so it no longer
  shadows the imported yaml package in the new legacy-overrides test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ResolveFallbacks runs on every dispatch (the live-attempt observer gates on
FallbackSelectors), and it called Service.Rules() and Service.Disabled(), each
of which cloned every failover rule via List() and rebuilt a map — a
per-request allocation proportional to the rule count whenever the failover
Service is wired.

Precompute the rules and disabled lookup maps once per Refresh and publish them
atomically as an immutable *ruleSnapshot. Rules()/Disabled() now return the
cached maps (read-only) with no per-call cloning or map building; List() reads
the same snapshot's rows. The only callers (resolver effectiveManualRules/
effectiveDisabled) range and copy, so they never mutate the shared maps.

Added a regression test asserting the maps are reused across calls and a new
snapshot is published only on Refresh.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@SantiagoDePolonia SantiagoDePolonia merged commit abc9129 into main Jun 30, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants