feat: add Polly circuit breaker for external API calls (HARD-01)#924
feat: add Polly circuit breaker for external API calls (HARD-01)#924Chris0Jeky merged 17 commits intomainfrom
Conversation
…plication layer Thread-safe singleton tracker that records Polly circuit breaker state transitions (open/half-open/closed) with timestamps and failure reasons. Settings class binds from appsettings CircuitBreaker section with defaults of 5 failures and 60-second break duration.
Applies HttpPolicyExtensions.HandleTransientHttpError circuit breaker to OpenAI and Gemini typed HTTP clients. Circuit opens after configurable consecutive failures (default 5), stays open for configurable duration (default 60s), then transitions to half-open for probe. State transitions are recorded in CircuitBreakerStateTracker for health endpoint visibility.
Adds circuit breaker protection to GitHub OAuth and OIDC provider backchannel HTTP handlers via PolicyHttpMessageHandler wrapping. Reuses the same CircuitBreakerStateTracker and settings as LLM providers.
Resolves CircuitBreakerStateTracker and settings from service descriptors registered by AddLlmProviders and passes them to AddTaskdeckAuthentication for OAuth/OIDC backchannel circuit breaker wiring.
/health/ready now includes a circuitBreakers section reporting each circuit's state, last transition time, and failure reason. An open circuit degrades overall readiness to 503.
Default configuration: 5 failure threshold, 60-second break duration.
Unit tests for CircuitBreakerStateTracker (CRUD, thread safety, timestamps), CircuitBreakerSettings defaults, Polly policy integration (consecutive failures trip circuit, successful requests do not, 400s excluded, 408/5xx counted, open circuit rejects requests), OAuth backchannel handler creation, and health endpoint integration (circuit state reporting, open degrades readiness, half-open/closed do not).
ADR-0031 documents the decision to use Polly circuit breaker for external API calls (LLM providers and OAuth). Configuration reference updated with CircuitBreaker section documentation.
There was a problem hiding this comment.
Code Review
This pull request implements Polly circuit breaker policies for external LLM and OAuth providers to improve system resilience, including a state tracker integrated with the health readiness endpoint and ADR-0031. Feedback identifies a critical issue where circuit breaker state is reset on every request and recommends using a PolicyRegistry for shared state. Other suggestions include improving observability for OAuth handlers, correcting a NuGet package version, refining readiness check logic, and adding configuration validation.
| var circuitBreakerSettings = configuration.GetSection("CircuitBreaker").Get<CircuitBreakerSettings>() ?? new CircuitBreakerSettings(); | ||
| services.AddSingleton(circuitBreakerSettings); | ||
| var circuitBreakerTracker = new CircuitBreakerStateTracker(); | ||
| services.AddSingleton(circuitBreakerTracker); |
There was a problem hiding this comment.
To ensure circuit breaker policies maintain state across requests, they should be registered in a PolicyRegistry and shared. Creating them inside the AddPolicyHandler delegate (as seen on lines 79 and 101) results in a new instance per request, which breaks the circuit breaker logic as the failure count is reset every time.
var circuitBreakerSettings = configuration.GetSection("CircuitBreaker").Get<CircuitBreakerSettings>() ?? new CircuitBreakerSettings();
services.AddSingleton(circuitBreakerSettings);
var circuitBreakerTracker = new CircuitBreakerStateTracker();
services.AddSingleton(circuitBreakerTracker);
services.AddPolicyRegistry((sp, registry) =>
{
registry.Add("OpenAI", BuildCircuitBreakerPolicy(sp, "OpenAI", circuitBreakerSettings));
registry.Add("Gemini", BuildCircuitBreakerPolicy(sp, "Gemini", circuitBreakerSettings));
});| }; | ||
| }); | ||
| }) | ||
| .AddPolicyHandler((sp, _) => BuildCircuitBreakerPolicy(sp, "OpenAI", circuitBreakerSettings)); |
| }; | ||
| }); | ||
| }) | ||
| .AddPolicyHandler((sp, _) => BuildCircuitBreakerPolicy(sp, "Gemini", circuitBreakerSettings)); |
| if (snapshot.State == CircuitState.Open) | ||
| { | ||
| isReady = false; | ||
| } |
There was a problem hiding this comment.
Failing the entire readiness check (503) when any circuit is open might be too aggressive. If an optional LLM provider or one of several OIDC providers is down, the system might still be functional for many users. Consider if these should only degrade the status (report in JSON) without failing the readiness probe, or only fail for "critical" circuits.
| internal static HttpMessageHandler BuildOAuthBackchannelHandler( | ||
| CircuitBreakerStateTracker tracker, | ||
| CircuitBreakerSettings settings, | ||
| string circuitName) |
There was a problem hiding this comment.
The OAuth backchannel circuit breaker lacks logging. Unlike BuildCircuitBreakerPolicy in LlmProviderRegistration.cs, this handler won't log when the circuit opens, resets, or half-opens, making it harder to debug authentication issues in production. Consider passing an ILogger or ILoggerFactory to this method to maintain observability parity with the LLM providers.
| @@ -1,7 +1,11 @@ | |||
| using System.Net.Sockets; | |||
| using Microsoft.Extensions.Http; | |||
| using Polly; | |||
| <PrivateAssets>all</PrivateAssets> | ||
| <IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets> | ||
| </PackageReference> | ||
| <PackageReference Include="Microsoft.Extensions.Http.Polly" Version="10.0.7" /> |
There was a problem hiding this comment.
The version 10.0.7 for Microsoft.Extensions.Http.Polly seems incorrect for a net8.0 project. Typically, you should use a version that matches your target framework (e.g., 8.0.x). Using a version from a future release may lead to compatibility issues or restore errors.
<PackageReference Include="Microsoft.Extensions.Http.Polly" Version="8.0.7" />
| @@ -0,0 +1,21 @@ | |||
| namespace Taskdeck.Application.Services; | |||
| /// Number of consecutive failures before the circuit opens. | ||
| /// Default: 5. | ||
| /// </summary> | ||
| public int FailureThreshold { get; set; } = 5; |
There was a problem hiding this comment.
Add [Range] validation to ensure FailureThreshold is at least 1. Polly's CircuitBreakerAsync will throw an ArgumentOutOfRangeException if handledEventsAllowedBeforeBreaking is not greater than 0.
[Range(1, 100, ErrorMessage = "FailureThreshold must be between 1 and 100.")]
public int FailureThreshold { get; set; } = 5;| /// Duration in seconds the circuit stays open before transitioning | ||
| /// to half-open. Default: 60 (1 minute). | ||
| /// </summary> | ||
| public int BreakDurationSeconds { get; set; } = 60; |
The previous approach used AddPolicyHandler with a factory delegate that created a new Polly policy per HTTP request. Since Polly tracks consecutive failures per policy instance, this defeated circuit breaking entirely. Now policies are created eagerly during service registration and reused. The BuildCircuitBreakerPolicy method signature simplified to take the tracker directly instead of IServiceProvider.
Adversarial Self-ReviewCritical Bug Found and FixedPolicy instance per-request: The original implementation used Other Review Findings (no action needed)
Test Coverage Assessment
|
Adversarial Review - PR #924: Polly Circuit BreakerSummaryReviewed the full diff, all 10 gemini-code-assist[bot] comments, and all changed/new files. The core circuit breaker implementation is solid -- policies are correctly shared as singleton instances (fixed in commit 6bc86e8), the tracker is thread-safe via ConcurrentDictionary, and both LLM providers handle Bot Comment Triage
Issues Found and Fixed
Things That Are Correct
Remaining Minor Items (Not Blocking)
Verification
|
FailureThreshold=0 causes Polly's CircuitBreakerAsync to throw ArgumentOutOfRangeException with a cryptic message. Add [Range] data annotations and startup guard clauses that fail fast with clear error messages when settings are misconfigured.
LLM and OAuth providers are optional -- the system falls back to mock responses when a provider is unavailable. An open circuit should not cause Kubernetes to pull the pod from service rotation. Circuit state is still reported for operator visibility via a _summary field.
- Rename HealthReady_OpenCircuitDegradeOverallReadiness to reflect new behavior (open circuit reports Degraded, does not fail readiness) - Add Settings_FailureThreshold_HasRangeValidation test - Add Settings_BreakDurationSeconds_HasRangeValidation test
Reflect that open circuits report as Degraded without failing the readiness probe, and document the startup validation for settings.
… to ADR-0032 ADR-0031 was claimed by SAST Scanning (Semgrep) in main. Renumber Polly Circuit Breaker from 0031 to 0032 and update cross-references.
* docs: update STATUS.md and AUDIT.md for 10-PR post-merge sweep Add delivery wave entry for PRs #914--#924 covering CI/hardening (SAST, migration validation, performance regression gate, circuit breaker), frontend decomposition (ReviewView, InboxView, AutomationChatView), ops (alerting rules), docs (data model ERD), and UX (session timeout warning). Mark resolved items in AUDIT.md: oversized views, session timeout, SAST, alerting rules, data model reference, performance regression tests. * docs: update IMPLEMENTATION_MASTERPLAN.md with wave 27 delivery history Add delivery wave 27 for PRs #914--#924 covering CI/hardening (SAST, migration validation, performance regression gate, circuit breaker), frontend decomposition (ReviewView, InboxView, AutomationChatView), ops (alerting rules), docs (data model ERD), and session timeout warning. Note ADR-0031 and ADR-0032. Update wave 26 to cross-reference view decomposition resolution. * docs: mark 10 issues delivered in ISSUE_EXECUTION_GUIDE.md Add Stage 7 with all 10 issues from PRs #914--#924 marked as delivered. Update Stage 6 execution note to reflect view decomposition is now resolved. * docs: add ADR-0031 and ADR-0032 to decisions index ADR-0031: SAST Scanning with Semgrep (from PR #915, CI-01) ADR-0032: Circuit Breaker for External API Calls (from PR #924, HARD-01) Note: Both PRs originally created ADR-0031. Renumbered the circuit breaker ADR to ADR-0032 to resolve the conflict. * docs: update TESTING_GUIDE.md with new CI gates and test totals Add migration-validation job to ci-required, SAST scanning and performance regression gate to ci-extended, and both to ci-nightly. Update test counts for circuit breaker (23 backend) and session timeout (19 frontend) tests. * docs: update CLAUDE.md frontend architecture description Reflect view decomposition pattern (thin shells + extracted composables/components) and add examples of view-specific composables and component directories. * docs: mark resolved items in EXPANSION_ROADMAP and HARDENING docs Mark view decomposition (ReviewView, InboxView, AutomationChatView) and monitoring/alerting setup as resolved in both roadmap files. * fix: format STATUS.md Last Updated line to pass docs governance check The governance regex requires the line to end after the date (YYYY-MM-DD) with only optional whitespace. Move the parenthetical sweep note to its own line. * fix: correct OPS-27 and PERF-12 issue numbers across docs OPS-27 (config validation) is GitHub issue #863, not #858. PERF-12 (board list pagination) is GitHub issue #848, not #859. The wrong numbers (#858/#859) belong to FE-17 and FE-18 respectively. Fixes references in STATUS.md, IMPLEMENTATION_MASTERPLAN.md, ISSUE_EXECUTION_GUIDE.md, AUDIT.md, HARDENING_AND_PERFORMANCE.md, TESTING_GUIDE.md, and CONFIGURATION_REFERENCE.md.
Summary
/health/readyunderchecks.circuitBreakers; open circuit degrades overall readiness to 503CircuitBreaker:FailureThresholdandCircuitBreaker:BreakDurationSecondsin appsettingsCloses #876
Test plan
dotnet build backend/Taskdeck.sln -c Releasesucceeds with 0 errors