Skip to content

feat: add Polly circuit breaker for external API calls (HARD-01)#924

Merged
Chris0Jeky merged 17 commits intomainfrom
feat/hard-01-circuit-breaker-polly
Apr 23, 2026
Merged

feat: add Polly circuit breaker for external API calls (HARD-01)#924
Chris0Jeky merged 17 commits intomainfrom
feat/hard-01-circuit-breaker-polly

Conversation

@Chris0Jeky
Copy link
Copy Markdown
Owner

Summary

  • Add Polly circuit breaker policies to LLM provider HTTP clients (OpenAI, Gemini) and OAuth/OIDC backchannel handlers
  • Circuit opens after 5 consecutive transient failures (5xx, 408, HttpRequestException), stays open for 60 seconds, then half-opens for probe
  • Circuit breaker state (open/half-open/closed) reported on /health/ready under checks.circuitBreakers; open circuit degrades overall readiness to 503
  • Configurable via CircuitBreaker:FailureThreshold and CircuitBreaker:BreakDurationSeconds in appsettings
  • 21 new tests covering tracker CRUD, thread safety, Polly policy behavior, health endpoint integration, and OAuth backchannel handler construction
  • ADR-0031 documents the decision; configuration reference updated

Closes #876

Test plan

  • All 21 new CircuitBreakerTests pass
  • All 3 existing HealthApiTests pass (no regressions)
  • All 6 existing LlmProviderDegradationTests pass (no regressions)
  • Full test suite: 4884 passed, 2 skipped, 0 failed
  • dotnet build backend/Taskdeck.sln -c Release succeeds with 0 errors
  • Verify circuit breaker trips correctly against a real failing LLM endpoint (manual)
  • Verify health endpoint reports circuit state in browser/curl (manual)

…plication layer

Thread-safe singleton tracker that records Polly circuit breaker state
transitions (open/half-open/closed) with timestamps and failure reasons.
Settings class binds from appsettings CircuitBreaker section with defaults
of 5 failures and 60-second break duration.
Applies HttpPolicyExtensions.HandleTransientHttpError circuit breaker to
OpenAI and Gemini typed HTTP clients. Circuit opens after configurable
consecutive failures (default 5), stays open for configurable duration
(default 60s), then transitions to half-open for probe. State transitions
are recorded in CircuitBreakerStateTracker for health endpoint visibility.
Adds circuit breaker protection to GitHub OAuth and OIDC provider
backchannel HTTP handlers via PolicyHttpMessageHandler wrapping. Reuses
the same CircuitBreakerStateTracker and settings as LLM providers.
Resolves CircuitBreakerStateTracker and settings from service descriptors
registered by AddLlmProviders and passes them to AddTaskdeckAuthentication
for OAuth/OIDC backchannel circuit breaker wiring.
/health/ready now includes a circuitBreakers section reporting each
circuit's state, last transition time, and failure reason. An open
circuit degrades overall readiness to 503.
Default configuration: 5 failure threshold, 60-second break duration.
Unit tests for CircuitBreakerStateTracker (CRUD, thread safety, timestamps),
CircuitBreakerSettings defaults, Polly policy integration (consecutive
failures trip circuit, successful requests do not, 400s excluded, 408/5xx
counted, open circuit rejects requests), OAuth backchannel handler creation,
and health endpoint integration (circuit state reporting, open degrades
readiness, half-open/closed do not).
ADR-0031 documents the decision to use Polly circuit breaker for external
API calls (LLM providers and OAuth). Configuration reference updated with
CircuitBreaker section documentation.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements Polly circuit breaker policies for external LLM and OAuth providers to improve system resilience, including a state tracker integrated with the health readiness endpoint and ADR-0031. Feedback identifies a critical issue where circuit breaker state is reset on every request and recommends using a PolicyRegistry for shared state. Other suggestions include improving observability for OAuth handlers, correcting a NuGet package version, refining readiness check logic, and adding configuration validation.

Comment on lines +48 to +51
var circuitBreakerSettings = configuration.GetSection("CircuitBreaker").Get<CircuitBreakerSettings>() ?? new CircuitBreakerSettings();
services.AddSingleton(circuitBreakerSettings);
var circuitBreakerTracker = new CircuitBreakerStateTracker();
services.AddSingleton(circuitBreakerTracker);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

To ensure circuit breaker policies maintain state across requests, they should be registered in a PolicyRegistry and shared. Creating them inside the AddPolicyHandler delegate (as seen on lines 79 and 101) results in a new instance per request, which breaks the circuit breaker logic as the failure count is reset every time.

        var circuitBreakerSettings = configuration.GetSection("CircuitBreaker").Get<CircuitBreakerSettings>() ?? new CircuitBreakerSettings();
        services.AddSingleton(circuitBreakerSettings);
        var circuitBreakerTracker = new CircuitBreakerStateTracker();
        services.AddSingleton(circuitBreakerTracker);

        services.AddPolicyRegistry((sp, registry) =>
        {
            registry.Add("OpenAI", BuildCircuitBreakerPolicy(sp, "OpenAI", circuitBreakerSettings));
            registry.Add("Gemini", BuildCircuitBreakerPolicy(sp, "Gemini", circuitBreakerSettings));
        });

};
});
})
.AddPolicyHandler((sp, _) => BuildCircuitBreakerPolicy(sp, "OpenAI", circuitBreakerSettings));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Use the shared policy from the registry instead of creating a new one per request.

        .AddPolicyHandlerFromRegistry("OpenAI");

};
});
})
.AddPolicyHandler((sp, _) => BuildCircuitBreakerPolicy(sp, "Gemini", circuitBreakerSettings));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Use the shared policy from the registry instead of creating a new one per request.

        .AddPolicyHandlerFromRegistry("Gemini");

Comment on lines +219 to +222
if (snapshot.State == CircuitState.Open)
{
isReady = false;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Failing the entire readiness check (503) when any circuit is open might be too aggressive. If an optional LLM provider or one of several OIDC providers is down, the system might still be functional for many users. Consider if these should only degrade the status (report in JSON) without failing the readiness probe, or only fail for "critical" circuits.

Comment on lines +208 to +211
internal static HttpMessageHandler BuildOAuthBackchannelHandler(
CircuitBreakerStateTracker tracker,
CircuitBreakerSettings settings,
string circuitName)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The OAuth backchannel circuit breaker lacks logging. Unlike BuildCircuitBreakerPolicy in LlmProviderRegistration.cs, this handler won't log when the circuit opens, resets, or half-opens, making it harder to debug authentication issues in production. Consider passing an ILogger or ILoggerFactory to this method to maintain observability parity with the LLM providers.

@@ -1,7 +1,11 @@
using System.Net.Sockets;
using Microsoft.Extensions.Http;
using Polly;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add the required namespace for IPolicyRegistry.

using Polly;
using Polly.Registry;

<PrivateAssets>all</PrivateAssets>
<IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
</PackageReference>
<PackageReference Include="Microsoft.Extensions.Http.Polly" Version="10.0.7" />
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The version 10.0.7 for Microsoft.Extensions.Http.Polly seems incorrect for a net8.0 project. Typically, you should use a version that matches your target framework (e.g., 8.0.x). Using a version from a future release may lead to compatibility issues or restore errors.

    <PackageReference Include="Microsoft.Extensions.Http.Polly" Version="8.0.7" />

@@ -0,0 +1,21 @@
namespace Taskdeck.Application.Services;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add the required namespace for validation attributes.

using System.ComponentModel.DataAnnotations;

namespace Taskdeck.Application.Services;

/// Number of consecutive failures before the circuit opens.
/// Default: 5.
/// </summary>
public int FailureThreshold { get; set; } = 5;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add [Range] validation to ensure FailureThreshold is at least 1. Polly's CircuitBreakerAsync will throw an ArgumentOutOfRangeException if handledEventsAllowedBeforeBreaking is not greater than 0.

    [Range(1, 100, ErrorMessage = "FailureThreshold must be between 1 and 100.")]
    public int FailureThreshold { get; set; } = 5;

/// Duration in seconds the circuit stays open before transitioning
/// to half-open. Default: 60 (1 minute).
/// </summary>
public int BreakDurationSeconds { get; set; } = 60;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add [Range] validation to ensure a positive break duration.

    [Range(1, 3600, ErrorMessage = "BreakDurationSeconds must be between 1 and 3600.")]
    public int BreakDurationSeconds { get; set; } = 60;

The previous approach used AddPolicyHandler with a factory delegate that
created a new Polly policy per HTTP request. Since Polly tracks consecutive
failures per policy instance, this defeated circuit breaking entirely.
Now policies are created eagerly during service registration and reused.
The BuildCircuitBreakerPolicy method signature simplified to take the
tracker directly instead of IServiceProvider.
@Chris0Jeky
Copy link
Copy Markdown
Owner Author

Adversarial Self-Review

Critical Bug Found and Fixed

Policy instance per-request: The original implementation used AddPolicyHandler((sp, _) => BuildCircuitBreakerPolicy(...)) which creates a new Polly circuit breaker policy for every HTTP request. Since Polly tracks consecutive failures per policy instance, this meant the circuit breaker would never actually trip -- each request got a fresh counter. Fixed in commit 6bc86e8b by creating policies eagerly during service registration and passing the same instance to AddPolicyHandler.

Other Review Findings (no action needed)

  1. Logging removed from policy callbacks: The fix simplified BuildCircuitBreakerPolicy to not require IServiceProvider (since it's called before the SP is built). This removed structured logging from the onBreak/onReset/onHalfOpen callbacks. The state tracker still records transitions, so this is acceptable -- operators can observe circuit state via the health endpoint. If detailed logging is desired later, we could inject a logger factory during startup.

  2. OAuth backchannel handler sharing: Each OAuth/OIDC provider gets its own PolicyHttpMessageHandler wrapping a fresh SocketsHttpHandler. Since these are created once per scheme during authentication setup (not per request), the policy reuse is correct. The backchannel handler does not go through IHttpClientFactory, so there's no factory-level caching concern.

  3. GetAll() returns live ConcurrentDictionary reference: CircuitBreakerStateTracker.GetAll() returns the underlying ConcurrentDictionary cast as IReadOnlyDictionary. Callers could observe mutations during iteration, but in the health endpoint context this is benign -- a stale-by-one-state snapshot is acceptable for health reporting.

  4. Circuit state enum naming: Our CircuitState enum (Closed/Open/HalfOpen) conflicts with Polly.CircuitBreaker.CircuitState, requiring a using alias. This is a minor ergonomic cost but acceptable.

Test Coverage Assessment

  • 21 tests covering tracker CRUD, thread safety, Polly policy behavior (consecutive failures, 400 exclusion, 408/5xx counting, open rejection), health endpoint integration (state reporting, readiness degradation), and OAuth handler creation.
  • Existing tests (HealthApiTests, LlmProviderDegradationTests) all pass without modification.
  • Full suite: 4884 passed, 2 skipped, 0 failed.

@Chris0Jeky
Copy link
Copy Markdown
Owner Author

Adversarial Review - PR #924: Polly Circuit Breaker

Summary

Reviewed the full diff, all 10 gemini-code-assist[bot] comments, and all changed/new files. The core circuit breaker implementation is solid -- policies are correctly shared as singleton instances (fixed in commit 6bc86e8), the tracker is thread-safe via ConcurrentDictionary, and both LLM providers handle BrokenCircuitException gracefully through their existing catch (Exception) fallback paths. I've pushed fixes for the issues I found.

Bot Comment Triage

# Bot Comment Verdict
1 CRITICAL: Policy created per request Already fixed in 6bc86e8. Policies are now pre-built once and passed to AddPolicyHandler(). The bot was reviewing the original commit, not the fix commit.
2-3 CRITICAL: Use PolicyRegistry Not needed. The current approach (pre-built policy variables) achieves the same goal without adding PolicyRegistry complexity. Both are valid patterns.
4 MEDIUM: 503 for any open circuit too aggressive Legitimate -- FIXED. An open LLM circuit should not pull a pod from service rotation when the system has mock fallback. Changed to report "Degraded" without failing readiness.
5 MEDIUM: OAuth backchannel lacks logging Legitimate but minor. Neither the LLM nor OAuth circuit breaker callbacks log -- they only record state in the tracker. Logging in callbacks would be nice-to-have but the tracker already provides visibility via the health endpoint. Not fixing in this pass.
6 MEDIUM: Add Polly.Registry namespace Not needed. No PolicyRegistry was adopted, so the namespace is unnecessary.
7 MEDIUM: Package version 10.0.7 wrong for net8.0 Bot is wrong. Microsoft.Extensions.Http.Polly follows its own versioning (not the TFM). Version 10.0.7 is the latest stable release, restores and builds fine on net8.0, and pulls in Polly 7.2.4.
8 MEDIUM: Add validation namespace Legitimate -- FIXED. Added [Range] attributes and the System.ComponentModel.DataAnnotations import.
9-10 MEDIUM: Add [Range] validation Legitimate -- FIXED. Added [Range(1, 1000)] on FailureThreshold and [Range(1, 3600)] on BreakDurationSeconds, plus startup guard clauses in LlmProviderRegistration that throw InvalidOperationException for invalid values (fail fast instead of a confusing Polly exception).

Issues Found and Fixed

  1. Settings validation gap: FailureThreshold=0 would cause Polly's CircuitBreakerAsync to throw ArgumentOutOfRangeException at startup with a cryptic message. Added [Range] data annotations and startup guard clauses with clear error messages. Added 2 new tests.

  2. Overly aggressive readiness degradation: Any open circuit (even an optional LLM provider with mock fallback) caused /health/ready to return 503, which in a Kubernetes environment would pull the pod from service. LLM and OAuth providers are optional/degradeable, so open circuits now report as "Degraded" without failing readiness. Updated test, ADR, and config docs.

Things That Are Correct

  • Policy sharing: Policies are pre-built once and reused across all requests (the AddPolicyHandler(IAsyncPolicy) overload stores the reference, not a factory).
  • Thread safety: CircuitBreakerStateTracker uses ConcurrentDictionary correctly. GetAll() returns the live dictionary but ConcurrentDictionary enumeration is safe (point-in-time snapshot semantics).
  • Exception handling: Both OpenAiLlmProvider and GeminiLlmProvider catch Exception (which includes BrokenCircuitException) and return degraded fallback results. No unhandled circuit breaker exceptions will bubble up.
  • HandleTransientHttpError scope: Only 5xx, 408, and HttpRequestException trip the circuit. 4xx client errors (400, 401, 403, 404) correctly do not trip it.
  • Test coverage: 23 tests covering tracker CRUD, thread safety, health endpoint integration, policy behavior (open/reject/success/transient/4xx), OAuth handler construction, and settings validation.

Remaining Minor Items (Not Blocking)

  • No circuit breaker on OutboundWebhookDelivery: This named HTTP client makes external calls to user-configured webhook URLs. A per-destination circuit breaker would make sense but requires a different pattern (per-URL policy) since it's not a single provider. Could be a follow-up issue.
  • No logging in circuit breaker callbacks: The onBreak/onReset/onHalfOpen delegates only update the tracker. Adding structured logging here would improve production debugging. Minor -- the health endpoint provides the same information.
  • GetAll() exposes live ConcurrentDictionary: Callers could theoretically observe mutations during iteration. Acceptable for the health endpoint use case.

Verification

  • Build: dotnet build succeeds (0 errors, existing warnings only)
  • Tests: All pass (23 circuit breaker tests, full suite green)

FailureThreshold=0 causes Polly's CircuitBreakerAsync to throw
ArgumentOutOfRangeException with a cryptic message. Add [Range]
data annotations and startup guard clauses that fail fast with
clear error messages when settings are misconfigured.
LLM and OAuth providers are optional -- the system falls back to mock
responses when a provider is unavailable. An open circuit should not
cause Kubernetes to pull the pod from service rotation. Circuit state
is still reported for operator visibility via a _summary field.
- Rename HealthReady_OpenCircuitDegradeOverallReadiness to reflect new
  behavior (open circuit reports Degraded, does not fail readiness)
- Add Settings_FailureThreshold_HasRangeValidation test
- Add Settings_BreakDurationSeconds_HasRangeValidation test
Reflect that open circuits report as Degraded without failing the
readiness probe, and document the startup validation for settings.
Chris0Jeky and others added 3 commits April 23, 2026 00:41
… to ADR-0032

ADR-0031 was claimed by SAST Scanning (Semgrep) in main.
Renumber Polly Circuit Breaker from 0031 to 0032 and update cross-references.
@Chris0Jeky Chris0Jeky merged commit 9e67923 into main Apr 23, 2026
15 checks passed
@github-project-automation github-project-automation Bot moved this from Pending to Done in Taskdeck Execution Apr 23, 2026
Chris0Jeky added a commit that referenced this pull request Apr 23, 2026
* docs: update STATUS.md and AUDIT.md for 10-PR post-merge sweep

Add delivery wave entry for PRs #914--#924 covering CI/hardening
(SAST, migration validation, performance regression gate, circuit
breaker), frontend decomposition (ReviewView, InboxView,
AutomationChatView), ops (alerting rules), docs (data model ERD),
and UX (session timeout warning).

Mark resolved items in AUDIT.md: oversized views, session timeout,
SAST, alerting rules, data model reference, performance regression
tests.

* docs: update IMPLEMENTATION_MASTERPLAN.md with wave 27 delivery history

Add delivery wave 27 for PRs #914--#924 covering CI/hardening
(SAST, migration validation, performance regression gate, circuit
breaker), frontend decomposition (ReviewView, InboxView,
AutomationChatView), ops (alerting rules), docs (data model ERD),
and session timeout warning. Note ADR-0031 and ADR-0032. Update
wave 26 to cross-reference view decomposition resolution.

* docs: mark 10 issues delivered in ISSUE_EXECUTION_GUIDE.md

Add Stage 7 with all 10 issues from PRs #914--#924 marked as
delivered. Update Stage 6 execution note to reflect view
decomposition is now resolved.

* docs: add ADR-0031 and ADR-0032 to decisions index

ADR-0031: SAST Scanning with Semgrep (from PR #915, CI-01)
ADR-0032: Circuit Breaker for External API Calls (from PR #924, HARD-01)

Note: Both PRs originally created ADR-0031. Renumbered the circuit
breaker ADR to ADR-0032 to resolve the conflict.

* docs: update TESTING_GUIDE.md with new CI gates and test totals

Add migration-validation job to ci-required, SAST scanning and
performance regression gate to ci-extended, and both to ci-nightly.
Update test counts for circuit breaker (23 backend) and session
timeout (19 frontend) tests.

* docs: update CLAUDE.md frontend architecture description

Reflect view decomposition pattern (thin shells + extracted
composables/components) and add examples of view-specific
composables and component directories.

* docs: mark resolved items in EXPANSION_ROADMAP and HARDENING docs

Mark view decomposition (ReviewView, InboxView, AutomationChatView)
and monitoring/alerting setup as resolved in both roadmap files.

* fix: format STATUS.md Last Updated line to pass docs governance check

The governance regex requires the line to end after the date
(YYYY-MM-DD) with only optional whitespace. Move the parenthetical
sweep note to its own line.

* fix: correct OPS-27 and PERF-12 issue numbers across docs

OPS-27 (config validation) is GitHub issue #863, not #858.
PERF-12 (board list pagination) is GitHub issue #848, not #859.
The wrong numbers (#858/#859) belong to FE-17 and FE-18 respectively.

Fixes references in STATUS.md, IMPLEMENTATION_MASTERPLAN.md,
ISSUE_EXECUTION_GUIDE.md, AUDIT.md, HARDENING_AND_PERFORMANCE.md,
TESTING_GUIDE.md, and CONFIGURATION_REFERENCE.md.
@Chris0Jeky Chris0Jeky deleted the feat/hard-01-circuit-breaker-polly branch April 23, 2026 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

HARD-01: Circuit breaker for external API calls (Polly)

1 participant