Skip to content

Conversation

@mcmire
Copy link
Contributor

@mcmire mcmire commented Nov 15, 2025

Explanation

In a future commit we will introduce changes to network-controller so that it will keep track of the status of each network as requests are made. This commit paves the way for this to happen by redefining the existing RPC endpoint-related events that NetworkController produces.

Currently, when requests are made through the network clients that NetworkController exposes, three events are published:

  • NetworkController:rpcEndpointDegraded
    • Published when enough successive retriable errors are encountered while making a request to an RPC endpoint that the maximum number of retries is reached.
  • NetworkController:rpcEndpointUnavailable
    • Published when enough successive errors are encountered while making a request to an RPC endpoint that the underlying circuit breaks.
  • NetworkController:rpcEndpointRequestRetried
    • Published when a request is retried (mainly used for testing).

It's important to note that in the context of the RPC failover feature, an "RPC endpoint" can actually encompass multiple URLs, so the above events actually fire for any URL.

While these events are useful for reporting metrics on RPC endpoints, in order to effectively be able to update the status of a network, we need events that are less granular and are guaranteed not to fire multiple times in a row. We also need a new event.

Now the list of events looks like this:

  • NetworkController:rpcEndpointDegraded
    • The same as before.
  • NetworkController:rpcEndpointUnavailable
    • The same as before.
  • NetworkController:rpcEndpointRetried
    • Renamed from NetworkController:rpcEndpointRequestRetried.
  • NetworkController:rpcEndpointChainDegraded
    • Similar to NetworkController:rpcEndpointDegraded, but won't be published again if the RPC endpoint is already in a degraded state.
  • NetworkController:rpcEndpointChainUnavailable
    • Published when all of the circuits underlying all of the URLs for an RPC endpoint have broken (none of the URLs are available). Won't be published again if the RPC endpoint is already in an unavailable state.
  • NetworkController:rpcEndpointChainAvailable
    • A new event. Published the first time a successful request is made to one of the URLs for an RPC endpoint, or following a degraded or unavailable status.

Going a bit deeper, in order to make the changes above, it was necessary to rewrite the core logic responsible for diverting traffic to failovers from RpcService to RpcServiceChain, which was a more natural fit, anyway. This also meant that we could simplify RpcService, as well as its tests.

References

Progresses https://consensyssoftware.atlassian.net/browse/WPC-99.

Checklist

  • I've updated the test suite for new or updated code as appropriate
  • I've updated documentation (JSDoc, Markdown, etc.) for new or updated code as appropriate
  • I've communicated my changes to consumers by updating changelogs for packages I've changed, highlighting breaking changes as necessary
  • I've prepared draft pull requests for clients and consumer packages to resolve any breaking changes
    • This will come in a later PR.

Note

Reworks RPC endpoint events (adds chain-level available/degraded/unavailable, renames retry event, updates payloads) and replaces failover logic with a new RpcServiceChain, wiring through NetworkController and tests.

  • Events/API (BREAKING):
    • Add NetworkController:rpcEndpointChainAvailable, ...ChainDegraded, ...ChainUnavailable.
    • Rename ...rpcEndpointRequestRetried to ...rpcEndpointRetried.
    • Update payloads: failoverEndpointUrlendpointUrl, endpointUrlprimaryEndpointUrl; add networkClientId (and attempt for retried).
  • Core changes:
    • Introduce RpcServiceChain to manage primary/failover endpoints and emit chain/service-level events.
    • Simplify RpcService (no failoverService), add availability/degraded hooks, circuit state helpers.
    • Update create-network-client to use RpcServiceChain and publish new events via messenger.
    • create-auto-managed-network-client now requires networkClientId and passes it through.
    • Export updated types and events; adjust shared Cockatiel event typing.
  • Tests:
    • Add comprehensive tests for new chain/service events and behaviors; update existing tests accordingly.
  • Deps/Changelog:
    • Add devDep cockatiel and update CHANGELOG with breaking/event changes.

Written by Cursor Bugbot for commit 0f55384. This will update automatically on new commits. Configure here.

In a future commit we will introduce changes to `network-controller` so
that it will keep track of the status of each network as requests are
made. These updates to `createServicePolicy` assist with that.

See the changelog for a list of changes to the `ServicePolicy` API.

Besides the changes listed there, the tests for `createServicePolicy`
have been refactored slightly so that it is easier to maintain in the
future.
@mcmire mcmire force-pushed the update-network-controller-rpc-endpoint-events branch from 2f9688e to 2c7678c Compare November 17, 2025 16:55
@mcmire mcmire force-pushed the update-network-controller-rpc-endpoint-events branch from 2c7678c to 4c58933 Compare November 17, 2025 17:55
In a future commit we will introduce changes to `network-controller` so
that it will keep track of the status of each network as requests are
made. This commit paves the way for this to happen by redefining the
existing RPC endpoint-related events that NetworkController produces.

Currently, when requests are made through the network clients that
NetworkController exposes, three events are published:

- `NetworkController:rpcEndpointDegraded`
  - Published when enough successive retriable errors are encountered
    while making a request to an RPC endpoint that the maximum number of
    retries is reached.
- `NetworkController:rpcEndpointUnavailable`
  - Published when enough successive errors are encountered while making
    a request to an RPC endpoint that the underlying circuit breaks.
- `NetworkController:rpcEndpointRequestRetried`
  - Published when a request is retried (mainly used for testing).

It's important to note that in the context of the RPC failover feature,
an "RPC endpoint" can actually encompass multiple URLs, so the above
events actually fire for any URL.

While these events are useful for reporting metrics on RPC endpoints, in
order to effectively be able to update the status of a network, we need
events that are less granular and are guaranteed not to fire multiple
times in a row. We also need a new event.

Now the list of events looks like this:

- `NetworkController:rpcEndpointInstanceDegraded`
  - The same as `NetworkController:rpcEndpointDegraded` before.
- `NetworkController:rpcEndpointInstanceUnavailable`
  - The same as `NetworkController:rpcEndpointInstanceDegraded` before.
- `NetworkController:rpcEndpointInstanceRetried`
  - Renamed from `NetworkController:rpcEndpointRequestRetried`.
- `NetworkController:rpcEndpointDegraded`
  - Similar to `NetworkController:rpcEndpointInstanceDegraded`, but
    won't be published again if the RPC endpoint is already in a
    degraded state.
- `NetworkController:rpcEndpointUnavailable`
  - Published when all of the circuits underlying all of the URLs for an
    RPC endpoint have broken (none of the URLs are available). Won't be
    published again if the RPC endpoint is already in an unavailable
    state.
- `NetworkController:rpcEndpointAvailable`
  - A new event. Published the first time a successful request is made
    to one of the URLs for an RPC endpoint, or following a degraded or
    unavailable status.
@mcmire mcmire force-pushed the update-network-controller-rpc-endpoint-events branch from 4c58933 to 9d090e9 Compare November 17, 2025 19:12
): Promise<JsonRpcResponse<Result | null>> {
return this.#services[0].request(jsonRpcRequest, fetchOptions);
}
// Start with the primary (first) service and switch to failovers as the
Copy link
Contributor Author

@mcmire mcmire Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to these changes, each RpcService object could have an optional failoverService property. This class, RpcServiceChain, would then build a chain (really, a linked list) of services. In order to make a request, RpcServiceChain would call request on the first service in the chain, and RpcService would decide whether it needed to call the next service in the chain, etc.

While this model is easy to understand, I needed access to certain data along the way, and it seemed easier to use a loop rather than a linked list. Anyway, I figured it really should be the responsibility of RpcServiceChain to manage how requests are sent across the chain.

onRetry(
listener: AddToCockatielEventData<
Parameters<ServicePolicy['onRetry']>[0],
listener: CockatielEventToEventListenerWithData<
Copy link
Contributor Author

@mcmire mcmire Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be doing the same thing as the previous code. I just thought that AddToCockatielEventData<Parameters<...>[0] was a bit ugly (and I added some more self-descriptive utility types, so this is one of them).

export function createAutoManagedNetworkClient<
Configuration extends NetworkClientConfiguration,
>({
networkClientId,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now expose the network client ID in the rpcEndpoint* messenger events, so we need to receive it and pass it down to createNetworkClient.

});
});

it('publishes the NetworkController:rpcEndpointUnavailable event when the failover occurs', async () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests got moved to src/create-network-client-tests/rpc-endpoint-events.test.ts as I realized they didn't belong here. It's true that this test file is concerned with the RPC failover feature (which is required to test the rpcEndpoint* events), but all of the tests in tests/network-client directory really exercise the middleware stack that createNetworkClient builds and in so doing loop over all of the RPC methods that our internal provider handles specially. We don't need to go to all that trouble to test the rpcEndpoint* events, we can just use an arbitrary RPC method. (I think eventually I will rename tests/network-client to tests/internal-provider-api or something like that, but that's a PR for another time.)

/**
* Obtains the event data type from a Cockatiel event or event listener type.
*/
export type ExtractCockatielEventData<CockatielEventOrEventListener> =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Cockatiel types are a bit awkward to work with (especially since event listeners whose event payloads are empty are typed as Event<void> which is rather strange). These utilities just make them a bit easier to work with.

- This ought to be unobservable, but we mark it as breaking out of an abundance of caution.
- **BREAKING:** Split up and update payload data for `NetworkController:rpcEndpoint{Degraded,Unavailable}` ([#7166](https://github.com/MetaMask/core/pull/7166))
- The existing events are now called `NetworkController:rpcEndpointInstance{Degraded,Unavailable}` and retain their present behavior.
- `NetworkController:rpcEndpointInstance{Degraded,Unavailable}` do still exist, but they are now designed to represent the entire RPC endpoint and are guaranteed to not be published multiple times in a row. In particular, `NetworkController:rpcEndpointUnavailable` is published only after trying all of the designated URLs for a particular RPC endpoint and the underlying circuit for the last URL breaks, not as each primary's or failover's circuit breaks.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, NetworkController:rpcEndpointUnavailable is published only after trying all of the designated URLs for a particular RPC endpoint and the underlying circuit for the last URL breaks, not as each primary's or failover's circuit breaks.

This change is I suppose a bit controversial, and the first version of this was slightly different before I landed on this approach. But I think it makes sense? Basically, if we can reach the network somehow, don't broadcast that it's unavailable until we're absolutely sure. That does mean that the NetworkController:rpcEndpointUnavailable may never be published for Infura networks if the failover does its job, but I think that's precisely the intent.

@mcmire
Copy link
Contributor Author

mcmire commented Nov 17, 2025

Adding no-changelog because there is a minor change to createServicePolicy which is non-user-facing. I've cherry-picked the createServicePolicy to the other PR, so this no longer applies.

github-merge-queue bot pushed a commit that referenced this pull request Nov 18, 2025
## Explanation

<!--
Thanks for your contribution! Take a moment to answer these questions so
that reviewers have the information they need to properly understand
your changes:

* What is the current state of things and why does it need to change?
* What is the solution your changes offer and how does it work?
* Are there any changes whose purpose might not obvious to those
unfamiliar with the domain?
* If your primary goal was to update one package but you found you had
to update another one along the way, why did you do so?
* If you had to upgrade a dependency, why did you do so?
-->

In a future commit we will introduce changes to `network-controller` so
that it will keep track of the status of each network as requests are
made. These updates to `createServicePolicy` assist with that. See the
changelog for more.

Besides this, the tests for `createServicePolicy` have been refactored
slightly so that they are easier to maintain in the future.

## References

<!--
Are there any issues that this pull request is tied to?
Are there other links that reviewers should consult to understand these
changes better?
Are there client or consumer pull requests to adopt any breaking
changes?

For example:

* Fixes #12345
* Related to #67890
-->

Progresses https://consensyssoftware.atlassian.net/browse/WPC-99.

You can see how these changes will be used in the next PR:
#7166

## Checklist

- [x] I've updated the test suite for new or updated code as appropriate
- [x] I've updated documentation (JSDoc, Markdown, etc.) for new or
updated code as appropriate
- [x] I've communicated my changes to consumers by [updating changelogs
for packages I've
changed](https://github.com/MetaMask/core/tree/main/docs/contributing.md#updating-changelogs),
highlighting breaking changes as necessary
- [x] I've prepared draft pull requests for clients and consumer
packages to resolve any breaking changes

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds `getCircuitState`, `onAvailable`, and `reset` to `ServicePolicy`,
exports Cockatiel types, and updates logic/tests to support availability
tracking and circuit state introspection.
> 
> - **controller-utils**:
>   - **ServicePolicy API**:
>     - Add `getCircuitState()` to expose underlying circuit state.
> - Add `onAvailable` event for first success and post-recovery success.
>     - Add `reset()` to close the circuit and reset breaker counters.
>   - **Behavior/Internals**:
> - Track availability status and emit `onAvailable`/`onDegraded`
appropriately.
> - Update `onBreak` to mark unavailable; wire `ConsecutiveBreaker` for
reset.
>   - **Exports**:
> - Export `CockatielEventEmitter` and `CockatielFailureReason`;
re-export via `index`.
>   - **Tests**:
> - Expand/refactor tests to cover `onAvailable`, `getCircuitState`,
`reset`, and timing cases; update export snapshot.
>   - **Docs**:
>     - Update `CHANGELOG.md` with new methods and exports.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
e597d0b. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
});
await ignoreRejection(promise);
expect(onBreakListener).not.toHaveBeenCalled();
testsForNonRetriableErrors({
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the opportunity to do some refactoring, as there were some repeated checks, and it was getting a bit difficult to maintain. As a result, the diff after this point is not the best. However I have done my best to point out differences.

});
});

it('calls the onAvailable callback the first time a successful request occurs', async () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These next two tests are new.

it('forwards the request to the failover service in addition to the primary endpoint while the circuit is broken, stopping when the primary endpoint recovers', async () => {
const clock = getClock();
const jsonRpcRequest = {
it('does not call onAvailable', async () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is new.

);
});

it('calls the onAvailable callback if the endpoint becomes degraded via errors and then recovers', async () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is new.

expect(onBreakListener).not.toHaveBeenCalled();
});

describe('if a failover service is provided', () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the failover-related tests have been removed from this file, as RpcServiceChain now handles diverting traffic to failovers.

});
});

describe('if a failover service is provided', () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, this describe block has been removed.

*
* @param payload - The event payload.
* @param payload.chainId - The chain ID of the target network.
* @param payload.endpointUrl - The URL of the endpoint among the chain of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of a chain becoming unavailable, this is a confusing description. Wouldn't this nearly always be the failover chain? But technically it could be the primary if both circuits break, then the primary recovers then breaks again?

This seems more confusing for subscribers than helpful. I'm not really sure what the purpose of exposing the primary is either.

Copy link
Contributor Author

@mcmire mcmire Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. For this event, endpointUrl will always be the last failover URL, even if the primary recovers and then breaks again. So this description is not specific enough.

As for the payload data, you're also right that realistically, neither endpointUrl and primaryEndpointUrl are used. You might have noticed that each of the events has the same payload data. My thinking here was if we needed to hook this event up to metrics or the network connection banner in the future, we wouldn't need to make any changes to network-controller. That said, endpointUrl is not very useful for this event due to the behavior described above. So I think it would be safe to remove, if that helps clear the confusion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed endpointUrl from this event. I have also tried to go through each event's JSDoc and clarify any confusing wording: 73017d1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking here was if we needed to hook this event up to metrics or the network connection banner in the future, we wouldn't need to make any changes to network-controller.

We could derive that from the network client ID anyway though right? It still doesn't seem clear why this would be useful in the event. Maybe I'm missing something though.

Even if it's just included for convenience reasons, chain ID seems like the more valuable property (though that's also derivable from the network client ID).

primaryEndpointUrl,
endpointUrl,
attempt,
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Credentials leak in RPC event payloads

createNetworkClient uses the raw primaryEndpointUrl from the configuration (which may contain credentials) in multiple event payloads, instead of the sanitized URL provided by the RpcServiceChain event data. This exposes potential credentials (username/password) to event subscribers.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Fixed here: 026ce00

primaryEndpointUrl,
endpointUrl,
attempt,
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Credentials leak in RPC event payloads

createNetworkClient uses the raw primaryEndpointUrl from the configuration (which may contain credentials) in multiple event payloads (rpcEndpointChainUnavailable, rpcEndpointUnavailable, etc.), instead of the sanitized URL provided by the RpcServiceChain event data. This exposes potential credentials (username/password) to event subscribers.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Fixed here: 026ce00

primaryEndpointUrl: string;
},
];
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing endpointUrl in chain unavailable event payload

The NetworkController:rpcEndpointChainUnavailable event payload type definition and usage omit the endpointUrl property. This is inconsistent with all other similar events (ChainDegraded, ChainAvailable, Unavailable) which include it, and contradicts the PR discussion implying uniform payloads. Including endpointUrl is useful for identifying the specific failover endpoint that caused the chain to break.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, as per the discussion above.

// The only way we can end up here is if there are no services to loop over.
// That is not possible due to the types on the constructor, but TypeScript
// doesn't know this, so we have to appease it.
throw new Error('Nothing to return');
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Chain available event not published on first success

The NetworkController:rpcEndpointChainAvailable event is documented to be published on the first successful request, but RpcServiceChain only emits this event when an underlying RpcService emits onAvailable (which typically only happens upon circuit recovery). The request method lacks logic to update the status to Available and emit the event upon a successful request when the status is Unknown.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit confusing; the onAvailable event actually occurs at the service policy level and is then propagated to RpcService and RpcServiceChain. So although there is some logic to track the status within RpcServiceChain.request, it only covers setting the status to unavailable.

primaryEndpointUrl: this.#primaryService.endpointUrl.toString(),
});
}
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Chain status oscillates between Degraded and Unavailable

The RpcServiceChain unconditionally updates its status to Degraded when any underlying service emits onDegraded, even if the chain is currently Unavailable. When the request subsequently fails on all endpoints, the status reverts to Unavailable. This oscillation causes duplicate rpcEndpointChainDegraded and rpcEndpointChainUnavailable events to fire repeatedly during outages, violating the requirement to not publish these events multiple times in a row.

Fix in Cursor Fix in Web

Copy link
Contributor Author

@mcmire mcmire Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is accurate. The chain should not transition from unavailable to degraded. If the circuit breaks, then the circuit break duration passes, then the endpoint fails again, onBreak will be emitted, not onDegraded. It is possible to transition from degraded to unavailable to available to degraded, but that is by design.

? this.#failoverService.endpointUrl.toString()
: undefined,
});
if (!('isolated' in data)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the Update AbstractRpcService/RpcServiceRequestableto remove{ isolated: true }from theonBreak event data type change I guess?

Nit: If we're ignoring this because we never call isolate, perhaps we should throw an error in case we do at some point in the future, so these events aren't totally suppressed? Or log something to the console at at least?

I was going to suggest a comment to explain why we're doing this, but an error or log would probably serve that purpose and highlight this mistake if it does come up.

Copy link
Contributor Author

@mcmire mcmire Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, okay, I guess the changelog entry isn't quite accurate. We do call .isolate on the circuit breaker policy, but we're sort of abusing it. It's not so that we can hold the circuit open; we call it in ServicePolicy.reset so that we can reset the state of the circuit breaker.

The problem is that when we call .isolate, it will immediately call onBreak with { isolated: true }: https://github.com/connor4312/cockatiel/blob/f87137b9bb55e0cdfb02e7d5c401e47c6f5fcc9f/src/CircuitBreakerPolicy.ts#L187. We don't want that to cause the NetworkController:rpcEndpointUnavailable or NetworkController:rpcEndpointChainUnavailable events to be published. So that's why we are ignoring the error here.

If that makes sense, I can add a comment and update the changelog entry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added here: 0f55384

expect(onAvailableListener).toHaveBeenCalledTimes(1);
});

/* eslint-enable jest/require-top-level-describe */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this is fine, but I wonder going forward if we should disable this rule? Or if we should move test helpers like this to separate files, and add them to the "test helpers" ESLint config section where this rule is disabled.

Either way we can deal with it later, in a separate PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this isn't the first time I've had to do this, and it feels noisy to me. I do think the rule is valuable, so I'm hesitant to disable it. Maybe moving to a separate file is best. I'll think about it.

import { projectLogger, createModuleLogger } from '../logger';

// const log = createModuleLogger(projectLogger, 'RpcServiceChain');
const log = console.log.bind(console);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Debug logging accidentally committed to production code

The proper logger using createModuleLogger is commented out and replaced with console.log.bind(console). This debug code will cause log statements to appear in production, bypassing the application's logging infrastructure. The commented line should be uncommented and the console.log line should be removed.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here: 72de1d6

}

return response;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Primary service not reset when failover succeeds

When a failover service succeeds after the primary fails, only services after the successful failover are reset, but the primary service's circuit is not reset. This means subsequent requests will continue to skip the primary even though a failover is working. The code should reset all services before the successful service index (including the primary at index 0) so that future requests will attempt the primary first again.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's okay, the primary's circuit does not need to be reset, it is already in the state we want it to be in.

this.#status = STATUSES.Available;
this.#onAvailableEventEmitter.emit({
...data,
primaryEndpointUrl: this.#primaryService.endpointUrl.toString(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Slow successful requests incorrectly mark chain as available

When a request succeeds but takes longer than the degraded threshold, both onDegraded and onAvailable events fire from the underlying service policy. The onAvailable handler unconditionally sets status to Available if it's not already Available, which overwrites the Degraded status that was just set by the onDegraded handler. This causes slow but successful requests to incorrectly mark the chain as Available instead of Degraded. The onAvailable handler should check if the status is Degraded and avoid overwriting it, or the status transition logic needs to account for the fact that a service can be both available and degraded simultaneously.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants