Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow custom polling behavior in CallHttpAsync() for long-running endpoints #1183

Open
idg10 opened this issue Jan 29, 2020 · 3 comments
Open

Allow custom polling behavior in CallHttpAsync() for long-running endpoints #1183

idg10 opened this issue Jan 29, 2020 · 3 comments
Labels

Comments

@idg10
Copy link

@idg10 idg10 commented Jan 29, 2020

The 202-based polling pattern that Durable Functions supports (both for consuming external HTTP endpoints, and also for presenting the status of Durable Functions in progress) is at odds with the HTTP specification, and also does not conform Microsoft's guidelines at https://github.com/Microsoft/api-guidelines/blob/vNext/Guidelines.md

We would like Durable Functions to be able to consume long-running operations that work in the way described in those guidelines. Currently, Durable Functions require an operation status endpoint to return a 202 status for as long as the operation is running. If the operation status endpoint returns a 200 status, Durable Functions interprets this as meaning that the operation is complete.

That is directly incompatible with section 13.2.5 of Microsoft's API guidelines at https://github.com/Microsoft/api-guidelines/blob/vNext/Guidelines.md#1325-operation-resource which say:

The GET operation against an operation MUST return:

  1. The operation resource, it's state, and any extended state relevant to the particular API.
  2. 200 OK as the response code.

The reason we'd prefer to have long-running endpoints work the way that these guidelines recommend is that they are consistent with the HTTP specification in a way that the approach taken by Durable Functions is not.

Durable Functions expect a GET against the URL representing an operation to return a 202 if the operation is incomplete, but the problem with this, from an HTTP Specification point of view, is that the meaning of a 202 status at this point is not "the operation hasn't finished", it is "we haven't finished retrieving the operation's status". That's not how Durable Functions interprets it of course, but it's what a 202 status at this point implies based on what the HTTP Spec says.

Where a 202 does come into this is as the return code from the endpoints that starts the operation. In that context it makes sense, because the HTTP operation it describes is typically a POST asking to do a thing, and the 202 correctly indicates that the work to do the thing is underway but not complete, and the response can include a header saying where to find out more. (Opinion is divided on exactly what that header should be. Azure uses either of 2 different headers for this as described at https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-manager-async-operations#monitor-status-of-operation while the Microsoft guidelines show a non-standard Operation-Location header in https://github.com/Microsoft/api-guidelines/blob/vNext/Guidelines.md#1327-the-typical-flow-polling and there's a proposal at microsoft/api-guidelines#101 to use the standard Content-Location instead; Durable Functions expects Location).

HTTP status codes tell you about the HTTP request at hand. They are not designed to tell you about the status of something else. The reason the API Guidelines say that the Operation endpoint MUST return a 200 (even if the operation is not in progress) is that this is how it tells the client that the GET request to fetch the status has been successfully handled.

With 202s, this is arguably not a problem (assuming you're prepared to ignore the abuse of the HTTP protocol) but where the problems in this design really become apparent is on failure. If you look at https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-http-api#get-instance-status it says that the way the orchestration instance endpoint reports the failure of the instance is by producing a 500 status.

The big problem with that is that there is no way to distinguish beween the failure of the orchestration instance, and the failure of the server to report the status of the instance.

That's a critical distinction. If a client tries to learn the status of an orchestration instance, and receives a 500 status code, it doesn't know anything about the status of the instance because there is no way to distinguish between failure of the orchestration instance or failure of the attempt to discover that instance's status. But if the request to fetch the status were to return a 200 and provide a body reporting that the instance has failed, the client would know exactly what the status of the instance is. The appropriate recovery/reporting action for this failure mode is completely different than for the first failure mode. (If the attempt to discover the status failed due to a server error, it's worth waiting and retrying because 500 errors are often transient. But in the case where the instance has failed, there's no use in retrying the get status because it's going to carry on telling you the same thing.) But with the current design, you can't tell which of these has happened when you get a 500.

This is exactly why you mustn't use the HTTP status to report anything other than how the particular request being responded to was handled. As soon as you start trying to merge other information into the HTTP status response (e.g., information about how some other operation is going) you introduce ambiguity.

The canonical HTTP mechanism for doing this is to use the status code purely to describe the request handling, and then to put all other state of interest into the response body (or possibly response headers). And that's exactly what the Microsoft API guidelines say to do.

I understand that Durable Functions are now committed to this design when it comes to the implementation of the server side of this pattern because it's out there. So much as I'd like to see it fixed, I understand you've got customers depending on it. (Although you could still introduce a new orchestration instance endpoint that is compliant with the HTTP spec, and gradually deprecate the old one.)

But really want to implement my own APIs in a way that complies with the HTTP spec (which means I must be able to produce a 200 status code from the endpoint reporting the status of an operation in progress regardless of whether that operation is complete, in progress, or has failed). And ideally I'd be able to take advantage of Orchestration's ability to consume long-running operations to monitor operations that report their status in a way that is compliant with the HTTP specification and with Microsoft's own recommendations.

So even if you can't change how Functions' implementation of the server side of this pattern works, you could modify your implementation of the client side of the pattern so that it's possible to consume HTTP-compliant status endpoints.

One way to do this would be to open up the logic that looks at the response from a status endpoint and decides whether it's done. Currently that logic is baked into DurableOrchestrationContext's implementation of IDurableOrchestrationContext.CallHttpAsync at https://github.com/Azure/azure-functions-durable-extension/blob/v2/src/WebJobs.Extensions.DurableTask/ContextImplementations/DurableOrchestrationContext.cs#L233

If this logic could be factored out, with an option to plug in some other test (e.g., something conceptually equivalent to Func<HttpResponseMessage, Task<bool>>, although the reality might be more complex) I could then plug in a test which, rather than inferring the operation status from the HTTP status code, looked at the body and retrieved the status property from the JSON in the response (i.e., using the mechanism described in Microsoft's API guidelines).

@idg10 idg10 changed the title Support long-running operations without forcing them to violate the HTTP specification Support long-running operations without requiring violation of the HTTP specification Jan 29, 2020
@ConnorMcMahon

This comment has been minimized.

Copy link
Contributor

@ConnorMcMahon ConnorMcMahon commented Jan 29, 2020

@idg10

Thank you for the feedback. I think you bring up a lot of very valid points. To clarify, while our design does not match the Microsoft API guidelines you linked to, we did follow the asynchronous request-reply pattern mentioned in other Microsoft documentation.

At the end of the day, because there is no true universal standard for how to handle the asynchronous pattern, we went with the approach on our client-side implementation that was compatible with our server-side implementation, to allow coordination across different Durable Functions applications. It is very possible that there were better alternatives, but at this point we are a bit stuck on what the default behavior is.

I definitely think allowing some callback function to decide retry behavior is an excellent way to allow the best of both worlds. We should definitely revisit the defaults in the future when we decide to do another major release, but it will likely be a long time before we do that again.

As a side-note, the documentation regarding our GET orchestration status endpoint is actually incorrect for Durable v2.x (I'll open a separate issue to get that fixed). By default, we now return a 200 status code on orchestration failures. In order to restore the v1.x behavior, the url needs to use the query string parameter returnInternalServerErrorOnFailure=true. You can see the code for that here.

@ConnorMcMahon ConnorMcMahon changed the title Support long-running operations without requiring violation of the HTTP specification Allow custom polling behavior in CallHttpAsync() for long-running endpoints Feb 8, 2020
@markheath

This comment has been minimized.

Copy link
Contributor

@markheath markheath commented Feb 15, 2020

I think there is scope for several improvements to the way CallHttpAsync works for long-running operations.

First, there is no support for retries. You'd probably want separate retry policy options for the "initial" call to the endpoint, and for the calls to the endpoint for status polling. This could be supported with two optional RetryOptions properties on the DurableHttpRequest

Second, it would be nice to be able to specify an overall maximum polling duration, after which it could throw a time out exception.

Third, from my experiments it seems that the original HTTP request will time out in just under 90 seconds. It would be nice to be able to control that timeout duration with another property on DurableHttpRequest

Fourth, it seems a bit odd that the polling endpoint has to keep specifying the Location header. Could there be an option to always poll the Location header returned from the original request. Or even just to default to using that if there was no Location header returned by the polling endpoint.

Fifth, as @idg10 has pointed out, the polling pattern is a little idiosyncratic and doesn't necessarily match how many long-running APIs work in the wild. Maybe DurableHttpRequest could take a func that inspects the polling response and returns whether to keep polling or not.

Finally, the ManagedIdentityTokenSource that you can provide doesn't let you override the azureAdInstance constructor for the underlying AzureServiceTokenProvider, which prevents you from using it against Azure Government.

I'd be interested to hear any feedback on these suggestions, as we'd like to use this feature for some of our workflows and might be able to offer a PR with these changes in.

@bachuv

This comment has been minimized.

Copy link
Contributor

@bachuv bachuv commented Mar 18, 2020

Below are some possible modifications we can make to incorporate custom polling into the CallHttpAsync API. These modifications would allow customers to provide a Func<T, TResult> delegate to the CallHttpAsync API.

There will be a new type called PollingDecision that has 2 subtypes:

  1. PollAgain subtype: DurableHttpRequest
  2. DonePolling subtype: Empty Object

Func<T, TResult> delegate signature for the CallHttpAsync API:

  • Func<DurableHttpResponse, PollingDecision>

Using this delegate function signature would be simple to use and implement. It would also allow synchronous operations because it would be performed in an orchestrator function.

Modifications to the 2 existing CallHttpAsync APIs:

  1. Task<DurableHttpResponse> CallHttpAsync(DurableHttpRequest req, Func<DurableHttpResponse, PollingDecision> pollingDecider = null)
  2. Task<DurableHttpResponse> CallHttpAsync(HttpMethod method, Uri uri, string content = null, Func<DurableHttpResponse, PollingDecision> funcName = null)

The Func<DurableHttpResponse, PollingDecision> delegate will allow customers to implement their own polling logic.

Also, 2 new issues were created to address the polling durations (issue #1278) and configuring ManagedIdentityTokenSource to work with government clouds (issue #1279).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.