Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Allow custom polling behavior in CallHttpAsync() for long-running endpoints #1183
The 202-based polling pattern that Durable Functions supports (both for consuming external HTTP endpoints, and also for presenting the status of Durable Functions in progress) is at odds with the HTTP specification, and also does not conform Microsoft's guidelines at https://github.com/Microsoft/api-guidelines/blob/vNext/Guidelines.md
We would like Durable Functions to be able to consume long-running operations that work in the way described in those guidelines. Currently, Durable Functions require an operation status endpoint to return a 202 status for as long as the operation is running. If the operation status endpoint returns a 200 status, Durable Functions interprets this as meaning that the operation is complete.
That is directly incompatible with section 13.2.5 of Microsoft's API guidelines at https://github.com/Microsoft/api-guidelines/blob/vNext/Guidelines.md#1325-operation-resource which say:
The reason we'd prefer to have long-running endpoints work the way that these guidelines recommend is that they are consistent with the HTTP specification in a way that the approach taken by Durable Functions is not.
Durable Functions expect a GET against the URL representing an operation to return a 202 if the operation is incomplete, but the problem with this, from an HTTP Specification point of view, is that the meaning of a 202 status at this point is not "the operation hasn't finished", it is "we haven't finished retrieving the operation's status". That's not how Durable Functions interprets it of course, but it's what a 202 status at this point implies based on what the HTTP Spec says.
Where a 202 does come into this is as the return code from the endpoints that starts the operation. In that context it makes sense, because the HTTP operation it describes is typically a POST asking to do a thing, and the 202 correctly indicates that the work to do the thing is underway but not complete, and the response can include a header saying where to find out more. (Opinion is divided on exactly what that header should be. Azure uses either of 2 different headers for this as described at https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-manager-async-operations#monitor-status-of-operation while the Microsoft guidelines show a non-standard
HTTP status codes tell you about the HTTP request at hand. They are not designed to tell you about the status of something else. The reason the API Guidelines say that the Operation endpoint MUST return a 200 (even if the operation is not in progress) is that this is how it tells the client that the GET request to fetch the status has been successfully handled.
With 202s, this is arguably not a problem (assuming you're prepared to ignore the abuse of the HTTP protocol) but where the problems in this design really become apparent is on failure. If you look at https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-http-api#get-instance-status it says that the way the orchestration instance endpoint reports the failure of the instance is by producing a 500 status.
The big problem with that is that there is no way to distinguish beween the failure of the orchestration instance, and the failure of the server to report the status of the instance.
That's a critical distinction. If a client tries to learn the status of an orchestration instance, and receives a 500 status code, it doesn't know anything about the status of the instance because there is no way to distinguish between failure of the orchestration instance or failure of the attempt to discover that instance's status. But if the request to fetch the status were to return a 200 and provide a body reporting that the instance has failed, the client would know exactly what the status of the instance is. The appropriate recovery/reporting action for this failure mode is completely different than for the first failure mode. (If the attempt to discover the status failed due to a server error, it's worth waiting and retrying because 500 errors are often transient. But in the case where the instance has failed, there's no use in retrying the get status because it's going to carry on telling you the same thing.) But with the current design, you can't tell which of these has happened when you get a 500.
This is exactly why you mustn't use the HTTP status to report anything other than how the particular request being responded to was handled. As soon as you start trying to merge other information into the HTTP status response (e.g., information about how some other operation is going) you introduce ambiguity.
The canonical HTTP mechanism for doing this is to use the status code purely to describe the request handling, and then to put all other state of interest into the response body (or possibly response headers). And that's exactly what the Microsoft API guidelines say to do.
I understand that Durable Functions are now committed to this design when it comes to the implementation of the server side of this pattern because it's out there. So much as I'd like to see it fixed, I understand you've got customers depending on it. (Although you could still introduce a new orchestration instance endpoint that is compliant with the HTTP spec, and gradually deprecate the old one.)
But really want to implement my own APIs in a way that complies with the HTTP spec (which means I must be able to produce a 200 status code from the endpoint reporting the status of an operation in progress regardless of whether that operation is complete, in progress, or has failed). And ideally I'd be able to take advantage of Orchestration's ability to consume long-running operations to monitor operations that report their status in a way that is compliant with the HTTP specification and with Microsoft's own recommendations.
So even if you can't change how Functions' implementation of the server side of this pattern works, you could modify your implementation of the client side of the pattern so that it's possible to consume HTTP-compliant status endpoints.
One way to do this would be to open up the logic that looks at the response from a status endpoint and decides whether it's done. Currently that logic is baked into
If this logic could be factored out, with an option to plug in some other test (e.g., something conceptually equivalent to
Thank you for the feedback. I think you bring up a lot of very valid points. To clarify, while our design does not match the Microsoft API guidelines you linked to, we did follow the asynchronous request-reply pattern mentioned in other Microsoft documentation.
At the end of the day, because there is no true universal standard for how to handle the asynchronous pattern, we went with the approach on our client-side implementation that was compatible with our server-side implementation, to allow coordination across different Durable Functions applications. It is very possible that there were better alternatives, but at this point we are a bit stuck on what the default behavior is.
I definitely think allowing some callback function to decide retry behavior is an excellent way to allow the best of both worlds. We should definitely revisit the defaults in the future when we decide to do another major release, but it will likely be a long time before we do that again.
As a side-note, the documentation regarding our GET orchestration status endpoint is actually incorrect for Durable v2.x (I'll open a separate issue to get that fixed). By default, we now return a 200 status code on orchestration failures. In order to restore the v1.x behavior, the url needs to use the query string parameter
I think there is scope for several improvements to the way
First, there is no support for retries. You'd probably want separate retry policy options for the "initial" call to the endpoint, and for the calls to the endpoint for status polling. This could be supported with two optional
Second, it would be nice to be able to specify an overall maximum polling duration, after which it could throw a time out exception.
Third, from my experiments it seems that the original HTTP request will time out in just under 90 seconds. It would be nice to be able to control that timeout duration with another property on
Fourth, it seems a bit odd that the polling endpoint has to keep specifying the
Fifth, as @idg10 has pointed out, the polling pattern is a little idiosyncratic and doesn't necessarily match how many long-running APIs work in the wild. Maybe
I'd be interested to hear any feedback on these suggestions, as we'd like to use this feature for some of our workflows and might be able to offer a PR with these changes in.
Below are some possible modifications we can make to incorporate custom polling into the
There will be a new type called
Func<T, TResult> delegate signature for the
Using this delegate function signature would be simple to use and implement. It would also allow synchronous operations because it would be performed in an orchestrator function.
Modifications to the 2 existing