Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Honouring retries from 429 response #247

Closed
mcquiggd opened this issue May 9, 2017 · 32 comments
Closed

Honouring retries from 429 response #247

mcquiggd opened this issue May 9, 2017 · 32 comments

Comments

@mcquiggd
Copy link

mcquiggd commented May 9, 2017

I am attempting to capture a 429 response, and the advised retry interval from the HTTPResponse, as outlined in the blog post here

I am using .Net Core, with version 5.1 (current latest) of Polly.

I have started trying to get a simple HTTPResponse captured, but ultimately I want to use Polly to handle retries for the .Net Core DocumentDB client.

The example code in the blog post is as follows:

var honoursRetryAfter = Policy<HttpResponseMessage>  
  .Handle<HttpException>
  .OrResult(r => statusCodesToRetry.Contains(r.StatusCode)) // detail elided
  .RetryAsync(retryCount: /* your preferred retry count */, 
         sleepDurationProvider: (retryAttempt, context) => 
             context.ContainsKey("RetryAfter") && context["RetryAfter"] != null
             ? (TimeSpan)context["RetryAfter"] 
             : defaultSleepProvider(retryAttempt)  // detail elided
  );

Unfortunately, this particular item seems to be more pseudocode than a usable example - there is no such named parameter as sleepDurationProvider for RetryAsync. So, I have tried WaitAndRetry as follows:

            var honoursRetryAfterAsync = Policy<HttpResponseMessage>
                .Handle<Exception>()
                .OrResult(r => r.StatusCode.ToString().Contains("429"))
                .WaitAndRetryAsync(                    
                    retryCount: 3,
                    sleepDurationProvider: (retryAttempt, context) => context.ContainsKey("RetryAfter") && context["RetryAfter"] != null
                                                                          ? (TimeSpan)context["RetryAfter"]
                                                                          : TimeSpan.FromSeconds(0.5),
                    onRetryAsync: async (message, timespan, retryAttempt, context) => await DoSomething(context["httpClient"]));

However, it is becoming confusing, as there are so many overrides with different signatures, and their availability depends on the complete signature of the complete Policy, and some of the examples in the documentation do not compile. e.g:

Policy
  .HandleResult<HttpResponseMessage>(r => r.StatusCode == 500) // Error - cannot use equality between r.StatusCode and int
  .OrResult<HttpResponseMessage>(r => r.StatusCode == 502) // Error - should not specify <HttpResponseMessage> again

So, I am feeling as If I am just hacking things together in an inappropriate way ;)

I would really appreciate it if someone can point me in the right direction to get this feature working, as intended.

@reisenberger
Copy link
Member

@mcquiggd. Yes, that was pseudo-code-ish (have re-labelled). Thanks for identifying the issues! Let's see if we can sort it out. I'd also been meaning to ask @jmansar (who originally requested the feature, #177) or @tcsatheesh (who was also interested in using it with DocumentDb) if they wanted to expand on my blogged (pseudo-) code: both, feel free to chip in!

More coming shortly.

@reisenberger
Copy link
Member

@mcquiggd Having spent a couple of hours on this, I am not convinced we have adequate support within Polly yet for the DocumentDB RetryAfter Use Case.

I can see various ways forward to support this better in Polly, but need more time to consider.

A possible workround for now with Polly might be (using microsoft.azure.documentdb.core) as below:

var honoursRetryAfterFromDocumentDb = Policy
    .Handle<DocumentClientException>(de => de.RetryAfter > TimeSpan.Zero)
    .Or<AggregateException>(ae => ((ae.InnerException as DocumentClientException)?.RetryAfter ?? TimeSpan.Zero) > TimeSpan.Zero)
    .WaitAndRetryAsync(retryCount: 3,
        sleepDurationProvider: (retryAttempt, context) => TimeSpan.Zero,
        onRetryAsync: (exception, timespan, retryAttempt, context) => 
            Task.Delay((exception as DocumentClientException)?.RetryAfter ?? (exception.InnerException as DocumentClientException)?.RetryAfter ?? TimeSpan.FromSeconds(0.5))
       );

Clearly this is sub-optimal: we are bypassing Polly's in-built sleep, and sleeping in the onRetryAsync. We lose the benefit of Polly's in-built cancellation support.

I haven't had time to test the above tonight and would be interested in feedback on whether it works/gets you closer - that would help guide how we wrap this up into a more off-the-shelf/ready-to-go solution!

Thanks!

@reisenberger
Copy link
Member

reisenberger commented May 9, 2017

Note: the Microsoft Patterns/Practices team recommend the in-built retry for DocumentDb: Is this relevant? . Also more in this documentation

Would be great to hear if this covers it, or if there's a reason it's still worth us pursuing this use case in Polly - will help guide where to place effort. Many thanks!

EDIT: I guess this paragraph:

If you have more than one client cumulatively operating consistently above the request rate, the default
retry count currently set to 9 internally by the client may not suffice; in this case, the client throws a
DocumentClientException with status code 429 to the application. The default retry count can be changed
by setting the RetryOptions on the ConnectionPolicy instance. By default, the DocumentClientException
with status code 429 is returned after a cumulative wait time of 30 seconds if the request continues to
operate above the request rate. This occurs even when the current retry count is less than the max retry
count, be it the default of 9 or a user-defined value.

in the linked documentation suggests a reason why still handling this in Polly could be useful ....

@reisenberger
Copy link
Member

@mcquiggd Thank you also for pointing out the slips in the readme doco. Now fixed in #248 .

@mcquiggd
Copy link
Author

mcquiggd commented May 9, 2017

@reisenberger

Many thanks for taking the time to look into this - it's appreciated.

We had sort of come to the same point.

My thoughts so far:

  1. My understanding is that the Microsoft recommendations are really a general approach to retries with services (as in - 'you should have retries'), rather than any specific technical reason for using their implementation. Having chatted to a member of the DocumentDB team, they did state that the Client is only intended for basic fault handling scenarios. At present you cannot even specify a sliding scale for wait times in the Client; to do so you need to use the Transient Fault Handling package, which will be deprecated when the Client has feature parity. But there are no plans to add other features AFAIK.

  2. Polly has a major part to play in Cloud architectures, and we are unlikely to see any similar efforts from Microsoft (the Enterprise Library blocks seem to be left to one side). I want to be able to architect a common set of Policies and patterns for handling errors - be they Blob Storage, Azure Search, DocumentDB, DynamoDB, ElasticSearch.

  3. It might therefore be appropriate to take the approach of say, wrapping one of the more advanced patterns (Circuit Breaker) around the individual service-specific, simple retry logic. Perhaps catch specific exceptions such as the DocumentException in a try catch, and when retries are exhausted rethrow and handle a custom 'ServiceException' with a Polly Policy / bundle of Policies, as a small abstraction.

Still thinking about all this... as you mention in your edit, I feel there is a role for Polly to play in this.

I need to ponder it for a while and try some ideas over a cup of tea ... ;-)

@mcquiggd
Copy link
Author

24 hours is a long time in technology - now we have Azure Cosmos DB... 🥇

From the documentation:

If you have more than one client cumulatively operating above the request rate, the default retry behavior may not suffice, and the client will throw a DocumentClientException with status code 429 to the application. In cases such as this, you may consider handling retry behavior and logic in your application's error handling routines or increasing the reserved throughput for the container.

Having looked at the DocumentClient source code, I am currently testing an approach that disables the built-in retries, and uses Polly instead. Will report back if I make progress.

@reisenberger
Copy link
Member

reisenberger commented May 10, 2017

Thanks for the thoughtful commentary @mcquiggd ^^ (two up). Agree on all counts!

Re your point 3 - wrapping a bunch of policies around a call as a small abstraction - check out Polly's PolicyWrap (if not already seen it), designed for exactly this purpose. We also provide guidelines on ordering the policies within a wrap, although PolicyWrap is intentionally fully flexible, so any ordering can be used.

+1 to your idea of:

architect a common set of Policies and patterns for handling errors -
be they Blob Storage, Azure Search, DocumentDB, DynamoDB, ElasticSearch.

We would love to hear more about this later, if grows and you would like to share. We would love to eg blog/guest-blog about it, or share samples perhaps (duly credited) via a Polly.Contrib. (Contact me on Polly slack later if u want to discuss.)

One issue with the in-built retries for the various Azure services is that they are all different APIs - good in the sense each is probably a good fit/strategy for the particular service - but a slightly fragmented API experience. Another issue with combining Polly policies with Azure's in-built retries, is that with Polly-wraps-Azure-retry you can effectively only have the retry innermost (out of the set of policies). However, you might want to have certain policy types inside a retry, in some PolicyWraps.

Re Topaz and the Enterprise Application Blocks, yes, feels like it is well on the way out. The Microsoft patterns and practices team are now moving away from recommending Topaz, in favour of Polly (I've provided input on the Polly samples).

@mcquiggd
Copy link
Author

mcquiggd commented May 11, 2017

@reisenberger

Well, I have made a little progress.

Firstly, I have taken the DocumentDB Benchmark project (available from the DocumentDB GitHub repo, under samples), copied it and converted it to .Net Core, so we have both .Net 4.x and .Net Core 1.x versions in the same solution.

I am using this as a testbed for my 'experiments', with the latest DocumentDB emulator set to use a single, non-partitioned collection of 400 Request Unit capacity, the 'slowest'.

I deliberately set the values for threads, and amount of documents to insert, to cause throttling.

With the DocumentDB Client retry, throttling was handled automatically.

Next, I disabled the DocumentDB Client retries; of course exceptions occurred. Bear in mind that these settings are 'global' if you follow the sound advice of using a single Client instance, which does a lot of performance enhancement, but is also a little inflexible...

private static readonly ConnectionPolicy ConnectionPolicy = new ConnectionPolicy 
{ 
    ConnectionMode = ConnectionMode.Direct, 
    ConnectionProtocol = Protocol.Tcp, 
    RequestTimeout = new TimeSpan(1, 0, 0), 
    MaxConnectionLimit = 1000, 
    RetryOptions = new RetryOptions 
    { 
        MaxRetryAttemptsOnThrottledRequests = 0, // was 10
        MaxRetryWaitTimeInSeconds = 60
    } 
};

Then, I created a Polly Retry Policy, as follows:

private static readonly RetryPolicy PollyRetryPolicy = Policy.Handle<DocumentClientException>(de => de.RetryAfter > TimeSpan.Zero).
        Or<AggregateException>(ae => ((ae.InnerException as DocumentClientException)?.RetryAfter ?? TimeSpan.Zero) > TimeSpan.Zero).
        WaitAndRetryAsync(retryCount: 3, 
                sleepDurationProvider: (retryAttempt, context) =>
                {
                    var retryAfter = TimeSpan.FromMilliseconds(10); // Should be 0? 
                    if (context == null || !context.ContainsKey("RetryAfter")) return retryAfter;                       
                    return (TimeSpan)context["RetryAfter"];
                }, 
                onRetryAsync: (exception, timespan, retryAttempt, context) =>
                {
                    context["RetryAfter"] = (exception as DocumentClientException)?.RetryAfter
                                            ?? (exception.InnerException as DocumentClientException)?.RetryAfter 
                                                 ?? TimeSpan.FromMilliseconds(1); //Just an arbitrary chosen value, never seems to be used
                    return Task.CompletedTask;
                }                
        );

So, basically using Context to pass the retry value from the onRetryAsync to the sleepDurationProvider, and using purely Polly to handle waits etc. And now we can create different policies / settings for different 'operations', using the same Client instance, potentially passing the 'operation name' in Context from the call to Execute the Policy... if my understanding of the Retry life-cycle is correct. I have noticed that retry attempt 1 does not seem to include a server retry-after value, but had limited time today to verify.

This Policy is then called in the multi-threaded benchmarking app as below:

var response  = await PollyRetryPolicy.ExecuteAsync(async () => await client.CreateDocumentAsync(
                                                                    UriFactory.CreateDocumentCollectionUri(_appSettings.DatabaseName, _appSettings.CollectionName),
                                                                    newDictionary,
                                                                    new RequestOptions()));

The result - Polly handles all the retries, observing the returned retry interval from the server:

pollysettingretryinterval

I'll continue to work on this (it can definitely be tidied up, I have a lot of distractions around at the moment), and upload the Solution to Github in case others wish to contribute.

Perhaps you can spot any obvious mistakes - e.g. is it valid to create the Policy as static readonly in a multithreaded environment?

I want to create some more advanced examples for the different scenarios available with Polly, along the lines of your last post... will pop onto Slack at some point...

@mcquiggd
Copy link
Author

mcquiggd commented May 11, 2017

@reisenberger

Made some changes this morning, including storing the id of the Document to be inserted via Context when ExecuteAsync is called, and that functions as expected, with data subsequently being available within the Policy.

A1. I changed the Policy declaration to be simply private - i.e. not static, not read only - I observed no change in behaviour.

A2. I have also set the parameter continueOnCapturedContext on ExecuteAsync to true - this does not appear to have any impact on results.

A3. I have verified that on retry attempt 1, there is no RetryAfter being returned in the Context, although it is returned from the server.

A4. On investigation, this is due to the sleepDurationProvider being called before the onRetryAsync delegate, which is not what I expected from the Retry lifecycle :

If another try is permitted, the policy:
Raises the onRetry delegate (if configured)
In the case of a wait-and-retry policy, calculates the duration to wait from the IEnumerable or Func<int, TimeSpan> configured, and waits.

pollysettingretryinterval4

A5. I don't see an override of the sleepDurationProvider that provides access to the Exception / handled object - which would allow simplified logic in this instance.

Any thoughts...?

@reisenberger
Copy link
Member

@mcquiggd Great to see this all coming together!

👍 using the Context to pass things (operation name; Id) into the call - exactly what intended for.

A1. All Polly policies are fully thread-safe, so you're absolutely fine re-using an instance across multiple calls (if relevant to any decisions around this).

A2. Whether you want to continueOnCapturedContext or not, be driven simply by whether better for your Use Case / usage with DocumentDB (usual considerations around how it influences async context marshalling). Polly doesn't care either way. continueOnCapturedContext == true usually only necessary if you specially need execution after the ExecuteAsync(...) to continue on the same context as before (some actor frameworks; or need preserve UI execution context).

A3/A4. Your analysis is exactly correct. Sleep duration is calculated before onRetry/Async is invoked - even though only used after - because it's an input parameter to onRetry/Async. I've updated the Retry wiki to make this more accurate.

Foresaw your A3 problem as soon as you posted this, btw, but ... busy day this end ... you beat me to the follow-up 😀 and figured it out, so great 👍 . This exact issue is why my example the day before was more circuitous!

( A5. comments in mo )

@reisenberger
Copy link
Member

@mcquiggd

A5. So, we could do this (add sleepDurationProvider overloads taking exception, handled result), but I'm favouring adding the Exception / handled result to Context instead to avoid proliferating further overloads (and opens up potential for other Use Cases profiting from the same info).

So, you'd have access to eg, context.LastHandledException and context.LastHandledResult - would this do the trick?

( Great to have this conversation and see the DocumentDb use case in detail: thanks for sharing! )

@mcquiggd
Copy link
Author

mcquiggd commented May 12, 2017

So, you'd have access to eg, context.LastHandledException and context.LastHandledResult - would this do the trick?

Very nicely, and I was going to suggest that - the existing number of overloads was a little intimidating at first, and of course changing the execution behaviour and breaking backwards compatibility is a non-starter.

@mcquiggd
Copy link
Author

mcquiggd commented May 12, 2017

@reisenberger

Added some random thoughts:

B1. A Policy can Handle, and also Or different Exception or Result types. As we found in our example for DocumentDB, that could be DocumentClientException or AggregateException. That could result in a bit of spaghetti code to determine what type the context.LastHandledException or context.LastHandledResult should be cast to for accessing the desired properties. Ugly.

B2. As we are using an initial generic Handle<T> or HandleResult<T>, will we be passing context.LastHandledException of type T, context.LastHandledResult of type T (and exploring AggregateExceptions to locate T) ? This would save a lot of repetitive code ... Any use cases that would rule this out?

B3. If we are able to pass the Handled type, makes sense to have context.HandledException and context.HandledResult.

@reisenberger
Copy link
Member

@mcquiggd

B1/B2 Handled results: When a policy is a strongly-typed generic Policy<TResult>, executions through the policy can only return results of type TResult. Given that, we should be able to pass context.LastHandledResult as TResult.

It would mean creating a new generic-typed Context<TResult> : Context. We would need to check for breaking changes, but given Context<TResult> : Context, my initial (shallow) assessment is that breaking changes would be avoided.

(thoughts on other categories coming separately)

@reisenberger
Copy link
Member

reisenberger commented May 15, 2017

@mcquiggd

B1/B2 Handling multiple exception types: Yes, a single policy can handle multiple unrelated exception types (say, TException and VException). And this power has some relatively common and useful Use Cases; eg using HttpClient w/ async, it's common to want to handle both HttpRequestException and TaskCanceledException (for timeouts).

That power does make providing a strongly-typed TException to eg onRetry not a straight fit.

Thinking around ... The only option (which would be a major API change) would be moving the onRetry-style delegates of Polly into the .Handle<>(...) clause. But it carries other disadvantages. It weakens the meaning of the delegates (would dissociate them from their current, nicely-stated onBreak, onRetry, onFallback role). And it would introduce code duplication for the relatively common Use Case, that you simply want to log something around the exception - that it eg caused the retry or circuit breakage. Not altogether satisfactory in different ways...

At this stage, I'm left thinking that the need (sometimes) to disambiguate which exception you have received is a necessary corollary of the power to handle multiple exception types, encapsulated nicely in a single Policy. Open to thoughts, though, from anyone, if we're missing some clever alternative?

(This discussion excludes AggregateException - comment following)

EDIT: Another option that does exist, with existing Polly, is to specify separate policies for each exception type. You can use the same policy type more than once in a wrap. Where the onRetry handling was known to want to differ for different exception types, this provides a mechanism. For example, this pattern is commonly used to set up one retry policy for re-authentication, and another for transient faults.

@reisenberger
Copy link
Member

reisenberger commented May 15, 2017

Re the case ex is TException || ex is AggregateException && ex.Flatten().InnerExceptions.Single/Any(e => e is TException):

It would probably be possible to introduce handle clauses like:

.HandleIfSingleInAggregate<TException>(Func <Exception, bool> predicate) [*]

and/or

.HandleIfAnyInAggregate<TException>(Func <Exception, bool> predicate) [*]

These could extract a matching (single; or any) exception from within AggregateExceptions, and pass that directly (ie unwrapped) to any policy delegate that expects an Exception input parameter. Reducing user code navigating this, if it's possible that an underlying delegate might throw either.

Do Polly users think this would be useful?

The alternative - changing existing .Handle<>() clauses to do this - would obviously be a breaking change. We've also always adopted the premise that Polly doesn't transform the way exceptions from your existing code are expressed (eg if rethrown) - to make introducing Polly to a project easier - there's no extra/hidden 'magic' you have to know/learn/keep in mind, to adjust for Polly being in the mix. So, if we're going to do this, my leaning would be towards an approach that explicitly states the behaviour, like .HandleIfSingleInAggregate<TEx>().

[*] Long, intention-revealing names intentionally used - other suggestions welcome!

@reisenberger
Copy link
Member

reisenberger commented May 15, 2017

As an aside @mcquiggd , do we know that DocumentCosmosDb might throw DocumentClientException either directly, or wrapped in an AggregateException? [EDIT: as potentially either, from the same call?; see later note] (ie have we observed this in practice, or found it documented on DocumentDB?).

I saw this dual approach in a code example when researching, but it could have started a spurious hare ...

@mcquiggd
Copy link
Author

mcquiggd commented May 16, 2017

@reisenberger

Indeed, Document Cosmos DB will throw either a DocumentClientException or an AggregateException; the latter is, I believe, due to the behaviour of asynchronous code, specifically Tasks.

Stephen Cleary states here:

Allowing async to grow through the codebase is the best solution, but this means there’s a lot of initial work for an application to see real benefit from async code. There are a few techniques for incrementally converting a large codebase to async code, but they’re outside the scope of this article. In some cases, using Task.Wait or Task.Result can help with a partial conversion, but you need to be aware of the deadlock problem as well as the error-handling problem. I’ll explain the error-handling problem now and show how to avoid the deadlock problem later in this article.

Every Task will store a list of exceptions. When you await a Task, the first exception is re-thrown, so you can catch the specific exception type (such as InvalidOperationException). However, when you synchronously block on a Task using Task.Wait or Task.Result, all of the exceptions are wrapped in an AggregateException and thrown. Refer again to Figure 4. The try/catch in MainAsync will catch a specific exception type, but if you put the try/catch in Main, then it will always catch an AggregateException. Error handling is much easier to deal with when you don’t have an AggregateException, so I put the “global” try/catch in MainAsync.

So, there is a need to handle both possibilities. This is, I believe, how the Transient Fault Handler works. I will look at the code for it tomorrow, and will drop a message to a person I know on the DocumentDB team to see if they have any insight...

I have some time this week to devote to this - I will try to create an example of what I envisaged for my own application; basically the same sort of approach as the Transient Fault Handler that decorates the Document DB client, but with Polly policies. These would be specified at client instantiation as '"global" Policies, and optionally overridden by operation specific Policies passed when calling a method.

So basically there would be a library of extension methods, which wrap each type of client, and return a 'Resilient' instance; such as IResilientDocumentDBClient, IResilientAzureStorageClient, IResilientHTTPClient, etc, etc.

If we cannot find an elegant way to enhance the existing Polly approach to AggregateExceptions without breaking backwards compatibility, I would probably look at adding a try catch block around the actual DocumentDB Client call, extracting the DocumentClientException and rethrowing that to the ExecuteAsync method of the Policy, within the Resilient decorator. For my project I also intend to add such features as concurrency handling (e.g. handing etag mismatch in DocumentDB) where appropriate.

That's just a summary of my initial thoughts... at this point I prefer to try to write some code to see how it 'feels', and adjust accordingly... Ill share it so we can discuss it...

David

@reisenberger
Copy link
Member

reisenberger commented May 16, 2017

Hi @mcquiggd . Long familiar with async, TPL and their exception behaviour - I should have been (apologies) more precise in my previous question. Q in my mind was whether any one execution through DocumentDb might fail with either a DocumentClientException or AggregateException (as opposed to DocumentDb expressing that variety if called in different ways). If not, some of the preceding code handling either AggregateException or DocumentClientException in the same policy could be moot (so complexity could perhaps be reduced).

Agree tho somewhat theoretical until see how your code patterns pan out - cranking out some real code 👍

And independently: the .HandleIfSingle/AnyInAggregate<TException>() technique of earlier comment could remain useful in Polly, regardless of whether any one execution/policy needs to trap the exception at both levels (.Handle<TException>().OrIfSingleInAggregate<TExecption>()). It could provide useful syntactic sugar/lifting power for AggregateExceptions in general.

Re:

If we cannot find an elegant way to enhance the existing Polly approach to
AggregateExceptions without breaking backwards compatibility

.HandleIfSingle/AnyInAggregate<TException>(Func <Exception, bool> predicate) would be the way to extend Polly in this direction without breaking backwards compatibility. I'll likely pull that out to a separate issue/roadmap point (for community comment) in due course.

@mcquiggd
Copy link
Author

@reisenberger

I should have been more precise in my previous question. Q in my mind was whether any one execution through DocumentDb might fail with either a DocumentClientException or AggregateException (as opposed to DocumentDb expressing that variety if called in different ways)

Yep, got it - thats why I wanted to look at the Transient Fault Handler code and Document Client code to see under what circumstances (and why) there are Aggregate exceptions. I put the background for others reading the thread :)

It would be a perfectly reasonable theory that the Async methods on DocumentClient throw AggregateExceptions, and the non-async methods throw DocumentClientExceptions. But as I say, now its time to code / examine code and see how things work ;)

@reisenberger
Copy link
Member

@mcquiggd More detailed proposal for handling InnerExceptions of AggregateException, now under #256 : comment welcome!

@reisenberger reisenberger changed the title Sample code for honouring retries from 429 response Honouring retries from 429 response Jun 11, 2017
@reisenberger reisenberger added this to the v5.6.0 milestone Nov 26, 2017
@reisenberger
Copy link
Member

@mcquiggd Polly v5.6.0 (uploading to nuget in next day or so) will include the ability for WaitAndRetry policies to define a sleepDurationProvider which can take the handled fault (exception; or returned result) as an input parameter to the sleepDurationProvider delegate.

This should make it even easier to use WaitAndRetry policies to wait for a 'retry after' duration specified by the called system; eg as CosmosDB does with HTTP response code 429 and response headers

@ranouf
Copy link

ranouf commented Nov 12, 2018

Hi @reisenberger ,

Does this code is valid? If not, do you have a valid solution available somewhere?

Thanks for your help

private static readonly ConnectionPolicy ConnectionPolicy = new ConnectionPolicy 
{ 
    ConnectionMode = ConnectionMode.Direct, 
    ConnectionProtocol = Protocol.Tcp, 
    RequestTimeout = new TimeSpan(1, 0, 0), 
    MaxConnectionLimit = 1000, 
    RetryOptions = new RetryOptions 
    { 
        MaxRetryAttemptsOnThrottledRequests = 0, // was 10
        MaxRetryWaitTimeInSeconds = 60
    } 
};

Then, I created a Polly Retry Policy, as follows:

private static readonly RetryPolicy PollyRetryPolicy = Policy.Handle<DocumentClientException>(de => de.RetryAfter > TimeSpan.Zero).
        Or<AggregateException>(ae => ((ae.InnerException as DocumentClientException)?.RetryAfter ?? TimeSpan.Zero) > TimeSpan.Zero).
        WaitAndRetryAsync(retryCount: 3, 
                sleepDurationProvider: (retryAttempt, context) =>
                {
                    var retryAfter = TimeSpan.FromMilliseconds(10); // Should be 0? 
                    if (context == null || !context.ContainsKey("RetryAfter")) return retryAfter;                       
                    return (TimeSpan)context["RetryAfter"];
                }, 
                onRetryAsync: (exception, timespan, retryAttempt, context) =>
                {
                    context["RetryAfter"] = (exception as DocumentClientException)?.RetryAfter
                                            ?? (exception.InnerException as DocumentClientException)?.RetryAfter 
                                                 ?? TimeSpan.FromMilliseconds(1); //Just an arbitrary chosen value, never seems to be used
                    return Task.CompletedTask;
                }                
        );

@reisenberger
Copy link
Member

@ranouf We added overloads where the sleepDurationProvider takes the preceding exception as an input parameter, so this can now be done more simply. You can use that overload and take the code you have in onRetryAsync into sleepDurationProvider, and avoid having to pass anything by Context

@ranouf
Copy link

ranouf commented Nov 13, 2018

Hi,

Thanks @reisenberger.

For next one who are looking for a sample code:

            private readonly RetryPolicy _pollyRetryPolicy = Policy
                .Handle<DocumentClientException>(e => e.RetryAfter > TimeSpan.Zero)
                .WaitAndRetryAsync(
                    retryCount: 3,
                    sleepDurationProvider: (i, e, ctx) =>
                    {
                        var dce = (DocumentClientException)e;
                        return dce.RetryAfter;
                    },
                    onRetryAsync: (e, ts, i, ctx) => Task.CompletedTask
                );

Let me know if you see something to improve :)

@briansboyd
Copy link

Hi @ranouf,

I'm new to Polly and am struggling with the sync version of your above code. Here's my first attempt, but it's not clear to me what I should use for ???:

RetryPolicy PollyRetryPolicy = Policy
                                .Handle<DocumentClientException>(e => e.RetryAfter > TimeSpan.Zero)
                                .WaitAndRetry(
                                    retryCount: 3,
                                    sleepDurationProvider: (i, e, ctx) =>
                                    {
                                        var dce = (DocumentClientException)e;
                                        return dce.RetryAfter;
                                    },
                                    onRetry: (e, ts, i, ctx) => ???
                                );

It seems to be asking for that delegate being retried. How can this be accessed there?

Also, on a side note, do you (or anyone watching this thread) know if 449's (Transient Errors) include a RetryAfter value?

Many thanks,
Brian

@reisenberger
Copy link
Member

@briansboyd In the original:

onRetryAsync: (e, ts, i, ctx) => Task.CompletedTask

just denotes an async function which does nothing. To create an equivalent do-nothing for a sync function, you might use;

 onRetry: (e, ts, i, ctx) => { }

More realistically, you might use the onRetry delegate hook to do something useful like logging.

Out of interest: Is there a sync API on the Azure CosmosDB SDK which is throwing DocumentClientException with the RetryAfter property set?

@m1nkeh
Copy link

m1nkeh commented May 22, 2019

Edit: heh, started replying hours ago & got really distracted.. glad i got the same answer..

@briansboyd i guess you don't actually have to do anything in the onRetry if you don't want as for the syncronous version it's an action, and not a func so expects no return.. i think?

it's a shame that the only synchronous signature that lets you access the exception in the sleepDurationProvider also requires an onRetry, but there is already an absolute slew of overloads it's really quite confusing.. so you could just do:

RetryPolicy _policy;

_policy = Policy.Handle<ResponseException>().WaitAndRetry(
            retryCount: 2,
            sleepDurationProvider: (x, ex, ctx) => ((DocumentClientException)ex).RetryAfter,
            onRetry: (ex, tS, retryCount, ctx) => { }
        );

i tend to log what i'm doing though 🙂

_policy = Policy.Handle<ResponseException>().WaitAndRetry(
            retryCount: 2,
            sleepDurationProvider: (x, ex, ctx) => ((DocumentClientException)ex).RetryAfter,
            onRetry: (ex, tS, retryCount, ctx) => log.LogWarning("retry attempt no. {retryCount}", retryCount)
        );

@briansboyd
Copy link

briansboyd commented May 22, 2019

Thanks @reisenberger and @m1nkeh for your responses. I guessed {} before I went home last night (working in Tokyo), glad to see that it made sense. I agree that it makes even more sense to log this action :)

Out of interest: Is there a sync API on the Azure CosmosDB SDK which is throwing DocumentClientException with the RetryAfter property set?

Although I haven't explicitly seen this error being thrown (and RetryAfter property set), I'm asking for the Create*Query(...) set of methods: e.g.,
public IOrderedQueryable<T> CreateDocumentQuery<T>(Uri documentCollectionUri, FeedOptions feedOptions = null)

The documentation here doesn't distinguish between sync and async, so I'm hopeful this property will be populated. Of course, would be great to hear if otherwise.

I figure that 429's and 449's are the exceptions to retry for Cosmos DB, I'm surprised that 449 wasn't included in this discussion. Maybe Microsoft handled this and wraps 449's into DocumentClientException's and includes a RetryAfter? I'll ask our Microsoft support contact this question and post back the answer.

Cheers,
Brian

@briansboyd
Copy link

Here's the response from Microsoft's Cosmos DB Product Group:

"The short answer is yes and it is wrapped. The 449 errors will be wrapped in a document client exception with the RetryAfter property set. All exceptions from Cosmos DB are wrapped in a DocumentClientException.

The 449 errors are internally retried by the SDK:
The SDK does retry 449s in direct mode. (The gateway handles retries for gateway mode and it works slightly different).
The SDK in direct mode does do some amount of exponential backoff. We retry up to 30 seconds ourselves.

If the users would like to perform more retries on their side, similar to how they handle 429, there is a RetryAfter property on the DocumentClientException that can be used to implement that backoff.
https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.documents.documentclientexception.retryafter?view=azure-dotnet#Microsoft_Azure_Documents_DocumentClientException_RetryAfter"

Looks to be good news all around.

Cheers,
Brian

@reisenberger
Copy link
Member

Thanks @briansboyd . Re the sync/async question, it may be worth noting that the CreateDocumentQuery(...) methods in the CosmosDB SDK don't actually execute the query yet, they just define it. So, afaiu, a sync Polly policy would not be needed to guard a call to CreateDocumentQuery(...) in itself.

  • If you then evaluate/execute that IQueryable<T> synchronously (eg .AsEnumerable() and other overloads) then yes, a sync Polly policy could be useful to guard that sync execution throwing DocumentClientException with RetryAfter.
  • If you evaluate/execute that IQueryable<T> asynchronously (eg .AsDocumentQuery<T> and other overloads), then that execution would be guarded with an async Polly policy; and there isn't any need for a sync Polly policy just to guard the CreateDocumentQuery(...) step which sets up the query.

(Apologies if this is stating something known already - just for clarity.)

@briansboyd
Copy link

Thanks @reisenberger for clarity! I had missed that point. Just wanna also say thanks for the quality of your responses throughout this thread, helped me learn better how Polly works.

What I learned from studying this thread, and the call out to Microsoft Support, was that 30 secs of retry is fine for our use case. Any more than that and we need to just kick down and pay for more RU's :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants