CallActivityWithRetryAsync ability to impact logging behavior #586

baywet · 2019-01-23T19:43:18Z

Let's suppose that I have an acivity that I expect to fail pretty often because it's relying on a "not so great" API. Now calling that activity I leverage the CallActivityWithRetryAsync method to make sure my orchestration is more likely to execute properly.
However each failed attempt is logging everything as errors which clutters my telemetry with "exceptions". In that scenario I'd like to be able to provide a callback that handles logging, it'd be up to my code to determine whether or not to log the failed attempts and if yes, at which level.
That callback definition could be on the RetryOptions object, a property like public Action<Exception, int, bool, ILogger> HandleLogging {get;set;} (the exception happening, the number of attempts, is it the last attempt or not based on the options, logger to leverage)

The text was updated successfully, but these errors were encountered:

cgillum · 2019-01-23T21:08:42Z

Thanks for the suggestion. The idea seems reasonable to me.

The challenge is that while we can control logging done by the Durable extension, we don't have control over the logging behavior of the Functions runtime. This means someone would have to change both the extension and the host to make this scenario work the way you want. It's probably doable, but I think we'd need to get a good number of votes for this capability or rely on an externally contributed PR since I expect it will be difficult to prioritize.

Let me know if you think there is a compromise that could result in a simpler fix.

baywet · 2019-01-23T21:58:45Z

Thanks for the fast answer! At which point is the exception being debounced and loggued? Within the activity? Within the proxy layer of extensions? Within functions sdk? Webjob sdk? A mix?

cgillum · 2019-01-24T00:19:28Z

The layers of a durable function execution look something like this:

[Storage] --> [DTFx] --> [WebJobs] --> [Durable Extension] --> [User Function]

When there is a failure, both [WebJobs] (which is what I meant by "functions runtime") and [Durable Extension] will write logs to Application Insights. The failure propagates all the way back to [DTFx] (Durable Task Framework) which manages the durable retry - so each retry goes all the way through the entire stack.

baywet · 2019-01-24T00:24:20Z

Thanks for the precision. The highest level of user code where the exception surfaces is the body of the run method in the activity. Who gets to call that method? The durable extension right?
Am I right to assume that if we intercept the exception, still mark the activity as failed, and pass the exception along to the "logging callback" defined with retry options it should be good and not require to touch webjobs?

cgillum · 2019-01-24T01:09:33Z

The problem is that the exception needs to flow all the way down the stack to [DTFx] in order for retry to work correctly. The other problem we have is that it is the orchestrator which manages retries, not the activity function, so activity functions actually don't know anything about whether they are being retried.

BTW: Someone just pointed out to me that there is actually a simple bug in our function logging where we are logging replayed exceptions to Application Insights (i.e. the orchestrator will re-log the same exception multiple times, once for each replay of a failed activity). Could this be the cause of the log-spam you're seeing?

baywet · 2019-01-24T01:27:53Z

Hum I get it, you don't handle the state of the activity in any other way than the exception showing up at the DTFx level to know whether it succeeds or fails and whether or not your should retry. So it's not possible to intercept the exception without breaking the retry mechanisms today.

This bug could explain some of the behavior, but even just in the console log, all the activity attempts that are failing are logging errors. When arguably until the last retry it (the orchestration gives up) it should be written at most as a warning (and then again, having the ability to handle that myself as the consumer would definitively solve the issue and avoid any opiniated decisions)

sebader · 2019-04-25T06:36:59Z

I came across the very same idea for a bit of a different use case. I sent a request to an external API and then need to poll a status URL every couple of seconds until the result is ready to be fetched. CallActivityWithRetryAsync() is basically a perfect fit for that use case - except for the issue with the log getting cluttered with the exceptions I need to throw.

var retryOptions = new RetryOptions(TimeSpan.FromSeconds(5), 10)
{
	BackoffCoefficient = 2,
	MaxRetryInterval = TimeSpan.FromSeconds(30),
	Handle = ex => ex.Message == "Status=Pending"
};
var readResult = await monitorContext.CallActivityWithRetryAsync<ReadResult>("GetStatusExternal", retryOptions, resultCheckUrl);

[FunctionName("GetStatusExternal")]
public static async Task<ReadResult> CheckOcrStatusThrowingExceptions([ActivityTrigger] string resultCheckUrl, ILogger log)
{
	HttpResponseMessage response = await _httpClient.GetAsync(resultCheckUrl);
	response.EnsureSuccessStatusCode();

	var result = await response.Content.ReadAsAsync<ReadResult>();

	if (result?.status == "Succeeded")
	{
		return result;
	}
	else
	{
		log.LogInformation($"Not yet finished. Current status: {result?.status}");
		throw new Exception("Status=Pending");
	}
}

If there is no way to change the behavior of the exception logging, is there a similar easy way to achieve this with durable functions? I had it before using a while loop etc, but this here would be much cleaner.

jbeeko · 2019-06-13T23:04:33Z

I'm in the same position. I'd like to use an Activity function with exponential backoff to keep trying to send callbacks until some considerable time has passed. Works very well but for the exceptions cluttering up the Azure logging :(

JosXa · 2020-06-20T14:24:36Z

I am firing a load of requests to a CosmosDB instance and absolutely expect some of those calls to time out. In addition to CosmosException not being serializable, I'm also having the same problem that I cannot retry the call without cluttering the AppInsights logs.

Which would be the minimal "logLevel" settings for getting rid of those log entries?

calloncampbell · 2021-01-22T20:16:38Z

I like this feature request

stevenveerbeek · 2023-01-18T08:50:45Z

I'm having the exact same issue. We're using the Azure DevOps REST API to create a new Azure DevOps project, and we use the CallActivityWithRetryAsync method to poll for the status of the operation. Would be nice if we could prevent our logs from being cluttered with error logs for each call that results in a "pending" status.

FFranz93 · 2023-03-09T12:46:22Z

I have a Durable Function that does some renaming on images and then uploads these images to my companies SharePoint Online.
Due to some restrictions, SharePoint throttles my application when I perform to many requests, and responds with an HTTP status code 429 (Too Many Requests) and a Retry-After header.
In my case CallActivityWithRetryAsync() is also the perfect solution for this, because I can use the Delay from the header to set the backoff for my RetryPolicy.

The only thing that's not so nice about this solution, is the flood of error logs that is written to ApplicationInsights, so it would be really nice to have a feature like this!

ThomasArdal · 2023-03-20T15:04:23Z

I have the same question here: https://stackoverflow.com/questions/75790044/retry-in-azure-durable-functions-without-throwing-exceptions and was guided towards this thread.

Idea: enrich the log message logged as part of this error with information about the retry attempt count. This will make it possible for us to ignore these errors until the max number of retries have been attempted.

cgillum · 2023-03-20T15:21:32Z

Adding @lilyjma for visibility

NattyMojo · 2023-06-30T15:13:57Z

@cgillum are you working this issue or is there somewhere this is being tracked? Would like to see this enhancement make it through.

campbellja · 2023-09-12T13:59:31Z

Our team is also very interested in this enhancement and hoping it gets implemented soon.

cgillum · 2023-09-12T17:27:14Z

We haven't been able to prioritize this, unfortunately.

One thing people can try in the meantime is to add additional levels of retry at the activity function level. This unfortunately requires code changes, but I suspect each language will have some libraries available to help with this. It might be possible to leverage the Azure Functions built-in retry policies, but I'm not aware of whether it applies to all trigger types or if (like the documentation seems to suggest) that it only works with specific trigger types.

ericleigh007 · 2024-02-28T13:03:10Z

+1 on having retry status.
Use case:
Since activiites are kind of expensive, and can actually take some time to become active after being called, we have had to combine some operations in a single activity.
The function does some operation but then it would normally CREATE a document in Cosmos. This could be reached from many threads (in this case, there is Conflict logic)
The first try wants to try to create the document, but subsequent ones want to try to Upsert the document (there is also PreconditionFailed logic of course)
It would be very helpful if we knew which retry we're on. is there a reason this cannot be available in the context?
This seems almost as straightforward as the availability of "IsReplaying"

allenzhang83 · 2024-05-22T22:40:56Z

where do I vote for this change to be prioritised? I'd like to just log the error as information instead of exception during retry, and only see it as an exception after retry timed out.

cgillum added the Enhancement Feature requests. label Jan 23, 2019

cgillum mentioned this issue Jan 24, 2019

Fixed issue where function failures were always being logged on replays #588

Merged

ConnorMcMahon added the needs-discussion label Mar 15, 2021

lilyjma self-assigned this Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CallActivityWithRetryAsync ability to impact logging behavior #586

CallActivityWithRetryAsync ability to impact logging behavior #586

baywet commented Jan 23, 2019

cgillum commented Jan 23, 2019

baywet commented Jan 23, 2019

cgillum commented Jan 24, 2019

baywet commented Jan 24, 2019

cgillum commented Jan 24, 2019

baywet commented Jan 24, 2019

sebader commented Apr 25, 2019

jbeeko commented Jun 13, 2019

JosXa commented Jun 20, 2020 •

edited

calloncampbell commented Jan 22, 2021

stevenveerbeek commented Jan 18, 2023

FFranz93 commented Mar 9, 2023

ThomasArdal commented Mar 20, 2023 •

edited

cgillum commented Mar 20, 2023

NattyMojo commented Jun 30, 2023

campbellja commented Sep 12, 2023

cgillum commented Sep 12, 2023

ericleigh007 commented Feb 28, 2024

allenzhang83 commented May 22, 2024

CallActivityWithRetryAsync ability to impact logging behavior #586

CallActivityWithRetryAsync ability to impact logging behavior #586

Comments

baywet commented Jan 23, 2019

cgillum commented Jan 23, 2019

baywet commented Jan 23, 2019

cgillum commented Jan 24, 2019

baywet commented Jan 24, 2019

cgillum commented Jan 24, 2019

baywet commented Jan 24, 2019

sebader commented Apr 25, 2019

jbeeko commented Jun 13, 2019

JosXa commented Jun 20, 2020 • edited

calloncampbell commented Jan 22, 2021

stevenveerbeek commented Jan 18, 2023

FFranz93 commented Mar 9, 2023

ThomasArdal commented Mar 20, 2023 • edited

cgillum commented Mar 20, 2023

NattyMojo commented Jun 30, 2023

campbellja commented Sep 12, 2023

cgillum commented Sep 12, 2023

ericleigh007 commented Feb 28, 2024

allenzhang83 commented May 22, 2024

JosXa commented Jun 20, 2020 •

edited

ThomasArdal commented Mar 20, 2023 •

edited