Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CallActivityWithRetryAsync ability to impact logging behavior #586

Open
baywet opened this issue Jan 23, 2019 · 19 comments
Open

CallActivityWithRetryAsync ability to impact logging behavior #586

baywet opened this issue Jan 23, 2019 · 19 comments
Assignees
Labels

Comments

@baywet
Copy link
Member

baywet commented Jan 23, 2019

Let's suppose that I have an acivity that I expect to fail pretty often because it's relying on a "not so great" API. Now calling that activity I leverage the CallActivityWithRetryAsync method to make sure my orchestration is more likely to execute properly.
However each failed attempt is logging everything as errors which clutters my telemetry with "exceptions". In that scenario I'd like to be able to provide a callback that handles logging, it'd be up to my code to determine whether or not to log the failed attempts and if yes, at which level.
That callback definition could be on the RetryOptions object, a property like public Action<Exception, int, bool, ILogger> HandleLogging {get;set;} (the exception happening, the number of attempts, is it the last attempt or not based on the options, logger to leverage)

@cgillum
Copy link
Collaborator

cgillum commented Jan 23, 2019

Thanks for the suggestion. The idea seems reasonable to me.

The challenge is that while we can control logging done by the Durable extension, we don't have control over the logging behavior of the Functions runtime. This means someone would have to change both the extension and the host to make this scenario work the way you want. It's probably doable, but I think we'd need to get a good number of votes for this capability or rely on an externally contributed PR since I expect it will be difficult to prioritize.

Let me know if you think there is a compromise that could result in a simpler fix.

@cgillum cgillum added the Enhancement Feature requests. label Jan 23, 2019
@baywet
Copy link
Member Author

baywet commented Jan 23, 2019

Thanks for the fast answer! At which point is the exception being debounced and loggued? Within the activity? Within the proxy layer of extensions? Within functions sdk? Webjob sdk? A mix?

@cgillum
Copy link
Collaborator

cgillum commented Jan 24, 2019

The layers of a durable function execution look something like this:

[Storage] --> [DTFx] --> [WebJobs] --> [Durable Extension] --> [User Function]

When there is a failure, both [WebJobs] (which is what I meant by "functions runtime") and [Durable Extension] will write logs to Application Insights. The failure propagates all the way back to [DTFx] (Durable Task Framework) which manages the durable retry - so each retry goes all the way through the entire stack.

@baywet
Copy link
Member Author

baywet commented Jan 24, 2019

Thanks for the precision. The highest level of user code where the exception surfaces is the body of the run method in the activity. Who gets to call that method? The durable extension right?
Am I right to assume that if we intercept the exception, still mark the activity as failed, and pass the exception along to the "logging callback" defined with retry options it should be good and not require to touch webjobs?

@cgillum
Copy link
Collaborator

cgillum commented Jan 24, 2019

The problem is that the exception needs to flow all the way down the stack to [DTFx] in order for retry to work correctly. The other problem we have is that it is the orchestrator which manages retries, not the activity function, so activity functions actually don't know anything about whether they are being retried.

BTW: Someone just pointed out to me that there is actually a simple bug in our function logging where we are logging replayed exceptions to Application Insights (i.e. the orchestrator will re-log the same exception multiple times, once for each replay of a failed activity). Could this be the cause of the log-spam you're seeing?

@baywet
Copy link
Member Author

baywet commented Jan 24, 2019

Hum I get it, you don't handle the state of the activity in any other way than the exception showing up at the DTFx level to know whether it succeeds or fails and whether or not your should retry. So it's not possible to intercept the exception without breaking the retry mechanisms today.

This bug could explain some of the behavior, but even just in the console log, all the activity attempts that are failing are logging errors. When arguably until the last retry it (the orchestration gives up) it should be written at most as a warning (and then again, having the ability to handle that myself as the consumer would definitively solve the issue and avoid any opiniated decisions)

@sebader
Copy link
Member

sebader commented Apr 25, 2019

I came across the very same idea for a bit of a different use case. I sent a request to an external API and then need to poll a status URL every couple of seconds until the result is ready to be fetched. CallActivityWithRetryAsync() is basically a perfect fit for that use case - except for the issue with the log getting cluttered with the exceptions I need to throw.

var retryOptions = new RetryOptions(TimeSpan.FromSeconds(5), 10)
{
	BackoffCoefficient = 2,
	MaxRetryInterval = TimeSpan.FromSeconds(30),
	Handle = ex => ex.Message == "Status=Pending"
};
var readResult = await monitorContext.CallActivityWithRetryAsync<ReadResult>("GetStatusExternal", retryOptions, resultCheckUrl);
[FunctionName("GetStatusExternal")]
public static async Task<ReadResult> CheckOcrStatusThrowingExceptions([ActivityTrigger] string resultCheckUrl, ILogger log)
{
	HttpResponseMessage response = await _httpClient.GetAsync(resultCheckUrl);
	response.EnsureSuccessStatusCode();

	var result = await response.Content.ReadAsAsync<ReadResult>();

	if (result?.status == "Succeeded")
	{
		return result;
	}
	else
	{
		log.LogInformation($"Not yet finished. Current status: {result?.status}");
		throw new Exception("Status=Pending");
	}
}

If there is no way to change the behavior of the exception logging, is there a similar easy way to achieve this with durable functions? I had it before using a while loop etc, but this here would be much cleaner.

@jbeeko
Copy link

jbeeko commented Jun 13, 2019

I'm in the same position. I'd like to use an Activity function with exponential backoff to keep trying to send callbacks until some considerable time has passed. Works very well but for the exceptions cluttering up the Azure logging :(

@JosXa
Copy link

JosXa commented Jun 20, 2020

I am firing a load of requests to a CosmosDB instance and absolutely expect some of those calls to time out. In addition to CosmosException not being serializable, I'm also having the same problem that I cannot retry the call without cluttering the AppInsights logs.

Which would be the minimal "logLevel" settings for getting rid of those log entries?

@calloncampbell
Copy link

I like this feature request

@stevenveerbeek
Copy link

I'm having the exact same issue. We're using the Azure DevOps REST API to create a new Azure DevOps project, and we use the CallActivityWithRetryAsync method to poll for the status of the operation. Would be nice if we could prevent our logs from being cluttered with error logs for each call that results in a "pending" status.

@FFranz93
Copy link

FFranz93 commented Mar 9, 2023

I have a Durable Function that does some renaming on images and then uploads these images to my companies SharePoint Online.
Due to some restrictions, SharePoint throttles my application when I perform to many requests, and responds with an HTTP status code 429 (Too Many Requests) and a Retry-After header.
In my case CallActivityWithRetryAsync() is also the perfect solution for this, because I can use the Delay from the header to set the backoff for my RetryPolicy.

The only thing that's not so nice about this solution, is the flood of error logs that is written to ApplicationInsights, so it would be really nice to have a feature like this!

@ThomasArdal
Copy link

ThomasArdal commented Mar 20, 2023

I have the same question here: https://stackoverflow.com/questions/75790044/retry-in-azure-durable-functions-without-throwing-exceptions and was guided towards this thread.

Idea: enrich the log message logged as part of this error with information about the retry attempt count. This will make it possible for us to ignore these errors until the max number of retries have been attempted.

@cgillum
Copy link
Collaborator

cgillum commented Mar 20, 2023

Adding @lilyjma for visibility

@NattyMojo
Copy link

@cgillum are you working this issue or is there somewhere this is being tracked? Would like to see this enhancement make it through.

@campbellja
Copy link

Our team is also very interested in this enhancement and hoping it gets implemented soon.

@cgillum
Copy link
Collaborator

cgillum commented Sep 12, 2023

We haven't been able to prioritize this, unfortunately.

One thing people can try in the meantime is to add additional levels of retry at the activity function level. This unfortunately requires code changes, but I suspect each language will have some libraries available to help with this. It might be possible to leverage the Azure Functions built-in retry policies, but I'm not aware of whether it applies to all trigger types or if (like the documentation seems to suggest) that it only works with specific trigger types.

@lilyjma lilyjma self-assigned this Sep 20, 2023
@ericleigh007
Copy link

+1 on having retry status.
Use case:
Since activiites are kind of expensive, and can actually take some time to become active after being called, we have had to combine some operations in a single activity.
The function does some operation but then it would normally CREATE a document in Cosmos. This could be reached from many threads (in this case, there is Conflict logic)
The first try wants to try to create the document, but subsequent ones want to try to Upsert the document (there is also PreconditionFailed logic of course)
It would be very helpful if we knew which retry we're on. is there a reason this cannot be available in the context?
This seems almost as straightforward as the availability of "IsReplaying"

@allenzhang83
Copy link

where do I vote for this change to be prioritised? I'd like to just log the error as information instead of exception during retry, and only see it as an exception after retry timed out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests