Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many errors for endpoint 'https://...': retrying later #19360

Closed
rossigee opened this issue Sep 10, 2023 · 3 comments
Closed

Too many errors for endpoint 'https://...': retrying later #19360

rossigee opened this issue Sep 10, 2023 · 3 comments

Comments

@rossigee
Copy link

I would like to understand better the cause of the following error message...

2023-09-10 01:39:06 GMT | CORE | ERROR | (pkg/forwarder/worker.go:179 in process) | Too many errors for endpoint 'https://10.70.12.34:3834/api/v1/series?api_key=<blahblah>': retrying later

The agent is running on Windows and is configured to use an haproxy serving TLS endpoints with a self-signed certificate. The certificate has been added to the Windows system trust chain.

Looking into the source code, it appears that actually it (seems to) mean that the target has been blocked by a 'circuit breaker' intended to blacklist certain targets (?!).

	// Run the endpoint through our blockedEndpoints circuit breaker
	target := t.GetTarget()
	if w.blockedList.isBlock(target) {
		requeue()
		log.Errorf("Too many errors for endpoint '%s': retrying later", target)
	} else if err := t.Process(ctx, w.Client); err != nil {
		w.blockedList.close(target)
		requeue()
		log.Errorf("Error while processing transaction: %v", err)
	} else {
		w.blockedList.recover(target)
	}

Further investigation is still required, but I'm filing this as a 'support' request, even though I feel it is bordering on a 'bug report' as the error message is just far too ambiguous, and lacks any detail that would help normal users understand what needs to be investigated and fixed. Why has my target been blocked?!

Also, doing some Googling I'm not the only person that's scratched their head over this one. The usual advice seems to be to send Datadog Support a flare. Really, that's not the best advice. If everyone has to file private support requests and generate flares every time they come across a poorly written error message, that's going to create a huge amount of unnecessary resistance, work and lost time for both Datadog support and us customers. So, I'm filing this here as a 'support request' on GitHub instead to try to save more people's precious time.

Please feel free to move to 'bug report' if I've misjudged it.

@remeh
Copy link
Contributor

remeh commented Sep 15, 2023

Hey @rossigee,

First, thanks for the time you've spent investigating this to open a meaningful issue 👍

I'll explain a bit this piece of code but TL;DR: blockedList is an unfortunate naming (not really a blocking list, but more of a list maintaining a health status for every endpoint), and this error message is sent when the backoff policy (avoiding too many retries on a continuously failing endpoint) has decided to slow down on continuing to try to use this endpoint because its health status looks bad.

For a more detailed answer:

In order to reach the Datadog intake, there is a part of the Agent called the Forwarder which deals with all the endpoints.

When everything is working fine, the payload sent by the Forwarder reaches the intake and an info log is displayed, either this one or this one after many payloads have been successful sent.

However, if there is an error reaching an endpoint, an error message is thrown, there are multiple possible errors, this one, that one, or that one (the actual call to print that one out is later on in the branch). These contain useful information to better understand what's not working with the connection and you should be able to see a few of them in your Agent logs.
That's when the "blocking list"/backoff policy enters the scene: when there is such an error, a mechanism is counting per endpoint how many times in a row these fails happens and it slows down between each try, because if these requests continuously fail, there are no reasons continuing to try to push the data. This is when you can start seeing the error you mentioned, Too many error for endpoint. Since previous errors while trying to reach this endpoint have been displayed already, this one is only mentioning that for now, the Forwarder has decided to wait for a while before retrying to reach this endpoint.
In your specific case, I agree that this specific error isn't enough to troubleshoot your problem, however, the previous endpoint connection errors in your Agent logs should have more information on what's happening with the connection. If the previous error messages are still not enough to troubleshoot the problem, don't hesitate to reach our support by opening a support case related to this 👍 Don't hesitate to mention that conversation into the support ticket if you feel it's necessary, and on my side, I'll also open a ticket about renaming this blockedList object to something more meaningful.

I hope this helps.

@rossigee
Copy link
Author

Hi @remeh,

Thanks for taking the time to elaborate publicly on the issue. I'm sure others that encounter this will appreciate the additional context. I certainly do.

While it took me a moment, I did eventually figure out that the repeated "Too many errors" meant that there was an earlier repeated failure. It then clicked that I should look much much further back in the logs to identify the error that was the actual root cause. This led me quickly to identify and solve my misconfiguration and move on.

Now, I should probably have realised this all sooner, but the wording used here threw me on a bit of a wild goose chase. I guess we there is a little room for improvement regarding the terminology and function, as per your suggestion. I would perhaps rename blockedList to backoffList or something for clarity in the first case, and then refactor it a bit to capture and store the error message (caught from one of the three causes you mentioned above) against the endpoint in the map too. That way, when we present the error message that gets constantly repeated we can also repeat the root cause too, which should save any operator confusion, and any need to go hunting for clues way back in the log history.

Cheers 👍

@remeh
Copy link
Contributor

remeh commented Sep 18, 2023

👋
Happy to hear you've solved your initial problem! I created an entry in our backlog as a follow-up of this conversation. I'm closing this issue (SEO will still do its job and this conversation will pop on search engines).

@remeh remeh closed this as completed Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants