Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration capabilities to retry for loading config via URL #8854

Closed
sjwang90 opened this issue Feb 12, 2021 · 2 comments · Fixed by #15377
Closed

Configuration capabilities to retry for loading config via URL #8854

sjwang90 opened this issue Feb 12, 2021 · 2 comments · Fixed by #15377
Assignees
Labels
area/configuration feature request Requests for new plugin and for new features to existing plugins

Comments

@sjwang90
Copy link
Contributor

sjwang90 commented Feb 12, 2021

Feature Request

Related: #7338

Proposal:

User should be able to designated the interval and number of retries for loading their config from a URL if their endpoint is down.

Current behavior:

Right now, Telegraf retries three times at 10s intervals when receiving an error on loading config from a url in the case of the remote endpoint being down. Current solution does not use env variables or use flags to change these settings (based on #8803).

Desired behavior:

User needs some way to configure interval and number of retries settings to determine the behavior of loading the config from a URL.

Use case:

From @schmorgs:
Planning to use Telegraf in production across a large number of servers across the globe, and there are many points where breakages could happen, especially in countries where there is very low bandwidth and old infrastructure. Along with that comes many standards and versions of OS, etc, hence our approach to manage config centrally so that we don't have to navigate the variety of ways of reaching an endpoint.

So if Telegraf starts up and there happened to be a breakage somewhere (NW connectivity, Web Server down, etc), the agent will die. On RHEL7/8 and Windows, we can utilise systemd/SCM to configure infinite retries on the agent so that even if it does die, it will be restarted.
But RHEL6 doesn't have systemd and so we would end up writing some sort of watcher daemon as well which seems a bit overkill if the agent could handle (at least) this condition.

The reason for the importance is this will be our primary monitoring agent and so want to make this as available and robust as possible. We would still implement external controls such as systemd restarts to provide an extra layer of resilience, but the more the agent can do in this area makes just adds to this.

In some cases, the situation where the agent was unable to get config would be fairly small as the agent only pulls config on startup. But we want the agent to periodically pull its config down so that it can be configured centrally and automatically pulled by the agent. I understand this is part of a longer term strategy for Telegraf, but in the meantime, we HUP the agent periodically as a workaround, and so now the agent has constant reliability on the HTTP endpoint and therefore, more likelihood of encountering a problem.

Whether a switch, environment variable, config file on the server, etc, I'm happy to see whichever approach works best.

@powersj
Copy link
Contributor

powersj commented Mar 31, 2022

next step: investigate design and implications

@nkcfan
Copy link

nkcfan commented May 5, 2024

It's quite normal requirement considering a power outage at home. The modem and router need time to connect to Internet, and the telegraf service with a url config just quickly tries several time and completely fails.

powersj added a commit to powersj/telegraf that referenced this issue May 17, 2024
This introduces a new cli option to allow the user to set the number of
retry attempts to something other than 3. It also allows the user to set
the attempt count to -1 to infinitely retry.

fixes: influxdata#8854
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/configuration feature request Requests for new plugin and for new features to existing plugins
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants