-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(agent): persist to statefile at a regular interval #14561
Conversation
64df91c
to
63fd479
Compare
63fd479
to
4fd0e3f
Compare
57d949b
to
f97535f
Compare
Hi @dayvar14, Thanks for putting up the PR. We have a few concerns about this change. This is a scenario that we talked about in depth when we first added the statefile feature. Catching the state There are two main points that need to be resolved: First, this needs to be opt-in and not the default behavior. Second, doing this for every gather cycle could also be I/O heavy depending on your deployment and your interval setting. Possibly having a config option for how often this is done may be necessary. Keep in mind that not all plugins even have a gather cycle as well. While you are doing this in the agent for each interval, knowing how this operates on other plugins types, like service input plugins that listen for metrics would be good to ensure is tested. Thanks |
e4b2cd2
to
b3ec6fa
Compare
b3ec6fa
to
7b25843
Compare
5a5e67c
to
1206f80
Compare
Co-authored-by: Thomas Casteleyn <thomas.casteleyn@me.com>
After reviewing the code with my colleagues (@noahneedscoffee and @ericmiller-wpengine), we discovered that when the tails are terminated at the |
9af5cc9
to
16c3f13
Compare
16c3f13
to
388b112
Compare
Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip. 📦 Click here to get additional PR build artifactsArtifact URLs |
My apologies, I owe you a response. Looks like you have tested this thoroughly, and I have no doubt resolved it for the tail plugin. I think there was concern around other plugins that currently implement the state and ensuring we have a consistent mechanism. @srebhan is that your current thinking as well? |
@powersj No worries. I understand the concern a bit better now. I only found four plugins that are stateful at the moment. inputs.tail, inputs.docker_log, inputs.win_eventlog, and processors.dedup. Are you considering as a solution that...
Or do you have another approach in mind? |
@dayvar14 my minimal request would be that the plugin's state is consistent when stored on disk. For the current implementation that is the case as all Gather operations and all dataflows have been stopped, so Telegraf is in a "frozen" state when we collect the states. Here an example for my concerns: Imagine we do tail 4 files and use a file output to save the results for illustration. Assume further that every file provides 32 bytes corresponding to 1 metric per "write" The following is fully possible even with locks around the state:
Now we lost the last two metrics in file 3 and file 1, because the state says we did read them already, but they are not written at the output! One possible solution would be double buffering. It will not fix the lost-metric example above, but it might guarantee we do have a clear gather-state e.g. by always returning the state before the actual gather... @dayvar14 does that make any sense? |
I see @srebhan. For the tail input plugin specifically, I think there might be a slight limitation with the tail implementation InfluxData's implementation of tail does not output the line index of the currently read line. Meaning we are reliant on the tail.Tell(), which is noted specifically that its not very accurate and its very possible a new line was received by the time the line was processed in the receiver method. I do see some implementations of tail that do this, so it may be possible to either use this or update the current implementation to support this. If we do that, or if we are fine with the limitations of tail.Tell(), buffering the offsets and updating the local state only when the metrics are delivered seems like a good potential solution. Again thanks for the feedback. |
hi dayvar14,
For the tail plugin, does that provide something different than the offset used today? As Sven was mentioning, when we start gathering state at an interval it may not be consistent with messages that are successfully written. The offset may have advanced, but the metric not written. If a crash happens before writing, this leads to you missing data and having small gaps in your state when you recover. There is a similar concern with the Windows Event Log and its use of a bookmark. As Sven and I have reviewed this we are recalling why we did not originally add periodic state tracking. There are enough moving parts to make this racy and inconsistent. I would like to avoid take a PR that we know will result in this type of behavior for now. |
9440e86
to
32be4ab
Compare
32be4ab
to
1d3d599
Compare
I updated the following the following plugins win_eventlog, docker_log, and dedup plugins. Win_eventlog - only updates the state at the end of a gather. Meaning the state only represent metrics that have been successfully added to the accumulator. Docker_log - Separates out the state to be persisted from the frequently updated local state that is updated during the gather. Meaning that again plugin state represents only metrics that have been successfully added to the accumulator provided in the gather. Dedup - updates the state during the apply. The edge case I found with the current main implementation (specifically to the tail plugin) is that if the process never safely shutdowns, then no state is ever recorded. A crash may cause a whole bunch of messages to be lost, since it starts from the end of the file if no state is found. |
1d3d599
to
4ae810d
Compare
Correct, however unless I missed an update this PR now introduces a case where the state claims you wrote messages, but may not actually have. While I can understand that saving the state periodically is seen as an improvement, it introduces an incorrect behavior that needs to get resolved. |
I did notice @srebhan mentioned the following
I might need to look through the code a bit more, but (for non-service plugins) perhaps only persisting the state of previous "gather" during a gather might be better? Maybe that is what was meant. I'm not currently sure if another gather confirms that the previous gather's metrics were written atm. As for the tail plugin, confirming the metrics were delivered before state is written might be a bit more complex. |
While rare, some cases may involve where a use gathers every 5 seconds, and flushes every 10 seconds. So multiple gathers could occur before a flush happens. But keep in mind this only applies to non-service plugins.
100% - this is why he and I are hesitating :) the service plugins (e.g. tail, *-consumer) use tracking metrics are read not at gathers, but instead as they come in and as we have room for more. |
Darn. This is fine in our use case, but i can see why it wont be desirable in most situations.
Perhaps a more acceptable solution might be to configure input.tail to read from start of file if no state within statefile is found? Perhaps a |
@dayvar14 sorry for the late response!
Yeah and I'm sorry for not being able to offer a good solution! I wish we could provide an easy way...
This should really be fixed as not shutting down safely is causing other troubles as well. Are you able to reproduce the issue with the latest version? If so, please open a bug-report and let me know!
This is definitively feasible and I think can be done. Instead of introducing a new boolean I would suggest to have ## Read strategy
## Available settings are:
## continue -- continue from the last saved state or, if no state provided, from the end of the file
## from_tail -- always start from the end of the file no matter if a state is provided or not
## from_head -- always start from the beginning of the no matter if a state is provided or not
## fallback_head -- continue from the last saved state or, if no state provided, from the start of the file
# read_strategy = "continue" with What do you think? |
@srebhan I believe this is talked about in #14566. State is never recorded as its only recorded when telegraf safely shutdowns.
In that case I will open a new issue with this more specific problem. I'll leave this issue open as i do believe an incrementing state is desirable. Other log agents like fluentd seem to have this feature (for tail input) and in our use case it is important minimize the amount of our critical logs lost as well as reducing duplication.
Naming is always hard! |
@dayvar14 can we please open an issue for this with the exact requirements and close the PR? We can discuss ways forward in the issue... Keeping the PR open will only clutter the list and makes it harder for us to maintain an overview of what is open... |
Summary
State is now stored at each gather for each plugin individually.
State is restored correctly if telegraf fails catastrophically.
Change shouldn't affect any current implementation of state usage as its simply writing to the state more frequently.
Checklist
Related issues
resolves #14566