Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(agent): persist to statefile at a regular interval #14561

Closed
wants to merge 14 commits into from

Conversation

dayvar14
Copy link
Contributor

@dayvar14 dayvar14 commented Jan 10, 2024

Summary

State is now stored at each gather for each plugin individually.
Screenshot 2024-01-11 at 11 14 09 AM

State is restored correctly if telegraf fails catastrophically.
Screenshot 2024-01-11 at 11 19 46 AM

Change shouldn't affect any current implementation of state usage as its simply writing to the state more frequently.

Checklist

  • No AI generated code was used in this PR

Related issues

resolves #14566

@telegraf-tiger telegraf-tiger bot added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Jan 10, 2024
@dayvar14 dayvar14 changed the title feat: update statefile for each plugin per gather feat(persister): update statefile for each plugin per gather Jan 17, 2024
@powersj
Copy link
Contributor

powersj commented Jan 17, 2024

Hi @dayvar14,

Thanks for putting up the PR. We have a few concerns about this change. This is a scenario that we talked about in depth when we first added the statefile feature. Catching the state during a crash periodically was ultimately not included in the first pass.

There are two main points that need to be resolved:

First, this needs to be opt-in and not the default behavior.

Second, doing this for every gather cycle could also be I/O heavy depending on your deployment and your interval setting. Possibly having a config option for how often this is done may be necessary.

Keep in mind that not all plugins even have a gather cycle as well. While you are doing this in the agent for each interval, knowing how this operates on other plugins types, like service input plugins that listen for metrics would be good to ensure is tested.

Thanks

@dayvar14 dayvar14 changed the title feat(persister): update statefile for each plugin per gather feat(persister): persist to statefile at a regular interval Jan 18, 2024
@dayvar14 dayvar14 force-pushed the update-state-per-gather branch 2 times, most recently from e4b2cd2 to b3ec6fa Compare January 18, 2024 08:54
cmd/telegraf/agent.conf Outdated Show resolved Hide resolved
@dayvar14
Copy link
Contributor Author

dayvar14 commented Jan 18, 2024

Ran the tests in the same test described in the issue and

  • the state is persisting correctly.
  • Persist loop stopping correctly
  • interval correctly rounding if enabled.
  • persist loop not starting if no interval not enabled in conf
  • tail plugin correctly updating offests on get state

Screen shot of starting a docker image after a crash/force deletion
Screenshot 2024-01-18 at 4 41 11 AM

cmd/telegraf/agent.conf Outdated Show resolved Hide resolved
cmd/telegraf/telegraf.go Outdated Show resolved Hide resolved
cmd/telegraf/agent.conf Outdated Show resolved Hide resolved
config/config.go Outdated Show resolved Hide resolved
docs/CONFIGURATION.md Outdated Show resolved Hide resolved
cmd/telegraf/agent.conf Show resolved Hide resolved
Co-authored-by: Thomas Casteleyn <thomas.casteleyn@me.com>
@Hipska Hipska added area/agent ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. labels Jan 18, 2024
@dayvar14
Copy link
Contributor Author

dayvar14 commented Jan 19, 2024

2024-01-18T18:02:35Z D! [inputs.tail] Tail removed for "/nas/log/nginx/access_log.txt"
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 981 for "/nas/log/mysql/mysql-slow.log"
2024-01-18T18:02:35Z D! [inputs.tail] Tail removed for "/nas/log/mysql/mysql-slow.log"
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 0 for "/nas/log/nginx/access.json"
2024-01-18T18:02:35Z D! [inputs.tail] Tail removed for "/nas/log/nginx/access.json"
2024-01-18T18:02:35Z D! [agent] Input channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z I! [agent] Hang on, flushing any cached metrics before shutdown
2024-01-18T18:02:35Z D! [outputs.cloud_pubsub] Buffer fullness: 0 / 10000 metrics
2024-01-18T18:02:35Z D! [outputs.cloud_pubsub] Buffer fullness: 0 / 10000 metrics
2024-01-18T18:02:35Z D! [outputs.influxdb] Wrote batch of 274 metrics in 77.130676ms
2024-01-18T18:02:35Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2024-01-18T18:02:35Z D! [outputs.bigquery] Wrote batch of 582 metrics in 103.440871ms
2024-01-18T18:02:35Z D! [outputs.bigquery] Buffer fullness: 0 / 10000 metrics
2024-01-18T18:02:35Z I! [agent] Stopping running outputs
2024-01-18T18:02:35Z D! [agent] Persisting plugin states
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 0 for "/nas/log/nginx/access_log.txt"
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 0 for "/nas/log/mysql/mysql-slow.log"
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 0 for "/nas/log/nginx/access.json"
2024-01-18T18:02:35Z D! [agent] Stopped Successfully2024-01-18T18:02:35Z D! [inputs.tail] Tail removed for "/nas/log/nginx/access_log.txt"
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 981 for "/nas/log/mysql/mysql-slow.log"
2024-01-18T18:02:35Z D! [inputs.tail] Tail removed for "/nas/log/mysql/mysql-slow.log"
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 0 for "/nas/log/nginx/access.json"
2024-01-18T18:02:35Z D! [inputs.tail] Tail removed for "/nas/log/nginx/access.json"
2024-01-18T18:02:35Z D! [agent] Input channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z I! [agent] Hang on, flushing any cached metrics before shutdown
2024-01-18T18:02:35Z D! [outputs.cloud_pubsub] Buffer fullness: 0 / 10000 metrics
2024-01-18T18:02:35Z D! [outputs.cloud_pubsub] Buffer fullness: 0 / 10000 metrics
2024-01-18T18:02:35Z D! [outputs.influxdb] Wrote batch of 274 metrics in 77.130676ms
2024-01-18T18:02:35Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2024-01-18T18:02:35Z D! [outputs.bigquery] Wrote batch of 582 metrics in 103.440871ms
2024-01-18T18:02:35Z D! [outputs.bigquery] Buffer fullness: 0 / 10000 metrics
2024-01-18T18:02:35Z I! [agent] Stopping running outputs
2024-01-18T18:02:35Z D! [agent] Persisting plugin states
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 0 for "/nas/log/nginx/access_log.txt"
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 0 for "/nas/log/mysql/mysql-slow.log"
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 0 for "/nas/log/nginx/access.json"
2024-01-18T18:02:35Z D! [agent] Stopped Successfully

Another example of the state being written to, but then the offsets being overwritten at the end of a reset.

After reviewing the code with my colleagues (@noahneedscoffee and @ericmiller-wpengine), we discovered that when the tails are terminated at the Stop() function of Tail, the offsets are read as zero when the persister calls the final GetState() method. We resolved this issue by simply setting the tails map to an empty struct and adding a conditional statement in GetState. I will provide an update once we identify the source of the persister corrupt written state bug and conduct stress tests with a high volume of logs to be read from.

@dayvar14
Copy link
Contributor Author

dayvar14 commented Jan 23, 2024

This change was stress tested with 3 tail plugins ingesting 100000 logs a second. State writes every 30 seconds. Was also tested in our environment and had correctly restored from state when restarted. @srebhan I hoped I fixed your concerns, if not i apologize in advance for misunderstanding.

@telegraf-tiger
Copy link
Contributor

@dayvar14
Copy link
Contributor Author

@srebhan @powersj Is there any other concerns with getting this pr merged?

@powersj
Copy link
Contributor

powersj commented Jan 31, 2024

@dayvar14,

My apologies, I owe you a response. Looks like you have tested this thoroughly, and I have no doubt resolved it for the tail plugin. I think there was concern around other plugins that currently implement the state and ensuring we have a consistent mechanism.

@srebhan is that your current thinking as well?

@dayvar14
Copy link
Contributor Author

dayvar14 commented Feb 1, 2024

@powersj No worries. I understand the concern a bit better now.

I only found four plugins that are stateful at the moment. inputs.tail, inputs.docker_log, inputs.win_eventlog, and processors.dedup.

Are you considering as a solution that...

  1. just locks around the states for each plugin that are stateful.
  2. a more aggressive refactor of StatefulPlugins that is better suited for the persistance loop perhaps even a separate interface for plugins that support the persistance loop to ensure this change only affects the behavior of the tail plugin specifically.

Or do you have another approach in mind?

@srebhan
Copy link
Contributor

srebhan commented Feb 2, 2024

@dayvar14 my minimal request would be that the plugin's state is consistent when stored on disk. For the current implementation that is the case as all Gather operations and all dataflows have been stopped, so Telegraf is in a "frozen" state when we collect the states. Here an example for my concerns: Imagine we do tail 4 files and use a file output to save the results for illustration. Assume further that every file provides 32 bytes corresponding to 1 metric per "write" The following is fully possible even with locks around the state:

- file 1 is read and offset is set to 32 -> state(32, 0, 0, 0)
- file 2 is read and offset is set to 32 -> state(32, 32, 0, 0)
- file 3 is read and offset is set to 32 -> state(32, 32, 32, 0)
- file 4 is read and offset is set to 32 -> state(32, 32, 32, 32)
-> write is called with 4 metrics
- file 3 is read and offset is set to 32 -> state(32, 32, 64, 32)
- file 1 is read and offset is set to 32 -> state(64, 32, 64, 32)
* state-persistence is triggered for the plugin and stored state(64, 32, 64, 32)
**crash**

Now we lost the last two metrics in file 3 and file 1, because the state says we did read them already, but they are not written at the output!
In a non-event-based plugin (tail triggers state-changes on file changes) the state might be modified in the gather loop e.g. when gathering multiple servers, but in the middle of that loop the state might be persisted, so your state represents some intermediate gather state and you will have no chance to know at which point of the gather process it was taken.

One possible solution would be double buffering. It will not fix the lost-metric example above, but it might guarantee we do have a clear gather-state e.g. by always returning the state before the actual gather...

@dayvar14 does that make any sense?

@dayvar14
Copy link
Contributor Author

dayvar14 commented Feb 2, 2024

I see @srebhan.

For the tail input plugin specifically, I think there might be a slight limitation with the tail implementation influxdata/tail in use currently.

InfluxData's implementation of tail does not output the line index of the currently read line.

Meaning we are reliant on the tail.Tell(), which is noted specifically that its not very accurate and its very possible a new line was received by the time the line was processed in the receiver method.

I do see some implementations of tail that do this, so it may be possible to either use this or update the current implementation to support this.

If we do that, or if we are fine with the limitations of tail.Tell(), buffering the offsets and updating the local state only when the metrics are delivered seems like a good potential solution.

Again thanks for the feedback.

@powersj
Copy link
Contributor

powersj commented Feb 6, 2024

hi dayvar14,

InfluxData's implementation of tail does not output the line index of the currently read line.

For the tail plugin, does that provide something different than the offset used today?

As Sven was mentioning, when we start gathering state at an interval it may not be consistent with messages that are successfully written. The offset may have advanced, but the metric not written. If a crash happens before writing, this leads to you missing data and having small gaps in your state when you recover. There is a similar concern with the Windows Event Log and its use of a bookmark.

As Sven and I have reviewed this we are recalling why we did not originally add periodic state tracking. There are enough moving parts to make this racy and inconsistent. I would like to avoid take a PR that we know will result in this type of behavior for now.

@dayvar14 dayvar14 force-pushed the update-state-per-gather branch 2 times, most recently from 9440e86 to 32be4ab Compare February 6, 2024 23:40
@dayvar14
Copy link
Contributor Author

dayvar14 commented Feb 7, 2024

I updated the following the following plugins win_eventlog, docker_log, and dedup plugins.

Win_eventlog - only updates the state at the end of a gather. Meaning the state only represent metrics that have been successfully added to the accumulator.

Docker_log - Separates out the state to be persisted from the frequently updated local state that is updated during the gather. Meaning that again plugin state represents only metrics that have been successfully added to the accumulator provided in the gather.

Dedup - updates the state during the apply.

The edge case I found with the current main implementation (specifically to the tail plugin) is that if the process never safely shutdowns, then no state is ever recorded. A crash may cause a whole bunch of messages to be lost, since it starts from the end of the file if no state is found.

@powersj
Copy link
Contributor

powersj commented Feb 7, 2024

A crash may cause a whole bunch of messages to be lost, since it starts from the end of the file if no state is found.

Correct, however unless I missed an update this PR now introduces a case where the state claims you wrote messages, but may not actually have. While I can understand that saving the state periodically is seen as an improvement, it introduces an incorrect behavior that needs to get resolved.

@dayvar14
Copy link
Contributor Author

dayvar14 commented Feb 7, 2024

I did notice @srebhan mentioned the following always returning the state before the actual gather

One possible solution would be double buffering. It will not fix the lost-metric example above, but it might guarantee we do have a clear gather-state e.g. by always returning the state before the actual gather...

I might need to look through the code a bit more, but (for non-service plugins) perhaps only persisting the state of previous "gather" during a gather might be better? Maybe that is what was meant.

I'm not currently sure if another gather confirms that the previous gather's metrics were written atm.

As for the tail plugin, confirming the metrics were delivered before state is written might be a bit more complex.

@powersj
Copy link
Contributor

powersj commented Feb 7, 2024

I might need to look through the code a bit more, but (for non-service plugins) perhaps only persisting the state of previous "gather" during a gather might be better? Maybe that is what was meant.
I'm not currently sure if another gather confirms that the previous gather's metrics were written atm.

While rare, some cases may involve where a use gathers every 5 seconds, and flushes every 10 seconds. So multiple gathers could occur before a flush happens. But keep in mind this only applies to non-service plugins.

As for the tail plugin, confirming the metrics were delivered before state is written might be a bit more complex.

100% - this is why he and I are hesitating :) the service plugins (e.g. tail, *-consumer) use tracking metrics are read not at gathers, but instead as they come in and as we have room for more.

@dayvar14
Copy link
Contributor Author

dayvar14 commented Feb 7, 2024

Darn. This is fine in our use case, but i can see why it wont be desirable in most situations.

A crash may cause a whole bunch of messages to be lost, since it starts from the end of the file if no state is found.

Perhaps a more acceptable solution might be to configure input.tail to read from start of file if no state within statefile is found? Perhaps a start_from_beginning_if_no_state option or allow start_from_beginning to work with state? This would possibly lead to duplicates, but those would be more desirable in our situation then dropped logs in the rare case that telegraf isn't gracefully shutdown.

@srebhan
Copy link
Contributor

srebhan commented Feb 14, 2024

@dayvar14 sorry for the late response!

Darn. This is fine in our use case, but i can see why it wont be desirable in most situations.

Yeah and I'm sorry for not being able to offer a good solution! I wish we could provide an easy way...

The edge case I found with the current main implementation (specifically to the tail plugin) is that if the process never safely shutdowns, then no state is ever recorded.

This should really be fixed as not shutting down safely is causing other troubles as well. Are you able to reproduce the issue with the latest version? If so, please open a bug-report and let me know!

Perhaps a more acceptable solution might be to configure input.tail to read from start of file if no state within statefile is found? Perhaps a start_from_beginning_if_no_state option or allow start_from_beginning to work with state?

This is definitively feasible and I think can be done. Instead of introducing a new boolean I would suggest to have

  ## Read strategy
  ## Available settings are:
  ##   continue      -- continue from the last saved state or, if no state provided, from the end of the file
  ##   from_tail     -- always start from the end of the file no matter if a state is provided or not
  ##   from_head     -- always start from the beginning of the no matter if a state is provided or not
  ##   fallback_head -- continue from the last saved state or, if no state provided, from the start of the file
  # read_strategy = "continue"

with continue being the default. You might want to improve the naming as this is not my strong part. ;-)

What do you think?

@dayvar14
Copy link
Contributor Author

dayvar14 commented Feb 14, 2024

This should really be fixed as not shutting down safely is causing other troubles as well. Are you able to reproduce the issue with the latest version? If so, please open a bug-report and let me know!

@srebhan I believe this is talked about in #14566. State is never recorded as its only recorded when telegraf safely shutdowns.

This is definitively feasible and I think can be done. Instead of introducing a new boolean I would suggest to have

In that case I will open a new issue with this more specific problem. I'll leave this issue open as i do believe an incrementing state is desirable. Other log agents like fluentd seem to have this feature (for tail input) and in our use case it is important minimize the amount of our critical logs lost as well as reducing duplication.

with continue being the default. You might want to improve the naming as this is not my strong part. ;-)

Naming is always hard!

@srebhan
Copy link
Contributor

srebhan commented Feb 16, 2024

@dayvar14 can we please open an issue for this with the exact requirements and close the PR? We can discuss ways forward in the issue... Keeping the PR open will only clutter the list and makes it harder for us to maintain an overview of what is open...

@dayvar14 dayvar14 closed this Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/agent feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

update state more frequently in case telegraf is suddenly shutdown
4 participants