feat(agent): persist to statefile at a regular interval #14561

dayvar14 · 2024-01-10T21:59:23Z

Summary

State is now stored at each gather for each plugin individually.

State is restored correctly if telegraf fails catastrophically.

Change shouldn't affect any current implementation of state usage as its simply writing to the state more frequently.

Checklist

No AI generated code was used in this PR

Related issues

resolves #14566

powersj · 2024-01-17T22:11:29Z

Hi @dayvar14,

Thanks for putting up the PR. We have a few concerns about this change. This is a scenario that we talked about in depth when we first added the statefile feature. Catching the state ~~during a crash~~ periodically was ultimately not included in the first pass.

There are two main points that need to be resolved:

First, this needs to be opt-in and not the default behavior.

Second, doing this for every gather cycle could also be I/O heavy depending on your deployment and your interval setting. Possibly having a config option for how often this is done may be necessary.

Keep in mind that not all plugins even have a gather cycle as well. While you are doing this in the agent for each interval, knowing how this operates on other plugins types, like service input plugins that listen for metrics would be good to ensure is tested.

Thanks

cmd/telegraf/agent.conf

dayvar14 · 2024-01-18T10:43:48Z

Ran the tests in the same test described in the issue and

the state is persisting correctly.
Persist loop stopping correctly
interval correctly rounding if enabled.
persist loop not starting if no interval not enabled in conf
tail plugin correctly updating offests on get state

Screen shot of starting a docker image after a crash/force deletion

cmd/telegraf/agent.conf

cmd/telegraf/telegraf.go

cmd/telegraf/agent.conf

config/config.go

docs/CONFIGURATION.md

cmd/telegraf/agent.conf

Co-authored-by: Thomas Casteleyn <thomas.casteleyn@me.com>

dayvar14 · 2024-01-19T18:24:34Z

2024-01-18T18:02:35Z D! [inputs.tail] Tail removed for "/nas/log/nginx/access_log.txt"
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 981 for "/nas/log/mysql/mysql-slow.log"
2024-01-18T18:02:35Z D! [inputs.tail] Tail removed for "/nas/log/mysql/mysql-slow.log"
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 0 for "/nas/log/nginx/access.json"
2024-01-18T18:02:35Z D! [inputs.tail] Tail removed for "/nas/log/nginx/access.json"
2024-01-18T18:02:35Z D! [agent] Input channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z I! [agent] Hang on, flushing any cached metrics before shutdown
2024-01-18T18:02:35Z D! [outputs.cloud_pubsub] Buffer fullness: 0 / 10000 metrics
2024-01-18T18:02:35Z D! [outputs.cloud_pubsub] Buffer fullness: 0 / 10000 metrics
2024-01-18T18:02:35Z D! [outputs.influxdb] Wrote batch of 274 metrics in 77.130676ms
2024-01-18T18:02:35Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2024-01-18T18:02:35Z D! [outputs.bigquery] Wrote batch of 582 metrics in 103.440871ms
2024-01-18T18:02:35Z D! [outputs.bigquery] Buffer fullness: 0 / 10000 metrics
2024-01-18T18:02:35Z I! [agent] Stopping running outputs
2024-01-18T18:02:35Z D! [agent] Persisting plugin states
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 0 for "/nas/log/nginx/access_log.txt"
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 0 for "/nas/log/mysql/mysql-slow.log"
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 0 for "/nas/log/nginx/access.json"
2024-01-18T18:02:35Z D! [agent] Stopped Successfully2024-01-18T18:02:35Z D! [inputs.tail] Tail removed for "/nas/log/nginx/access_log.txt"
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 981 for "/nas/log/mysql/mysql-slow.log"
2024-01-18T18:02:35Z D! [inputs.tail] Tail removed for "/nas/log/mysql/mysql-slow.log"
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 0 for "/nas/log/nginx/access.json"
2024-01-18T18:02:35Z D! [inputs.tail] Tail removed for "/nas/log/nginx/access.json"
2024-01-18T18:02:35Z D! [agent] Input channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z D! [agent] Processor channel closed
2024-01-18T18:02:35Z I! [agent] Hang on, flushing any cached metrics before shutdown
2024-01-18T18:02:35Z D! [outputs.cloud_pubsub] Buffer fullness: 0 / 10000 metrics
2024-01-18T18:02:35Z D! [outputs.cloud_pubsub] Buffer fullness: 0 / 10000 metrics
2024-01-18T18:02:35Z D! [outputs.influxdb] Wrote batch of 274 metrics in 77.130676ms
2024-01-18T18:02:35Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2024-01-18T18:02:35Z D! [outputs.bigquery] Wrote batch of 582 metrics in 103.440871ms
2024-01-18T18:02:35Z D! [outputs.bigquery] Buffer fullness: 0 / 10000 metrics
2024-01-18T18:02:35Z I! [agent] Stopping running outputs
2024-01-18T18:02:35Z D! [agent] Persisting plugin states
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 0 for "/nas/log/nginx/access_log.txt"
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 0 for "/nas/log/mysql/mysql-slow.log"
2024-01-18T18:02:35Z D! [inputs.tail] Recording offset 0 for "/nas/log/nginx/access.json"
2024-01-18T18:02:35Z D! [agent] Stopped Successfully

Another example of the state being written to, but then the offsets being overwritten at the end of a reset.

After reviewing the code with my colleagues (@noahneedscoffee and @ericmiller-wpengine), we discovered that when the tails are terminated at the Stop() function of Tail, the offsets are read as zero when the persister calls the final GetState() method. We resolved this issue by simply setting the tails map to an empty struct and adding a conditional statement in GetState. I will provide an update once we identify the source of the persister corrupt written state bug and conduct stress tests with a high volume of logs to be read from.

dayvar14 · 2024-01-23T01:47:07Z

This change was stress tested with 3 tail plugins ingesting 100000 logs a second. State writes every 30 seconds. Was also tested in our environment and had correctly restored from state when restarted. @srebhan I hoped I fixed your concerns, if not i apologize in advance for misunderstanding.

telegraf-tiger · 2024-01-23T02:22:59Z

Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip.
Downloads for additional architectures and packages are available below.

☺️ This pull request doesn't significantly change the Telegraf binary size (less than 1%)

📦 Click here to get additional PR build artifacts

Artifact URLs

DEB	RPM	TAR GZ	ZIP
amd64.deb	aarch64.rpm	darwin_amd64.tar.gz	windows_amd64.zip
arm64.deb	armel.rpm	darwin_arm64.tar.gz	windows_arm64.zip
armel.deb	armv6hl.rpm	freebsd_amd64.tar.gz	windows_i386.zip
armhf.deb	i386.rpm	freebsd_armv7.tar.gz
i386.deb	ppc64le.rpm	freebsd_i386.tar.gz
mips.deb	riscv64.rpm	linux_amd64.tar.gz
mipsel.deb	s390x.rpm	linux_arm64.tar.gz
ppc64el.deb	x86_64.rpm	linux_armel.tar.gz
riscv64.deb		linux_armhf.tar.gz
s390x.deb		linux_i386.tar.gz
		linux_mips.tar.gz
		linux_mipsel.tar.gz
		linux_ppc64le.tar.gz
		linux_riscv64.tar.gz
		linux_s390x.tar.gz

dayvar14 · 2024-01-31T17:43:06Z

@srebhan @powersj Is there any other concerns with getting this pr merged?

powersj · 2024-01-31T23:14:07Z

@dayvar14,

My apologies, I owe you a response. Looks like you have tested this thoroughly, and I have no doubt resolved it for the tail plugin. I think there was concern around other plugins that currently implement the state and ensuring we have a consistent mechanism.

@srebhan is that your current thinking as well?

dayvar14 · 2024-02-01T16:39:21Z

@powersj No worries. I understand the concern a bit better now.

I only found four plugins that are stateful at the moment. inputs.tail, inputs.docker_log, inputs.win_eventlog, and processors.dedup.

Are you considering as a solution that...

just locks around the states for each plugin that are stateful.
a more aggressive refactor of StatefulPlugins that is better suited for the persistance loop perhaps even a separate interface for plugins that support the persistance loop to ensure this change only affects the behavior of the tail plugin specifically.

Or do you have another approach in mind?

srebhan · 2024-02-02T14:50:28Z

@dayvar14 my minimal request would be that the plugin's state is consistent when stored on disk. For the current implementation that is the case as all Gather operations and all dataflows have been stopped, so Telegraf is in a "frozen" state when we collect the states. Here an example for my concerns: Imagine we do tail 4 files and use a file output to save the results for illustration. Assume further that every file provides 32 bytes corresponding to 1 metric per "write" The following is fully possible even with locks around the state:

- file 1 is read and offset is set to 32 -> state(32, 0, 0, 0)
- file 2 is read and offset is set to 32 -> state(32, 32, 0, 0)
- file 3 is read and offset is set to 32 -> state(32, 32, 32, 0)
- file 4 is read and offset is set to 32 -> state(32, 32, 32, 32)
-> write is called with 4 metrics
- file 3 is read and offset is set to 32 -> state(32, 32, 64, 32)
- file 1 is read and offset is set to 32 -> state(64, 32, 64, 32)
* state-persistence is triggered for the plugin and stored state(64, 32, 64, 32)
**crash**

Now we lost the last two metrics in file 3 and file 1, because the state says we did read them already, but they are not written at the output!
In a non-event-based plugin (tail triggers state-changes on file changes) the state might be modified in the gather loop e.g. when gathering multiple servers, but in the middle of that loop the state might be persisted, so your state represents some intermediate gather state and you will have no chance to know at which point of the gather process it was taken.

One possible solution would be double buffering. It will not fix the lost-metric example above, but it might guarantee we do have a clear gather-state e.g. by always returning the state before the actual gather...

@dayvar14 does that make any sense?

dayvar14 · 2024-02-02T21:52:29Z

I see @srebhan.

For the tail input plugin specifically, I think there might be a slight limitation with the tail implementation influxdata/tail in use currently.

InfluxData's implementation of tail does not output the line index of the currently read line.

Meaning we are reliant on the tail.Tell(), which is noted specifically that its not very accurate and its very possible a new line was received by the time the line was processed in the receiver method.

I do see some implementations of tail that do this, so it may be possible to either use this or update the current implementation to support this.

If we do that, or if we are fine with the limitations of tail.Tell(), buffering the offsets and updating the local state only when the metrics are delivered seems like a good potential solution.

Again thanks for the feedback.

powersj · 2024-02-06T21:19:50Z

hi dayvar14,

InfluxData's implementation of tail does not output the line index of the currently read line.

For the tail plugin, does that provide something different than the offset used today?

As Sven was mentioning, when we start gathering state at an interval it may not be consistent with messages that are successfully written. The offset may have advanced, but the metric not written. If a crash happens before writing, this leads to you missing data and having small gaps in your state when you recover. There is a similar concern with the Windows Event Log and its use of a bookmark.

As Sven and I have reviewed this we are recalling why we did not originally add periodic state tracking. There are enough moving parts to make this racy and inconsistent. I would like to avoid take a PR that we know will result in this type of behavior for now.

dayvar14 · 2024-02-07T00:28:56Z

I updated the following the following plugins win_eventlog, docker_log, and dedup plugins.

Win_eventlog - only updates the state at the end of a gather. Meaning the state only represent metrics that have been successfully added to the accumulator.

Docker_log - Separates out the state to be persisted from the frequently updated local state that is updated during the gather. Meaning that again plugin state represents only metrics that have been successfully added to the accumulator provided in the gather.

Dedup - updates the state during the apply.

The edge case I found with the current main implementation (specifically to the tail plugin) is that if the process never safely shutdowns, then no state is ever recorded. A crash may cause a whole bunch of messages to be lost, since it starts from the end of the file if no state is found.

powersj · 2024-02-07T19:13:04Z

A crash may cause a whole bunch of messages to be lost, since it starts from the end of the file if no state is found.

Correct, however unless I missed an update this PR now introduces a case where the state claims you wrote messages, but may not actually have. While I can understand that saving the state periodically is seen as an improvement, it introduces an incorrect behavior that needs to get resolved.

dayvar14 · 2024-02-07T20:01:12Z

I did notice @srebhan mentioned the following always returning the state before the actual gather

One possible solution would be double buffering. It will not fix the lost-metric example above, but it might guarantee we do have a clear gather-state e.g. by always returning the state before the actual gather...

I might need to look through the code a bit more, but (for non-service plugins) perhaps only persisting the state of previous "gather" during a gather might be better? Maybe that is what was meant.

I'm not currently sure if another gather confirms that the previous gather's metrics were written atm.

As for the tail plugin, confirming the metrics were delivered before state is written might be a bit more complex.

powersj · 2024-02-07T20:26:42Z

I might need to look through the code a bit more, but (for non-service plugins) perhaps only persisting the state of previous "gather" during a gather might be better? Maybe that is what was meant.
I'm not currently sure if another gather confirms that the previous gather's metrics were written atm.

While rare, some cases may involve where a use gathers every 5 seconds, and flushes every 10 seconds. So multiple gathers could occur before a flush happens. But keep in mind this only applies to non-service plugins.

As for the tail plugin, confirming the metrics were delivered before state is written might be a bit more complex.

100% - this is why he and I are hesitating :) the service plugins (e.g. tail, *-consumer) use tracking metrics are read not at gathers, but instead as they come in and as we have room for more.

dayvar14 · 2024-02-07T21:21:43Z

Darn. This is fine in our use case, but i can see why it wont be desirable in most situations.

A crash may cause a whole bunch of messages to be lost, since it starts from the end of the file if no state is found.

Perhaps a more acceptable solution might be to configure input.tail to read from start of file if no state within statefile is found? Perhaps a start_from_beginning_if_no_state option or allow start_from_beginning to work with state? This would possibly lead to duplicates, but those would be more desirable in our situation then dropped logs in the rare case that telegraf isn't gracefully shutdown.

srebhan · 2024-02-14T13:25:48Z

@dayvar14 sorry for the late response!

Darn. This is fine in our use case, but i can see why it wont be desirable in most situations.

Yeah and I'm sorry for not being able to offer a good solution! I wish we could provide an easy way...

The edge case I found with the current main implementation (specifically to the tail plugin) is that if the process never safely shutdowns, then no state is ever recorded.

This should really be fixed as not shutting down safely is causing other troubles as well. Are you able to reproduce the issue with the latest version? If so, please open a bug-report and let me know!

Perhaps a more acceptable solution might be to configure input.tail to read from start of file if no state within statefile is found? Perhaps a start_from_beginning_if_no_state option or allow start_from_beginning to work with state?

This is definitively feasible and I think can be done. Instead of introducing a new boolean I would suggest to have

  ## Read strategy
  ## Available settings are:
  ##   continue      -- continue from the last saved state or, if no state provided, from the end of the file
  ##   from_tail     -- always start from the end of the file no matter if a state is provided or not
  ##   from_head     -- always start from the beginning of the no matter if a state is provided or not
  ##   fallback_head -- continue from the last saved state or, if no state provided, from the start of the file
  # read_strategy = "continue"

with continue being the default. You might want to improve the naming as this is not my strong part. ;-)

What do you think?

dayvar14 · 2024-02-14T17:15:13Z

This should really be fixed as not shutting down safely is causing other troubles as well. Are you able to reproduce the issue with the latest version? If so, please open a bug-report and let me know!

@srebhan I believe this is talked about in #14566. State is never recorded as its only recorded when telegraf safely shutdowns.

This is definitively feasible and I think can be done. Instead of introducing a new boolean I would suggest to have

In that case I will open a new issue with this more specific problem. I'll leave this issue open as i do believe an incrementing state is desirable. Other log agents like fluentd seem to have this feature (for tail input) and in our use case it is important minimize the amount of our critical logs lost as well as reducing duplication.

with continue being the default. You might want to improve the naming as this is not my strong part. ;-)

Naming is always hard!

srebhan · 2024-02-16T11:46:13Z

@dayvar14 can we please open an issue for this with the exact requirements and close the PR? We can discuss ways forward in the issue... Keeping the PR open will only clutter the list and makes it harder for us to maintain an overview of what is open...

telegraf-tiger bot added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Jan 10, 2024

dayvar14 force-pushed the update-state-per-gather branch from 64df91c to 63fd479 Compare January 10, 2024 23:01

feat: update statefile for each plugin per gather

4fd0e3f

dayvar14 force-pushed the update-state-per-gather branch from 63fd479 to 4fd0e3f Compare January 10, 2024 23:32

dayvar14 mentioned this pull request Jan 11, 2024

update state more frequently in case telegraf is suddenly shutdown #14566

Open

dayvar14 marked this pull request as ready for review January 11, 2024 17:32

move offset recording to get state function

8afe510

dayvar14 changed the title ~~feat: update statefile for each plugin per gather~~ feat(persister): update statefile for each plugin per gather Jan 17, 2024

catch case when plugin does not have a state registered

f97535f

dayvar14 force-pushed the update-state-per-gather branch from 57d949b to f97535f Compare January 17, 2024 22:07

dayvar14 changed the title ~~feat(persister): update statefile for each plugin per gather~~ feat(persister): persist to statefile at a regular interval Jan 18, 2024

dayvar14 force-pushed the update-state-per-gather branch 2 times, most recently from e4b2cd2 to b3ec6fa Compare January 18, 2024 08:54

Hipska reviewed Jan 18, 2024

View reviewed changes

cmd/telegraf/agent.conf Outdated Show resolved Hide resolved

dayvar14 added 2 commits January 18, 2024 02:56

revert changes to enable input persistance on gather

87fba6f

create persistance loop for agent

7b25843

dayvar14 force-pushed the update-state-per-gather branch from b3ec6fa to 7b25843 Compare January 18, 2024 08:56

dayvar14 added 3 commits January 18, 2024 03:33

store offests on get state for tail input

8a582ca

remove enable persist interval field

f835b95

only interate through tailers if necessary on get state

1206f80

dayvar14 force-pushed the update-state-per-gather branch from 5a5e67c to 1206f80 Compare January 18, 2024 10:32

Hipska suggested changes Jan 18, 2024

View reviewed changes

cmd/telegraf/agent.conf Outdated Show resolved Hide resolved

cmd/telegraf/telegraf.go Outdated Show resolved Hide resolved

cmd/telegraf/agent.conf Outdated Show resolved Hide resolved

config/config.go Outdated Show resolved Hide resolved

docs/CONFIGURATION.md Outdated Show resolved Hide resolved

implement feedback by @Hipska

35de14b

Hipska suggested changes Jan 18, 2024

View reviewed changes

cmd/telegraf/agent.conf Show resolved Hide resolved

more feedback by @Hipska

995135c

Co-authored-by: Thomas Casteleyn <thomas.casteleyn@me.com>

Hipska approved these changes Jan 18, 2024

View reviewed changes

Hipska added area/agent ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. labels Jan 18, 2024

powersj assigned srebhan Jan 22, 2024

dayvar14 force-pushed the update-state-per-gather branch from 9af5cc9 to 16c3f13 Compare January 23, 2024 01:42

add locking around tail state

388b112

dayvar14 force-pushed the update-state-per-gather branch from 16c3f13 to 388b112 Compare January 23, 2024 01:45

Hipska approved these changes Jan 23, 2024

View reviewed changes

dayvar14 force-pushed the update-state-per-gather branch 2 times, most recently from 9440e86 to 32be4ab Compare February 6, 2024 23:40

update docker_log for persist loop

29c7cef

dayvar14 force-pushed the update-state-per-gather branch from 32be4ab to 1d3d599 Compare February 6, 2024 23:55

dayvar14 added 2 commits February 6, 2024 22:48

update win_eventlog for persist loop

dffc866

update dedup for persist loop

4ae810d

dayvar14 force-pushed the update-state-per-gather branch from 1d3d599 to 4ae810d Compare February 7, 2024 04:48

dayvar14 closed this Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent): persist to statefile at a regular interval #14561

feat(agent): persist to statefile at a regular interval #14561

dayvar14 commented Jan 10, 2024 •

edited

powersj commented Jan 17, 2024 •

edited

dayvar14 commented Jan 18, 2024 •

edited

dayvar14 commented Jan 19, 2024 •

edited

dayvar14 commented Jan 23, 2024 •

edited

telegraf-tiger bot commented Jan 23, 2024

Artifact URLs

dayvar14 commented Jan 31, 2024

powersj commented Jan 31, 2024

dayvar14 commented Feb 1, 2024 •

edited

srebhan commented Feb 2, 2024

dayvar14 commented Feb 2, 2024 •

edited

powersj commented Feb 6, 2024

dayvar14 commented Feb 7, 2024 •

edited

powersj commented Feb 7, 2024

dayvar14 commented Feb 7, 2024

powersj commented Feb 7, 2024

dayvar14 commented Feb 7, 2024 •

edited

srebhan commented Feb 14, 2024

dayvar14 commented Feb 14, 2024 •

edited

srebhan commented Feb 16, 2024

feat(agent): persist to statefile at a regular interval #14561

feat(agent): persist to statefile at a regular interval #14561

Conversation

dayvar14 commented Jan 10, 2024 • edited

Summary

Checklist

Related issues

powersj commented Jan 17, 2024 • edited

dayvar14 commented Jan 18, 2024 • edited

dayvar14 commented Jan 19, 2024 • edited

dayvar14 commented Jan 23, 2024 • edited

telegraf-tiger bot commented Jan 23, 2024

Artifact URLs

dayvar14 commented Jan 31, 2024

powersj commented Jan 31, 2024

dayvar14 commented Feb 1, 2024 • edited

srebhan commented Feb 2, 2024

dayvar14 commented Feb 2, 2024 • edited

powersj commented Feb 6, 2024

dayvar14 commented Feb 7, 2024 • edited

powersj commented Feb 7, 2024

dayvar14 commented Feb 7, 2024

powersj commented Feb 7, 2024

dayvar14 commented Feb 7, 2024 • edited

srebhan commented Feb 14, 2024

dayvar14 commented Feb 14, 2024 • edited

srebhan commented Feb 16, 2024

dayvar14 commented Jan 10, 2024 •

edited

powersj commented Jan 17, 2024 •

edited

dayvar14 commented Jan 18, 2024 •

edited

dayvar14 commented Jan 19, 2024 •

edited

dayvar14 commented Jan 23, 2024 •

edited

dayvar14 commented Feb 1, 2024 •

edited

dayvar14 commented Feb 2, 2024 •

edited

dayvar14 commented Feb 7, 2024 •

edited

dayvar14 commented Feb 7, 2024 •

edited

dayvar14 commented Feb 14, 2024 •

edited