Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent db: make rejecting ooo samples configurable #14094

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

rabenhorst
Copy link
Contributor

Add storage.agent.reject-out-of-order-samples flag and make rejecting out-of-order samples in agent's db configurable. This allows disabling remote writing out-of-order samples from agent, which was introduced in #12897.

@rabenhorst rabenhorst force-pushed the configurable-ooo-sample-agent-db branch 2 times, most recently from 2f991c2 to 323b88f Compare May 13, 2024 16:56
@ArthurSens
Copy link
Member

Maybe I'm missing something, but isn't it already possible to disable OOO ingestion with the configuration file?

From my understanding if out_of_order_time_window is zero (the default), then OOO is disabled. See https://prometheus.io/docs/prometheus/latest/configuration/configuration/#tsdb

@rabenhorst
Copy link
Contributor Author

rabenhorst commented May 14, 2024

From my understanding if out_of_order_time_window is zero (the default), then OOO is disabled. See https://prometheus.io/docs/prometheus/latest/configuration/configuration/#tsdb

Maybe I miss something, but I don't see how this applies to the agent db, which, as I understand, is just the TSDB WAL.

For context, the use-case here is to avoid remote writing ooo samples (e.g. when the sink would reject them anyhow).

@sdufel
Copy link

sdufel commented May 14, 2024

When agents ingest out-of-order samples and and remote write them to external stores, it can be extremely expensive on the other end. We run a setup with prometheus agents writing to thanos receivers, and out of order samples generate severe amounts of load.

@ArthurSens
Copy link
Member

ArthurSens commented May 14, 2024

Maybe I miss something, but I don't see how this applies to the agent db, which, as I understand, is just the TSDB WAL.

Yeah, I'm reading the code now and it looks like the agent indeed doesn't take this into consideration.

I think that creating yet another configuration option for out of order would be confusing though. What do you think about re-using the already existing out_of_order_time_window configuration option?

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>
Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>
Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>
Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>
@rabenhorst rabenhorst force-pushed the configurable-ooo-sample-agent-db branch from 26aca44 to 157ec2e Compare May 17, 2024 17:27
@rabenhorst
Copy link
Contributor Author

rabenhorst commented May 17, 2024

Yeah, I'm reading the code now and it looks like the agent indeed doesn't take this into consideration.

I think that creating yet another configuration option for out of order would be confusing though. What do you think about re-using the already existing out_of_order_time_window configuration option?

I think it makes sense and I changed it and added some test cases for edge cases. Is this what you had in mind?

It might be confusing now that out_of_order_time_window, which is part of tsdb config, is also applied to agent db. What would be the best place to document it?

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It LGTM, just a few nitpicks

I'll try to ping maintainers to take a look at this as well

tsdb/agent/db.go Outdated Show resolved Hide resolved
cmd/prometheus/main.go Outdated Show resolved Hide resolved
tsdb/agent/db_test.go Outdated Show resolved Hide resolved
@ArthurSens
Copy link
Member

ArthurSens commented May 23, 2024

I'm also realizing that changing the default behavior now (always accept OOO -> have to configure window to accept) might be a breaking change. Since agent-mode is a feature flag and OOO is marked as experimental in our docs, I think the change should be fine, but let's see what others think

@ArthurSens
Copy link
Member

It might be confusing now that out_of_order_time_window, which is part of tsdb config, is also applied to agent db. What would be the best place to document it?

Good point 😬, I don't remember if the configuration documentation is auto-generated... I'm 90% sure it's not. So it should be fine to just add a statement here clarifying that it is also applied for agent-mode?

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>
@rabenhorst
Copy link
Contributor Author

It LGTM, just a few nitpicks

I'll try to ping maintainers to take a look at this as well

I addressed the nits and thx!

@roidelapluie
Copy link
Member

cc @rfratto

Copy link
Member

@jesusvazquez jesusvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

I think rejecting OOO doing remote write is a fair choice regardless or whether or not the backend can ingest it. But its even a more valid choice given today's context where OOO is still not enabled by default.

Left a small nit that I'd like to see fixed for consistency 👍

Nice work 💪

tsdb/agent/db.go Outdated

// mintTs returns the minimum timestamp that a sample can have
// and is needed for preventing underflow.
func (a *appender) minTs(lastTs int64) int64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename this method to minValidTime or appendable() to make it consistent with oither parts of the codebase

func (s *memSeries) appendable(t int64, v float64, headMaxt, minValidTime, oooTimeWindow int64) (isOOO bool, oooDelta int64, err error) {
// Check if we can append in the in-order chunk.
if t >= minValidTime {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, renamed to minValidTime.

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>
@jesusvazquez
Copy link
Member

Holding off merge for a @rfratto review since he maintains the agent 👍

Copy link
Contributor

@tpaschalis tpaschalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're reusing an existing Prometheus config field, should we update docs for out_of_order_time_window around https://prometheus.io/docs/prometheus/latest/configuration/configuration/#tsdb to mention this dual use-case?

@rabenhorst
Copy link
Contributor Author

Since we're reusing an existing Prometheus config field, should we update docs for out_of_order_time_window around https://prometheus.io/docs/prometheus/latest/configuration/configuration/#tsdb to mention this dual use-case?

I added it to out_of_order_time_window docs:

# When out_of_order_time_window is greater than 0, it also affects experimental agent. It allows
# the agent's WAL to accept out-of-order samples that fall within the specified time window relative
# to the timestamp of the last appended sample for the same series.

WDYT?

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the next release process starts next week, I'd merge this even without @rfratto's approval 😬.

@rabenhorst, it seems like you forgot to sign your lastest commit. Could you fix that please? 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants