Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with deduplication alogrithm in Thanos #7364

Open
pardha-visa opened this issue May 16, 2024 · 4 comments
Open

Issue with deduplication alogrithm in Thanos #7364

pardha-visa opened this issue May 16, 2024 · 4 comments

Comments

@pardha-visa
Copy link

pardha-visa commented May 16, 2024

We have a pretty straightforward Thanos setup which consists of a querier, two Prometheus replicas and their corresponding two sidecars, each co-existing with their own Prometheus instance. Both the Prometheus replicas share the exact same configuration and scrape the same set of targets. The sidecars use Prometheus remote read API's for querying.

Recently we saw that for some target, one of the Prometheus replicas experienced scrape failures due to timeouts, which created data collection gaps. The other prometheus replica, however, didn't face any such issues and there were no data collection gaps there.

Our expectation was that while querying data for this target via Thanos Querier, these gaps will be automatically filled by the deduplication algorithm. However, this didn't happen, and Thanos selected data from the replica which had data gaps.

Here's the graph with deduplication disabled (first replica selected):
Screenshot 2024-05-16 at 9 35 38 AM

Here's the graph with deduplication disabled (second replica selected):

Screenshot 2024-05-16 at 9 35 50 AM

Here's the graph with deduplication enabled:

Screenshot 2024-05-16 at 9 36 04 AM

Here is the raw data from both the replicas for the same time range:

Raw data for this timeseries from both the replicas

Query = node_cpu_seconds_total{mode='iowait',instance='<masked>',cpu="0"}[5m]

_replica=occ-node-A
9389.87 1713668216.753
9390.03 1713668306.753
9390.33 1713668336.753
9391.36 1713668426.753
9391.38 1713668456.753
9393.49 1713668486.753

_replica=oce-node-A
9389.94 1713668224.198
9389.95 1713668254.198
9390.02 1713668284.198
9390.03 1713668314.198
9390.33 1713668344.198
9390.83 1713668374.198
9391.13 1713668404.198
9391.38 1713668434.198
9391.61 1713668464.198
9393.53 1713668494.198

Thanos version: 0.33.0
Prometheus version: 2.51.1

@MichaHoffmann
Copy link
Contributor

MichaHoffmann commented May 18, 2024

The query in the UI uses [1m] ~ from the sample it feels like you have 30s scrape frequency. Does it also happen with [5m]? I wrote a qucik test with your given inputs and the result series looks somewhat like:

					samples: []sample{
						{t: 1713668216000, f: 9389.87},
						{t: 1713668224000, f: 9389.94},
						{t: 1713668254000, f: 9389.95},
						{t: 1713668284000, f: 9390.02},
						{t: 1713668314000, f: 9390.03},
						{t: 1713668344000, f: 9390.33},
						{t: 1713668374000, f: 9390.83},
						{t: 1713668404000, f: 9391.13},
						{t: 1713668434000, f: 9391.38},
						{t: 1713668464000, f: 9391.61},
						{t: 1713668494000, f: 9393.53}},
                                         },

@MichaHoffmann
Copy link
Contributor

It looks like all samples are there and do have proper 30s scrape interval between them; it could be that your 1m windows are aligned in a way that only one sample is contained in the window which would break rate. I think this is an issue with too small window, but the deduplication result looks somewhat correct to me except that we have one sample too much at the beginning

@jnyi
Copy link
Contributor

jnyi commented May 20, 2024

We have evidence about dedup logic bug as well, here are the proof:

First graph is data points missing when getting the results from thanos receiver with replicationFactor == 3, and because we are rolling update the receiver pods, so 1 copy was absent for sure, however after check Deduplication, it still have dips:

No Dedup
Screenshot 2024-05-16 at 10 36 34 PM

With Dedup
Screenshot 2024-05-16 at 10 36 41 PM

After the data got compact and returned from store, the results become correct with no dips:
Screenshot 2024-05-20 at 10 22 21 AM

@jnyi
Copy link
Contributor

jnyi commented May 21, 2024

seems related to this issue: #981

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants