Issue with deduplication alogrithm in Thanos #7364

pardha-visa · 2024-05-16T04:18:52Z

We have a pretty straightforward Thanos setup which consists of a querier, two Prometheus replicas and their corresponding two sidecars, each co-existing with their own Prometheus instance. Both the Prometheus replicas share the exact same configuration and scrape the same set of targets. The sidecars use Prometheus remote read API's for querying.

Recently we saw that for some target, one of the Prometheus replicas experienced scrape failures due to timeouts, which created data collection gaps. The other prometheus replica, however, didn't face any such issues and there were no data collection gaps there.

Our expectation was that while querying data for this target via Thanos Querier, these gaps will be automatically filled by the deduplication algorithm. However, this didn't happen, and Thanos selected data from the replica which had data gaps.

Here's the graph with deduplication disabled (first replica selected):

Here's the graph with deduplication disabled (second replica selected):

Here's the graph with deduplication enabled:

Here is the raw data from both the replicas for the same time range:

Raw data for this timeseries from both the replicas

Query = node_cpu_seconds_total{mode='iowait',instance='<masked>',cpu="0"}[5m]

_replica=occ-node-A
9389.87 1713668216.753
9390.03 1713668306.753
9390.33 1713668336.753
9391.36 1713668426.753
9391.38 1713668456.753
9393.49 1713668486.753

_replica=oce-node-A
9389.94 1713668224.198
9389.95 1713668254.198
9390.02 1713668284.198
9390.03 1713668314.198
9390.33 1713668344.198
9390.83 1713668374.198
9391.13 1713668404.198
9391.38 1713668434.198
9391.61 1713668464.198
9393.53 1713668494.198

Thanos version: 0.33.0
Prometheus version: 2.51.1

The text was updated successfully, but these errors were encountered:

MichaHoffmann · 2024-05-18T12:08:50Z

The query in the UI uses [1m] ~ from the sample it feels like you have 30s scrape frequency. Does it also happen with [5m]? I wrote a qucik test with your given inputs and the result series looks somewhat like:

					samples: []sample{
						{t: 1713668216000, f: 9389.87},
						{t: 1713668224000, f: 9389.94},
						{t: 1713668254000, f: 9389.95},
						{t: 1713668284000, f: 9390.02},
						{t: 1713668314000, f: 9390.03},
						{t: 1713668344000, f: 9390.33},
						{t: 1713668374000, f: 9390.83},
						{t: 1713668404000, f: 9391.13},
						{t: 1713668434000, f: 9391.38},
						{t: 1713668464000, f: 9391.61},
						{t: 1713668494000, f: 9393.53}},
                                         },

MichaHoffmann · 2024-05-18T16:24:22Z

It looks like all samples are there and do have proper 30s scrape interval between them; it could be that your 1m windows are aligned in a way that only one sample is contained in the window which would break rate. I think this is an issue with too small window, but the deduplication result looks somewhat correct to me except that we have one sample too much at the beginning

jnyi · 2024-05-20T17:26:05Z

We have evidence about dedup logic bug as well, here are the proof:

First graph is data points missing when getting the results from thanos receiver with replicationFactor == 3, and because we are rolling update the receiver pods, so 1 copy was absent for sure, however after check Deduplication, it still have dips:

No Dedup

With Dedup

After the data got compact and returned from store, the results become correct with no dips:

jnyi · 2024-05-21T05:09:06Z

seems related to this issue: #981

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with deduplication alogrithm in Thanos #7364

Issue with deduplication alogrithm in Thanos #7364

pardha-visa commented May 16, 2024 •

edited

MichaHoffmann commented May 18, 2024 •

edited

MichaHoffmann commented May 18, 2024

jnyi commented May 20, 2024 •

edited

jnyi commented May 21, 2024

Issue with deduplication alogrithm in Thanos #7364

Issue with deduplication alogrithm in Thanos #7364

Comments

pardha-visa commented May 16, 2024 • edited

MichaHoffmann commented May 18, 2024 • edited

MichaHoffmann commented May 18, 2024

jnyi commented May 20, 2024 • edited

jnyi commented May 21, 2024

pardha-visa commented May 16, 2024 •

edited

MichaHoffmann commented May 18, 2024 •

edited

jnyi commented May 20, 2024 •

edited