Duplicate series error during join queries #2782

rahulguptajss · 2024-03-27T09:16:43Z

Thanks @ybizeul for reporting.

From the screenshot, it seems that the poller port has changed. We should investigate whether the instance label can be ignored during join queries in Prometheus/Grafana.

rahulguptajss · 2024-04-24T11:00:15Z

Case 1: Instant Query Failure

For the problem mentioned above, we encountered an instant query failure. This can only happen if the same poller is being monitored on different ports in Harvest. Here is an example:

Suppose we are publishing the following metrics to Prometheus. The port field is used to simulate instance label with different values:

volume_avg_latency{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", port="12990"} 2.45
volume_avg_latency{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", port="12991"} 2.45
volume_labels{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", isEncrypted="false", isHardwareEncrypted="false", is_sis_volume="true", junction_path="/DB1", node_root="false", root_volume="No", snapshot_autodelete="true", snapshot_policy="none", state="online", svm_root="false", type="rw", port="12990"} 1
volume_labels{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", isEncrypted="false", isHardwareEncrypted="false", is_sis_volume="true", junction_path="/DB1", node_root="false", root_volume="No", snapshot_autodelete="true", snapshot_policy="none", state="online", svm_root="false", type="rw", port="12991"} 1

When we run the following Prometheus instant query:

volume_avg_latency
* on(aggr,volume) group_right
volume_labels

We receive the following error:

Error executing query: found duplicate series for the match group {aggr="EPICaggr", volume="DB1"} on the left hand-side of the operation: [...]; many-to-many matching not allowed: matching labels must be unique on one side.

Since this is not a valid use case in the field, there is no need to handle this situation.

Case 2: Range Query Failure

During situations such as node move or volume move, joins may fail. These need to be tackled according to the query, which may involve ignoring certain labels to fix the issue.

Case 3: Poller Port Change

Another scenario occurs when the poller port changes over time due to poller addition or deletion, resulting in a change to the Prometheus instance label. For simulation purposes, we use the port label.

Initially, Prometheus scrapes the following data:

volume_avg_latency{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", port="12990"} 2.45
volume_labels{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", isEncrypted="false", isHardwareEncrypted="false", is_sis_volume="true", junction_path="/DB1", node_root="false", root_volume="No", snapshot_autodelete="true", snapshot_policy="none", state="online", svm_root="false", type="rw", port="12990"} 1

After some time, if the poller port changes to 12991, it publishes the following data:

volume_avg_latency{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", port="12991"} 2.45
volume_labels{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", isEncrypted="false", isHardwareEncrypted="false", is_sis_volume="true", junction_path="/DB1", node_root="false", root_volume="No", snapshot_autodelete="true", snapshot_policy="none", state="online", svm_root="false", type="rw", port="12991"} 1

Running the range query volume_labels in Prometheus will result in a change in color for this metric and a duplicate listing of the metric in the Grafana panel due to the change in port.

To address this, we have two options:

Modify the query to exclude the port label. Here is the adjusted query:
```
label_replace(volume_labels,"port", "", "port", ".*")
```
Drop the port label using Prometheus label rules in the Prometheus configuration.

If we apply solution 1 to topk queries, the modification would look like this:

Before:

volume_labels
  and 
topk(5, avg_over_time(volume_labels[3h] @ end()))

After:

label_replace(volume_labels,"port", "", "port", ".*")
and
topk(5, avg_over_time(label_replace(volume_labels,"port", "", "port", ".*")[3h:] @ end()))

rahulguptajss added status/needs-triage customer labels Mar 27, 2024

rahulguptajss self-assigned this Apr 24, 2024

rahulguptajss added the status/open label Apr 24, 2024

rahulguptajss removed the status/open label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate series error during join queries #2782

Duplicate series error during join queries #2782

rahulguptajss commented Mar 27, 2024

rahulguptajss commented Apr 24, 2024

Duplicate series error during join queries #2782

Duplicate series error during join queries #2782

Comments

rahulguptajss commented Mar 27, 2024

rahulguptajss commented Apr 24, 2024

Case 1: Instant Query Failure

Case 2: Range Query Failure

Case 3: Poller Port Change