Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate series error during join queries #2782

Open
rahulguptajss opened this issue Mar 27, 2024 · 1 comment
Open

Duplicate series error during join queries #2782

rahulguptajss opened this issue Mar 27, 2024 · 1 comment

Comments

@rahulguptajss
Copy link
Contributor

Thanks @ybizeul for reporting.

image

From the screenshot, it seems that the poller port has changed. We should investigate whether the instance label can be ignored during join queries in Prometheus/Grafana.

@rahulguptajss
Copy link
Contributor Author

Case 1: Instant Query Failure

For the problem mentioned above, we encountered an instant query failure. This can only happen if the same poller is being monitored on different ports in Harvest. Here is an example:

Suppose we are publishing the following metrics to Prometheus. The port field is used to simulate instance label with different values:

volume_avg_latency{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", port="12990"} 2.45
volume_avg_latency{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", port="12991"} 2.45
volume_labels{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", isEncrypted="false", isHardwareEncrypted="false", is_sis_volume="true", junction_path="/DB1", node_root="false", root_volume="No", snapshot_autodelete="true", snapshot_policy="none", state="online", svm_root="false", type="rw", port="12990"} 1
volume_labels{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", isEncrypted="false", isHardwareEncrypted="false", is_sis_volume="true", junction_path="/DB1", node_root="false", root_volume="No", snapshot_autodelete="true", snapshot_policy="none", state="online", svm_root="false", type="rw", port="12991"} 1

When we run the following Prometheus instant query:

volume_avg_latency
* on(aggr,volume) group_right
volume_labels

We receive the following error:

Error executing query: found duplicate series for the match group {aggr="EPICaggr", volume="DB1"} on the left hand-side of the operation: [...]; many-to-many matching not allowed: matching labels must be unique on one side.

Since this is not a valid use case in the field, there is no need to handle this situation.

Case 2: Range Query Failure

During situations such as node move or volume move, joins may fail. These need to be tackled according to the query, which may involve ignoring certain labels to fix the issue.

Case 3: Poller Port Change

Another scenario occurs when the poller port changes over time due to poller addition or deletion, resulting in a change to the Prometheus instance label. For simulation purposes, we use the port label.

Initially, Prometheus scrapes the following data:

volume_avg_latency{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", port="12990"} 2.45
volume_labels{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", isEncrypted="false", isHardwareEncrypted="false", is_sis_volume="true", junction_path="/DB1", node_root="false", root_volume="No", snapshot_autodelete="true", snapshot_policy="none", state="online", svm_root="false", type="rw", port="12990"} 1

After some time, if the poller port changes to 12991, it publishes the following data:

volume_avg_latency{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", port="12991"} 2.45
volume_labels{datacenter="rtp", node="sg-tme-af200-01", volume="DB1", aggr="EPICaggr", cluster="sg-tme-af200-01-02", style="flexvol", svm="FP-Test", isEncrypted="false", isHardwareEncrypted="false", is_sis_volume="true", junction_path="/DB1", node_root="false", root_volume="No", snapshot_autodelete="true", snapshot_policy="none", state="online", svm_root="false", type="rw", port="12991"} 1

Running the range query volume_labels in Prometheus will result in a change in color for this metric and a duplicate listing of the metric in the Grafana panel due to the change in port.

image

To address this, we have two options:

  1. Modify the query to exclude the port label. Here is the adjusted query:

    label_replace(volume_labels,"port", "", "port", ".*")
    
  2. Drop the port label using Prometheus label rules in the Prometheus configuration.

If we apply solution 1 to topk queries, the modification would look like this:

Before:

volume_labels
  and 
topk(5, avg_over_time(volume_labels[3h] @ end()))

After:

label_replace(volume_labels,"port", "", "port", ".*")
and
topk(5, avg_over_time(label_replace(volume_labels,"port", "", "port", ".*")[3h:] @ end()))
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant