Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Thanos Ruler #527

Closed
wants to merge 2 commits into from
Closed

Add Support for Thanos Ruler #527

wants to merge 2 commits into from

Conversation

anas-aso
Copy link

I would like to use prometheus-engine frontend component as a datasource for Thanos Ruler. Unfortunately, Thanos Ruler always sends 2 extra params dedup and partial_response which results on a 400 error on GCP side:

read query instant response: expected 2xx response, got 400. Body: {
  "error": {
    "code": 400,
    "message": "Invalid JSON payload received. Unknown name \"dedup\": Cannot bind query parameter. Field 'dedup' could not be found in request message.
Invalid JSON payload received. Unknown name \"partial_response\": Cannot bind query parameter. Field 'partial_response' could not be found in request message.",
    "status": "INVALID_ARGUMENT",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.BadRequest",
        "fieldViolations": [
          {
            "description": "Invalid JSON payload received. Unknown name \"dedup\": Cannot bind query parameter. Field 'dedup' could not be found in request message."
          },
          {
            "description": "Invalid JSON payload received. Unknown name \"partial_response\": Cannot bind query parameter. Field 'partial_response' could not be found in request message."
          }
        ]
      }
    ]
  }
}

The change in this PR mutates the request body by removing those extra params before forwarding them to GCP.
@bwplotka your feedback is welcome (maybe you have a different way of doing it).

@lyanco
Copy link
Collaborator

lyanco commented Jul 26, 2023

Are there features that thanos-ruler provides that our self-deployed rule evaluator doesn't? If this request is due to a feature gap, maybe we should address that as well.

@anas-aso
Copy link
Author

Hi @lyanco. Thanks a lot for the quick reply.

Are there features that thanos-ruler provides that our self-deployed rule evaluator doesn't?

Thanos Ruler is already part of our self-managed monitoring infrastructure and the generated series are pushed to Cloud Storage (in raw Prometheus format) instead of sending them to Cloud Monitoring via the CreateTimeSeries API.
In our setup, the extra round trip to Cloud Monitoring doesn't make sense.

The workflow using rule-evaluator is like this:

  1. generate the recording rules
  2. push them to GCP (cloud monitoring - custom metrics)
  3. download them via Stackdriver-exporter
  4. push them again to GCP (cloud storage) via Thanos Sidecar

But with GMP+Thanos Ruler it will look like this:

  1. generate the recording rules (Thanos Ruler)
  2. push them to GCP (Thanos Ruler - shipper functionality).

@lyanco
Copy link
Collaborator

lyanco commented Jul 26, 2023

Got it, thanks for the context. And what metrics are you recording over? System metrics or custom (user-defined) metrics?

If system metrics, why is exporting the raw metrics via stackdriver-exporter not sufficient?
If custom metrics, why are you paying us at all instead of writing these in your oss stack? Are these log-based metrics or some other type of metric that can only be computed within GCP?

@anas-aso
Copy link
Author

And what metrics are you recording over? System metrics or custom (user-defined) metrics?

@lyanco system metrics (http LBs, PubSub, etc).

I have had issues before yielding wrong results with Stackdriver-exporter, but when I try the same query on Cloud Monitoring it returns correct results ... deltas are not easy to translate and the histograms can start returning weird results. I even have an open support ticket with GCP for more than a week now.
When something is wrong the answer is always "stackdriver exporter is not something supported by GCP" and I have to spend lots of time proving that stackdriver-exporter is fine and the problems is on GCP side.

Another reason why I like GMP Frontend is that I don't have to pull high cardinality metrics (histograms) when I am interested in the aggregation only. So doing the aggregation GCP side is more efficient for our needs.

@lyanco
Copy link
Collaborator

lyanco commented Jul 26, 2023

Got it! Thank you for the context on your use case, that makes sense.

As you said, the issue with stackdriver-exporter likely boils down to OSS Prometheus having no idea how to scrape DELTAs natively... But yeah, not our code 🙂

I don't see any issue with this PR from a goals/objective perspective but I'll let the devs weigh in on correctness etc.

@anas-aso anas-aso changed the base branch from main to StevenYCChou/init-removing-tsdb July 26, 2023 22:45
@anas-aso anas-aso changed the base branch from StevenYCChou/init-removing-tsdb to main July 26, 2023 22:46
@bwplotka bwplotka self-requested a review July 27, 2023 14:15
@anas-aso
Copy link
Author

Just an update from my side, I opened another PR for Thanos to allow disabling sending those parameters altogether: thanos-io/thanos#6560. I believe dropping them from the source would be better and will keep prometheus-engine-frontend cleaner.

@bwplotka
Copy link
Collaborator

Thanks!

From the dev site, we are happy to accept this PR, in fact we could go as far as exposing Thanos StoreAPI gRPC at some point.

However, to merge this one we need some unit test and perhaps some flag like --strip-thanos-unsupported-parameters or so that will enable this functionality. We can fix source as well, but not a biggie.

@bwplotka
Copy link
Collaborator

Commented on thanos-io/thanos#6560 (review) nevertheless

@anas-aso
Copy link
Author

Thanks @bwplotka for the feedback. I personally think the PR I opened on Thanos repo is a better way of supporting my use case. If it gets merged, then I will close this PR.

@bwplotka
Copy link
Collaborator

sgtm!

@anas-aso
Copy link
Author

closing this in favor of thanos-io/thanos#6560

@anas-aso anas-aso closed this Jul 27, 2023
@anas-aso anas-aso deleted the support-thanos-ruler branch July 27, 2023 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants