Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add envoy stat_prefix support for httpproxy #5535

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

therealak12
Copy link
Contributor

@therealak12 therealak12 commented Jun 29, 2023

This pull request addresses issue#4637.
I'm not sure if this is a perfect implementation but I thought receiving feedback could help me implement it more properly.

Currently it adds a field in the envoy stanza of contour configuration, which, if set, will also be set in the envoy route config.

Another possible implementation would be to add a field to HTTPProxy spec, allowing users to set different StatPrefixes per route.

Also, as a third way to implement, we can set resource.name._resource.kind_resource.namespace as the stat_prefix!

@therealak12 therealak12 requested a review from a team as a code owner June 29, 2023 19:05
@therealak12 therealak12 requested review from tsaarni and stevesloka and removed request for a team June 29, 2023 19:05
@github-actions
Copy link

Hi @therealak12! Welcome to our community and thank you for opening your first Pull Request. Someone will review it soon. Thank you for committing to making Contour better. You can also join us on our mailing list and in our channel in the Kubernetes Slack Workspace

@therealak12
Copy link
Contributor Author

therealak12 commented Jul 1, 2023

Screenshot from 2023-06-29 17-51-42

Here is a demo of the current implementation. There were 2 httpproxies in the cluster with example.com and example.com.ir hosts at the time.

@izturn
Copy link
Member

izturn commented Jul 4, 2023

I prefer this to xref: #5539

@izturn izturn requested a review from sunjayBhatia July 4, 2023 03:28
@izturn izturn added area/metrics Issues or PRs related to exposing time series metrics. release-note/small A small change that needs one line of explanation in the release notes. labels Jul 4, 2023
@izturn izturn requested review from skriss and removed request for stevesloka July 4, 2023 03:30
Copy link
Member

@sunjayBhatia sunjayBhatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think users will need to configure this per-route stat prefix? Is there any value in making this configurable? If not we can probably get away with just adding a boolean flag and configuring the stat prefix using some details of the Envoy (hashing the route match parameters for example) route to ensure uniqueness/ability to differentiate between different routes

In general this seems like it achieves what #4637 requires, an opt-in ability to signal Envoy to collect per-route upstream stats

@therealak12
Copy link
Contributor Author

Do we think users will need to configure this per-route stat prefix? Is there any value in making this configurable? If not we can probably get away with just adding a boolean flag and configuring the stat prefix using some details of the Envoy (hashing the route match parameters for example) route to ensure uniqueness/ability to differentiate between different routes

In general this seems like it achieves what #4637 requires, an opt-in ability to signal Envoy to collect per-route upstream stats

This PR does something like what you've explained.

@sunjayBhatia
Copy link
Member

Do we think users will need to configure this per-route stat prefix? Is there any value in making this configurable? If not we can probably get away with just adding a boolean flag and configuring the stat prefix using some details of the Envoy (hashing the route match parameters for example) route to ensure uniqueness/ability to differentiate between different routes
In general this seems like it achieves what #4637 requires, an opt-in ability to signal Envoy to collect per-route upstream stats

This PR does something like what you've explained.

I think it would be useful to modify that PR with a generated stat prefix that differentiates between different Envoy routes (maybe in addition to or instead of the existing stat prefix that differentiates between originating resources) and test it out with a couple different HTTPProxies/Ingresses/HTTPRoutes with multiple route rules

(I would guess that using different stat prefixes could possibly expand the memory footprint that was analysed in #4637 as well, but would have to confirm to be sure)

cc @izturn @wilsonwu from #4637 is the goal to see stats between different route matches (e.g. see the upstream stats for requests to /foo vs /bar), if so I don't think the existing PRs do so?

@therealak12
Copy link
Contributor Author

@sunjayBhatia
Does it look good to you?

image

Please don't worry about the prefix section for now. It is just a temporary thing that I put in place to make the PoC work. I will improve the String() method if it is accepted.

These are the stats for an httpproxy with the following spec:

routes:
- conditions:
  - prefix: /etc
  permitInsecure: true
  services:
  - name: python-http-server
    port: 8000
- conditions:
  - prefix: /tmp
  permitInsecure: true
  services:
  - name: python-http-server
    port: 8000
virtualhost:
  fqdn: some-random-path.apps.private.okd4.ts-2.staging-snappcloud.io

It sets StatPrefix to a string built like this:

fmt.Sprintf("%s_%s_%s_%s", httpRoute.GetNamespace(), "httproute", httpRoute.GetName(), r.PathMatchCondition.String())

@sunjayBhatia
Copy link
Member

this seems more usable and now does separate the stats for different route rules

we'll have to figure out a stat prefix we can generate that can express uniqueness, but also can be used by an operator to figure out what route theyre looking at (e.g. not just an opaque hash)

would love to get @izturn and @wilsonwu's thoughts since they originally requested this

@therealak12
Copy link
Contributor Author

therealak12 commented Jul 14, 2023

This seems more usable and now does separate the stats for different route rules

We'll have to figure out a stat prefix we can generate that can express uniqueness, but also can be used by an operator to figure out what route they're looking at (e.g. not just an opaque hash)

would love to get @izturn and @wilsonwu's thoughts since they originally requested this

We had a conversation about the hash which I've replied here:
https://kubernetes.slack.com/archives/C8XRH2R4J/p1689159578352139?thread_ts=1688670204.661949&cid=C8XRH2R4J

It seems the 60 characters limit can be safely removed now, and then we'll have the full hostname available in the stats.

@therealak12
Copy link
Contributor Author

One point to add here,

It's not that trivial to include the matchCondition in the stats.

There are various cases such as regex, exact, and prefix match. To distinguish between two routes, one having exactMatch set to /sample and the other having prefix set to /sample we should include the type of the match condition in the stat too.

Besides, there are query and header match conditions too.

A possible format for the stat_prefix would be:

namespace_ingressResourceType_ingressResourceName_matchType_matchValue

where matchType can be one of regex, prefix, or exact.

Although this supports all 3 match types, it doesn't count headers or query matches in. Trying to put those in the prefix makes it too complex and even useless.

It would be nice if we agree on a format before going on.

@sunjayBhatia
Copy link
Member

It's not that trivial to include the matchCondition in the stats.

Yep agreed, especially as there can be multiple match conditions in a route and these conditions can be complex to serialize into a string that isnt incredibly long

Some notes for discussion:

  • We could generate a name (based on the originating k8s resource name/route index or a hash based on match conditions) for each Envoy Route (see field here) which we do not do today and also use this as a stat prefix, which would mean users could look this up in their HTTPProxy/Ingress or Envoy config dump to coordinate with to see which stats line up with what
  • This sort of goes against Contours goal to abstract Envoy-isms from the user, but we could make the Route stat prefix configurable in HTTPProxy (this means Ingress/Gateway API HTTPRoute don't have this option)
  • It really depends what users actually will find useful, do we need the granularity of per-Route rule stats or is it enough to know what HTTPProxy/Ingress/HTTPRoute stats may come from?

@izturn
Copy link
Member

izturn commented Jul 19, 2023

@therealak12 , thx for your pr, i am ok with it

@sunjayBhatia
Copy link
Member

@therealak12 , thx for your pr, i am ok with it

can you clarify a bit more @izturn how do you intend to use the new stats? specifically around the last bullet here: #5535 (comment)

@wilsonwu
Copy link
Member

Sorry for the late reply, I think per route stats is enough for our current situation, if some use cases for per virtualhost, just calculated by metrics system is OK.

@izturn
Copy link
Member

izturn commented Aug 7, 2023

@sunjayBhatia Sorry for the late reply, As @wilsonwu said, I think per route stats is enough for our use case

@therealak12
Copy link
Contributor Author

As @sunjayBhatia pointed above, the per-route metrics are located at the highest level of the granularity pyramid:

do we need the granularity of per-Route rule stats

So let's discuss whether per-virtualhost metrics is enough or if we should agree on a format for presenting per-route metrics. In our use case, we are happy with only including the prefix from matchConditions. In general, there are several matchConditions that can differentiate routes. Including all of them results in excessively lengthy metric names.

Your valuable comment will be crucial in reaching the best decision. Thanks in advance.

@wilsonwu
@izturn

@wilsonwu
Copy link
Member

wilsonwu commented Aug 8, 2023

As @sunjayBhatia pointed above, the per-route metrics are located at the highest level of the granularity pyramid:

do we need the granularity of per-Route rule stats

So let's discuss whether per-virtualhost metrics is enough or if we should agree on a format for presenting per-route metrics. In our use case, we are happy with only including the prefix from matchConditions. In general, there are several matchConditions that can differentiate routes. Including all of them results in excessively lengthy metric names.

Your valuable comment will be crucial in reaching the best decision. Thanks in advance.

@wilsonwu @izturn

Thanks for these information, I our use case, we make envoy as an API gateway, so users pay more attention to the API, and as my early reply I thnk if we have per route metrics, the per virtualhost and per envoy metrics can be caculated be themselves, anyway, support per virtualhost metrics directly is better.

@therealak12
Copy link
Contributor Author

@sunjayBhatia
Based on the discussion, it seems providing per-virtual-host metrics is enough. I'd like to hear your thoughts on this.

Signed-off-by: Ahmad Karimi <ak12hastam@gmail.com>
Signed-off-by: Ahmad Karimi <ak12hastam@gmail.com>
Signed-off-by: Ahmad Karimi <ak12hastam@gmail.com>
@therealak12 therealak12 reopened this Dec 23, 2023
@sunjayBhatia sunjayBhatia requested review from a team and clayton-gonsalves and removed request for a team December 23, 2023 11:43
@therealak12
Copy link
Contributor Author

I am checking this out locally and the first comment is that we should also add this flag in contour/pkg/config/parameters.go so users can configure this with contour-config.yaml

I've just made a change to address your comment. I'd appreciate it if you take a look. Thanks.

Copy link

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 18, 2024
@dnbn
Copy link

dnbn commented Feb 7, 2024

Thanks for the PR @therealak12, I'am also very much interested by this feature.

Copy link

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 25, 2024
@izturn izturn removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 13, 2024
Copy link

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 28, 2024
Copy link

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 12, 2024
@chaosbox
Copy link

Thankyou for the PR, and have been waiting for a while now, we would be happy to test / provide initial feedback if reqd.

@izturn
Copy link
Member

izturn commented Apr 28, 2024

any progress?

@therealak12
Copy link
Contributor Author

therealak12 commented Apr 28, 2024

Dear @izturn @chaosbox,

This feature is a controversial one. We've proceeded with a custom implementation that satisfies our minimum requirements on a separate fork.
IMO, implementing a format that satisfies every project's requirements is not possible.

Copy link

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Issues or PRs related to exposing time series metrics. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. release-note/small A small change that needs one line of explanation in the release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants