Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish more prometheus stats throughout system #206

Open
9 of 29 tasks
allada opened this issue Jul 19, 2023 · 4 comments
Open
9 of 29 tasks

Publish more prometheus stats throughout system #206

allada opened this issue Jul 19, 2023 · 4 comments

Comments

@allada
Copy link
Member

allada commented Jul 19, 2023

Now that Prometheus is added and the API is established, we need to spread the usage around the system.

@chrisstaite
Copy link
Contributor

That's quite the project. I tried to grab some stats today, after enabling the endpoint I just got #EOF but it's probably because it was on the scheduler which doesn't have anything logging any metrics yet...

I did want to try and get some metrics off the scheduler based on the user, but Goma doesn't appear to pass the authenticated user through to the RBE. I need to look into that.

@allada
Copy link
Member Author

allada commented Jul 19, 2023

I'm about to send the behemoth PR to enable metrics for workers. I suggest waiting a little to see how that PR looks before starting to implement it.

As for enabling it for a specific user... this is going to be tricky, we'd have to pass down a lot of context to enable it for a specific strand. Right now we always gather metrics regardless of what user it is.

If this feature is really requested, here's a thought...

  • Disable metrics globally (would be trivial, since everything is currently wrapped in prometheus_utils.rs), set a global thread_local flag that prevents metric collection with env flag or disable it at runtime with a service endpoint.
  • Add a special endpoint that you can add specific IP addresses or endpoints to trigger it into a "debug" mode.
  • When users connect check to see if they are in the list of "debug" users, if they are create a new thread, setup a new tokio runtime for that thread manually and set the thread_local flag to capture metrics.

In theory we'd only capture metrics for debug endpoints in that case with near zero runtime cost.

@allada
Copy link
Member Author

allada commented Jul 20, 2023

I did want to try and get some metrics off the scheduler based on the user

I started some of the work here. To get it fully to what you want, we need a way to reset metrics (also should be trivial, since we already have a visitor), an endpoint to trigger the reset and to create a new thread and runtime for specific attached clients.
#215

@blakehatch
Copy link
Member

Taking a crack at "Total GRPC connections since server started":

  • Tracking how many clients have connected with a lazily initialized counter for thread safety.
  • Publishing just total number of client connections, not sure if other metrics are wanted like with Prometheus now publishes connected clients #230, don't want it to be bloated since it's never removing any clients unlike when only active connections are recorded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants