Publish more prometheus stats throughout system #206

allada · 2023-07-19T01:13:45Z

chrisstaite · 2023-07-19T21:34:05Z

That's quite the project. I tried to grab some stats today, after enabling the endpoint I just got #EOF but it's probably because it was on the scheduler which doesn't have anything logging any metrics yet...

I did want to try and get some metrics off the scheduler based on the user, but Goma doesn't appear to pass the authenticated user through to the RBE. I need to look into that.

allada · 2023-07-19T22:11:30Z

I'm about to send the behemoth PR to enable metrics for workers. I suggest waiting a little to see how that PR looks before starting to implement it.

As for enabling it for a specific user... this is going to be tricky, we'd have to pass down a lot of context to enable it for a specific strand. Right now we always gather metrics regardless of what user it is.

If this feature is really requested, here's a thought...

Disable metrics globally (would be trivial, since everything is currently wrapped in prometheus_utils.rs), set a global thread_local flag that prevents metric collection with env flag or disable it at runtime with a service endpoint.
Add a special endpoint that you can add specific IP addresses or endpoints to trigger it into a "debug" mode.
When users connect check to see if they are in the list of "debug" users, if they are create a new thread, setup a new tokio runtime for that thread manually and set the thread_local flag to capture metrics.

In theory we'd only capture metrics for debug endpoints in that case with near zero runtime cost.

allada · 2023-07-20T17:14:38Z

I did want to try and get some metrics off the scheduler based on the user

I started some of the work here. To get it fully to what you want, we need a way to reset metrics (also should be trivial, since we already have a visitor), an endpoint to trigger the reset and to create a new thread and runtime for specific attached clients.
#215

blakehatch · 2023-09-27T06:18:27Z

Taking a crack at "Total GRPC connections since server started":

Tracking how many clients have connected with a lazily initialized counter for thread safety.
Publishing just total number of client connections, not sure if other metrics are wanted like with Prometheus now publishes connected clients #230, don't want it to be bloated since it's never removing any clients unlike when only active connections are recorded

blakehatch mentioned this issue Sep 27, 2023

Prometheus now publishes number of clients connected since server started and the timestamp when the server starts #298

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publish more prometheus stats throughout system #206

Publish more prometheus stats throughout system #206

allada commented Jul 19, 2023 •

edited

Loading

chrisstaite commented Jul 19, 2023

allada commented Jul 19, 2023 •

edited

Loading

allada commented Jul 20, 2023

blakehatch commented Sep 27, 2023

Publish more prometheus stats throughout system #206

Publish more prometheus stats throughout system #206

Comments

allada commented Jul 19, 2023 • edited Loading

chrisstaite commented Jul 19, 2023

allada commented Jul 19, 2023 • edited Loading

allada commented Jul 20, 2023

blakehatch commented Sep 27, 2023

allada commented Jul 19, 2023 •

edited

Loading

allada commented Jul 19, 2023 •

edited

Loading