Enhancing metric observability #2791
Labels
Priority 1: Must
Highest priority. A release cannot be made if this issue isn’t resolved.
Type: Enhancement
Use to signal an issue enhances an already existing feature of the project.
After two separate instances of providing training and on-site support for performance in AF applications, I believe there are things we can do to improve the observability.
We should separate the observability into two different categories:
The reason for this separation is that timings can be very hard to alert on. They can fluctuate heavily and therefore we mention percentiles. E.g. 95% of requests are done withing 200ms.
However, this value fluctuating is not a problem, as long as it doesn't cause the capacity to be reached. When you reach the capacity (or want to optimize) you start investigating.
Capacity
The current way to measure capacity for commands and events is the capacity metric. This is the number of threads that were busy (on average) over the last 10 minutes. There are a few problems with this metric:
MonitorCallback
is not called so the time taken is not registeredI want to propose to:
Capacity monitoring for event processors is good (using eventprocessor latency). We lack any autoscaling capabilities though! And I would like monitoring on the PSEP thread pool, just like in the buses.
I have an idea for the autoscaling. Expect a blog soon.
Timings
The timings are already very good. There are some things we can improve there:
The text was updated successfully, but these errors were encountered: