-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics and settings for last sent/received files #824
Comments
Using the sr configuration language is easy for other tools to adopt, because when we do sr3 dump, it dumps the entire config as a json. so someone building a Columbo-like tool could either read in the output of dump, or they might be able |
OK, got a first patch that writes the metrics, the look like this:
the metrics is: transferRxLast
|
There is now a last column indicating the last time something was transferred (as a difference in time from the present in seconds.)
sample use in sr3 status of the metrics produced above. |
The last column is great, but right now there's nothing to interpret that number. If sr3 showed a status of "late" for configs which have last > a threshold, then it would be extremely easy for us to monitor with Nagios right away. Eventually we could get fancier monitoring. I reopened the issue to discuss it. Maybe we want to exclusively have the interpretation done outside of sr3, that's okay and we can close this again. |
Also, for components that don't transfer files, it would be nice to have the 'last' metric calculated from the last post (and if a vip is set in the config, only the instance with the VIP matters for last post time, wVip instances could continue to show |
I chose "late" because it's 4 letters and fits with the other States reported by sr3 status. But @MagikEh is worried about the word:
|
ah... so report things that we think are a problem... by different status... interesting... if the latency (as reported by LagAvg is over a threshold, should that trigger "late" as well? Also, I think a high rejection rate should trigger a "bad" status of some kind. we could call that "poor" for poorly tuned? sr3 --diag status would only show us stuff with non running or stopped states (the bad ones.) partial, late, poor ... |
or is calling them all "late" and then just grep for late good enough? |
We might also need settings to indicate what a normal "lastTransfer" is...
so have a system-wide default, and be able to tune for certain flows. |
slow is a four letter word. |
thinking: running -- no problems identified. so you do an sr3 status | grep -v run and you should get all the stopped, part, missing, idle, slow ones. |
OK, there is issue824_morestates... implements this... sort of... but it doesn't pass any tests... |
I was wrong... somehow changing the display broke the flow tests... this will take some time. |
There are many derived values that do calculations based on the raw metrics, done in the status() message. By moving these calculations to the _resolve() routine, we make the results available to sr overview and sr dump, and this provides a foundation for more refined running states, as per #824
fwiw... somehow this change broke "stop" and tickled all kinds of latent vulnerabilities with sanity, cleanup and stop... working through them... what I have would likely pass the flow tests, but fail in practice for operations like sr3 stop... This work is on a second branch: issue824_morestates_2ndTry |
OK, so the problem was not identifying running states ( idle and retry were missing.) and that was causing all manner of havoc. Now the static and flakey tests pass. Dynamic is... interesting... I've noticed that sr3 sanity starts even stopped processes... which is not correct. |
One of the useful features in Columbo is that it shows when each config either last received a file or last transmitted a file. Adding this information to the metrics could make it easier to develop a monitoring tool for sr3.
Columbo uses its own config file to set limits on how long is too long since last sending/receiving a file. I think it would be better to put these limits directly in the sr3 config for at least two reasons:
For example:
metrics_maxDelayRecv
: when more time than this option's value has elapsed since the last accepted message, this config is "late".metrics_maxDelaySend
: when more time than this option's value has elapsed since the last message was posted or file was sent, this config is "late".sr3 status already takes up a lot of space on the screen, I'm not sure if it's worth trying to squeeze in detailed information about the last send/receive time, but maybe it could show "late" if a file hasn't been sent or received within the defined time limits.
I think the last reception time is always the time of the last accepted message, but the last transmitted time is a little less obvious. For configs that post, using the last post time should work but for senders that don't post, we would have to use the last successful file send time.
The text was updated successfully, but these errors were encountered: