Metrics and settings for last sent/received files #824

reidsunderland · 2023-11-15T14:20:30Z

One of the useful features in Columbo is that it shows when each config either last received a file or last transmitted a file. Adding this information to the metrics could make it easier to develop a monitoring tool for sr3.

Columbo uses its own config file to set limits on how long is too long since last sending/receiving a file. I think it would be better to put these limits directly in the sr3 config for at least two reasons:

it's more intuitive (to me) to set these limits directly in the config file. It's easy to forget about Columbo's maxSettings.conf when setting up or modifying a Sundew config.
sr3 can know when something is wrong and report it.

For example:

metrics_maxDelayRecv: when more time than this option's value has elapsed since the last accepted message, this config is "late".
metrics_maxDelaySend: when more time than this option's value has elapsed since the last message was posted or file was sent, this config is "late".

sr3 status already takes up a lot of space on the screen, I'm not sure if it's worth trying to squeeze in detailed information about the last send/receive time, but maybe it could show "late" if a file hasn't been sent or received within the defined time limits.

I think the last reception time is always the time of the last accepted message, but the last transmitted time is a little less obvious. For configs that post, using the last post time should work but for senders that don't post, we would have to use the last successful file send time.

The text was updated successfully, but these errors were encountered:

petersilva · 2023-11-15T14:32:24Z

Using the sr configuration language is easy for other tools to adopt, because when we do sr3 dump, it dumps the entire config as a json. so someone building a Columbo-like tool could either read in the output of dump, or they might be able
the sr_GlobalState class from sr.py... haven't looked into how to do that, so it might mean some mods. but in principle, buildings things that understand the config of an entire node should not be hard.

petersilva · 2023-12-15T23:50:51Z

OK, got a first patch that writes the metrics, the look like this:


fractal% more sarra_download_f20_01.json
{"flow": {"stop_requested": false, "last_housekeeping": 1702684097.2077858, "transferConnected": false, "transferConnectStart": 0, "transferConnectTime": 0, "transferRxBytes": 256603, "transferTxBytes": 0, "transf
erRxFiles": 87, "transferTxFiles": 0, "next_housekeeping": 1702684157.2077858, "transferRxLast": "20231215T234817.66847229", "msgRate": 8.631587419459613, "current_sleep": 0.1}, "post.message": {"connected": false
, "disconnectLast": 1702684098.7549658, "disconnectTime": 0.0032203197479248047, "disconnectCount": 2, "rxByteCount": 0, "rxGoodCount": 0, "rxBadCount": 0, "txByteCount": 50985, "txGoodCount": 91, "txBadCount": 1}
, "gather.message": {"connected": false, "disconnectLast": 1702684097.7512178, "disconnectTime": 1.0048370361328125, "disconnectCount": 1, "rxByteCount": 98249, "rxGoodCount": 175, "rxBadCount": 0, "txByteCount": 
0, "txGoodCount": 0, "txBadCount": 0, "brokerQueuedMessageCount": 287}, "retry": {"msgs_in_download_retry": 0, "msgs_in_post_retry": 9}, "log": {"lagMax": 87.95920205116272, "lagTotal": 7636.8115911483765, "lagMes
sageCount": 87, "rejectCount": 88}, "sarracenia.transfer": {"byteRateInstant": 1159295.2049689442}}

the metrics is: transferRxLast


fractal% more sender_tsource2send_f50_01.json
{"flow": {"stop_requested": false, "last_housekeeping": 1702684012.0278788, "transferConnected": true, "transferConnectStart": 1702684203.829592, "transferConnectTime": 189.70056891441345, "transferRxBytes": 0, "t
ransferTxBytes": 1009148, "transferRxFiles": 0, "transferTxFiles": 958, "transferTxLast": "20231215T234959.566806793", "msgRate": 5.057996349006494, "current_sleep": 3.2}, "post.message": {"connected": true, "disc
onnectLast": 1702684181.2908318, "disconnectTime": 0.02028059959411621, "disconnectCount": 4, "rxByteCount": 0, "rxGoodCount": 0, "rxBadCount": 0, "txByteCount": 834602, "txGoodCount": 820, "txBadCount": 2}, "gath
er.message": {"connected": true, "disconnectLast": 1702684149.1651616, "disconnectTime": 97.30984234809875, "disconnectCount": 2, "rxByteCount": 909723, "rxGoodCount": 958, "rxBadCount": 0, "txByteCount": 0, "txGo
odCount": 0, "txBadCount": 0, "brokerQueuedMessageCount": 38}, "retry": {"msgs_in_download_retry": 0, "msgs_in_post_retry": 138}, "log": {"lagMax": 97.46203708648682, "lagTotal": 9851.86624956131, "lagMessageCount
": 958, "rejectCount": 0}, "sarracenia.transfer": {"byteRateInstant": 1588481.089361702}}
fractal%

in there is:

 "transferTxLast": "20231215T234959.566806793"

petersilva · 2023-12-16T02:36:26Z

There is now a last column indicating the last time something was transferred (as a difference in time from the present in seconds.)


fractal% sr3 status
status: 
Component/Config                         Processes   Connection        Lag                              Rates                                        
                                         State   Run Retry  msg data   Queued   LagMax LagAvg  Last  %rej     pubsub   messages     RxData     TxData 
                                         -----   --- -----  --- ----   ------   ------ ------  ----  ----   --------       ----     ------     ------ 
cpost/veille_f34                         run     1/1     0 100%   0%      0    0.00s    0.00s  0.0s  0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
cpump/pelle_dd1_f04                      stop    0/0     0 100%   0%      0    0.00s    0.00s  0.0s 79.2%  0 Bytes/s   8 msgs/s  0 Bytes/s  0 Bytes/s
cpump/pelle_dd2_f05                      stop    0/0     0 100%   0%      0    0.00s    0.00s  0.0s 79.2%  0 Bytes/s   8 msgs/s  0 Bytes/s  0 Bytes/s
cpump/xvan_f14                           hung    1/1     0 100%   0%      0    0.00s    0.00s  0.0s  0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
cpump/xvan_f15                           hung    1/1     0 100%   0%      0    0.00s    0.00s  0.0s  0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
poll/f62                                 run     1/1     0 100%   0%      0    0.42s    0.15s  0.0s  0.0%  1.4 KiB/s   0 msgs/s  0 Bytes/s  0 Bytes/s
post/shim_f63                            stop    0/0          -          -         -     -     -          -        -
post/test2_f61                           stop    0/0     0 100%   0%      0    0.07s    0.06s  0.0s  0.0% 15.8 KiB/s   0 msgs/s  0 Bytes/s  0 Bytes/s
sarra/download_f20                       run     3/3     0 100%   0%      0   11.25s    3.54s  0.4s  0.0%  2.0 KiB/s   1 msgs/s 561.8 KiB/s  0 Bytes/s
sender/tsource2send_f50                  run   10/10     0 100%   7%      0    1.71s    0.68s  2.0s  0.0%  7.4 KiB/s   6 msgs/s  0 Bytes/s 81.2 KiB/s
shovel/pclean_f90                        run     3/3   610 100%   0%      0   81.27s   75.74s  0.0s  6.0%  2.5 KiB/s   1 msgs/s  0 Bytes/s  0 Bytes/s
shovel/pclean_f92                        run     3/3     0 100%   0%      0  137.20s   82.64s  0.0s  0.0%  5.5 KiB/s   3 msgs/s  0 Bytes/s  0 Bytes/s
shovel/rabbitmqtt_f22                    hung    3/3     0 100%   0%      0    2.21s    0.78s  0.0s  0.0%  7.6 KiB/s   6 msgs/s  0 Bytes/s  0 Bytes/s
shovel/t_dd1_f00                         run     3/3     0 100%   0%      2   11.94s    2.61s  0.0s 80.1%  2.7 KiB/s  10 msgs/s  0 Bytes/s  0 Bytes/s
shovel/t_dd2_f00                         run     3/3     0 100%   0%      1   11.68s    2.62s  0.0s 80.1%  2.7 KiB/s  10 msgs/s  0 Bytes/s  0 Bytes/s
subscribe/amqp_f30                       run     3/3     0 100%   0%      0   15.91s    5.01s  0.3s  0.0%  1.5 KiB/s   1 msgs/s 151.8 KiB/s  0 Bytes/s
subscribe/cclean_f91                     run     3/3   265 100%   0%      0    0.00s    0.00s  0.0s 100.0% 33 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
subscribe/cdnld_f21                      run     3/3     0 100%   0%      0   37.46s    7.35s 55.6s  0.3% 964 Bytes/s   1 msgs/s 13.1 KiB/s  0 Bytes/s
subscribe/cfile_f44                      run     3/3     0 100%   0%      0    1.90s    0.21s  1.6s  0.0%  1.1 KiB/s   3 msgs/s 1004 Bytes/s  0 Bytes/s
subscribe/cp_f61                         run     3/3     0 100%   0%      0   13.90s    2.71s  1.1s  0.0%  3.8 KiB/s   6 msgs/s 83.5 KiB/s  0 Bytes/s
subscribe/ftp_f70                        run     3/3    12 100%  42%      0  283.14s    1.84s  1.0s  0.0%  1.6 KiB/s   3 msgs/s  9.0 KiB/s  0 Bytes/s
subscribe/q_f71                          run     3/3    16 100%  35%      0  292.54s    4.73s 16.0s  0.1%  1.3 KiB/s   3 msgs/s 41.3 KiB/s  0 Bytes/s
subscribe/rabbitmqtt_f31                 run     3/3     0 100%   0%     24    3.76s    1.05s  2.2s  0.0%  3.7 KiB/s   6 msgs/s 159.7 KiB/s  0 Bytes/s
subscribe/u_sftp_f60                     run     3/3     0 100%   0%      0   11.92s    1.78s  1.7s  0.0%  3.7 KiB/s   6 msgs/s 83.3 KiB/s  0 Bytes/s
watch/f40                                run     1/1     0 100%   0%      0    0.07s    0.02s  0.0s  0.0%  2.7 KiB/s   0 msgs/s  0 Bytes/s  0 Bytes/s
winnow/t00_f10                           run     1/1     0 100%   0%      0    7.28s    2.83s  0.0s 51.1%  1.5 KiB/s   1 msgs/s  0 Bytes/s  0 Bytes/s
winnow/t01_f10                           run     1/1     0 100%   0%      0    4.29s    1.54s  0.0s 55.6%  1.6 KiB/s   1 msgs/s  0 Bytes/s  0 Bytes/s
      Total Running Configs:  23 ( Processes: 62 missing: 0 stray: 0 )

sample use in sr3 status of the metrics produced above.

reidsunderland · 2024-05-16T17:34:19Z

The last column is great, but right now there's nothing to interpret that number.

If sr3 showed a status of "late" for configs which have last > a threshold, then it would be extremely easy for us to monitor with Nagios right away. Eventually we could get fancier monitoring.

I reopened the issue to discuss it. Maybe we want to exclusively have the interpretation done outside of sr3, that's okay and we can close this again.

@andreleblanc11 @junhu3

reidsunderland · 2024-05-16T17:44:55Z

Also, for components that don't transfer files, it would be nice to have the 'last' metric calculated from the last post

(and if a vip is set in the config, only the instance with the VIP matters for last post time, wVip instances could continue to show - in sr3 status)

reidsunderland · 2024-05-16T17:51:40Z

I chose "late" because it's 4 letters and fits with the other States reported by sr3 status.

But @MagikEh is worried about the word:

"Caution to terminology, 'late' in the case of 'no obs' can lead an analyst to looking for data that doesn't exist. I'd prefer something along the lines of 'Unexpected Lapse', 'Missed Milestone', 'Checkpoint', or something to that nature"

petersilva · 2024-05-17T12:24:57Z

ah... so report things that we think are a problem... by different status... interesting... if the latency (as reported by LagAvg is over a threshold, should that trigger "late" as well?

Also, I think a high rejection rate should trigger a "bad" status of some kind. we could call that "poor" for poorly tuned?

sr3 --diag status

would only show us stuff with non running or stopped states (the bad ones.) partial, late, poor ...
?

petersilva · 2024-05-17T12:26:03Z

or is calling them all "late" and then just grep for late good enough?

petersilva · 2024-05-17T12:29:00Z

We might also need settings to indicate what a normal "lastTransfer" is...

lastTransferThresholdLate
lagAvgThresholdLate
rejectionRateLate/poor

so have a system-wide default, and be able to tune for certain flows.

petersilva · 2024-05-18T02:25:36Z

slow is a four letter word.

petersilva · 2024-05-18T14:45:47Z

thinking:

running -- no problems identified.
lag -- latency theshold exceeded lagTreshold (system-wide default, + per flow override)
idle -- last transfer too long ago... idleThreshold
slow -- transfer rate too low slowThreshold

so you do an sr3 status | grep -v run

and you should get all the stopped, part, missing, idle, slow ones.

petersilva · 2024-05-18T17:37:24Z

OK, there is issue824_morestates... implements this... sort of... but it doesn't pass any tests...
how did that happen? I think a regression crept into a previous merge.... have to examine.

petersilva · 2024-05-18T18:24:01Z

I was wrong... somehow changing the display broke the flow tests... this will take some time.

There are many derived values that do calculations based on the raw metrics, done in the status() message. By moving these calculations to the _resolve() routine, we make the results available to sr overview and sr dump, and this provides a foundation for more refined running states, as per #824

petersilva · 2024-05-23T18:59:55Z

fwiw... somehow this change broke "stop" and tickled all kinds of latent vulnerabilities with sanity, cleanup and stop... working through them...
stop is still occasionally broken (doesn't break every time, but often enough...)

what I have would likely pass the flow tests, but fail in practice for operations like sr3 stop... This work is on a second branch:

issue824_morestates_2ndTry

petersilva · 2024-05-23T23:03:42Z

OK, so the problem was not identifying running states ( idle and retry were missing.) and that was causing all manner of havoc. Now the static and flakey tests pass. Dynamic is... interesting...

I've noticed that sr3 sanity starts even stopped processes... which is not correct.
hmm...

reidsunderland added enhancement New feature or request v3 issue deferred to or affects version 3 wishlist would be nice, not pressing for any particular client. Sugar nice to have features... not super important. labels Nov 15, 2023

petersilva added the Priority 3 - Important worrisome impact... should study carefully label Dec 4, 2023

This was referenced Dec 16, 2023

Issue824 - last transfer (for files) last message (for message flows) in metrics and status. #871

Merged

match python lastTx/Rx metrics... MetPX/sarrac#136

Closed

petersilva closed this as completed in #871 Dec 18, 2023

reidsunderland reopened this May 16, 2024

petersilva mentioned this issue May 24, 2024

Issue824 morestates and issue966 and... other stuff #1067

Merged

petersilva closed this as completed in #1067 May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics and settings for last sent/received files #824

Metrics and settings for last sent/received files #824

reidsunderland commented Nov 15, 2023

petersilva commented Nov 15, 2023

petersilva commented Dec 15, 2023

petersilva commented Dec 16, 2023

reidsunderland commented May 16, 2024

reidsunderland commented May 16, 2024 •

edited

Loading

reidsunderland commented May 16, 2024

petersilva commented May 17, 2024 •

edited

Loading

petersilva commented May 17, 2024

petersilva commented May 17, 2024

petersilva commented May 18, 2024

petersilva commented May 18, 2024 •

edited

Loading

petersilva commented May 18, 2024

petersilva commented May 18, 2024

petersilva commented May 23, 2024

petersilva commented May 23, 2024

Metrics and settings for last sent/received files #824

Metrics and settings for last sent/received files #824

Comments

reidsunderland commented Nov 15, 2023

petersilva commented Nov 15, 2023

petersilva commented Dec 15, 2023

petersilva commented Dec 16, 2023

reidsunderland commented May 16, 2024

reidsunderland commented May 16, 2024 • edited Loading

reidsunderland commented May 16, 2024

petersilva commented May 17, 2024 • edited Loading

petersilva commented May 17, 2024

petersilva commented May 17, 2024

petersilva commented May 18, 2024

petersilva commented May 18, 2024 • edited Loading

petersilva commented May 18, 2024

petersilva commented May 18, 2024

petersilva commented May 23, 2024

petersilva commented May 23, 2024

reidsunderland commented May 16, 2024 •

edited

Loading

petersilva commented May 17, 2024 •

edited

Loading

petersilva commented May 18, 2024 •

edited

Loading