Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics and settings for last sent/received files #824

Closed
reidsunderland opened this issue Nov 15, 2023 · 15 comments · Fixed by #871 or #1067
Closed

Metrics and settings for last sent/received files #824

reidsunderland opened this issue Nov 15, 2023 · 15 comments · Fixed by #871 or #1067
Labels
enhancement New feature or request Priority 3 - Important worrisome impact... should study carefully Sugar nice to have features... not super important. v3 issue deferred to or affects version 3 wishlist would be nice, not pressing for any particular client.

Comments

@reidsunderland
Copy link
Member

One of the useful features in Columbo is that it shows when each config either last received a file or last transmitted a file. Adding this information to the metrics could make it easier to develop a monitoring tool for sr3.

Columbo uses its own config file to set limits on how long is too long since last sending/receiving a file. I think it would be better to put these limits directly in the sr3 config for at least two reasons:

  1. it's more intuitive (to me) to set these limits directly in the config file. It's easy to forget about Columbo's maxSettings.conf when setting up or modifying a Sundew config.
  2. sr3 can know when something is wrong and report it.

For example:

  • metrics_maxDelayRecv: when more time than this option's value has elapsed since the last accepted message, this config is "late".
  • metrics_maxDelaySend: when more time than this option's value has elapsed since the last message was posted or file was sent, this config is "late".

sr3 status already takes up a lot of space on the screen, I'm not sure if it's worth trying to squeeze in detailed information about the last send/receive time, but maybe it could show "late" if a file hasn't been sent or received within the defined time limits.

I think the last reception time is always the time of the last accepted message, but the last transmitted time is a little less obvious. For configs that post, using the last post time should work but for senders that don't post, we would have to use the last successful file send time.

@reidsunderland reidsunderland added enhancement New feature or request v3 issue deferred to or affects version 3 wishlist would be nice, not pressing for any particular client. Sugar nice to have features... not super important. labels Nov 15, 2023
@petersilva
Copy link
Contributor

Using the sr configuration language is easy for other tools to adopt, because when we do sr3 dump, it dumps the entire config as a json. so someone building a Columbo-like tool could either read in the output of dump, or they might be able
the sr_GlobalState class from sr.py... haven't looked into how to do that, so it might mean some mods. but in principle, buildings things that understand the config of an entire node should not be hard.

@petersilva petersilva added the Priority 3 - Important worrisome impact... should study carefully label Dec 4, 2023
@petersilva
Copy link
Contributor

OK, got a first patch that writes the metrics, the look like this:


fractal% more sarra_download_f20_01.json
{"flow": {"stop_requested": false, "last_housekeeping": 1702684097.2077858, "transferConnected": false, "transferConnectStart": 0, "transferConnectTime": 0, "transferRxBytes": 256603, "transferTxBytes": 0, "transf
erRxFiles": 87, "transferTxFiles": 0, "next_housekeeping": 1702684157.2077858, "transferRxLast": "20231215T234817.66847229", "msgRate": 8.631587419459613, "current_sleep": 0.1}, "post.message": {"connected": false
, "disconnectLast": 1702684098.7549658, "disconnectTime": 0.0032203197479248047, "disconnectCount": 2, "rxByteCount": 0, "rxGoodCount": 0, "rxBadCount": 0, "txByteCount": 50985, "txGoodCount": 91, "txBadCount": 1}
, "gather.message": {"connected": false, "disconnectLast": 1702684097.7512178, "disconnectTime": 1.0048370361328125, "disconnectCount": 1, "rxByteCount": 98249, "rxGoodCount": 175, "rxBadCount": 0, "txByteCount": 
0, "txGoodCount": 0, "txBadCount": 0, "brokerQueuedMessageCount": 287}, "retry": {"msgs_in_download_retry": 0, "msgs_in_post_retry": 9}, "log": {"lagMax": 87.95920205116272, "lagTotal": 7636.8115911483765, "lagMes
sageCount": 87, "rejectCount": 88}, "sarracenia.transfer": {"byteRateInstant": 1159295.2049689442}}

the metrics is: transferRxLast


fractal% more sender_tsource2send_f50_01.json
{"flow": {"stop_requested": false, "last_housekeeping": 1702684012.0278788, "transferConnected": true, "transferConnectStart": 1702684203.829592, "transferConnectTime": 189.70056891441345, "transferRxBytes": 0, "t
ransferTxBytes": 1009148, "transferRxFiles": 0, "transferTxFiles": 958, "transferTxLast": "20231215T234959.566806793", "msgRate": 5.057996349006494, "current_sleep": 3.2}, "post.message": {"connected": true, "disc
onnectLast": 1702684181.2908318, "disconnectTime": 0.02028059959411621, "disconnectCount": 4, "rxByteCount": 0, "rxGoodCount": 0, "rxBadCount": 0, "txByteCount": 834602, "txGoodCount": 820, "txBadCount": 2}, "gath
er.message": {"connected": true, "disconnectLast": 1702684149.1651616, "disconnectTime": 97.30984234809875, "disconnectCount": 2, "rxByteCount": 909723, "rxGoodCount": 958, "rxBadCount": 0, "txByteCount": 0, "txGo
odCount": 0, "txBadCount": 0, "brokerQueuedMessageCount": 38}, "retry": {"msgs_in_download_retry": 0, "msgs_in_post_retry": 138}, "log": {"lagMax": 97.46203708648682, "lagTotal": 9851.86624956131, "lagMessageCount
": 958, "rejectCount": 0}, "sarracenia.transfer": {"byteRateInstant": 1588481.089361702}}
fractal%

in there is:

 "transferTxLast": "20231215T234959.566806793"

@petersilva
Copy link
Contributor

There is now a last column indicating the last time something was transferred (as a difference in time from the present in seconds.)


fractal% sr3 status
status: 
Component/Config                         Processes   Connection        Lag                              Rates                                        
                                         State   Run Retry  msg data   Queued   LagMax LagAvg  Last  %rej     pubsub   messages     RxData     TxData 
                                         -----   --- -----  --- ----   ------   ------ ------  ----  ----   --------       ----     ------     ------ 
cpost/veille_f34                         run     1/1     0 100%   0%      0    0.00s    0.00s  0.0s  0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
cpump/pelle_dd1_f04                      stop    0/0     0 100%   0%      0    0.00s    0.00s  0.0s 79.2%  0 Bytes/s   8 msgs/s  0 Bytes/s  0 Bytes/s
cpump/pelle_dd2_f05                      stop    0/0     0 100%   0%      0    0.00s    0.00s  0.0s 79.2%  0 Bytes/s   8 msgs/s  0 Bytes/s  0 Bytes/s
cpump/xvan_f14                           hung    1/1     0 100%   0%      0    0.00s    0.00s  0.0s  0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
cpump/xvan_f15                           hung    1/1     0 100%   0%      0    0.00s    0.00s  0.0s  0.0%  0 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
poll/f62                                 run     1/1     0 100%   0%      0    0.42s    0.15s  0.0s  0.0%  1.4 KiB/s   0 msgs/s  0 Bytes/s  0 Bytes/s
post/shim_f63                            stop    0/0          -          -         -     -     -          -        -
post/test2_f61                           stop    0/0     0 100%   0%      0    0.07s    0.06s  0.0s  0.0% 15.8 KiB/s   0 msgs/s  0 Bytes/s  0 Bytes/s
sarra/download_f20                       run     3/3     0 100%   0%      0   11.25s    3.54s  0.4s  0.0%  2.0 KiB/s   1 msgs/s 561.8 KiB/s  0 Bytes/s
sender/tsource2send_f50                  run   10/10     0 100%   7%      0    1.71s    0.68s  2.0s  0.0%  7.4 KiB/s   6 msgs/s  0 Bytes/s 81.2 KiB/s
shovel/pclean_f90                        run     3/3   610 100%   0%      0   81.27s   75.74s  0.0s  6.0%  2.5 KiB/s   1 msgs/s  0 Bytes/s  0 Bytes/s
shovel/pclean_f92                        run     3/3     0 100%   0%      0  137.20s   82.64s  0.0s  0.0%  5.5 KiB/s   3 msgs/s  0 Bytes/s  0 Bytes/s
shovel/rabbitmqtt_f22                    hung    3/3     0 100%   0%      0    2.21s    0.78s  0.0s  0.0%  7.6 KiB/s   6 msgs/s  0 Bytes/s  0 Bytes/s
shovel/t_dd1_f00                         run     3/3     0 100%   0%      2   11.94s    2.61s  0.0s 80.1%  2.7 KiB/s  10 msgs/s  0 Bytes/s  0 Bytes/s
shovel/t_dd2_f00                         run     3/3     0 100%   0%      1   11.68s    2.62s  0.0s 80.1%  2.7 KiB/s  10 msgs/s  0 Bytes/s  0 Bytes/s
subscribe/amqp_f30                       run     3/3     0 100%   0%      0   15.91s    5.01s  0.3s  0.0%  1.5 KiB/s   1 msgs/s 151.8 KiB/s  0 Bytes/s
subscribe/cclean_f91                     run     3/3   265 100%   0%      0    0.00s    0.00s  0.0s 100.0% 33 Bytes/s   0 msgs/s  0 Bytes/s  0 Bytes/s
subscribe/cdnld_f21                      run     3/3     0 100%   0%      0   37.46s    7.35s 55.6s  0.3% 964 Bytes/s   1 msgs/s 13.1 KiB/s  0 Bytes/s
subscribe/cfile_f44                      run     3/3     0 100%   0%      0    1.90s    0.21s  1.6s  0.0%  1.1 KiB/s   3 msgs/s 1004 Bytes/s  0 Bytes/s
subscribe/cp_f61                         run     3/3     0 100%   0%      0   13.90s    2.71s  1.1s  0.0%  3.8 KiB/s   6 msgs/s 83.5 KiB/s  0 Bytes/s
subscribe/ftp_f70                        run     3/3    12 100%  42%      0  283.14s    1.84s  1.0s  0.0%  1.6 KiB/s   3 msgs/s  9.0 KiB/s  0 Bytes/s
subscribe/q_f71                          run     3/3    16 100%  35%      0  292.54s    4.73s 16.0s  0.1%  1.3 KiB/s   3 msgs/s 41.3 KiB/s  0 Bytes/s
subscribe/rabbitmqtt_f31                 run     3/3     0 100%   0%     24    3.76s    1.05s  2.2s  0.0%  3.7 KiB/s   6 msgs/s 159.7 KiB/s  0 Bytes/s
subscribe/u_sftp_f60                     run     3/3     0 100%   0%      0   11.92s    1.78s  1.7s  0.0%  3.7 KiB/s   6 msgs/s 83.3 KiB/s  0 Bytes/s
watch/f40                                run     1/1     0 100%   0%      0    0.07s    0.02s  0.0s  0.0%  2.7 KiB/s   0 msgs/s  0 Bytes/s  0 Bytes/s
winnow/t00_f10                           run     1/1     0 100%   0%      0    7.28s    2.83s  0.0s 51.1%  1.5 KiB/s   1 msgs/s  0 Bytes/s  0 Bytes/s
winnow/t01_f10                           run     1/1     0 100%   0%      0    4.29s    1.54s  0.0s 55.6%  1.6 KiB/s   1 msgs/s  0 Bytes/s  0 Bytes/s
      Total Running Configs:  23 ( Processes: 62 missing: 0 stray: 0 )

sample use in sr3 status of the metrics produced above.

@reidsunderland
Copy link
Member Author

The last column is great, but right now there's nothing to interpret that number.

If sr3 showed a status of "late" for configs which have last > a threshold, then it would be extremely easy for us to monitor with Nagios right away. Eventually we could get fancier monitoring.

I reopened the issue to discuss it. Maybe we want to exclusively have the interpretation done outside of sr3, that's okay and we can close this again.

@andreleblanc11 @junhu3

@reidsunderland
Copy link
Member Author

reidsunderland commented May 16, 2024

Also, for components that don't transfer files, it would be nice to have the 'last' metric calculated from the last post

(and if a vip is set in the config, only the instance with the VIP matters for last post time, wVip instances could continue to show - in sr3 status)

@reidsunderland
Copy link
Member Author

I chose "late" because it's 4 letters and fits with the other States reported by sr3 status.

But @MagikEh is worried about the word:

"Caution to terminology, 'late' in the case of 'no obs' can lead an analyst to looking for data that doesn't exist. I'd prefer something along the lines of 'Unexpected Lapse', 'Missed Milestone', 'Checkpoint', or something to that nature"

@petersilva
Copy link
Contributor

petersilva commented May 17, 2024

ah... so report things that we think are a problem... by different status... interesting... if the latency (as reported by LagAvg is over a threshold, should that trigger "late" as well?

Also, I think a high rejection rate should trigger a "bad" status of some kind. we could call that "poor" for poorly tuned?

sr3 --diag status

would only show us stuff with non running or stopped states (the bad ones.) partial, late, poor ...
?

@petersilva
Copy link
Contributor

or is calling them all "late" and then just grep for late good enough?

@petersilva
Copy link
Contributor

We might also need settings to indicate what a normal "lastTransfer" is...

  • lastTransferThresholdLate
  • lagAvgThresholdLate
  • rejectionRateLate/poor

so have a system-wide default, and be able to tune for certain flows.

@petersilva
Copy link
Contributor

slow is a four letter word.

@petersilva
Copy link
Contributor

petersilva commented May 18, 2024

thinking:

running -- no problems identified.
lag -- latency theshold exceeded lagTreshold (system-wide default, + per flow override)
idle -- last transfer too long ago... idleThreshold
slow -- transfer rate too low slowThreshold

so you do an sr3 status | grep -v run

and you should get all the stopped, part, missing, idle, slow ones.

@petersilva
Copy link
Contributor

OK, there is issue824_morestates... implements this... sort of... but it doesn't pass any tests...
how did that happen? I think a regression crept into a previous merge.... have to examine.

@petersilva
Copy link
Contributor

I was wrong... somehow changing the display broke the flow tests... this will take some time.

petersilva added a commit that referenced this issue May 21, 2024
There are many derived values that do calculations based on the
raw metrics, done in the status() message.  By moving these
calculations to the _resolve() routine, we make the results available
to sr overview and sr dump, and this provides a foundation for more
refined running states, as per #824
@petersilva
Copy link
Contributor

fwiw... somehow this change broke "stop" and tickled all kinds of latent vulnerabilities with sanity, cleanup and stop... working through them...
stop is still occasionally broken (doesn't break every time, but often enough...)

what I have would likely pass the flow tests, but fail in practice for operations like sr3 stop... This work is on a second branch:

issue824_morestates_2ndTry

@petersilva
Copy link
Contributor

OK, so the problem was not identifying running states ( idle and retry were missing.) and that was causing all manner of havoc. Now the static and flakey tests pass. Dynamic is... interesting...

I've noticed that sr3 sanity starts even stopped processes... which is not correct.
hmm...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Priority 3 - Important worrisome impact... should study carefully Sugar nice to have features... not super important. v3 issue deferred to or affects version 3 wishlist would be nice, not pressing for any particular client.
Projects
None yet
2 participants