Issue824 morestates and issue966 and... other stuff #1067

petersilva · 2024-05-24T16:31:29Z

close #824
close #966

So this branch was originally meant to add more interpreted states to the sr status display. Instead of just saying "running" we read the metrics a bit and say the flow is:

lagging ... if the messages being processed are too old.
slow ... if data transfers are below a threshold rate.
reje ... if the messages are being rejected at a high rate (excess cpu to process messages)... bad subtopics, etc..
retry ... if lots of files are being retries.
idle ... it it's not doing anything at all.

Now those states are things calculated from the metrics, they aren't directly in the metrics.
So to make it more obvious and show up in sr3 dump as well as sr3 status I moved all
the calculations on metrics out of the status() routing in sr.py and into _resolve() (the routine that looks at all the the sources of information and makes judgements about them.

When I started running (initially broken) versions of this code, all manner of interesting failure modes were exposed:

realized that sr3 sanity should likely not be starting stopped jobs... changed things so it wouldn't.
found weird error about credentials lookups failing (a regression) cdf8dac
naturally while it wasn't quite right, some processes were identified as strays... weirdly. It turns out that if sanity configs were fine, other than a few strays, it would not kill the strays. It would only kill them if there was something else also wrong at the same time (missing instances e.g.)
01b21f0
noticed cleanup doesn't care if the flows are running. the processes would just redeclare the resources. Added a gate so cleanup refuses to run if the flows are running.
sr3 cleanup is protected by dangerWillRobinson, so you have to supply the right number of configs to it for that to work. When the wrong number of configs was supplied, the sr3 cleanup didn't happen. the subsequent sr3 remove, if given the correct number, for dangerWillRobinson did succeed, then the result was that all the queues remained on the broker after a flow_test.

from that came the observation that remove() should probably also do a cleanup()
which is actually what v2 does.
also noticed that the ftp subscribers were bombing during the dynamic flow test (hence the crasher tag). I think this is the result of Better sender io errors (download also...) #1037 narrowing the scope of try/excepts. there was a failure mode not covered in the new the ftps were aborting with timeouts... so this is exactly the logic of Improve handling of large file transfers #966 is dealing with... so I added a new routine logProgress to put log messages out so that the log is no longer dead during long transfers.
adding the new states meant adding new settings to contol them:
- lagThrehoold ... how far behind do you need to be to say you are lagging.
- slowThreshold ... how slow do transfers have to be, to be a problem.
- retryThreshold ... how many retries are a problem?
- idleThreshold ... how long doing nothing becomes idle?
the existing sanity_log_dead setting is how we decide if a flow is hung. To fit in better, renamed it hungThreshold (with a synonym, so the sanity_log_dead still works in a config file.)

There is still a strange bug in sanity, where it doesn't leave all stopped jobs alone, only some of them... working on that before this PR is declared final. but here is a progress report.

There are many derived values that do calculations based on the raw metrics, done in the status() message. By moving these calculations to the _resolve() routine, we make the results available to sr overview and sr dump, and this provides a foundation for more refined running states, as per #824

There is a weird timeout error caused by setting timers that are never cleared when errors occur. this causes subscriber to crashes when FTP file retrieval fails, rather than logging and continuoing. Also added logProgress routine which is called to print a log message often enough to satisfy sanity_log_dead. currently fixed at every 60 seconds during a transfer.

andreleblanc11 · 2024-05-24T17:52:55Z

This is quite the PR

andreleblanc11 · 2024-05-24T17:57:42Z

It might be worth waiting until the dev meeting to go through the PR together too.

reidsunderland · 2024-05-24T17:59:21Z

I'm not saying I want it to be changed, but would this be a case where we should use underscore in the config option to indicate that they're grouped together?

threshold_idle
threshold_lag
threshold_reject
threshold_retry
threshold_hung
threshold_slow

~~If we were to group them like that, maybe even sanity_log_dead should be threshold_logDead ?~~ I just noticed that sanity_log_dead is renamed :)

"Underscore is now reserved for cases where it represents a grouping of options, or options related to a given class. For example, post_ settings retained the first underscore, but not the rest. "

petersilva · 2024-05-24T18:10:41Z

sure. not attached to the names... but threshold_logDead is what is used to decide if a process is hung ... which is why I called it hungThreshold... so threshold_hung would make the most sense. there is also the existing accelThreshold used to decide whether a binary accellerator should be used to a transfer in place of the built-in python.

threshold_hung
threshold_accel

?

petersilva · 2024-05-24T18:13:24Z

oh... and no issue with waiting until monday... it's still a draft... there is still a weirdness with sanity.

reidsunderland · 2024-05-24T18:28:40Z

yeah threshold_hung/hungThreshold makes sense, I didn't notice that sanity_log_dead had already been changed.

For accelThreshold, I feel like it wouldn't really fit into the threshold_ group, because it's used to change actual behaviour of the program, not just monitoring/status/sanity related.

If we want to commit to underscores, I think I'd actually want an accel_ group, with accel_threshold and accel_xxxCommand 😄

I think I'm leaning towards leaving it alone. If we want to be serious about the underscore grouping, there's a lot of other options that would need to be renamed too

petersilva · 2024-05-24T18:40:38Z

maybe threshold is a bad word in the first place....

status_hung
status_retry
status_idle
...

are those more descriptive? I mean they are thresholds... but ... does having that in the name really help?

petersilva · 2024-05-24T18:48:25Z

I also don't know if I want to commit to underscores... v2 was all underscores... some guy (a really good guy btw) I was talking to at a WMO committee wanted camelCase, so I said sure... that was for fields in the message format, but I wanted the code to reflect the messages, one thing led to another... but camelCase is a java/c++ kind of thing... it's weird in python context. There was no shape to it in v2 (we were coming from a C background.)

We ended up in this weird place... with _ here and there... My mental model... is that... if we had an .ini style or yaml config files, the _ would make sense for groupings. post_ would be something like:


[post]
   broker
   exchange
   topicPrefix

[subscriber]
    broker
    exchange
    topicPrefix
    subtopic

(the subscriber is the one without any _ ... the implied default.)
In that context... [status] would make more sense than [threshold]....


[accel]
     command
     threshold

I dunno... no firm ideas, just sharing thoughts.

This code says the instances we find that don't have pids, are missing. This happens when we read metrics, and then know what the pids of the flows were when it was last running. An instance pid that doesn't have a matching file is actually a stray. I think this code pre-dates strays. anyways what was happenning: sr3 stop xx # xx has current metrics file. sr3 sanity # instances of xx are detected as "missing" because metrics present. # sanity therefore starts up xx when it should not.

petersilva · 2024-05-24T19:07:09Z

perhaps status is too generic... runstate... flowstate

flowstate_idle
flowstate_lag
flowstate_slow
flowstate_retry
flowstate_reject
flowstate_hung

petersilva · 2024-05-24T19:08:35Z

with the new commits I fixed the odd sanity starting too much up problem. It is ready for testing. renaming is a thing we can do also.

reidsunderland · 2024-05-24T19:15:54Z

Looking at the ini/yaml style config, I do think we can find a better group name than threshold

is stateThreshold_ too long? I think that's the most descriptive of what it actually is. It's a threshold that is used to determine the state on a component/config.

petersilva · 2024-05-24T19:22:06Z

there is a remaining race condition (that always existed): if you are starting up a multi-instance configuration, the instances take a few tens of seconds to completely set up... the children write their own pid files after they have forked, parsed their configurations, and waited a bit (to avoid fighting over queue declarations and such.) so it takes little while for all the instances to have pid files.

if sr3 sanity runs during that startup period. the instances without .pid files will be strays. sanity will kill them and start replacements. net result, it will take a few extra seconds for some of the instances to start up.

ways of resolving this:

have parent write the pid file, child will read it, see it maches and be happy, if missing, will write it.
the parent will do this as soon as the child is launched, much reduced vulnerability window.
have the start function write a .start file in the state directory... define an additional starting status.
sr3 status sees the starting status and ... um... does the right thing ... meanwhile the parent scans for pid files until they are all present, and then removes the .start file.

other ideas welcome... and this PR is big enough... maybe this should be a followup or something.

andreleblanc11 · 2024-05-24T19:48:12Z

Maybe stateLimit could also work? A bit less long then stateThreshold but communicates similar information

petersilva · 2024-05-24T20:39:11Z

we could do long/short combos...

runStateThreshold_idle ... rst_idle, rst_hung, rst_lag, rst_slow

With this change, the cleanup() routine now returns a boolean, True for success, False for failure. so remove() now only proceeds if cleanup() reports success. Also, cleanup will do nothing unless all the configurations chosen are stopped. Formerly it would clean up the stopped configurations, and skip the others.

petersilva · 2024-05-27T15:54:32Z

people want runStateThreshold... as the name.
discussion of defaults:

For deciding, if idle:


if posting:
     lastTransfer should be based on last successful post.
elif downloading:
     use last transfer time. 
else
    use last accepted message time.

reidsunderland

Reviewed in meeting, it's good to merge once renamed to runStateThreshold_...

petersilva · 2024-05-27T16:50:00Z

from runinng a dynamic_flow test, proof the new message works:

tarting flow_post on: /home/peter/sarra_devdocroot, saving pid in .flowpostpid, using: POST=sr3_post, CPOST=sr3_cpost
2024-05-27 12:48:47,401 4190153 [WARNING] sarracenia.config finalize hungTreshold 450 set lower than housekeeping 604800.0. sr3 sanity might think this flow is hung kill it too quickly.
2024-05-27 12:48:47,401 4190153 [WARNING] sarracenia.config finalize hungTreshold 450 set lower than housekeeping 604800.0. sr3 sanity might think this flow is hung kill it too quickly.

but there is a typo in it... which is why it wasn't replaced... must fix. corrected:


2024-05-27 13:22:51,403 251176 [WARNING] sarracenia.config finalize runStateThreshold_hung 450 set lower than housekeeping 604800.0. sr3 sanity might think this flow is hung kill it too quickly.
2024-05-27 13:22:51,404 251176 [WARNING] sarracenia.config finalize runStateThreshold_hung 450 set lower than housekeeping 604800.0. sr3 sanity might think this flow is hung kill it too quickly.

petersilva and others added 22 commits May 21, 2024 00:18

set flow status for readability

5597d08

create status_active so we can add to it

e442787

added the refined running states

b4bc0b2

Merge branch 'development' into issue824_morestates_2ndTry

691418c

credentials lookup should not include pw, even for default

cdf8dac

documenting new process status displayed

a3b534d

moving strays so they get killed more often more deader

01b21f0

Merge branch 'development' into issue824_morestates_2ndTry

0b085fc

make sure configs are not running before cleanup

d818bc2

Merge branch 'development' into issue824_morestates_2ndTry

e25c150

oops, dry_run test backwards

990af7f

lag is not the same as lagging, causing bad stops etc...

3f0aeec

adding cleanup to remove, after discussion

7e91bbf

reject was another missing running state

0de1ecd

better display of sr operations... better view of unusual processes

94551e0

more messaging improvements for unusual processes

a08776f

documenting Thresholds for new flow statuses

40935b6

rename sanity_log_dead to hungThreshold (with synonym)

e969274

translated threshole docs, and polished it a bit

cfb8341

remove debug output

0f61e38

petersilva added bug Something isn't working enhancement New feature or request crasher Crashes entire app. labels May 24, 2024

petersilva requested review from reidsunderland, MagikEh, tysonkaufmann, andreleblanc11 and mshak2 May 24, 2024 16:31

petersilva added 2 commits May 24, 2024 14:53

fix printing of stray pid command lines

e6738cb

another effect of. passwords should not be in credential db keys

caae421

petersilva mentioned this pull request May 24, 2024

Fix #1064 by adding try/except #1065

Merged

petersilva and others added 2 commits May 25, 2024 00:57

feedback from pr discussion, misc

7b658ed

refine idle determination as per discussion in PR #1067

6172486

reidsunderland approved these changes May 27, 2024

View reviewed changes

rename xxTreshold to runStateThreshold_xx as per PR discussion

976d3e1

petersilva mentioned this pull request May 27, 2024

strays/missing on sr3 status during start or stop #1069

Open

typos for configs and runStateThreshold

2fa43c7

petersilva marked this pull request as ready for review May 27, 2024 17:47

petersilva merged commit 08a468a into development May 27, 2024
4 checks passed

petersilva deleted the issue824_morestates_2ndTry branch June 6, 2024 20:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue824 morestates and issue966 and... other stuff #1067

Issue824 morestates and issue966 and... other stuff #1067

petersilva commented May 24, 2024 •

edited

Loading

andreleblanc11 commented May 24, 2024

andreleblanc11 commented May 24, 2024

reidsunderland commented May 24, 2024 •

edited

Loading

petersilva commented May 24, 2024

petersilva commented May 24, 2024 •

edited

Loading

reidsunderland commented May 24, 2024

petersilva commented May 24, 2024 •

edited

Loading

petersilva commented May 24, 2024

petersilva commented May 24, 2024

petersilva commented May 24, 2024

reidsunderland commented May 24, 2024 •

edited

Loading

petersilva commented May 24, 2024

andreleblanc11 commented May 24, 2024

petersilva commented May 24, 2024

petersilva commented May 27, 2024

reidsunderland left a comment

petersilva commented May 27, 2024 •

edited

Loading

Issue824 morestates and issue966 and... other stuff #1067

Issue824 morestates and issue966 and... other stuff #1067

Conversation

petersilva commented May 24, 2024 • edited Loading

andreleblanc11 commented May 24, 2024

andreleblanc11 commented May 24, 2024

reidsunderland commented May 24, 2024 • edited Loading

petersilva commented May 24, 2024

petersilva commented May 24, 2024 • edited Loading

reidsunderland commented May 24, 2024

petersilva commented May 24, 2024 • edited Loading

petersilva commented May 24, 2024

petersilva commented May 24, 2024

petersilva commented May 24, 2024

reidsunderland commented May 24, 2024 • edited Loading

petersilva commented May 24, 2024

andreleblanc11 commented May 24, 2024

petersilva commented May 24, 2024

petersilva commented May 27, 2024

reidsunderland left a comment

Choose a reason for hiding this comment

petersilva commented May 27, 2024 • edited Loading

petersilva commented May 24, 2024 •

edited

Loading

reidsunderland commented May 24, 2024 •

edited

Loading

petersilva commented May 24, 2024 •

edited

Loading

petersilva commented May 24, 2024 •

edited

Loading

reidsunderland commented May 24, 2024 •

edited

Loading

petersilva commented May 27, 2024 •

edited

Loading