Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue824 morestates and issue966 and... other stuff #1067

Merged
merged 30 commits into from
May 27, 2024

Conversation

petersilva
Copy link
Contributor

@petersilva petersilva commented May 24, 2024

close #824
close #966

So this branch was originally meant to add more interpreted states to the sr status display. Instead of just saying "running" we read the metrics a bit and say the flow is:

  • lagging ... if the messages being processed are too old.
  • slow ... if data transfers are below a threshold rate.
  • reje ... if the messages are being rejected at a high rate (excess cpu to process messages)... bad subtopics, etc..
  • retry ... if lots of files are being retries.
  • idle ... it it's not doing anything at all.

Now those states are things calculated from the metrics, they aren't directly in the metrics.
So to make it more obvious and show up in sr3 dump as well as sr3 status I moved all
the calculations on metrics out of the status() routing in sr.py and into _resolve() (the routine that looks at all the the sources of information and makes judgements about them.

When I started running (initially broken) versions of this code, all manner of interesting failure modes were exposed:

  • realized that sr3 sanity should likely not be starting stopped jobs... changed things so it wouldn't.

  • found weird error about credentials lookups failing (a regression) cdf8dac

  • naturally while it wasn't quite right, some processes were identified as strays... weirdly. It turns out that if sanity configs were fine, other than a few strays, it would not kill the strays. It would only kill them if there was something else also wrong at the same time (missing instances e.g.)
    01b21f0

  • noticed cleanup doesn't care if the flows are running. the processes would just redeclare the resources. Added a gate so cleanup refuses to run if the flows are running.

  • sr3 cleanup is protected by dangerWillRobinson, so you have to supply the right number of configs to it for that to work. When the wrong number of configs was supplied, the sr3 cleanup didn't happen. the subsequent sr3 remove, if given the correct number, for dangerWillRobinson did succeed, then the result was that all the queues remained on the broker after a flow_test.

    from that came the observation that remove() should probably also do a cleanup()
    which is actually what v2 does.

  • also noticed that the ftp subscribers were bombing during the dynamic flow test (hence the crasher tag). I think this is the result of Better sender io errors (download also...) #1037 narrowing the scope of try/excepts. there was a failure mode not covered in the new the ftps were aborting with timeouts... so this is exactly the logic of Improve handling of large file transfers #966 is dealing with... so I added a new routine logProgress to put log messages out so that the log is no longer dead during long transfers.

  • adding the new states meant adding new settings to contol them:

    • lagThrehoold ... how far behind do you need to be to say you are lagging.
    • slowThreshold ... how slow do transfers have to be, to be a problem.
    • retryThreshold ... how many retries are a problem?
    • idleThreshold ... how long doing nothing becomes idle?
  • the existing sanity_log_dead setting is how we decide if a flow is hung. To fit in better, renamed it hungThreshold (with a synonym, so the sanity_log_dead still works in a config file.)

There is still a strange bug in sanity, where it doesn't leave all stopped jobs alone, only some of them... working on that before this PR is declared final. but here is a progress report.

petersilva and others added 22 commits May 21, 2024 00:18
There are many derived values that do calculations based on the
raw metrics, done in the status() message.  By moving these
calculations to the _resolve() routine, we make the results available
to sr overview and sr dump, and this provides a foundation for more
refined running states, as per #824
There is a weird timeout error caused by setting timers that
are never cleared when errors occur. this causes subscriber to
crashes when FTP file retrieval fails, rather than logging and
continuoing.

Also added logProgress routine which is called to print a log
message often enough to satisfy sanity_log_dead. currently fixed
at every 60 seconds during a transfer.
@petersilva petersilva added bug Something isn't working enhancement New feature or request crasher Crashes entire app. labels May 24, 2024
@andreleblanc11
Copy link
Member

This is quite the PR

@andreleblanc11
Copy link
Member

It might be worth waiting until the dev meeting to go through the PR together too.

@reidsunderland
Copy link
Member

reidsunderland commented May 24, 2024

I'm not saying I want it to be changed, but would this be a case where we should use underscore in the config option to indicate that they're grouped together?

  • threshold_idle
  • threshold_lag
  • threshold_reject
  • threshold_retry
  • threshold_hung
  • threshold_slow

If we were to group them like that, maybe even sanity_log_dead should be threshold_logDead ? I just noticed that sanity_log_dead is renamed :)

"Underscore is now reserved for cases where it represents a grouping of options, or options related to a given class. For example, post_ settings retained the first underscore, but not the rest. "

@petersilva
Copy link
Contributor Author

sure. not attached to the names... but threshold_logDead is what is used to decide if a process is hung ... which is why I called it hungThreshold... so threshold_hung would make the most sense. there is also the existing accelThreshold used to decide whether a binary accellerator should be used to a transfer in place of the built-in python.

  • threshold_hung
  • threshold_accel

?

@petersilva
Copy link
Contributor Author

petersilva commented May 24, 2024

oh... and no issue with waiting until monday... it's still a draft... there is still a weirdness with sanity.

@reidsunderland
Copy link
Member

yeah threshold_hung/hungThreshold makes sense, I didn't notice that sanity_log_dead had already been changed.

For accelThreshold, I feel like it wouldn't really fit into the threshold_ group, because it's used to change actual behaviour of the program, not just monitoring/status/sanity related.

If we want to commit to underscores, I think I'd actually want an accel_ group, with accel_threshold and accel_xxxCommand 😄

I think I'm leaning towards leaving it alone. If we want to be serious about the underscore grouping, there's a lot of other options that would need to be renamed too

@petersilva
Copy link
Contributor Author

petersilva commented May 24, 2024

maybe threshold is a bad word in the first place....

  • status_hung
  • status_retry
  • status_idle
    ...

are those more descriptive? I mean they are thresholds... but ... does having that in the name really help?

@petersilva
Copy link
Contributor Author

I also don't know if I want to commit to underscores... v2 was all underscores... some guy (a really good guy btw) I was talking to at a WMO committee wanted camelCase, so I said sure... that was for fields in the message format, but I wanted the code to reflect the messages, one thing led to another... but camelCase is a java/c++ kind of thing... it's weird in python context. There was no shape to it in v2 (we were coming from a C background.)

We ended up in this weird place... with _ here and there... My mental model... is that... if we had an .ini style or yaml config files, the _ would make sense for groupings. post_ would be something like:


[post]
   broker
   exchange
   topicPrefix

[subscriber]
    broker
    exchange
    topicPrefix
    subtopic

(the subscriber is the one without any _ ... the implied default.)
In that context... [status] would make more sense than [threshold]....


[accel]
     command
     threshold

I dunno... no firm ideas, just sharing thoughts.

This code says the instances we find that don't have pids, are missing.
This happens when we read metrics, and then know what the pids of the
flows were when it was last running.
An instance pid that doesn't have a matching file is actually a stray.
I think this code pre-dates strays. anyways what was happenning:

sr3 stop xx   # xx has current metrics file.
sr3 sanity
  # instances of xx are detected as "missing" because metrics present.
  # sanity therefore starts up xx when it should not.
@petersilva
Copy link
Contributor Author

perhaps status is too generic... runstate... flowstate

  • flowstate_idle
  • flowstate_lag
  • flowstate_slow
  • flowstate_retry
  • flowstate_reject
  • flowstate_hung

@petersilva
Copy link
Contributor Author

with the new commits I fixed the odd sanity starting too much up problem. It is ready for testing. renaming is a thing we can do also.

@reidsunderland
Copy link
Member

reidsunderland commented May 24, 2024

Looking at the ini/yaml style config, I do think we can find a better group name than threshold

is stateThreshold_ too long? I think that's the most descriptive of what it actually is. It's a threshold that is used to determine the state on a component/config.

@petersilva
Copy link
Contributor Author

there is a remaining race condition (that always existed): if you are starting up a multi-instance configuration, the instances take a few tens of seconds to completely set up... the children write their own pid files after they have forked, parsed their configurations, and waited a bit (to avoid fighting over queue declarations and such.) so it takes little while for all the instances to have pid files.

if sr3 sanity runs during that startup period. the instances without .pid files will be strays. sanity will kill them and start replacements. net result, it will take a few extra seconds for some of the instances to start up.

ways of resolving this:

  • have parent write the pid file, child will read it, see it maches and be happy, if missing, will write it.
    the parent will do this as soon as the child is launched, much reduced vulnerability window.

  • have the start function write a .start file in the state directory... define an additional starting status.
    sr3 status sees the starting status and ... um... does the right thing ... meanwhile the parent scans for pid files until they are all present, and then removes the .start file.

other ideas welcome... and this PR is big enough... maybe this should be a followup or something.

@andreleblanc11
Copy link
Member

Maybe stateLimit could also work? A bit less long then stateThreshold but communicates similar information

@petersilva
Copy link
Contributor Author

we could do long/short combos...

runStateThreshold_idle ... rst_idle, rst_hung, rst_lag, rst_slow

petersilva and others added 2 commits May 25, 2024 00:57
With this change, the cleanup() routine now returns a boolean, True for
success, False for failure.   so remove() now only proceeds if cleanup()
reports success.

Also, cleanup will do nothing unless all the configurations chosen are stopped.
Formerly it would clean up the stopped configurations, and skip the others.
@petersilva
Copy link
Contributor Author

people want runStateThreshold... as the name.
discussion of defaults:

For deciding, if idle:


if posting:
     lastTransfer should be based on last successful post.
elif downloading:
     use last transfer time. 
else
    use last accepted message time.


Copy link
Member

@reidsunderland reidsunderland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed in meeting, it's good to merge once renamed to runStateThreshold_...

@petersilva
Copy link
Contributor Author

petersilva commented May 27, 2024

from runinng a dynamic_flow test, proof the new message works:

tarting flow_post on: /home/peter/sarra_devdocroot, saving pid in .flowpostpid, using: POST=sr3_post, CPOST=sr3_cpost
2024-05-27 12:48:47,401 4190153 [WARNING] sarracenia.config finalize hungTreshold 450 set lower than housekeeping 604800.0. sr3 sanity might think this flow is hung kill it too quickly.
2024-05-27 12:48:47,401 4190153 [WARNING] sarracenia.config finalize hungTreshold 450 set lower than housekeeping 604800.0. sr3 sanity might think this flow is hung kill it too quickly.

but there is a typo in it... which is why it wasn't replaced... must fix. corrected:


2024-05-27 13:22:51,403 251176 [WARNING] sarracenia.config finalize runStateThreshold_hung 450 set lower than housekeeping 604800.0. sr3 sanity might think this flow is hung kill it too quickly.
2024-05-27 13:22:51,404 251176 [WARNING] sarracenia.config finalize runStateThreshold_hung 450 set lower than housekeeping 604800.0. sr3 sanity might think this flow is hung kill it too quickly.

@petersilva petersilva marked this pull request as ready for review May 27, 2024 17:47
@petersilva petersilva merged commit 08a468a into development May 27, 2024
4 checks passed
@petersilva petersilva deleted the issue824_morestates_2ndTry branch June 6, 2024 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working crasher Crashes entire app. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve handling of large file transfers Metrics and settings for last sent/received files
3 participants