-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue824 morestates and issue966 and... other stuff #1067
Conversation
There are many derived values that do calculations based on the raw metrics, done in the status() message. By moving these calculations to the _resolve() routine, we make the results available to sr overview and sr dump, and this provides a foundation for more refined running states, as per #824
There is a weird timeout error caused by setting timers that are never cleared when errors occur. this causes subscriber to crashes when FTP file retrieval fails, rather than logging and continuoing. Also added logProgress routine which is called to print a log message often enough to satisfy sanity_log_dead. currently fixed at every 60 seconds during a transfer.
This is quite the PR |
It might be worth waiting until the dev meeting to go through the PR together too. |
I'm not saying I want it to be changed, but would this be a case where we should use underscore in the config option to indicate that they're grouped together?
|
sure. not attached to the names... but threshold_logDead is what is used to decide if a process is hung ... which is why I called it hungThreshold... so threshold_hung would make the most sense. there is also the existing accelThreshold used to decide whether a binary accellerator should be used to a transfer in place of the built-in python.
? |
oh... and no issue with waiting until monday... it's still a draft... there is still a weirdness with sanity. |
yeah threshold_hung/hungThreshold makes sense, I didn't notice that sanity_log_dead had already been changed. For accelThreshold, I feel like it wouldn't really fit into the threshold_ group, because it's used to change actual behaviour of the program, not just monitoring/status/sanity related. If we want to commit to underscores, I think I'd actually want an I think I'm leaning towards leaving it alone. If we want to be serious about the underscore grouping, there's a lot of other options that would need to be renamed too |
maybe threshold is a bad word in the first place....
are those more descriptive? I mean they are thresholds... but ... does having that in the name really help? |
I also don't know if I want to commit to underscores... v2 was all underscores... some guy (a really good guy btw) I was talking to at a WMO committee wanted camelCase, so I said sure... that was for fields in the message format, but I wanted the code to reflect the messages, one thing led to another... but camelCase is a java/c++ kind of thing... it's weird in python context. There was no shape to it in v2 (we were coming from a C background.) We ended up in this weird place... with _ here and there... My mental model... is that... if we had an .ini style or yaml config files, the _ would make sense for groupings. post_ would be something like:
(the subscriber is the one without any _ ... the implied default.)
I dunno... no firm ideas, just sharing thoughts. |
This code says the instances we find that don't have pids, are missing. This happens when we read metrics, and then know what the pids of the flows were when it was last running. An instance pid that doesn't have a matching file is actually a stray. I think this code pre-dates strays. anyways what was happenning: sr3 stop xx # xx has current metrics file. sr3 sanity # instances of xx are detected as "missing" because metrics present. # sanity therefore starts up xx when it should not.
perhaps status is too generic... runstate... flowstate
|
with the new commits I fixed the odd sanity starting too much up problem. It is ready for testing. renaming is a thing we can do also. |
Looking at the ini/yaml style config, I do think we can find a better group name than threshold is stateThreshold_ too long? I think that's the most descriptive of what it actually is. It's a threshold that is used to determine the state on a component/config. |
there is a remaining race condition (that always existed): if you are starting up a multi-instance configuration, the instances take a few tens of seconds to completely set up... the children write their own pid files after they have forked, parsed their configurations, and waited a bit (to avoid fighting over queue declarations and such.) so it takes little while for all the instances to have pid files. if sr3 sanity runs during that startup period. the instances without .pid files will be strays. sanity will kill them and start replacements. net result, it will take a few extra seconds for some of the instances to start up. ways of resolving this:
other ideas welcome... and this PR is big enough... maybe this should be a followup or something. |
Maybe |
we could do long/short combos... runStateThreshold_idle ... rst_idle, rst_hung, rst_lag, rst_slow |
With this change, the cleanup() routine now returns a boolean, True for success, False for failure. so remove() now only proceeds if cleanup() reports success. Also, cleanup will do nothing unless all the configurations chosen are stopped. Formerly it would clean up the stopped configurations, and skip the others.
people want runStateThreshold... as the name. For deciding, if idle:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed in meeting, it's good to merge once renamed to runStateThreshold_...
from runinng a dynamic_flow test, proof the new message works:
but there is a typo in it... which is why it wasn't replaced... must fix. corrected:
|
close #824
close #966
So this branch was originally meant to add more interpreted states to the sr status display. Instead of just saying "running" we read the metrics a bit and say the flow is:
Now those states are things calculated from the metrics, they aren't directly in the metrics.
So to make it more obvious and show up in sr3 dump as well as sr3 status I moved all
the calculations on metrics out of the status() routing in sr.py and into _resolve() (the routine that looks at all the the sources of information and makes judgements about them.
When I started running (initially broken) versions of this code, all manner of interesting failure modes were exposed:
realized that sr3 sanity should likely not be starting stopped jobs... changed things so it wouldn't.
found weird error about credentials lookups failing (a regression) cdf8dac
naturally while it wasn't quite right, some processes were identified as strays... weirdly. It turns out that if sanity configs were fine, other than a few strays, it would not kill the strays. It would only kill them if there was something else also wrong at the same time (missing instances e.g.)
01b21f0
noticed cleanup doesn't care if the flows are running. the processes would just redeclare the resources. Added a gate so cleanup refuses to run if the flows are running.
sr3 cleanup is protected by dangerWillRobinson, so you have to supply the right number of configs to it for that to work. When the wrong number of configs was supplied, the sr3 cleanup didn't happen. the subsequent sr3 remove, if given the correct number, for dangerWillRobinson did succeed, then the result was that all the queues remained on the broker after a flow_test.
from that came the observation that remove() should probably also do a cleanup()
which is actually what v2 does.
also noticed that the ftp subscribers were bombing during the dynamic flow test (hence the crasher tag). I think this is the result of Better sender io errors (download also...) #1037 narrowing the scope of try/excepts. there was a failure mode not covered in the new the ftps were aborting with timeouts... so this is exactly the logic of Improve handling of large file transfers #966 is dealing with... so I added a new routine logProgress to put log messages out so that the log is no longer dead during long transfers.
adding the new states meant adding new settings to contol them:
the existing sanity_log_dead setting is how we decide if a flow is hung. To fit in better, renamed it hungThreshold (with a synonym, so the sanity_log_dead still works in a config file.)
There is still a strange bug in sanity, where it doesn't leave all stopped jobs alone, only some of them... working on that before this PR is declared final. but here is a progress report.