sr_subscribe does not return cleanly when run through ssh/dsh #240

racetted · 2019-08-02T19:48:09Z

Using version: 2.19.04b2

When I try to restart subscribers on several host at once, it hangs at the end of the restart.

My guess is that the stderr pipe is still being written into while the command ends.

Sample code

for host in "some list of hosts" ; do    ssh $host sr_subscribe mirror-ss1-from-hall2-cmop restart;  done
2019-08-01 19:03:22,442 [INFO] sr_subscribe mirror-ss1-from-hall2-cmop 01 stopping (pid=32521)
2019-08-01 19:03:22,946 [INFO] sr_subscribe mirror-ss1-from-hall2-cmop 01 stopped
2019-08-01 19:03:22,947 [INFO] sr_subscribe mirror-ss1-from-hall2-cmop 01 starting
2019-08-01 19:03:22,949 [INFO] sr_subscribe mirror-ss1-from-hall2-cmop 02 stopping (pid=32524)
2019-08-01 19:03:23,452 [INFO] sr_subscribe mirror-ss1-from-hall2-cmop 02 stopped
2019-08-01 19:03:23,452 [INFO] sr_subscribe mirror-ss1-from-hall2-cmop 02 starting
^CKilled by signal 2.

The ctrl-c is done manually as the command simply hangs there.

The text was updated successfully, but these errors were encountered:

petersilva · 2019-08-02T19:57:00Z

Hey @benlapETS please have a talk with @racetted and see if you can reproduce the issue.

benlapETS · 2019-08-07T14:11:30Z

I have some question that may be on the wrong place, if so, please tell me where we should start that discussion:

What shell are you using when in ssh (sh, bash, dash, ksh, ...) ??
Is there more informations about the hosts that fails giving up the stderr (profile, user right, network arch, ...) ?

racetted · 2019-08-07T14:27:56Z

I was using bash.
I ran dsh on a rhel 7.1 powerpc host, and the sshs target some rhel 7.3 powerpc.

As for what hosts, I didn't want to publish to the world the internal lists of hosts. If that's a non-issue, let me know.

benlapETS · 2019-08-07T15:24:50Z

Have you tried running sr_subscribe interactively (on your troubled hosts), and if so does it hangs ?

racetted · 2019-08-07T15:28:02Z

Yes, and no it doesn't hang, but will return control before the last stderr output, which I think is the crux of the issue why the ssh will not end properly.

benlapETS · 2019-08-07T16:16:09Z

The problem has something to do with trying to run non-interactive script the same way you are running interactive script and also the fact that ssh redirect stdin which comes in conflict with the way we redirect stdout/err (something that was used to accommodate analyst complaining about clogging stdout/stderr) in a fancy way, try this:

ssh $host -t "bash -ic 'sr_subscribe mirror-ss1-from-hall2-cmop restart &'"

... and tell me if it does the trick. If it does, then the problem will have nothing to do with Sarracenia (for now) but the way you guys want to run interactive script in non-interactive session and the way you configure your profiles on those hosts.

benlapETS · 2019-08-07T16:24:11Z

Also, I believe that the way we handle logging (redirecting stdout, stderr) in Sarracenia is terribly wrong as it creates that kind of issues. This handling is itself a workaround to coverup un bigger problem which is the way we handle multiprocessing and startup logging which is also terribly wrong. Solving all that would require a big refactoring, a lot of work to correct that flawed design, redesigning the multi-process (with python multiprocess module) and enable logging (as the user specifies it) earlier than anything else would avoid clogging stdout without missing any important info during the startup.

petersilva · 2019-08-07T21:03:07Z

fwiw, can always have more site specific information on a matching internal issue. This issue would just be used to track progress on a fix.

petersilva · 2019-08-07T21:53:33Z

I think we can get a really easy example. It is reproducible with any configuration...
sr_subscribe start dd_amis
returns relatively quickly, while,
ssh localhost sr_subscribe start dd_amis
never returns, so it´s pretty easy to reproduce... how to fix, not so obvious.

Ben´s hint with the bash -ic is a good start.

petersilva · 2019-08-07T22:06:54Z

I wonder if the issue isn't just stdin is perhaps not closed? I just tried and it didn't change a thing, but stdout would appear to be the culprit...

peter@idefix:~/.config/sarra/plugins$ ssh localhost "sr_subscribe start dd_amis >/dev/null"
returns immediately (and works properly.) so I think it is just a matter of closing stdout in the
instances once the logging is initialized.

petersilva · 2019-08-07T22:11:58Z

or re-directing stdout to just like stderr to the log.

benlapETS · 2019-08-09T13:59:35Z

Note1: to @petersilva: so, do you agree that it has nothing to do with Sarracenia (and close this as not a bug) ?... Ok as we discussed we disagree, you are stating that it is a simple problem, and I believe it will open a can worm with our multiprocess management. I will continue digging to document that issue then.

petersilva · 2019-08-09T17:52:14Z

As discussed, I came to the conclusion that sarracenia is not managing stdout correctly and requires correction. Any unix/linux application that needs to daemonize is responsible for dealing properly with its´ file descriptors.

sample reference:
http://www.netzmafia.de/skripten/unix/linux-daemon-howto.html

in step 4.6 of the linux daemon howto, the closing of standard file descriptors is a standard element of the job of becoming a daemon. In the cases of start, we are starting a number (controlled by the instances setting) of independent daemons, so each instance needs to properly manage it´s stdout. It looks like standard error is already properly dealt with from the testing so far, so it should be a matter of having the exiting code do the same thing for the other file descriptor.

yes the case is a little complicated because in the case of actions like: stop, foreground, status where we aren´t launching background instances (daemons), but all that stuff seems already correct for stderr, so it looks like we can just do the same thing and it should be OK.

We should correct this, and it should be a very small correction.

petersilva · 2019-08-09T18:30:19Z

some history on this topic #63

benlapETS · 2019-08-14T14:25:56Z

If we really wanted to deamonize sarracenia, why is the default behavior is to log into the console, it should log directly to a file so we wouldnt bother about it and it would be consistent . And the exception would be to run in foreground. Also, sr_post would still be different (because of already known design issues) unless we finally decide to fix those inconsistencies, because sr_post is not only sr_post, it is also sr_watch and sr_poll...

matthewdarwin · 2019-08-14T14:30:00Z

The easy way to deamonize is to put it in a systemd wrapper. Output to stdout/stderr is automatically logged to syslog. Nothing you need to change in the code. This is my preferred way of running things these days.

[Unit]
Description=My Sarracenia

[Service]
Type=simple
SyslogIdentifier=sarracenia
ExecStart=/usr/bin/sarracenia
Restart=on-failure

[Install]
WantedBy=multi-user.target

benlapETS · 2019-08-14T14:31:48Z

Note1: Yes, this would be the right thing to do, thx. It would remove a lot of overhead in Sarracenia code. But the problem is how the multiprocess has been designed in Sarracenia (and is part of what I critic here). If one decides to run 5 instances it will mixup log information because the child processes are independant of the parent and they will get lost or maybe not, will they share the same log file, not sure ? If it does, it may also bottlenecks and impairs performances because of the log file being lock by one process and other waiting to write to it. We're just talking here, I'll wait for some input on that if anyone has any idea.

Note2: Maybe we could manage multiprocess with systemd config and also remove that burden from Sarracenia ? , but then auditing and restarting process would have to be redesign too... or adapted.

petersilva · 2019-08-14T22:28:12Z

uh, the default is to log to a log file. when using start stdout and stderr are supposed to log to the log. I don´t understand why you claim the default is to log to the console. if anything is writing to stdout after start it is the result of a bug in sarracenia, as it should have been -redirected to the log. The ability to use foreground to debug interactively is a feature, not a default or a defect.

The systemd wrapper exists, is in tools/sarra_user.service, and is used on systems that support systemd. However there are many non-systemd cases. I guess we could insist that users start using systemd, but it seems like a work-around for a bug.

benlapETS · 2019-08-21T13:29:25Z

Note1: From the doc of Python logging:

Note that Loggers should NEVER be instantiated directly, but always through the module-level function logging.getLogger(name).

...as it is used to be thread safe and each logging handler instance will lock its file. Still from the doc about thread safety something is incompatible with how we handle signal:

The logging module is intended to be thread-safe without any special work needing to be done by its clients. It achieves this though using threading locks; there is one lock to serialize access to the module’s shared data, and each handler also creates a lock to serialize access to its underlying I/O.

If you are implementing asynchronous signal handlers using the signal module, you may not be able to use logging from within such handlers. This is because lock implementations in the threading module are not always re-entrant, and so cannot be invoked from such signal handlers.

Now, guess what we are doing:

self.logger  = logging.RootLogger(...)

Need to fix that so we might not leave a file descriptor dangling anymore.

Note2: Also, we were wrong about the correct behavior of stdin, stdout and stderr. Passing PIPE argument is the right thing to do to create a new fd for the child. If None it will pass the parent. From the doc of Popen:

stdin, stdout and stderr specify the executed program’s standard input, standard output and standard error file handles, respectively. Valid values are PIPE, DEVNULL, an existing file descriptor (a positive integer), an existing file object, and None. PIPE indicates that a new pipe to the child should be created. DEVNULL indicates that the special file os.devnull will be used. With the default settings of None, no redirection will occur; the child’s file handles will be inherited from the parent. Additionally, stderr can be STDOUT, which indicates that the stderr data from the applications should be captured into the same file handle as for stdout.

petersilva · 2019-08-21T23:51:52Z

It's good idea to make things clean from a thread safety point of view.

For what it's worth, thread safety means using locks to mediate access to the log file from multiple processes at once.To avoid the need for such locks, we open a log file for each process, so there should be no such contention, and thread safety should not be a critical problem. We eschew threads whenever possible (although it isn't always possible.)

RootLogger... hmm... yeah chase that... don't know what that is about.

for file descriptors, PIPE is definitely wrong. We don't want new file descriptors. more detail...

stdout=NONE, means we get the parent stdout file descriptor, we close it, and redirect to a logger. This is what normal processes do when forking a deamon in C. As the child is intended to communicate entirely via it's log file, and not back to the parent, this is the correct setting.
stdout=PIPE, means we get a fresh file descriptor pointing to a pipe (man 2 pipe). This is to establish a communcation channel between parent and child process, where the parent has the pipes as higher numbered file descriptors and it can read the childs stdout without it getting confused with it's own. We don't want the parent to read the child's output, the parent should be fire and forget. but even if we get a pipe, we should be re-directing it to the log file and closing it, so the end result should be no different from NONE. We don't want such a pipe. look at subprocess.communicate for more info. It's not what we are doing. it will also add complexity, for example in https://docs.python.org/3/library/subprocess.html, under Popen Objects:

Popen.wait(timeout=None)
Wait for child process to terminate. Set and return returncode attribute.

If the process does not terminate after timeout seconds, raise a TimeoutExpired exception. It is safe to catch this exception and retry the wait.

Note This will deadlock when using stdout=PIPE or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use Popen.communicate() when using pipes to avoid that.

Look up Popen.communciate() for more details...

stdout=DEVNULL, means it is closed before we start, so rather then we need to re-open the fd=1 as a logger stream. The challenge here is ensuring that you get fd=1.

No matter what, the value of standard output file descriptor inherited by the child should be irrelevant and get re-directed by the logging setup routine. It's a hint to us that any of these settings makes any difference. None is correct, but anything could work, provided the logging setup is correct.

I'm still hoping the bobo is something to do with bogus loggers created on initial startup before we read any options... like in sr_amqp.py there are these: self.logger = self.hc.logger assignments. I'm worried that self.hc.logger gets set differently later, and self.logger in sr_amqp.py stays with the initial place holder logging class.

I see some suspicious stuff:

blacklab% grep -n lfd sr_instances.py
767: #lfd=os.open( self.logpath, os.O_CREAT|os.O_WRONLY|os.O_APPEND )
768: #os.dup2(lfd,1)
769: #os.dup2(lfd,2)
blacklab%

That was commented out last November... as part of work on #112. Could be I was confused, and it might be fun to bring that code back and see if stuff is fixed. It looks like it has the right idea.
At least It is easy to try...

benlapETS · 2019-08-22T16:18:21Z

Note1: The first problem that I see is that you misunderstand what PIPE does. What it does is that each process will open its own PIPEs to the standard files... in opposition to NONE which all subprocess will uses its parent fds to manage its standard files. I don't say that only with what I understood from the doc, but also because I just tested it and it works like that.

Note2: I also see that you makes parallel with C programming but python has nothing to do with C, and that is a very concerning thing for me that you and Michel code in python like if it was C and we ends up doing very bad code because were making things which may sounds good in C but is anti python. I must remind that python handles thing at a much higher level than C and we must use the powerful syntax of python at its full potential.

benlapETS · 2019-08-26T17:04:49Z

Note1: for win32 compatibility we must avoid dup2 call for redirection of stdout python-on-windows-handle-invalid-when-redirecting-stdout-writing-to-file

petersilva · 2019-08-26T21:47:18Z

sad... perhaps a simpler answer. The following might work? ::

   sys.stdout.close()
   sys.stdout = open(lfn, 'a')
   sys.stderr.close()
   sys.stderr = sys.stdout

Try printing the sys.stdout.fileno() and if it is 1, it's OK.

( from: https://stackoverflow.com/questions/4675728/redirect-stdout-to-a-file-in-python )

of course there is a minor race condition that fd2 mi

benlapETS · 2019-08-27T20:34:28Z

This doesn't work, only this is working actually:

os.close(1)

petersilva · 2019-08-28T01:38:17Z

so that works?

benlapETS · 2019-08-28T12:12:43Z

No this is low-level i/o, we'll mess up windows with that, also had to pass logger handler stream to make it work, which is a hack, as open() on logpath won't work. Then i have to investigate more on the low-level and/or look at where is sys.stdout is still open other than in the parent and the child instance, something prevent sys.stdout.close() to work properly.

petersilva · 2019-08-28T16:54:59Z

rather than guessing, I wrote a script... the following:

import sys

sys.stderr.close()
sys.stderr=open('kk','a')
print( 'hello', file=sys.stderr)

Does what I expected and was initially trying to describe, it is using all python level stuff. What is objectionable here?

petersilva · 2019-08-28T17:08:51Z

Please note, that:

modules in python may not use proper logging. we have no simple way of imposing our logging conventions on third party modules.
the application has a plugin architecture which means that code written in any language could potentially be invoked in the normal course of operations. The flow test already invokes the wget c-binary, for example, and users processing radar data are using FORTRAN. The convention on Linux is to write information to these standard file descriptors, regardless of application language.

Redirection of standard file descriptors is a well-known basic requirement for this type of application, which is very traditional in a Linux environment, and has nothing to do with the programming language used for a particular application. Perfect logging in our application code will not eliminate or even affect the need for standard file re-direction, which will remain.

benlapETS · 2019-08-28T17:19:47Z

ok I repeat more clearly sys.stdout.close() does not work in our instance model (using Popen without pipes), but os.close(1) works and this is a fact (not guessing) and I am investigating why.

http://effbot.org/pyfaq/why-doesn-t-closing-sys-stdout-stdin-stderr-really-close-it.htm
https://stackoverflow.com/questions/24173998/difference-between-os-close0-sys-stdin-close

petersilva · 2019-08-28T21:00:41Z

cool info about sys.stdout.close() you see in the first link, where it says it closes the underlying c stream ? A C stream is a file opened with fopen(3), and closed with fclose(3), as those are c library functions, rather than system calls (open(2), close(2)) ... Python is using the c library to do some buffering which for some file descriptors means there is user level buffering that can result in file corruption if a file is close(2) rather than fclose(3)'d... but stdout is line buffered and stderr, is completely unbufferred, so in this particular case, there is very little likelihood (uh... really none) of any issue.

reference:
https://stackoverflow.com/questions/3723795/is-stdout-line-buffered-unbuffered-or-indeterminate-by-default

C is the language the operating system is implemented in, and python uses calls implemented in both the kernel(2) and the c library (3). So in this case the C API is a good reference for behavior of the python code.

So, if we have:


os.close(1)
sys.stdout = open(lfn, "a" )

It should perform correctly. Are you saying the application performs correctly, but you have a worry about doing that? I think the only thing the python level stuff does is some buffering (exactly like fopen/fclose()), which is likely not an issue as explained above. Is that reassuring? Is that the problem you are worried about?

petersilva · 2019-08-28T21:13:24Z

but I read that link wrong... even os.close() knows whether a file is a plain file or a stream, so there is zero risk of any buffering issue... and found a newer version of your link:

https://docs.python.org/3/faq/library.html?highlight=sys%20stderr%20close#why-doesn-t-closing-sys-stdout-stdin-stderr-really-close-it. It says:

To close the underlying C file descriptor for one of these three, you should first be sure that’s what you really want to do (e.g., you may confuse extension modules trying to do I/O). If it is, use os.close():

os.close(stdin.fileno())
os.close(stdout.fileno())
os.close(stderr.fileno())
Or you can use the numeric constants 0, 1 and 2, respectively.

I'm pretty sure that we are sure we really want to do that. really.

benlapETS · 2019-09-06T17:42:36Z

Update on the issue: I tested (flow test) the merge (with master) on my side and it passed well. My only remaining task (coding the tests for this issue) break down into two steps:

Detailing the test cases
Writing the tests

Then I will certainly merge everything before next wednesday

benlapETS · 2019-09-06T19:22:23Z

So far I provide those test cases:

test streams:

test that the type of the stream is _io.TextIOWrapper before redirection
test that the handler stream is open to standard files after we redirected stdout/stderr to write to it
test that the original stdout get closed after redirection
test that the original stream stays the same when creating the wrapper
test that the stream changes when the log rotates
test that the wrapper stream get updated after a rotation and a first write

test files descriptors:

test that fd is different before redirection
test that fd is preserved after redirection
test that handler fd is preserved after redirection

test output:

test that log received logging output
test that log received subprocess output (skipped for now)
test that log received sys.stdout output (skipped for now)

petersilva · 2019-09-06T22:50:29Z

sounds like good progress!

benlapETS · 2019-09-12T14:39:47Z

I will list here the problems of this implementation that will need to be address before I retry merge:

sr_audit fails on declare/setup/sanity call with Error in atexit._run_exitfuncs: which indicate that we mess up with low level routine. It happens when declare is in post-like instances (poll/post/watch)
remove pipe from plugins properly (also add wait time)

Note2: Seems that the problems weren't that serious... ready to try a merge again.

petersilva · 2019-09-20T15:17:41Z

fix released in 2.19.09

petersilva assigned benlapETS Aug 2, 2019

benlapETS mentioned this issue Aug 7, 2019

Logging structure is poorly designed and poorly implemented #185

Closed

benlapETS added the likely-fixed likely fix is in the repository, success not confirmed yet. label Aug 14, 2019

benlapETS added bug Something isn't working and removed likely-fixed likely fix is in the repository, success not confirmed yet. labels Aug 21, 2019

This was referenced Sep 4, 2019

self.flush() not found error message... #252

Closed

Logging errors when run in debug mode #249

Closed

log error flush stack trace printed for every message retry. #254

Closed

benlapETS mentioned this issue Sep 6, 2019

Expand the use of python unittest to all our unit testing. #255

Open

benlapETS mentioned this issue Sep 10, 2019

Issue240 2 #256

Closed

This was referenced Sep 12, 2019

Issue240 2 #257

Merged

Stack overflow on sr_watch restart #202

Closed

Issue240 2 #262

Merged

petersilva added the likely-fixed likely fix is in the repository, success not confirmed yet. label Sep 16, 2019

petersilva closed this as completed Sep 20, 2019

petersilva mentioned this issue Sep 22, 2019

Which version is stable? #139

Open

sr_subscribe does not return cleanly when run through ssh/dsh #240

sr_subscribe does not return cleanly when run through ssh/dsh #240

Comments

racetted commented Aug 2, 2019

Sample code

petersilva commented Aug 2, 2019

benlapETS commented Aug 7, 2019 • edited Loading

racetted commented Aug 7, 2019

benlapETS commented Aug 7, 2019 • edited Loading

racetted commented Aug 7, 2019

benlapETS commented Aug 7, 2019 • edited Loading

benlapETS commented Aug 7, 2019

petersilva commented Aug 7, 2019

petersilva commented Aug 7, 2019

petersilva commented Aug 7, 2019 • edited Loading

petersilva commented Aug 7, 2019

benlapETS commented Aug 9, 2019 • edited Loading

petersilva commented Aug 9, 2019

petersilva commented Aug 9, 2019

benlapETS commented Aug 14, 2019 • edited Loading

matthewdarwin commented Aug 14, 2019

benlapETS commented Aug 14, 2019 • edited Loading

petersilva commented Aug 14, 2019

benlapETS commented Aug 21, 2019 • edited Loading

petersilva commented Aug 21, 2019

benlapETS commented Aug 22, 2019 • edited Loading

benlapETS commented Aug 26, 2019

petersilva commented Aug 26, 2019

benlapETS commented Aug 27, 2019

petersilva commented Aug 28, 2019

benlapETS commented Aug 28, 2019

petersilva commented Aug 28, 2019

petersilva commented Aug 28, 2019

benlapETS commented Aug 28, 2019 • edited Loading

petersilva commented Aug 28, 2019

petersilva commented Aug 28, 2019

benlapETS commented Sep 6, 2019 • edited Loading

benlapETS commented Sep 6, 2019 • edited Loading

petersilva commented Sep 6, 2019

benlapETS commented Sep 12, 2019 • edited Loading

petersilva commented Sep 20, 2019

benlapETS commented Aug 7, 2019 •

edited

Loading

benlapETS commented Aug 7, 2019 •

edited

Loading

benlapETS commented Aug 7, 2019 •

edited

Loading

petersilva commented Aug 7, 2019 •

edited

Loading

benlapETS commented Aug 9, 2019 •

edited

Loading

benlapETS commented Aug 14, 2019 •

edited

Loading

benlapETS commented Aug 14, 2019 •

edited

Loading

benlapETS commented Aug 21, 2019 •

edited

Loading

benlapETS commented Aug 22, 2019 •

edited

Loading

benlapETS commented Aug 28, 2019 •

edited

Loading

benlapETS commented Sep 6, 2019 •

edited

Loading

benlapETS commented Sep 6, 2019 •

edited

Loading

benlapETS commented Sep 12, 2019 •

edited

Loading