Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sr_subscribe does not return cleanly when run through ssh/dsh #240

Closed
racetted opened this issue Aug 2, 2019 · 36 comments
Closed

sr_subscribe does not return cleanly when run through ssh/dsh #240

racetted opened this issue Aug 2, 2019 · 36 comments
Assignees
Labels
bug Something isn't working likely-fixed likely fix is in the repository, success not confirmed yet.

Comments

@racetted
Copy link

racetted commented Aug 2, 2019

Using version: 2.19.04b2

When I try to restart subscribers on several host at once, it hangs at the end of the restart.

My guess is that the stderr pipe is still being written into while the command ends.

Sample code

for host in "some list of hosts" ; do    ssh $host sr_subscribe mirror-ss1-from-hall2-cmop restart;  done
2019-08-01 19:03:22,442 [INFO] sr_subscribe mirror-ss1-from-hall2-cmop 01 stopping (pid=32521)
2019-08-01 19:03:22,946 [INFO] sr_subscribe mirror-ss1-from-hall2-cmop 01 stopped
2019-08-01 19:03:22,947 [INFO] sr_subscribe mirror-ss1-from-hall2-cmop 01 starting
2019-08-01 19:03:22,949 [INFO] sr_subscribe mirror-ss1-from-hall2-cmop 02 stopping (pid=32524)
2019-08-01 19:03:23,452 [INFO] sr_subscribe mirror-ss1-from-hall2-cmop 02 stopped
2019-08-01 19:03:23,452 [INFO] sr_subscribe mirror-ss1-from-hall2-cmop 02 starting
^CKilled by signal 2.

The ctrl-c is done manually as the command simply hangs there.

@petersilva
Copy link
Contributor

Hey @benlapETS please have a talk with @racetted and see if you can reproduce the issue.

@benlapETS
Copy link
Contributor

benlapETS commented Aug 7, 2019

I have some question that may be on the wrong place, if so, please tell me where we should start that discussion:

  • What shell are you using when in ssh (sh, bash, dash, ksh, ...) ??

  • Is there more informations about the hosts that fails giving up the stderr (profile, user right, network arch, ...) ?

@racetted
Copy link
Author

racetted commented Aug 7, 2019

I was using bash.
I ran dsh on a rhel 7.1 powerpc host, and the sshs target some rhel 7.3 powerpc.

As for what hosts, I didn't want to publish to the world the internal lists of hosts. If that's a non-issue, let me know.

@benlapETS
Copy link
Contributor

benlapETS commented Aug 7, 2019

Have you tried running sr_subscribe interactively (on your troubled hosts), and if so does it hangs ?

@racetted
Copy link
Author

racetted commented Aug 7, 2019

Yes, and no it doesn't hang, but will return control before the last stderr output, which I think is the crux of the issue why the ssh will not end properly.

@benlapETS
Copy link
Contributor

benlapETS commented Aug 7, 2019

The problem has something to do with trying to run non-interactive script the same way you are running interactive script and also the fact that ssh redirect stdin which comes in conflict with the way we redirect stdout/err (something that was used to accommodate analyst complaining about clogging stdout/stderr) in a fancy way, try this:

ssh $host -t "bash -ic 'sr_subscribe mirror-ss1-from-hall2-cmop restart &'"

... and tell me if it does the trick. If it does, then the problem will have nothing to do with Sarracenia (for now) but the way you guys want to run interactive script in non-interactive session and the way you configure your profiles on those hosts.

@benlapETS
Copy link
Contributor

Also, I believe that the way we handle logging (redirecting stdout, stderr) in Sarracenia is terribly wrong as it creates that kind of issues. This handling is itself a workaround to coverup un bigger problem which is the way we handle multiprocessing and startup logging which is also terribly wrong. Solving all that would require a big refactoring, a lot of work to correct that flawed design, redesigning the multi-process (with python multiprocess module) and enable logging (as the user specifies it) earlier than anything else would avoid clogging stdout without missing any important info during the startup.

@petersilva
Copy link
Contributor

fwiw, can always have more site specific information on a matching internal issue. This issue would just be used to track progress on a fix.

@petersilva
Copy link
Contributor

I think we can get a really easy example. It is reproducible with any configuration...
sr_subscribe start dd_amis
returns relatively quickly, while,
ssh localhost sr_subscribe start dd_amis
never returns, so it´s pretty easy to reproduce... how to fix, not so obvious.

Ben´s hint with the bash -ic is a good start.

@petersilva
Copy link
Contributor

petersilva commented Aug 7, 2019

I wonder if the issue isn't just stdin is perhaps not closed? I just tried and it didn't change a thing, but stdout would appear to be the culprit...

peter@idefix:~/.config/sarra/plugins$ ssh localhost "sr_subscribe start dd_amis >/dev/null"
returns immediately (and works properly.) so I think it is just a matter of closing stdout in the
instances once the logging is initialized.

@petersilva
Copy link
Contributor

or re-directing stdout to just like stderr to the log.

@benlapETS
Copy link
Contributor

benlapETS commented Aug 9, 2019

Note1: to @petersilva: so, do you agree that it has nothing to do with Sarracenia (and close this as not a bug) ?... Ok as we discussed we disagree, you are stating that it is a simple problem, and I believe it will open a can worm with our multiprocess management. I will continue digging to document that issue then.

@petersilva
Copy link
Contributor

As discussed, I came to the conclusion that sarracenia is not managing stdout correctly and requires correction. Any unix/linux application that needs to daemonize is responsible for dealing properly with its´ file descriptors.

sample reference:
http://www.netzmafia.de/skripten/unix/linux-daemon-howto.html

in step 4.6 of the linux daemon howto, the closing of standard file descriptors is a standard element of the job of becoming a daemon. In the cases of start, we are starting a number (controlled by the instances setting) of independent daemons, so each instance needs to properly manage it´s stdout. It looks like standard error is already properly dealt with from the testing so far, so it should be a matter of having the exiting code do the same thing for the other file descriptor.

yes the case is a little complicated because in the case of actions like: stop, foreground, status where we aren´t launching background instances (daemons), but all that stuff seems already correct for stderr, so it looks like we can just do the same thing and it should be OK.

We should correct this, and it should be a very small correction.

@petersilva
Copy link
Contributor

some history on this topic #63

@benlapETS
Copy link
Contributor

benlapETS commented Aug 14, 2019

If we really wanted to deamonize sarracenia, why is the default behavior is to log into the console, it should log directly to a file so we wouldnt bother about it and it would be consistent . And the exception would be to run in foreground. Also, sr_post would still be different (because of already known design issues) unless we finally decide to fix those inconsistencies, because sr_post is not only sr_post, it is also sr_watch and sr_poll...

@matthewdarwin
Copy link

The easy way to deamonize is to put it in a systemd wrapper. Output to stdout/stderr is automatically logged to syslog. Nothing you need to change in the code. This is my preferred way of running things these days.

[Unit]
Description=My Sarracenia

[Service]
Type=simple
SyslogIdentifier=sarracenia
ExecStart=/usr/bin/sarracenia
Restart=on-failure

[Install]
WantedBy=multi-user.target

@benlapETS
Copy link
Contributor

benlapETS commented Aug 14, 2019

Note1: Yes, this would be the right thing to do, thx. It would remove a lot of overhead in Sarracenia code. But the problem is how the multiprocess has been designed in Sarracenia (and is part of what I critic here). If one decides to run 5 instances it will mixup log information because the child processes are independant of the parent and they will get lost or maybe not, will they share the same log file, not sure ? If it does, it may also bottlenecks and impairs performances because of the log file being lock by one process and other waiting to write to it. We're just talking here, I'll wait for some input on that if anyone has any idea.

Note2: Maybe we could manage multiprocess with systemd config and also remove that burden from Sarracenia ? , but then auditing and restarting process would have to be redesign too... or adapted.

@benlapETS benlapETS added the likely-fixed likely fix is in the repository, success not confirmed yet. label Aug 14, 2019
@petersilva
Copy link
Contributor

uh, the default is to log to a log file. when using start stdout and stderr are supposed to log to the log. I don´t understand why you claim the default is to log to the console. if anything is writing to stdout after start it is the result of a bug in sarracenia, as it should have been -redirected to the log. The ability to use foreground to debug interactively is a feature, not a default or a defect.

The systemd wrapper exists, is in tools/sarra_user.service, and is used on systems that support systemd. However there are many non-systemd cases. I guess we could insist that users start using systemd, but it seems like a work-around for a bug.

@benlapETS
Copy link
Contributor

benlapETS commented Aug 21, 2019

Note1: From the doc of Python logging:

Note that Loggers should NEVER be instantiated directly, but always through the module-level function logging.getLogger(name).

...as it is used to be thread safe and each logging handler instance will lock its file. Still from the doc about thread safety something is incompatible with how we handle signal:

The logging module is intended to be thread-safe without any special work needing to be done by its clients. It achieves this though using threading locks; there is one lock to serialize access to the module’s shared data, and each handler also creates a lock to serialize access to its underlying I/O.

If you are implementing asynchronous signal handlers using the signal module, you may not be able to use logging from within such handlers. This is because lock implementations in the threading module are not always re-entrant, and so cannot be invoked from such signal handlers.

Now, guess what we are doing:

self.logger  = logging.RootLogger(...)

Need to fix that so we might not leave a file descriptor dangling anymore.

Note2: Also, we were wrong about the correct behavior of stdin, stdout and stderr. Passing PIPE argument is the right thing to do to create a new fd for the child. If None it will pass the parent. From the doc of Popen:

stdin, stdout and stderr specify the executed program’s standard input, standard output and standard error file handles, respectively. Valid values are PIPE, DEVNULL, an existing file descriptor (a positive integer), an existing file object, and None. PIPE indicates that a new pipe to the child should be created. DEVNULL indicates that the special file os.devnull will be used. With the default settings of None, no redirection will occur; the child’s file handles will be inherited from the parent. Additionally, stderr can be STDOUT, which indicates that the stderr data from the applications should be captured into the same file handle as for stdout.

@benlapETS benlapETS added bug Something isn't working and removed likely-fixed likely fix is in the repository, success not confirmed yet. labels Aug 21, 2019
@petersilva
Copy link
Contributor

It's good idea to make things clean from a thread safety point of view.

For what it's worth, thread safety means using locks to mediate access to the log file from multiple processes at once.To avoid the need for such locks, we open a log file for each process, so there should be no such contention, and thread safety should not be a critical problem. We eschew threads whenever possible (although it isn't always possible.)

RootLogger... hmm... yeah chase that... don't know what that is about.

for file descriptors, PIPE is definitely wrong. We don't want new file descriptors. more detail...

  • stdout=NONE, means we get the parent stdout file descriptor, we close it, and redirect to a logger. This is what normal processes do when forking a deamon in C. As the child is intended to communicate entirely via it's log file, and not back to the parent, this is the correct setting.

  • stdout=PIPE, means we get a fresh file descriptor pointing to a pipe (man 2 pipe). This is to establish a communcation channel between parent and child process, where the parent has the pipes as higher numbered file descriptors and it can read the childs stdout without it getting confused with it's own. We don't want the parent to read the child's output, the parent should be fire and forget. but even if we get a pipe, we should be re-directing it to the log file and closing it, so the end result should be no different from NONE. We don't want such a pipe. look at subprocess.communicate for more info. It's not what we are doing. it will also add complexity, for example in https://docs.python.org/3/library/subprocess.html, under Popen Objects:

Popen.wait(timeout=None)
Wait for child process to terminate. Set and return returncode attribute.

If the process does not terminate after timeout seconds, raise a TimeoutExpired exception. It is safe to catch this exception and retry the wait.

Note This will deadlock when using stdout=PIPE or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use Popen.communicate() when using pipes to avoid that.

Look up Popen.communciate() for more details...

  • stdout=DEVNULL, means it is closed before we start, so rather then we need to re-open the fd=1 as a logger stream. The challenge here is ensuring that you get fd=1.

No matter what, the value of standard output file descriptor inherited by the child should be irrelevant and get re-directed by the logging setup routine. It's a hint to us that any of these settings makes any difference. None is correct, but anything could work, provided the logging setup is correct.

I'm still hoping the bobo is something to do with bogus loggers created on initial startup before we read any options... like in sr_amqp.py there are these: self.logger = self.hc.logger assignments. I'm worried that self.hc.logger gets set differently later, and self.logger in sr_amqp.py stays with the initial place holder logging class.

I see some suspicious stuff:

blacklab% grep -n lfd sr_instances.py
767: #lfd=os.open( self.logpath, os.O_CREAT|os.O_WRONLY|os.O_APPEND )
768: #os.dup2(lfd,1)
769: #os.dup2(lfd,2)
blacklab%

That was commented out last November... as part of work on #112. Could be I was confused, and it might be fun to bring that code back and see if stuff is fixed. It looks like it has the right idea.
At least It is easy to try...

@benlapETS
Copy link
Contributor

benlapETS commented Aug 22, 2019

Note1: The first problem that I see is that you misunderstand what PIPE does. What it does is that each process will open its own PIPEs to the standard files... in opposition to NONE which all subprocess will uses its parent fds to manage its standard files. I don't say that only with what I understood from the doc, but also because I just tested it and it works like that.

Note2: I also see that you makes parallel with C programming but python has nothing to do with C, and that is a very concerning thing for me that you and Michel code in python like if it was C and we ends up doing very bad code because were making things which may sounds good in C but is anti python. I must remind that python handles thing at a much higher level than C and we must use the powerful syntax of python at its full potential.

@benlapETS
Copy link
Contributor

Note1: for win32 compatibility we must avoid dup2 call for redirection of stdout python-on-windows-handle-invalid-when-redirecting-stdout-writing-to-file

@petersilva
Copy link
Contributor

sad... perhaps a simpler answer. The following might work? ::

   sys.stdout.close()
   sys.stdout = open(lfn, 'a')
   sys.stderr.close()
   sys.stderr = sys.stdout

Try printing the sys.stdout.fileno() and if it is 1, it's OK.

( from: https://stackoverflow.com/questions/4675728/redirect-stdout-to-a-file-in-python )

of course there is a minor race condition that fd2 mi

@benlapETS
Copy link
Contributor

This doesn't work, only this is working actually:

os.close(1)

@petersilva
Copy link
Contributor

so that works?

@benlapETS
Copy link
Contributor

No this is low-level i/o, we'll mess up windows with that, also had to pass logger handler stream to make it work, which is a hack, as open() on logpath won't work. Then i have to investigate more on the low-level and/or look at where is sys.stdout is still open other than in the parent and the child instance, something prevent sys.stdout.close() to work properly.

@petersilva
Copy link
Contributor

rather than guessing, I wrote a script... the following:

import sys

sys.stderr.close()
sys.stderr=open('kk','a')
print( 'hello', file=sys.stderr)

Does what I expected and was initially trying to describe, it is using all python level stuff. What is objectionable here?

@petersilva
Copy link
Contributor

Please note, that:

  • modules in python may not use proper logging. we have no simple way of imposing our logging conventions on third party modules.

  • the application has a plugin architecture which means that code written in any language could potentially be invoked in the normal course of operations. The flow test already invokes the wget c-binary, for example, and users processing radar data are using FORTRAN. The convention on Linux is to write information to these standard file descriptors, regardless of application language.

Redirection of standard file descriptors is a well-known basic requirement for this type of application, which is very traditional in a Linux environment, and has nothing to do with the programming language used for a particular application. Perfect logging in our application code will not eliminate or even affect the need for standard file re-direction, which will remain.

@benlapETS
Copy link
Contributor

benlapETS commented Aug 28, 2019

ok I repeat more clearly sys.stdout.close() does not work in our instance model (using Popen without pipes), but os.close(1) works and this is a fact (not guessing) and I am investigating why.

http://effbot.org/pyfaq/why-doesn-t-closing-sys-stdout-stdin-stderr-really-close-it.htm
https://stackoverflow.com/questions/24173998/difference-between-os-close0-sys-stdin-close

@petersilva
Copy link
Contributor

cool info about sys.stdout.close() you see in the first link, where it says it closes the underlying c stream ? A C stream is a file opened with fopen(3), and closed with fclose(3), as those are c library functions, rather than system calls (open(2), close(2)) ... Python is using the c library to do some buffering which for some file descriptors means there is user level buffering that can result in file corruption if a file is close(2) rather than fclose(3)'d... but stdout is line buffered and stderr, is completely unbufferred, so in this particular case, there is very little likelihood (uh... really none) of any issue.

reference:
https://stackoverflow.com/questions/3723795/is-stdout-line-buffered-unbuffered-or-indeterminate-by-default

C is the language the operating system is implemented in, and python uses calls implemented in both the kernel(2) and the c library (3). So in this case the C API is a good reference for behavior of the python code.

So, if we have:


os.close(1)
sys.stdout = open(lfn, "a" )

It should perform correctly. Are you saying the application performs correctly, but you have a worry about doing that? I think the only thing the python level stuff does is some buffering (exactly like fopen/fclose()), which is likely not an issue as explained above. Is that reassuring? Is that the problem you are worried about?

@petersilva
Copy link
Contributor

but I read that link wrong... even os.close() knows whether a file is a plain file or a stream, so there is zero risk of any buffering issue... and found a newer version of your link:

https://docs.python.org/3/faq/library.html?highlight=sys%20stderr%20close#why-doesn-t-closing-sys-stdout-stdin-stderr-really-close-it. It says:

To close the underlying C file descriptor for one of these three, you should first be sure that’s what you really want to do (e.g., you may confuse extension modules trying to do I/O). If it is, use os.close():

os.close(stdin.fileno())
os.close(stdout.fileno())
os.close(stderr.fileno())
Or you can use the numeric constants 0, 1 and 2, respectively.

I'm pretty sure that we are sure we really want to do that. really.

@benlapETS
Copy link
Contributor

benlapETS commented Sep 6, 2019

Update on the issue: I tested (flow test) the merge (with master) on my side and it passed well. My only remaining task (coding the tests for this issue) break down into two steps:

  • Detailing the test cases
  • Writing the tests

Then I will certainly merge everything before next wednesday

@benlapETS
Copy link
Contributor

benlapETS commented Sep 6, 2019

So far I provide those test cases:

  1. test streams:
  • test that the type of the stream is _io.TextIOWrapper before redirection
  • test that the handler stream is open to standard files after we redirected stdout/stderr to write to it
  • test that the original stdout get closed after redirection
  • test that the original stream stays the same when creating the wrapper
  • test that the stream changes when the log rotates
  • test that the wrapper stream get updated after a rotation and a first write
  1. test files descriptors:
  • test that fd is different before redirection
  • test that fd is preserved after redirection
  • test that handler fd is preserved after redirection
  1. test output:
  • test that log received logging output
  • test that log received subprocess output (skipped for now)
  • test that log received sys.stdout output (skipped for now)

@petersilva
Copy link
Contributor

sounds like good progress!

@benlapETS benlapETS mentioned this issue Sep 10, 2019
@benlapETS
Copy link
Contributor

benlapETS commented Sep 12, 2019

I will list here the problems of this implementation that will need to be address before I retry merge:

  • sr_audit fails on declare/setup/sanity call with Error in atexit._run_exitfuncs: which indicate that we mess up with low level routine. It happens when declare is in post-like instances (poll/post/watch)
  • remove pipe from plugins properly (also add wait time)

Note2: Seems that the problems weren't that serious... ready to try a merge again.

This was referenced Sep 12, 2019
@petersilva petersilva added the likely-fixed likely fix is in the repository, success not confirmed yet. label Sep 16, 2019
@petersilva
Copy link
Contributor

fix released in 2.19.09

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working likely-fixed likely fix is in the repository, success not confirmed yet.
Projects
None yet
Development

No branches or pull requests

4 participants