feat: redirect output stream to os.devnull via parameters of Popen#203
feat: redirect output stream to os.devnull via parameters of Popen#203
Conversation
|
Thanks for reading the documentation. I think it is the better of the two options since we are using Popen anyway. |
|
I agree that both Initially, I had thought that all text after the 'ssh' command would get executed on the remote server, and therefore (a) would be redirecting text on the server, rather than on the local machine. But this is not true, as you can see from running Therefore, I still don't understand how/why this solves the problem of the remote process trying to write text and failing on a broken pipe caused by a disconnection of the ssh between our local machine and Zeus. Is Don't we also still want/need the @BainanXia when you tested (b) on 746 and 747, how long did you keep the VPN off? Long enough to get beyond the 10-15 minutes of buffering that we've observed? |
|
@danielolsen, try to run that:
|
|
@rouille you are right. The same happens if we redirect to In that case, maybe there is a difference between (a) and (b)? Because (a) seems to do the redirection of |
|
With there is nothing to read because the standard output has been redirected to a file. But you can get it. For |
|
You're right, it's more than just redirection of the text that comes back to the local machine, there are other differences as well in what the Solution (a) was still using |
|
@danielolsen When I tested 746 and 747, I cut off my VPN right after the jobs were submitted and never reconnected until they were finished and extracted on via my remote PC in the office. |
|
I would say that the a) and b) are equivalent since the only thing that seems to matter is redirecting the standard streams to |
What do you mean 'Popen handles the SIGHUP'? |
|
@danielolsen Try If we dive into the source code of |
|
@BainanXia I have tried that one before :) For the user's purposes, (a) and (b) are equivalent, and from your testing it seems that (b) works, but I still don't understand why redirecting the |
|
Automatically killed by who? Popen cannot kill the job because the connection is broken. If the job is killed by |
Don't you think that setting the streams to |
The remote child process is set with default settings, the file handle of which will be inherited from the parent process. This is documented in my first post. |
In my mind I see five generations of processes. If
I can see how Popen would apply to the first |
@BainanXia, this implies that any streams produced by |
|
I'm fairly confident that a remote process can't be a child of a local process - this wouldn't make sense given that the os kernel has to manage process state locally. The remote parent here is just sshd (whose parent is the remote init system), which executes commands we send it as children. You can see this by running |
|
@jon-hagg, do you understand why both options seem to act similarly? I don’t know much about all of this. |
|
@rouille yeah I'm going back and forth in my head a bit. Seems like there are some differences, so just to summarize -
Now that I think about it, I'm not sure why the second approach works, unless coincidentally. I'd still expect it to get a SIGHUP on disconnection as before. Can't say much else at the moment, will update if I think of something useful. |
|
It would be nice to understand what is going on? |
|
I think at this point, it is a good opportunity to understand what fundamentally causes the disconnected issue. We starts a local process via I have following script I execute following script locally on my Mac with VPN on. And indeed in both tests, no matter I terminated the local process or not, the remote log were complete, i.e. the remote process finished successfully. The only difference was whether I could get the Then I have the following question popped up in my mind: during our script running, do we have any functions that require the connection between local client and the remote server, in other words, is there any function that tries to write anything in the pipe and will terminate the remote process if it fails to do so. |
Currently, we are only writing to the pipe from the start of the Julia process until the process gets available capacity on the Gurobi cloud, and then for about a minute afterwards until all the input files are loaded/prepared. After that, REISE.jl redirects all further output to local (Zeus) files. See Breakthrough-Energy/REISE.jl#57. This REISE.jl feature was implemented around the same time as we started testing on the modifications to PowerSimData, which makes it a bit harder to draw solid conclusions. @BainanXia, did Scenarios 746 and 747 have to queue for a while, or did they get cloud capacity right away? |
They have to queue for a while until I spun up another machine. |
|
How long were they in the queue? Before we started redirecting the Julia output, we were seeing writes to the ssh pipe still succeeding for 10-15 minutes after the connection had been killed. Did they queue (and keep printing) longer than that? |
|
And we did not have any problem with REISE... |
|
@danielolsen I waited about 10 minutes before I spun up a new machine. Hence, they actually stayed in the queue for about 10 minutes. |
|
@rouille Any suggestions on what else I could test with my |
If you want to try to expand the capabilities of MATPOWER/MOST, be my guest 😛
Try counting to 3600 instead of 30? If the |
@danielolsen we should not look back but it is very strange that it does not happen for MATLAB + MATPOWER/MOST + GUROBI. I think that if it breaks for 3600s it would be worth to implement the |
If we want to be rigorous, let's try both |
|
Interesting observation, 1-hour test (counting to 3600) failed for both Python and Julia script. Python stopped at 3398 whereas Julia stopped at 3346. Will do a 2-hour test now. |
|
Just to make sure, you ran 4 tests, one for Julia for options a and b and one for Python for options a and b. Correct? |
Nope. Only two tests for Python and Julia respectively. Neither a) nor b), but our original setup, with which we encountered disconnected issue. The purpose of the tests is to reproduce the disconnected issue without our scenario framework to understand what actually causes the problem. |
Ok, thanks for refreshing my memory. When did you start the tests? STG did some maintenance from 7pm onward yesterday affecting the VPN. Now, you do a 2h test without no connection interruption. What do you plan to do next? |
Yesterday around 5:30. I'm not sure whether the tests were affected by the STG maintenance since I cut off my connection to Zeus (as well as the VPN) right after I started the tests. I presume the maintenance won't shut down the machines in the lab, hence should have no impact on the tests. Given the results are not as expected, I ran another 2h test to check it again. The whole loop involves 5 components: local client ssh to Zeus, sshd on Zeus, Python caller that calls either Matlab or Julia script, the script starts Gurobi client and submit jobs to Gurobi cloud, Gurobi cloud sends results back. We are aware of the issue is caused by something keeps trying to write streams to the ssh pipe between local client and Zeus and it terminates the process if it fails to do so, but not sure what it is. I think need to check those components one by one until we locate the function/script that is doing so. |
|
We know there is some sort of buffer in writing to the SSH pipe, as evidenced by the jobs not immediately failing once the SSH is disconnected, and we know that the buffer has some sort of limit, because the pipe eventually does fail. We have previously seen this happen after 10-15 minutes, but maybe it is not time-based, but size-based. That might explain the longer time for the |
|
Test update: the 2h tests conducted yesterday ended up with 1270 with Python and 998 with Julia, whereas 7200 is the expected number. Given both 1h tests and 2h tests were failed at some random time, no matter with or without connection interruption (STG maintenance), could we conclude that it is necessary for the SSH connection being active with the standard stream option 'PIPE'? If so, why REISE could survive? |
My best theory so far is that there is an undocumented feature within MATLAB (or the python interface to MATLAB) that wraps around the stdout printing: either catching and ignoring the broken pipe error, or with its own internal buffer, or something like that. I bet we could replicate this with a try/except wrapper around the Based on everything I've read so far on running commands over SSH, they should be expected to fail if the SSH tunnel does, and the fact that they did not with REISE is the exception, rather than the rule. |
|
Should we go with option a and call it a day? |
|
I think we should merge this PR with options It does not mean that we give up on |
|
@rouille Sure. Will switch to a) in my next commit. I will also create an issue which will take care of the following tests that we would like to carry on. |
danielolsen
left a comment
There was a problem hiding this comment.
Thanks for your patience as we continued to discuss root causes.
922cb09 to
54fec3f
Compare
Purpose
This branch is a clean-up version of the recent working branch
nohup_executeregarding the disconnected issue we encountered during scenario runs.We conducted a series of test runs to explore the possible solutions to get rid of the requirement of active SSH connection between local client and our remote server (Zeus). The following two approaches that redirect output stream to
os.devnullare proved to work:a)
b)
After reading through the documentation of
subprocess.popen, the parameters of functionPopendoes following:Literally, there is no difference between a) and b). In both cases, we create a child process with output stream being redirect to
os.devnulland any lower level process (Julia script) will inherit this property and suppress the communication with the local client who starts the process.Closes #278.
What is the code doing
Implement approach b) above.
Validation
This branch has been tested via Scenario 746 and 747. Both of them are created locally on a Mac laptop with VPN on. Then they went into the queue on Gurobi Cloud. Right after they were submitted, the local VPN was cut off. Both of the two scenarios were successfully finished and extracted.
Time to review
5 min to 30 min depends on how much one would like to dive into the documentations of the
subprocessmodule.