Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stdout not redirected to repl #10

Closed
bjarthur opened this issue Apr 3, 2014 · 16 comments
Closed

stdout not redirected to repl #10

bjarthur opened this issue Apr 3, 2014 · 16 comments

Comments

@bjarthur
Copy link
Collaborator

bjarthur commented Apr 3, 2014

while stdout for a local process appears in the repl, that for a remote sge process does not. see the transcript below. not shown is that it works fine for remote ssh processes. is this related to JuliaLang/julia#6030, JuliaLang/julia#5995, and #6 ?

[arthurb@h06u01 ~]$ juliac
_
_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" to list help topics
| | | | | | |/ ` | |
| | |
| | | | (
| | | Version 0.3.0-prerelease+2417 (2014-04-02 18:29 UTC)
/ |_'|||__'| | Commit 193cb11* (0 days old master)
|__/ | x86_64-redhat-linux

julia> using ClusterManagers

julia> ClusterManagers.addprocs_sge(1)
job id is 8083782, waiting for job to start ..................................................
1-element Array{Any,1}:
2

julia> addprocs(1)
1-element Array{Any,1}:
3

julia> remotecall_fetch(3,println,"foo")
From worker 3: foo

julia> remotecall_fetch(2,println,"foo")

julia> remotecall_fetch(1,println,"foo")
foo

julia> versioninfo()
Julia Version 0.3.0-prerelease+2417
Commit 193cb11* (2014-04-02 18:29 UTC)
Platform Info:
System: Linux (x86_64-redhat-linux)
CPU: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
LAPACK: libopenblas
LIBM: libopenlibm

julia> Pkg.status()
10 required packages:

  • ClusterManagers 0.0.1+ master
  • DSP 0.0.1+ spectrogram
  • Debug 0.0.1
  • Distributions 0.4.2
  • HDF5 0.2.20
  • IProfile 0.2.5
  • MAT 0.2.3
  • PyPlot 1.2.2
  • Stats 0.1.0
  • WAV 0.2.2
    11 additional packages:
  • ArrayViews 0.4.2
  • BinDeps 0.2.12
  • Color 0.2.9
  • NumericExtensions 0.6.0
  • NumericFuns 0.2.1
  • PDMats 0.1.1
  • Polynomial 0.1.1
  • PyCall 0.4.2
  • StatsBase 0.3.9
  • URIParser 0.0.1
  • Zlib 0.1.6

julia>

@amitmurthy
Copy link
Contributor

No, I don't think it is related to the issues mentioned.

cc @nlhepler

@bjarthur
Copy link
Collaborator Author

more data on this: the remote-process stdout does eventually appear on the local repl, but sometimes not for several tens of seconds. if i simultaneously examine the julia log files (the ones it clutters the home directory with), the stdout there is also delayed. it appears on the repl and in the log files at roughly the same time, both much delayed. so i think the fix is just a matter of flushing the i/o buffer. i tried to add flush(STDOUT) to the remote script, but got an error about serializing a pointer. i tried looking in base for a place to add a flush, but it was not obvious where to put it. thanks for any help.

@amitmurthy
Copy link
Contributor

You could have a function defined that does the flush and just call it remotely. For example, flush_stdout() = flush(STDOUT) and execute a remotecall(p, flush_stdout)

@bjarthur
Copy link
Collaborator Author

i tried precisely that, and just tried again. here is the error i get:

julia> using ClusterManagers

julia> ClusterManagers.addprocs_sge(1)
job id is 8855037, waiting for job to start ..............................
1-element Array{Any,1}:
 2

julia> @everywhere flush_stdout() = flush(STDOUT)

julia> remotecall_fetch(2,println,"foo")

julia> remotecall_fetch(2,flush_stdout)
Worker 2 terminated.
ERROR: ProcessExitedException()
 in remotecall_fetch at multi.jl:673
 in remotecall_fetch at multi.jl:678

julia>  From worker 2:  foo
    From worker 2:  fatal error on 2: ERROR: cannot serialize a pointer
    From worker 2:   in serialize at serialize.jl:60
julia> 

note that "foo" eventually appears on the repl, and that the call to flush() causes the error

@amitmurthy
Copy link
Contributor

Strange. It does not seem to be an issue with local workers, i.e. something like addprocs(2).

Could you try with this definition?
@everywhere flush_stdout() = eval(parse("flush(STDOUT)"))

@bjarthur
Copy link
Collaborator Author

that doesn't work either. exact same error. and you're right about it working fine on local workers.

@amitmurthy
Copy link
Contributor

Just to try:

Instead of the @everywhere, could you just do a

julia> remotecall_fetch(2,println,"foo")

julia> remotecall_fetch(2,()->eval(parse("flush(STDOUT)"))  )  

@bjarthur
Copy link
Collaborator Author

nope. same error. is the problem that it's referring to the STDOUT in worker 1 and not worker 2?

@amitmurthy
Copy link
Contributor

That's what I initially thought, but it does not seem to be the case. At least with parse and eval that is ruled out.

I just saw flush_cstdio() in the documentation. Could you try with remotecall_fetch(2, flush_cstdio)?

@amitmurthy
Copy link
Contributor

Also STDOUT seems to be of type TTY and not a regular stream as we have been assuming...

@bjarthur
Copy link
Collaborator Author

flush_cstdio has no effect: no error, no stdout on repl, no change to log files in home directory.

@amitmurthy
Copy link
Contributor

Is it possible that SGE is responsible for the delay? I am not at all familiar with cluster technologies, but a Google search brought up this link - http://scicomp.stackexchange.com/questions/7804/flush-output-in-torque-scheduler . Is there a similar option for SGE?

@bjarthur
Copy link
Collaborator Author

the problem is not SGE but rather the buffering done by our high-performance file system. a flush() in julia, as described here JuliaLang/julia#6549, is not sufficient. for STDOUT to actually appear in the julia log files i also had to readall(ls $(ENV["HOME"])). with these two extra steps, STDOUT of worker procs now appears on the repl. thanks for all the help.

@nlhepler
Copy link
Contributor

Whew, glad to see that was resolved.
Amit, I think we already use the appropriate flag (-k o on pbs and -j y on sge). Their semantics seem a little different in the documentation, though honestly the documentation has not been too illuminating.
bjarthur, what's the fs you're using? It might be worth mentioning this in the readme (and your workaround).

@bjarthur
Copy link
Collaborator Author

yesterday i got it to work by adding @fetchfrom 3 flush(eval(:STDOUT)); readall(ls $(ENV["HOME"])) after every println. but today the flush is throwing a serialization error, despite jeff's trick JuliaLang/julia#6549. so i'm reopening this issue. nothing has changed that i know of.

julia> addprocs(1)
1-element Array{Any,1}:
 2

julia> using ClusterManagers

julia> ClusterManagers.addprocs_sge(1)
job id is 9010468, waiting for job to start ............................................
1-element Array{Any,1}:
 3

julia> @fetchfrom 1 flush(eval(:STDOUT))

julia> @fetchfrom 2 flush(eval(:STDOUT))

julia> remotecall_fetch(3,println,"foo")

julia> @fetchfrom 3 flush(eval(:STDOUT))
Worker 3 terminated.
ERROR: ProcessExitedException()
 in remotecall_fetch at multi.jl:673
 in remotecall_fetch at multi.jl:678

julia> workers()
1-element Array{Int64,1}:
 2

julia>  From worker 3:  foo
    From worker 3:  fatal error on 3: ERROR: cannot serialize a pointer
    From worker 3:   in serialize at serialize.jl:60
julia> 

note that the remote workers stdout only appears on the repl (and in the log file) after it crashes.

this is on an asynchronous NFS3 file system. the sysadmin strongly discourages use of the disk for interprocess communication because of the async buffering.

i'm looking into replacing qsub with qrsh...

@bjarthur
Copy link
Collaborator Author

qrsh implementation here #11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants