Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SGE issue: could not connect #24

Closed
dkoslicki opened this issue Oct 8, 2015 · 5 comments
Closed

SGE issue: could not connect #24

dkoslicki opened this issue Oct 8, 2015 · 5 comments

Comments

@dkoslicki
Copy link

I tried using the following example:

using ClusterManagers
ClusterManagers.addprocs_sge(4)

And got the following output:

job id is 2559006, waiting for job to start ............................................................
    From worker 0:  Master process (id 1) could not connect within 60.0 seconds.
    From worker 0:  exiting.

I noticed (using qstat -U) that the jobs were started:

-bash-4.1$ qstat -U koslickd
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
2559006 0.50500 julia-3457 koslickd     t     10/07/2015 21:58:44 all.q@chrom15.cgrb.oregonstate     1 4
2559006 0.50500 julia-3457 koslickd     t     10/07/2015 21:58:44 all.q@chrom16.cgrb.oregonstate     1 3
2559006 0.50500 julia-3457 koslickd     t     10/07/2015 21:58:44 all.q@chrom22.cgrb.oregonstate     1 2
2559006 0.50500 julia-3457 koslickd     r     10/07/2015 21:58:44 math@math0.cgrb.oregonstate.lo     1 1

But shortly thereafter, the jobs seemed to disappear:

-bash-4.1$ qstat -U koslickd
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
2559006 0.50500 julia-3457 koslickd     r     10/07/2015 21:58:44 math@math0.cgrb.oregonstate.lo     1 1

Julia then seemed to freeze indefinitely and had to be killed manually.

Any ideas as to what went wrong?

Thanks,

~David

@gcamilo
Copy link
Contributor

gcamilo commented Oct 9, 2015

I think the problem is that the nodes are not being launched with the --worker flags, but It's curious that I ran into some problems that you did not. Try adding to line 29 in qsub.jl --worker after $execflags.

@dkoslicki
Copy link
Author

I think I may have found the problem: I'm using Julia version 0.3.3, so Pkg.add("ClusterManagers") was giving me version 0.0.2. I'll have to ask the sys admin to update Julia to v 0.4.0 I guess.

@gcamilo
Copy link
Contributor

gcamilo commented Oct 9, 2015

I installed Julia to my user directory as that path could take months over here.

@dkoslicki
Copy link
Author

Yeah, that will probably be the best route, thanks!

@jiahao
Copy link
Contributor

jiahao commented Apr 12, 2016

Closing as probably fixed by upgrading to Julia 0.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants