Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolving a batchtools_slurm-related error, probably a configuration problem #11

Closed
wlandau-lilly opened this issue Oct 30, 2017 · 9 comments

Comments

@wlandau-lilly
Copy link

wlandau-lilly commented Oct 30, 2017

See ropensci/drake#115, particularly here. @kendonB and I have been trying to debug his drake/SLURM powered project, and we are running into trouble. The following assumes development drake 8a3558a3fa9269f2c19c98e1a6404603486d5c3a.

library(drake)
example_drake("slurm")
setwd("slurm")
source("run.R")

On my Debian 9.2 VM with a toy installation of SLURM, this runs perfectly. But on his serious SLURM cluster, @kendonB gets the following.

Error: BatchtoolsExpiration: Future ('<none>') expired (registry path ~/.future/20171030_104404-sArxEt/batchtools_1079708388).. The last few lines of the logged output:
46: try(execJob(job))
47: doJobCollection.JobCollection(obj, output = output)
48: doJobCollection.character("~/.future/20171030_104404-sArxEt/batchtools_1079708388/jobs/jobcf9a9c87902b62125af3a0a1d9e37a56.rds")
49: batchtools::doJobCollection("~/.future/20171030_104404-sArxEt/batchtools_1079708388/jobs/jobcf9a9c87902b62125af3a0a1d9e37a56.rds")
An irrecoverable exception occurred. R is aborting now ...
/var/spool/slurmd/job65957677/slurm_script: line 22:  5660 Illegal instruction     (core dumped) Rscript -e 'batchtools::doJobCollection("~...
In addition: Warning message:
In waitForJobs(ids = jobid, timeout = timeout, sleep = sleep_fcn,  :
  Some jobs disappeared from the system

I suspect a configuration error that could be resolved with the right template file. I would rather not have to implement a special slurm_apply backend, especially since drake already has two different ways to talk to SLURM already.

@kendonB
Copy link

kendonB commented Oct 30, 2017

Hi all,

I get the above error when running this as well:

library(future.batchtools)
future::plan(batchtools_slurm(template = "batchtools_slurm.tmpl"))
future_lapply(1:2, cat)

The jobs successfully appear in squeue, complete, then I see the above in R.

@wlandau-lilly , could you edit out the working directory above.

@wlandau-lilly
Copy link
Author

wlandau-lilly commented Oct 30, 2017

Sure, I scrubbed it and inserted ~ everywhere. Any chance you would be willing to show more of the tmpl file post-configuration?

@kendonB
Copy link

kendonB commented Oct 30, 2017

I made only minor changes to the config file:

resources$account = "my_account_name"
resources$walltime = 1 * 3600
uncommented ``# module load R``
commented 
# module load gcc/4.8.5

# module load java

@HenrikBengtsson
Copy link
Owner

Hmm... that Illegal instruction (core dumped) looks really bad and I'm not sure where it's coming from. Before anything else, does it work when you use the batchtools_local backend;

> library("future.batchtools")
> future::plan(batchtools_local)
> future_lapply(1:2, identity)
[[1]]
[1] 1

[[2]]
[1] 2

@kendonB
Copy link

kendonB commented Oct 30, 2017

Yep that one works for me. Same output as you.

@HenrikBengtsson
Copy link
Owner

Next, check if it's unrelated to future.batchjobs. First, verify that you can run the following:

library("batchtools")
cf <- makeClusterFunctionsSocket(2)
reg <- makeRegistry(NA)
reg$cluster.functions <- cf

batchMap(fun = identity, x = 1:2)
submitJobs()
waitForJobs()
reduceResultsList()
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2

When that works, restart R and retry by using:

stopifnot(file.exists("batchtools.slurm.tmpl"))
cf <- makeClusterFunctionsSlurm("slurm") ## for ./batchtools.slurm.tmpl

@kendonB
Copy link

kendonB commented Oct 31, 2017

OK, I solved it I believe. It was my fault - we have two architectures on our cluster and I had compiled my R packages on sandybridge but the job was getting sent to westmere. This was as simple as adding another configuration flag to the *.tmpl file. The minimal example now works! Sorry to waste your time.

@wlandau-lilly
Copy link
Author

So glad to hear this! No time was wasted at all, thank you for sticking with it. I have big dreams for drake, and you helped me make a lot of progress!

@HenrikBengtsson
Copy link
Owner

Good to hear. No time wasted; this thread adds to the searchable knowledge base.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants