Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future ('<none>') expired #31

Closed
mhesselbarth opened this issue Sep 3, 2018 · 2 comments
Closed

Future ('<none>') expired #31

mhesselbarth opened this issue Sep 3, 2018 · 2 comments

Comments

@mhesselbarth
Copy link

Hey Henrik,

I'm using future.batchtools to submit jobs to a HPC with LSF. When only submitting a rather small amount of jobs everything is running smoothly, however, if I increase the number of submitted jobs I get the following error:

Error: BatchtoolsExpiration: Future ('<none>') expired (registry path /home/uni08/hesselbarth3/.future/20180901_172706-Dvm4Cm/batchtools_1849950729).. The last few lines of the logged output:
    Max Swap :                                   -
    Max Processes :                              1
    Max Threads :                                1
    Run time :                                   9 sec.
    Turnaround time :                            14 sec.
The output (if any) follows:

When looking into the .future - folder many jobs seem to be alright, but still at least one job is failing with the following log

TERM_EXTERNAL_SIGNAL: job killed by a signal external to LSF.
Exited with signal termination: Killed.

Resource usage summary:

    CPU time :                                   0.07 sec.
    Max Memory :                                 15 MB
    Average Memory :                             15.00 MB
    Total Requested Memory :                     -
    Delta Memory :                               -
    Max Swap :                                   -
    Max Processes :                              1
    Max Threads :                                1
    Run time :                                   9 sec.
    Turnaround time :                            14 sec.

The output (if any) follows:

I use future.batchtools_0.7.1-9000, future_1.9.0 and furrr_0.1.0.

@HenrikBengtsson
Copy link
Owner

I don't know LSF, but the message (which is not from anywhere in the future ecosystem):

TERM_EXTERNAL_SIGNAL: job killed by a signal external to LSF.
Exited with signal termination: Killed.

looks similar to messages received when schedulers such as Torque & SGE decides to terminate a job. Could it be that you hitting the LSF scheduler with too many jobs in a short period of time? Do you have a sysadm that might help you look into the local logs?

By the default, plan(batchtools_lsf) and other HPC scheduler backends, has an "infinite" number of workers, e.g.

> plan(batchtools_lsf)
> nbrOfWorkers()
[1] Inf

This means that there's no limit in the number of jobs the future framework can submit - it'll just keep submitting jobs to scheduler as more futures are created. You can limit this by specifying:

> plan(batchtools_lsf, workers = 50)
> nbrOfWorkers()
[1] 50

That way there will be at most 50 jobs on the queue (running or queued). That might help you avoid the problem.

@mhesselbarth
Copy link
Author

Hey Henrik,

I already had the suspicion it's rather a problem with the HPC itself than with future. To limit the workers may be a good idea I will definitely try, but also check with the cluster admin.

Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants