Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly specifying memory units breaks the profile in older versions of LSF #18

Closed
leoisl opened this issue Apr 10, 2020 · 4 comments
Labels
bug Something isn't working

Comments

@leoisl
Copy link
Collaborator

leoisl commented Apr 10, 2020

I use LSF in two different clusters. In one, this is the LSF version:

$ lsid
IBM Spectrum LSF Standard 10.1.0.6, May 25 2018

Explicitly specifying memory units in this LSF version works just fine, e.g. this command works fine:

bsub -M 1000MB -n 1 -R 'select[mem>1000MB] rusage[mem=1000MB] span[hosts=1]' -o ls.out -e ls.err -J ls_job ls

In the other cluster, the version is a bit older:

$ lsid
IBM Spectrum LSF Standard 10.1.0.0, Jul 08 2016

Explicitly specifying memory units in this cluster fails:

$ bsub -M 1000MB -n 1 -R 'select[mem>1000MB] rusage[mem=1000MB] span[hosts=1]' -o ls.out -e ls.err -J ls_job ls
1000MB: MEMLIMT value should be a positive integer. Job not submitted.

It works if MB is removed:

$ bsub -M 1000 -n 1 -R 'select[mem>1000] rusage[mem=1000] span[hosts=1]' -o ls.out -e ls.err -J ls_job ls
Job <9491039> is submitted to default queue <standard>.

I did not track which update of LSF between these two versions enabled memory units to be specified in -M.

I wonder if:

  1. Should we support older versions of LSF? Or should we require users to have an updated version of LSF? Supporting only the most updated version has the advantage of using new features, like this one, but it also means it will work only for a subset of the users. Supporting older versions (e.g. 10.1.0.0) means that a way larger fraction of users can use this profile, but we have to code around the lacking features.
  2. We should choose which version to support, and setup a docker image with the chosen version, and make some real end-to-end tests with trivial pipelines to ensure the profile works. The testing framework we have now is fine, but we are mocking the behaviour of LSF. I am hoping LSF is backwards compatible, so we don't actually need to test in every version after the chosen one.
@mbhall88
Copy link
Member

Oh man, that's annoying. I think only supporting the latest would be silly. We should be flexible. I will look into whether there is a way to specify the units that is backwards compatible.

@mbhall88
Copy link
Member

mbhall88 commented Apr 10, 2020

Damn, it looks like the ability to specify memory units was only added in 10.1.0.2 (See "Improvements to units for resource requirements and limits").

The only way of handling this that I can think of now is getting the LSF_UNIT_FOR_LIMITS variable in lsf.conf and using this to convert MB into whatever unit it has set. If LSF_UNITS_FOR_LIMITS is not present then we assume KB (which is the LSF default).
Luckily this file is supposed to live in a standardised location - ${LSF_ENVDIR}/lsf.conf

On my cluster

$ grep LSF_UNIT_FOR_LIMITS ${LSF_ENVDIR}/lsf.conf
LSF_UNIT_FOR_LIMITS=MB

This is annoying as it will add a tiny bit of IO to each job, but these files on our cluster seem to have ~150 lines so it shouldn't be too bad.

The best solution would be to add some auxiliary functions to the MemoryUnits enum that can handle this conversion for us.

In terms of the fastest way to get the value of LSF_UNIT_FOR_LIMITS we should benchmark a pythonic way and also a system call to grep. Although, is it safe to assume grep is present on all machines? The version of grep won't matter as we would be purely using grep PATTERN FILE.

Do you see a better solution or any problems with this suggestion @leoisl ?

@mbhall88 mbhall88 added the bug Something isn't working label Apr 10, 2020
@leoisl
Copy link
Collaborator Author

leoisl commented Apr 10, 2020

This solution is fine for me! I am just wondering if there is a way of getting the value of LSF_UNIT_FOR_LIMITS only once, when we are configuring the profile through cookiecutter, but instead of being configured by the user, it would be configured automatically for the cluster the profile is being setup. I am assuming this value rarely changes.

@mbhall88
Copy link
Member

mbhall88 commented Apr 11, 2020

Actually, that is a really good point. I guess we can just ask the user for this value during setup. We can provide code for how to get it both in the README and potentially in the cookiecutter prompts too.
Today I implemented a library to deal with memory units as this is becoming something I am having to do regularly. So tomorrow I will use this to fix this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants