This repository has been archived by the owner. It is now read-only.

Supporting Enhanced Reproducibility for Platforms like Galaxy #1191

Closed
jmchilton opened this Issue Sep 19, 2014 · 27 comments

Comments

Projects
None yet
9 participants
@jmchilton
Copy link

jmchilton commented Sep 19, 2014

I was told this was the right forum for this conversation - but it is not a bug or feature request per se. The Galaxy project develops an analysis platform focused on reproducibility and to help with that a product called the Tool Shed which includes its own custom package manager. Many people on the Galaxy core development team feel that the package management piece should be augmented to allow Homebrew/linuxbrew based specification and installation of dependencies.

Unfortunately Galaxy has a lot of specific requirements both because it is a platform and because its main focus is reproducibility - and I am pretty sure Homebrew/linuxbrew do not do exactly what is needed. The three biggest potential hurdles or questions that initially rose up when investigating this were:

  1. Supporting installing multiple versions of applications at any time - not just the newest. For instance should we salvage and improve the deprecated versions command (seems @chapmanb has had problems with it in CloudBioLinux) or should there exist versioned taps for scientific software?

  2. Creating isolated environments - is there a way to grab just what Galaxy needs from the Cellar at runtime for a particular job and not link stuff in globally at install time for instance? Galaxy knows what dependencies it needs for a particular job in the abstract - it would be nice to have a way to just set those up for a particular job - like Modules.

I have jotted some Galaxy-specific thoughts on this issue in our issue tracker here and here.

  1. How to best handle scripting languages. The current tool shed implementation has nice support for isolated Python, R, Perl, and Ruby environments and it would be great to replicate this in the brew ecosystem - I am not sure whether to rebuild something on top of brew or try to improve brew’s support for these.

If a path like this seems tenable to the Galaxy core development team - the team will happily devote developer resources to achieve these goals - be that maintaining forks of projects like linuxbrew and homebrew-science or working with the community to push changes we need upstream or help out with new community maintained projects in this ecosystem.

Any advice on these ideas/issues would be most appreciated and if other people have these same concerns we would be eager to talk things out and try to develop a unified approach.

@chapmanb

This comment has been minimized.

Copy link
Contributor

chapmanb commented Sep 19, 2014

John;
Thanks much for looking at brew for this. I've been happy with brew as a replacement for the custom Python code we were writing in CloudBioLinux and have been slowly migrating everything over to brew recipes. It's great to be working with a community that is updating and adding recipes. Shaun is definitely more in touch with the Homebrew community and I mainly try to get things done and contribute back to homebrew-science when I can, but I can give you my point of view as a user.

  1. Supporting installing multiple versions of applications at any time - not just the newest.

This is a bummer. The old approach was simple: dig into the git history to get the version you want, and run from that recipe. However, there are issues when those older versions have specific version dependencies. The homebrew infrastructure wouldn't handle this cleanly as dependency management is pretty simple. It seems like they've settled on needing specific versioned recipes, with versioned dependencies as well. This could get nasty if you're trying to support this for every single version change.

  1. Creating isolated environments (is there a way to grab just what we need from the Cellar at runtime for a particular job and not link stuff in globally at install time).

I don't know of any mechanisms for this but could be ignorant.

  1. How to best handle scripting languages.

I've been keeping scripting languages separate from brew and haven't tried to dig into this at all. One thing brew tries to do is take over the environment, with it's own Python and gcc and other bits. Generally I've tried to keep it as lightweight as possible and to avoid this as the more compiles I need to support the more scared I am of something breaking.

Any advice on these ideas/issues would be most appreciated and if other people have these same concerns we would be eager to talk things out and try to develop a unified approach.

You already know this, but my plan is to have Docker solve these issues. You'd have a specific Docker container with everything setup and use that, avoiding the need to bake all of this into a build system. This way you store versioned pre-build Docker containers instead of versioned recipes. brew is obviously a critical component of this since you'd need a way to build things in Docker images, but could keep moving forward with versions as long as you archive the old ones.

My feeling from supporting a lot of automated tool builds with bcbio is that this is the only way to avoid needing to deal with the nightmares of compiling everything well on everyone's strange system setups, so I think this goes beyond build tools to having a more useful way of scaling up support to the biological community.

Happy to talk more about any of this and hope this helps some.

@sjackman sjackman added the question label Sep 19, 2014

@ianml

This comment has been minimized.

Copy link
Contributor

ianml commented Sep 19, 2014

  1. Supporting installing multiple versions of applications at any time - not just the newest.

Unfortunately the current way to support this is through versioned formulae. We already maintain some versions here (e.g. jellyfish) but they are usually dependencies of other formulae. Toolkit-specific versions should probably be spun off to their own tap. Depending on how many combinations of versions and versioned dependencies are needed, this could be maintenance nightmare.

  1. Creating isolated environments (is there a way to grab just what we need from the Cellar at runtime for a particular job and not link stuff in globally at install time).

Formulae marked with keg_only are kept in Cellar (and linked to opt) and not linked globally. So you can access the Cellar either through brew (e.g. Formula["foo"].lib) or by modifying the user environment (the Modules approach). I think the latter approach would be flexible enough for runtime linkage while the former is what most versioned formulae take to avoid conflicts. Homebrew-julia is an example of a tap which is somewhat self-contained in handling certain dependencies. You can also manually brew unlink things which are normally linked during install.

@sjackman

This comment has been minimized.

Copy link
Member

sjackman commented Sep 19, 2014

  1. Supporting installing multiple versions of applications at any time - not just the newest. For instance should we salvage and improve the deprecated versions command (seems @chapmanb has had problems with it in CloudBioLinux) or should there exist versioned taps for scientific software?

I really like the versions command, still use it, and am sad that it's deprecated. With Homebrew it's easy to keep multiple versions of software installed. It's just a bit tricky to install a particular version of some software. One way to accomplish that would be tag each commit of a new version of software, so that the URL https://raw.githubusercontent.com/Homebrew/homebrew-science/samtools-0.1.19/samtools.rb fetches the samtools 0.1.19 formula for you. We currently tag Homebrew-science each month, so that https://raw.githubusercontent.com/Homebrew/homebrew-science/2014-07/samtools.rb fetches the version of samtools from 2014-07.

The homebrew infrastructure wouldn't handle this cleanly as dependency management is pretty simple. It seems like they've settled on needing specific versioned recipes, with versioned dependencies as well. This could get nasty if you're trying to support this for every single version change.

I don't think that the versioned recipes as practiced in Homebrew/versions will scale up easily to being able to install any version of any software. Using the tagging scheme described above it would be easy to add a custom command that when told brew install-version samtools-0.1.19 expanded that request to the versioned URL above, and installed it with the current versions of dependencies. If you wanted to install using the dependencies that were contemporaneous with samtools-0.1.19, you could instead git checkout samtools-0.1.19 && brew install samtools; git checkout master. Again, that could be rolled into its own command.

  1. Creating isolated environments - is there a way to grab just what Galaxy needs from the Cellar at runtime for a particular job and not link stuff in globally at install time for instance? Galaxy knows what dependencies it needs for a particular job in the abstract - it would be nice to have a way to just set those up for a particular job - like Modules.

This should actually be pretty straight forward. Most formula are meant to be able to be run from the Cellar without being linked. That's a design goal, I believe, how well it's accomplished in practice, well, who knows.

We can create a brew module external command like module from Modules that adds Cellar/$package/$version/bin to the PATH and Cellar/$package/$version/lib to the LD_LIBRARY_PATH. This would probably be enough for 90% of packages. Here's a start:

#!/bin/sh
set -eu
action=$1
module=$2
shift 2
if [ "$action" != load ]; then
    echo "module: error: Unknown action $action" >&2
    exit 1
fi
prefix=`brew --prefix`/Cellar/$module
if [ -d "$prefix/bin" ]; then
    echo PATH=$prefix/bin:'$PATH'
fi
if [ -d "$prefix/lib" ]; then
    echo LD_LIBRARY_PATH=$prefix/lib:'$LD_LIBRARY_PATH'
fi
echo export PATH LD_LIBRARY_PATH
❯❯❯ brew module load samtools/0.1.19
PATH=/usr/local/Cellar/samtools/0.1.19/bin:$PATH
LD_LIBRARY_PATH=/usr/local/Cellar/samtools/0.1.19/lib:$LD_LIBRARY_PATH
export PATH LD_LIBRARY_PATH
  1. How to best handle scripting languages. The current tool shed implementation has nice support for isolated Python, R, Perl, and Ruby environments and it would be great to replicate this in the brew ecosystem - I am not sure whether to rebuild something on top of brew or try to improve brew’s support for these.

I'm in less familiar territory here, and Homebrew does not attempt to solve this problem. I believe each scripting language has their own solution to this problem. For Ruby, the de factor solution is rbenv.

@brainstorm

This comment has been minimized.

Copy link
Contributor

brainstorm commented Sep 25, 2014

@sjackman, @jmchilton No, please, no module system into homebrew, that's an aberration. Let's not mix community-mantained systems (homebrew) with in-house, deprecated, incompatible systems (modules)... every HPC installation does their own stuff with the module system, it is really not to be trusted in my experience.

It also causes tons of headaches for users. Just the other day a couple of friends were struggling to install conda on a HPC system when they realized they couldn't install it without running "module unload biopython" beforehand... go figure.

I'm with @chapmanb when it comes to pre-built docker "releases" of, for instance, bcbio-nextgen with all tools in there. Then, wrapping this with http://www.w3.org/TR/prov-overview/ like the neuro people does with @incf-nidash and W3C-prov would be the new way (utopia?) to go after, imho. See an example here:

https://github.com/incf-nidash/nidm-results_fsl/blob/master/NIDMStat.py

PD: @chapmanb, also, I think bcbio-nextgen needs some work w.r.t continuous integration/deployment. The travis-ci build cannot sustain the installation of all the tools/dependencies with the 15 minute limit right now as @guillermo-carrasco recently found out. Fixing that could help in getting versioned (docker) releases of the the whole pipeline and tool updates.

@jmchilton

This comment has been minimized.

Copy link
Author

jmchilton commented Sep 26, 2014

Wow - thanks for the fantastic comments on this! A few replies (speaking only for myself not the Galaxy project obviously).

(In response @chapmanb and @brainstorm) Docker is definitely part of the solution long term and the Galaxy community and core team are actively pursuing many Docker related threads - Galaxy is distributed via Docker by @bgruening, Galaxy can very flexibly run jobs in Docker containers, and can interface with isolated IPython instances via Docker with RStudio on the way as well.<\sales_pitch>

That said, it is important to the project that Galaxy continues to support non-Docker based deployments. Galaxy is deployed at a lot of central computing facilities that likely won't even be running kernels capable of running Docker for years - and even after they are it will still be a few years more before Docker is actually enabled. Many system administrators deploying Galaxy have made it clear they an not interested in or unable to deploy Docker.

Beyond the sort of hard reality for the project that Docker cannot be the only solution for sometime - there is at least some value in a unified package manager infrastructure for deploying informatics software into Docker right? Brew provides provenance information about how and when software was deployed that a Dockerfile alone doesn't - perhaps its questionable this is needed if you actually have the container but I think it is of some value - its good to be able to reproduce the analysis as well as the container that produced the analysis.

(Responding to @sjackman): I certainly worry that a versioned tap won't scale up as well, but let me explain why I am actually more concerned with the admittedly cool tagging idea. Galaxy supports certain platforms and these evolve over time and so will brew. However, once Galaxy supports installing a certain version of software (say samtools 0.1.18) - I think the Galaxy project will always want to support installing that version of software. So if the recipe for samtools 0.1.18 needs to be modified slightly (e.g. to make a dependency explicit, to reflect some small change in brew, to support a new platform) we will want to fix the recipe. While the git tag could be updated - wouldn't we need to update every dependency of that version of software as well. The problem is it won't be clear which versions of software depend on it without making the version dependency explicit. And it seems making the dependencies explicit means requires using a versioned tap which in turn renders the tags unnecessary. Is this a crazy concern?

As for your idea about the runtime environments - I think you are exactly correct. I may try to prototype something based on that with extensions for the big scripting languages as well - Python, Perl, R, and Ruby as well.

Thanks again for the comments all.

@sjackman

This comment has been minimized.

Copy link
Member

sjackman commented Sep 26, 2014

@jmchilton

The problem is it won't be clear which versions of software depend on it without making the version dependency explicit.

Do you mean specifying in a formula that it depends specifically on version samtools 0.1.18 and will not build with the current version of samtools?

@chapmanb

This comment has been minimized.

Copy link
Contributor

chapmanb commented Sep 30, 2014

John;
Understood about docker pushback right now. I just mention it as it's worth thinking about how much engineering effort to put in when solutions will be coming (at some undefined time in the future).

Regarding versioning, would it be sufficient to revert the entire state of homebrew-science and homebrew to some point in the past, rather than trying to grab reverted versions of each program and dependency? For example, if you wanted to install samtools 1.0, you could roll back to the homebrew-science from August 15th (https://github.com/Homebrew/homebrew-science/commits/master/samtools.rb) and install. This would require an isolated Homebrew and tools to be sure you don't contaminate with more recently built dependencies already in the Cellar, but it sounds like that type of isolation is something you're considering anyways.

This gives you lightweight versioning and separation without a bigger engineering effort. Just brainstorming ideas to make the job a little easier.

@sjackman

This comment has been minimized.

Copy link
Member

sjackman commented Sep 30, 2014

Brad, that's exactly what I'm thinking.

@jmchilton

This comment has been minimized.

Copy link
Author

jmchilton commented Oct 1, 2014

I didn't express my concern very clearly - the problem with just rolling back the repository to when samtools 1.0 was the current install is that if there were bug fixes to the htslib (a samtools dependency) that came later - those would not be picked up when you do the samtools 1.0 install.

I actually came up with a solution to this - and that is to completely hijack the install procedure and install the dependencies myself. The algorithm is something like:

  • Checkout the old version of the target recipe.
  • Build a list of dependencies with versions as they existed at that time.
  • For each dependency - checkout master and recursively do the versioned install.

This ensures you always get the latest fixes of the versioned dependencies. I also need to hijack the install process to make everything effectively keg only and to persist that list of dependencies with versions so that when I build out the isolated environment at runtime I am grabbing the correct versions. I have been experimenting with this here with some help from @saketkc.

% brew vinstall homebrew/science/samtools 1.0
% brew vinstall homebrew/science/samtools 0.1.19
% brew vinstall homebrew/science/samtools 1.1
% . <(brew env homebrew/science/samtools 0.1.19)
% which samtools
/home/john/.linuxbrew/Cellar/samtools/0.1.19/bin/samtools
% brew vdeps samtools 1.1 # vdeps and env requires install with vinstall
htslib@1.1
% brew vuninstall samtools 1.1

I getting pretty excited about this approach - I think it could work equally well with brew versions or the tagging approach outlined by @sjackman - but it is a bit intrusive obviously.

Thanks for the continued feedback all.

@sjackman

This comment has been minimized.

Copy link
Member

sjackman commented Oct 1, 2014

@jmchilton I hadn't considered that issue (packaging bug fixes of dependencies), and it would work well with the tagging approach.

@guillermo-carrasco

This comment has been minimized.

Copy link

guillermo-carrasco commented Oct 2, 2014

Hi everyone,

After reading the discussion (very constructive discussion by the way), I have a concern regarding your last approach @jmchilton . Please correct me if I'm wrong.

If I understood your algorithm correctly, when you want to roll back to a specific version of a tool, you'll do that with the specific tool, but its dependencies will be installed on their last version, right?

For each dependency - checkout master and recursively do the versioned install.
This ensures you always get the latest fixes of the versioned dependencies

This is great if you want to go back to a specific version of a tool because that version had a feature that is not available anymore and you need it. But what happens, and this is my main concern, with reproducibility with this approach? I'll try to explain myself: Say I do a study using the tool X version Y and this gives me some results that I publish on a paper. After some time, a bug is discovered in one of the dependecies of the tool and a fix is proposed. With your approach, if someone wants to reproduce my results after the fix was proposed for that concrete dependency, it won't be possible because the dependency will be installed with the fixed bug, and if it is a major fix the results will probably be the same.

This may sound stupid, "Why would you want to reproduce wrong results?" and well, maybe it is, I don't know. But the thing is that if the results were wrong due to the bug, the paper is wrong as well and the analysis need to be redone. But there needs to be the possibility to exactly reproduce that error.

Of course in those cases one could also just install that particular dependency by hand and that would be it. Just brainstorming thoughts :-)

Great work with the brew-tests by the way :-D I'll have to take a look at them more seriously.

@jmchilton

This comment has been minimized.

Copy link
Author

jmchilton commented Oct 2, 2014

@guillermo-carrasco Thanks for the comments - this is an important point. I wan't to make clear I am fast forwarding the repository to grab the latest recipe fixes - not the latest updates to the underlying software.

Imagine some piece of software CoolAppX that depends on tophat. Now imagine when coolappx.rb gets added to the repository tophat is at version 2.0.10. As time goes by, tophat is migrated to version 2.0.11 and then to 2.0.12 and over that time the tophat download URL changed so the old recipes are no longer valid. When one goes to install coolappx at its revision - she will not be able to because the tophat recipe has become broken. The approach I am trying to implement (and it may still have some bugs) would still target tophat version 2.0.10. I very much want to "reproduce wrong results" as you aptly put it. But if I don't fix the tophat download URL in those past versions I cannot even install tophat version 2.0.10 - let alone produce the wrong results.

The important distinction here is that we are grabbing only recipe fixes that allow it to be installed in the future. I would imagine this would be for things like URL changes, changes to the Homebrew "API", addition of bottles, adding compatibility for new (or old) platforms, build fixes for new (or old) platforms. The tophat example is very clean - the fixes didn't result in a change in the SHA1 hash - the same thing that would have been installed in the past will still be installed.

Certainly the approach is deeply imperfect - in large part because the core Homebrew dependencies are going to change over time - Docker and/or virtualization will be large steps forward toward more "perfect" recomputability.

@guillermo-carrasco

This comment has been minimized.

Copy link

guillermo-carrasco commented Oct 2, 2014

Hi @jmchilton ,

Thanks for the clarification, I understand now. So you're not talking about downloading fixed tools, but fixed recipes, and that makes more sense, of course 👍 . This way one can reproduce results, wrong or correct.

@sjackman

This comment has been minimized.

Copy link
Member

sjackman commented Oct 4, 2014

edirect 2.00 has changed the SHA1 of the tarball three times. I can only assume that the authors are changing the underlying software without bumping the version number. How should we handle non-conforming software packages like this?

https://github.com/Homebrew/homebrew-science/commits/master/edirect.rb

@saketkc

This comment has been minimized.

Copy link
Contributor

saketkc commented Oct 4, 2014

How should we handle non-conforming software packages like this?

I have a suggestion(Feel free to reject, of course):

I believe it would make more sense to create a 'tagged' receipt of all verified/installable components.
A verified receipt is essentially a json file probably being committed by a build bot after a travis build passes. Here is a short attempt: https://travis-ci.org/saketkc/homebrew-science/jobs/37026196 (Note the builds are failing for multiple reasons, but that is a side issue)

So every time a new commit is made to homebrew-science and travis build succeeds, a new VERIFIED_INSTALLATIONS.json also gets committed(bots maybe?). This is what it's contents could look like:

"time":1411892756,
"HEAD":"1ff9048b45bd6a114bf809d26e308688ba04c78c",
"sha1":"xxxxxxxxxxxx",
"version":"yyyy"

Ideally the sha1 should be in sync with one defined in the formula file. It should then be possible to keep track of such sha1 changes on subsequent builds.

I am not sure of that is in sync with homebrew philosophy.

@sjackman

This comment has been minimized.

Copy link
Member

sjackman commented Oct 4, 2014

This sounds very similar to brew bottles. The brew test bot builds each commit, builds a bottle (a tarball of the compiled software), and commits the SHA1 of the bottle to the formula repo. Really, we need a brew test bot running for homebrew science. Anybody want to take this on?
See http://bot.brew.sh

@saketkc

This comment has been minimized.

Copy link
Contributor

saketkc commented Oct 4, 2014

Would love to!

@sjackman

This comment has been minimized.

Copy link
Member

sjackman commented Oct 4, 2014

Fantastic. Are you working on Mac or Linux, or both?

@saketkc

This comment has been minimized.

Copy link
Contributor

saketkc commented Oct 4, 2014

Both.

@sjackman

This comment has been minimized.

Copy link
Member

sjackman commented Oct 5, 2014

Great. Bottles on Linux aren't quite ready for prime time just yet, but I'm working on it. The two problems that must be solved are bottling glibc so that we don't have to worry about which glibc version a particular distro has, and making bottles relocatable so that they can be installed in arbitrary locations. The latter is already handled by Homebrew, but most Homebrew installations use /usr/local, and many bottles require /usr/local.

@saketkc

This comment has been minimized.

Copy link
Contributor

saketkc commented Oct 5, 2014

Since, we went off topic I have created a new issue for this here: #1231

@sjackman sjackman removed the Bioinformatics label Oct 5, 2014

@jmchilton

This comment has been minimized.

Copy link
Author

jmchilton commented Mar 24, 2015

The response on #1215 makes it clear the maintainers are not interested in supporting older versions of recipes. This is of course reasonable and entirely fair, but nonetheless disappointing - if we wish to do this we will have to create a competing tap.

Thanks for the input all.

@jmchilton jmchilton closed this Mar 24, 2015

@sjackman

This comment has been minimized.

Copy link
Member

sjackman commented Mar 24, 2015

I do want to support older versions, but I agree with @tdsmith that the git history is not the best way to accomplish that. I'm afraid that I don't have a better solution to suggest at the moment. If you need this feature immediately, a personal tap is probably the way to go.

@sjackman

This comment has been minimized.

Copy link
Member

sjackman commented Jul 9, 2016

Now that we have binary bottles for about 200 Homebrew-Science formula in Linuxbrew, you can install older versions of software from the bottles: http://linuxbrew.bintray.com/bottles-science/ Not a complete solution, but definitely an improvement.

@sjackman sjackman self-assigned this Jul 9, 2016

@brucellino

This comment has been minimized.

Copy link

brucellino commented Jul 9, 2016

Hi all.

Coming late to the party. In the project I'm working on, we maintain the older versions of the code, but since there is transparent continuous integration, we also maintain the tests (positive and false results). Very similar to what #1231 was talking about I think - we are using Jenkins to do this.

Applications which pass tests are put into a CVMFS repository, which is version controlled, similar to, but not the same as git versioning. Future work will have us publish the recipes (which are in git repos) and artifacts (which are in the CVMFS repo) with persistent identifiers so that we can track usage and citation of the application as well.

@bgruening

This comment has been minimized.

Copy link

bgruening commented Jul 11, 2016

@sjackman we were fixing the problem with building Docker images automatically out of brew recipes with this project: https://github.com/mulled/mulled
Unfortunately we all moved on to conda packages and this features is not maintained, it should work through.
If anyone cares to maintain the brew part I would be happy to give a introduction.

@sjackman

This comment has been minimized.

Copy link
Member

sjackman commented Jul 11, 2016

Mulled is a great idea. I'm not sure that I can volunteer to maintain the Linuxbrew port myself, but if there's any work I can do in Linuxbrew to better support mulled, I'd be happy to.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.