Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for bioconda picard installation. #22

Closed
wants to merge 9 commits into from

Conversation

karl616
Copy link
Contributor

@karl616 karl616 commented Jul 27, 2018

  • rely on that the bioconda wrapper is a symbolic link to the
    folder containing picard.jar

- rely on that the bioconda wrapper is a symbolic link to the
  folder containing picard.jar
@leepc12
Copy link
Contributor

leepc12 commented Jul 27, 2018

Thanks for your PR. We will test it and will let you know.

BTW there are some issues about readlink -f on MacOS but there is a workaround.
Example:

#DEST_DIR=$(readlink -f $2)
DEST_DIR=$(cd $(dirname $2) && pwd -P)/$(basename $2)

@karl616
Copy link
Contributor Author

karl616 commented Jul 28, 2018 via email

@karl616
Copy link
Contributor Author

karl616 commented Jul 30, 2018

I shifted it all into python.
For testing, what do you normally do? Is the test example sufficient?

@karl616
Copy link
Contributor Author

karl616 commented Aug 1, 2018

Hi @leepc12,
I'm setting up a small test set based on ENCSR356KRQ. That way I can at least make sure that the code I generate executes locally.

I'm a bit confused with regard to phantompeakqualtools; In the documentation on phantompeakqualtools the statement is that v1.14 is required and you also provide the custom archive.

Here you are using spp v1.13. At least it looks that way in the environment file and the custom installation of spp is disabled in the installation script.

==> two questions:

  1. What is changed with the custom v1.14? Can we use the stock version?
  2. What version of spp is needed by phantompeakqualtools? v1.14 or v1.13?

@karl616
Copy link
Contributor Author

karl616 commented Aug 1, 2018

I understood one reason for the custom conda installation today... macs2 requires python2 and idr requires python3. One solution would be to have multiple environments and then load them on demand. I came up with one design, but this being my first try with WDL I would appreciate feedback before I start testing:

task trim_adapter {
        ....
        #environment
        Boolean? useConda

        command {
                if [[ ${select_first([useConda,false])} == "true" ]]; then
                        source activate encode-atac-seq-pipeline-python3
                fi
                python $(which encode_trim_adapter.py) \
                        ${write_tsv(fastqs)} \
                        --adapters ${write_tsv(adapters)} \
                        ${if paired_end then "--paired-end" else ""} \
                        ${if select_first([auto_detect_adapter,false]) then "--auto-detect-adapter" else ""} \
                        ${"--min-trim-len " + select_first([min_trim_len,5])} \
                        ${"--err-rate " + select_first([err_rate,'0.1'])} \
                        ${"--nth " + select_first([cpu,2])}
                if [[ ${select_first([useConda,false])} == "true" ]]; then
                        source deactivate
                fi

        }
        ....
}

Alternatively, it could be made part of the python script. I'm not sure what is best or if it is a viable solution at all. What do you think?

@vsoch
Copy link

vsoch commented Aug 1, 2018

Just a heads up - Gael announced that scikit learn is dropping support for Python 2 - see the red box here:

http://gael-varoquaux.info/programming/sprint-on-scikit-learn-in-paris-and-austin.html

image

And it's actually part of a larger effort - if you click the "other projects" link there:

http://python3statement.org/

This likely doesn't require immediate action, but I would suggest a general mindset of development toward python 3 and sunsetting Python 2, if at all possible. If macs2 requires python2, then macs2 should be updated for Python 3, and not the other way around.

@karl616
Copy link
Contributor Author

karl616 commented Aug 1, 2018 via email

@leepc12
Copy link
Contributor

leepc12 commented Aug 1, 2018

@karl616: spp 1.14 was released to fix compilation issues for R (>=3.3.0) but our pipeline uses R 3.2.x. I just wanted to keep the latest spp (currently 1.15 is the latest though) in the phantompeakquals repo when I migrated it from Google code. Both 1.13 and 1.14 produce the same output for our pipeline and 1.14 was harder to install, it needed more dependencies to be installed so I chose 1.13 for the pipeline.

@leepc12
Copy link
Contributor

leepc12 commented Aug 1, 2018

About python versions, the old pipeline has two separate environments for py2 and py3 just like you implemented above. It activates py3 env only for IDR tasks.

But for this new pipeline, I didn't want to have any Conda-specific things in the WDL file and input JSON. There is a workaround to use both py2 and py3 at the same time. One major drawback of this hack is that users need to remove their own py2-based Conda from their PATH and install a new py3-based Conda for the pipeline. All py3-based apps like IDR should be installed in the root environment and called explicitly with python3 app.py.

I hate this workaround too. We will eventually remove all py2 dependencies and move on to py3. MACS2 is the only problem.

@vsoch
Copy link

vsoch commented Aug 1, 2018

If you could use a container, you don't need to worry about these things.

@karl616
Copy link
Contributor Author

karl616 commented Aug 2, 2018

@vsoch containers are very convenient. WDL has the runtime docker parameter as well. Still, I'm one of the sometimes underprivileged users (no root access and restrictive admins) why I would want a pure conda solution.

@leepc12 I understand the problem now. Thanks for the explanation. It would be nice if WDL had built in conda support. I know nextflow is going that way.

So no conda in the pipeline... Perhaps if IDR was moved from the root to it's own environment and then the application of the same linking trick it would be possible to avoid requiring a clean conda installation. I will try that.

@karl616
Copy link
Contributor Author

karl616 commented Aug 2, 2018

My initial test on a small subset was positive. The requirements.txt file anticipates that phantompeakqualtools will be accepted into bioconda, but beyond that, and some extensive clean-up I find this installation process easier.

It also installs on top of an available conda instance.

@leepc12
Copy link
Contributor

leepc12 commented Aug 2, 2018

Thanks @karl616. I tested your PR and it worked fine except phantompeakqualtools. Once phantompeakqualtools is available on a conda repo, I will test again and let you know.

BTW I cannot remove MACS2 py egg extraction workaround in the installer script.
kundajelab/atac_dnase_pipelines#72

Karl Nordström added 3 commits August 3, 2018 09:21
…ipeline into addCondaPicardSupport

Conflicts:
	installers/install_dependencies.sh
@karl616
Copy link
Contributor Author

karl616 commented Aug 3, 2018

OK, phantompeakqualtools is there. It should work. One thing I didn't mention earlier was that I also added cromwell as a dependency. It's not directly part of the pipeline, but having it in the bundle removes one step for the end-user.

I undid the removal of the egg extraction. In the same vein perhaps; I wasn't sure about this:
https://github.com/karl616/atac-seq-pipeline/blob/fb5f4255e932394c3a81a112cce95bcd3162ec7d/installers/install_dependencies.sh#L29-L32
It is currently excluded and works on my system, but I don't have access to all the various backends you support

@karl616 karl616 changed the title Add support for bioconda picard installtions. Add support for bioconda picard installations. Aug 3, 2018
@karl616 karl616 changed the title Add support for bioconda picard installations. Add support for bioconda picard installation. Aug 3, 2018
@karl616
Copy link
Contributor Author

karl616 commented Aug 3, 2018

If this environment setup gets through the tests we could also simplify the container generation (I think). (And I apologize if I dismissed the idea too harshly earlier, some of my systems are too backwards and I was one step behind in the thought process.)

What should be possible is to simply outsource the generation of containers to BioContainers, which is an extension/partner of bioconda.

http://biocontainers.pro/

They have one container per bioconda recipe, but also provide multi-package containers:

http://biocontainers.pro/multi-package-containers

I have to figure out if it is possible with all these legacy dependencies, but if so, that would relieve you of supporting containers along with the pipeline.

Biocontainers provide both docker:

https://quay.io/organization/biocontainers

and singularity containers:

https://depot.galaxyproject.org/singularity/

Perhaps this is a separate pull request and some of the bioconda recipes we added hasn't been ported to biocontainer yet (I'm following it up), but before venturing forth I wanted check if there is interest.

It requires some changes now, but I imagine it will make the process of updating software easier in the long run.

@karl616
Copy link
Contributor Author

karl616 commented Aug 3, 2018

It works with older versions
Pull requests like these:

python2: https://github.com/BioContainers/multi-package-containers/pull/529/files
python3: https://github.com/BioContainers/multi-package-containers/pull/528/files

Gives you containers like these:

python2:
https://quay.io/repository/biocontainers/mulled-v2-c0d01ccc79bbd3fe061a321d7ead24b5abd34899
docker pull quay.io/biocontainers/mulled-v2-c0d01ccc79bbd3fe061a321d7ead24b5abd34899:723b6776d5a7f3816c8ef3035a855cff230f1e65-0
singluarity:
https://depot.galaxyproject.org/singularity/mulled-v2-c0d01ccc79bbd3fe061a321d7ead24b5abd34899%3A723b6776d5a7f3816c8ef3035a855cff230f1e65-0

python3:
https://quay.io/repository/biocontainers/mulled-v2-4079a2d072f7abce3806c0603812aee84e36b5d7
docker pull quay.io/biocontainers/mulled-v2-4079a2d072f7abce3806c0603812aee84e36b5d7:fd6613618d8dafbebac12d49f3884cfa879c0ef5-0
singularity
https://depot.galaxyproject.org/singularity/mulled-v2-4079a2d072f7abce3806c0603812aee84e36b5d7%3Afd6613618d8dafbebac12d49f3884cfa879c0ef5-0

They replicate the conda environments which is good for reproducibility, The drawback is that you need two containers and the naming is a bit clumsy.

Anyhow, my idea was that this could be handled within the wdl file:
http://cromwell.readthedocs.io/en/develop/RuntimeAttributes/#docker

What do you think?

@karl616
Copy link
Contributor Author

karl616 commented Aug 7, 2018

@leepc12 the containers seems to require more work than I expected. BioContainers are built upon busybox and hence doesn't provide full bash functionality, which is required by built-in wdl functions like read_map. I'll see if I can find a way around it.
I'll open up a new pull request if I come up with something and leave this here for bioconda.

@leepc12
Copy link
Contributor

leepc12 commented Aug 8, 2018

@karl616: Thanks for your work. Yes, let's talk more about the containers in another PR or issue.

BTW, your PR works fine but I got slightly different MACS2 outputs between Conda and docker due to some changes in numpy/blas. So I still need to figure this out. I will make a new branch derived from this PR and will let you know about progress.

@karl616
Copy link
Contributor Author

karl616 commented Aug 9, 2018

@leepc12 Sounds good.
The difference is very inconvenient. I compared my original environment to the new one and there was a lot of differing versions.
To me this sounds like that we should lock down the version of more software. Perhaps even all of them. One way to go is this:

https://conda.io/docs/user-guide/tasks/manage-environments.html#cloning-an-environment

That would ensure that versions stay the same.

I'm going back and setting up the old version anew in order to do a comparison. If you want assistance, perhaps you could share the output from conda list for the old and new encode-atac-seq-pipeline environments.

I suspect that already by comparing our two installations we will see differences. In my experience, conda tries to stay as actual as possible when no version is specified.

@karl616
Copy link
Contributor Author

karl616 commented Aug 9, 2018

Here are my files if you want to have a look:

First two versions of the original pipeline installed with roughly one week in between
encode-atac-seq-pipeline.orig.180802.txt
encode-atac-seq-pipeline.orig.180810.txt

And the new installation process:
encode-atac-seq-pipeline.new.180802.txt

My point is that only the software with fixed version stay constant... And still there might be new builds. Cython and biopython are two packages changing by just installing old environment twice...

I looked at the docker file and there are quite a few differences to the conda environment... Would it be interesting to harmonize this?

I tried to do this quickly before and I messed it up...

@karl616
Copy link
Contributor Author

karl616 commented Aug 10, 2018

leepc12 added a commit that referenced this pull request Aug 15, 2018
@leepc12
Copy link
Contributor

leepc12 commented Aug 22, 2018

@karl616 . I am very sorry about very late update for this PR. We appreciate your work on this.
I am still struggling with dependency problems related to MACS2 and still getting different outputs on different platforms. but I will finish it soon and will let your know.
Thanks again for your work.

@leepc12 leepc12 mentioned this pull request Aug 27, 2018
@leepc12 leepc12 closed this Sep 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants