Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed human genome downloading and added auto-masking feature using dustmasker #57

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

taltman
Copy link

@taltman taltman commented Feb 22, 2017

Hi Derrick,

I've been using Kraken as of late, and decided to incorporate some fixes. First, I incorporated a suggested fix by Silas Kieser for downloading the human genome, and simplified it to use a single wget call:

https://groups.google.com/d/msg/kraken-users/wMNMSPo8Xtw/osYcrx90DgAJ

Next, I incorporated Adam Rivers' suggestion about how to use dustmasker to mask low-complexity regions into kraken-build, along with adding an option '--no-mask' to turn off masking if desired for reproducibility. The software reverts to no masking if dustmasker is not found.

https://groups.google.com/d/msg/kraken-users/jjRe21-qyvw/Kq8DXY45CQAJ

I also updated the documentation to reflect the new soft dependency on dustmasker, and documented the --no-mask option.

Please let me know what you think!

Cheers,

~Tomer

@salzberg
Copy link

Nice! Thanks Tomer - we use dustmasker ourselves in our pipeline, but we hadn't built it into Kraken as an option. I highly recommend 'dust'-ing any genomes before running Kraken (or any competing program) because of the confusion caused by low-complexity sequences.

@taltman
Copy link
Author

taltman commented Feb 23, 2017

Thanks for your feedback, Dr. Salzberg! As I've communicated to Derrick, Kraken was instrumental in me finishing my PhD, so I'm happy to be able to contribute back. Please let me know if you think that one of the Kraken devs will accept this pull request.

I'm curious about your thoughts on masking before or after mapping. I see that Heng Li advises masking after read mapping, and discarding reads that land in masked regions. Is this what you advise with Bowtie2, or do you mask DNA before building a Bowtie2 database? Thanks in advance!

https://www.biostars.org/p/170435/#170450

echo -n "Unpacking..."
tar zxf all.fna.tar.gz
rm all.fna.tar.gz
gunzip *.$EXTENSION
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default behavior of gunzip is to remove the compressed file, this causes an error when rm is called after.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @SheaML , thanks for your comment.

In the code, I removed the call to 'tar', because we are no longer downloading the all.fna.tar.gz file. Instead, we are downloading a set of gzipped files, as that is how RefSeq is now representing the data. Thus there's no call to 'rm' after a (sub-process) call to 'gunzip'. I think the pull diff shows it pretty clearly in terms of the red lines, which have been removed.

I'm new to GitHub pull requests, so please excuse me if I've confused something.

Copy link

@SheaML SheaML Mar 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @taltman the next line you added, line 46, "rm *.$EXTENSION" is the one causing the error for me. Sorry, I'm also a github neophyte.

@taltman
Copy link
Author

taltman commented Mar 10, 2017

@SheaML Right you are! I did have a commit locally that I failed to push. Resolved. Thanks for catching that and reporting it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants