-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test alternative duplicate marking programs #113
Comments
Run test on different duplicate marking programs. Dataset: XVII.SM10.reseq_2 mapped using bowtie2 and chr1 extracted. Final number of mapped alignments is 60,753,598 st. Program versions:
Test 1. Threads = 20
Test 2. Threads = 2
Conclusion
Finally in testing I also want to point out a difference in implementation when using samblaster. Unlike picard and sambamba, samblaster requires a name-sorted SAM file for input and outputs in SAM format. This means that it should be implemented to take stdin directly from the mapper, for example:
|
The reads which are not marked as read duplicates, do they seem to be duplicates? I'd say whether they are read duplicates or not is the most important factor for deciding which we should go for. |
True, I am investigating this further. |
From supplementary "PICARD and SAMBLASTER are using different strategies to identify duplicates. Read pairs in which one read is mapped and the other unmapped are called “orphans”. SAMBLASTER compares orphans only to other orphans to find duplicates, and always marks both reads in an orphan pair as either duplicate or not duplicate. PICARD marks many more mapped reads in orphans as duplicate than SAMBLASTER, and marks no unmapped reads in orphans as duplicates. One possible explanation for this large number of mapped orphan duplicates is that Picard compares the mapped orphan reads to all mapped reads to determine if it is a duplicate." So there is some difference where samblaster would only compare "orphan" alignments to other orphan alignment while picard compares against all. I also found that samblaster and picard does a different selection of which alignment to mark as duplicate. Samblaster selects the lexicographically first alignment while picard performs some sort of comparison between them to select the one with the higher quality alignment. See quote from samblasters issues page. "As to how the results will compare to Picard, the answer is the same for single-end reads as for paired-end reads. samblaster has much higher performance than Picard in terms of speed, but makes the trade-off that the first of a set of duplicates that are found in the input file is kept in the output file, while Picard will keep the "best" of a set of duplicates in the output file. In order to do this, Picard is forced to make two passes over an input file that has been landed to disk (not in a pipe), ergo the slower performance." In summary: picard MarkDuplicates 👎 Slow, no parallel processing sambamba markdup 👎 Cannot read/write from stdin/out |
Nice overview! Some extra points:
It appears to me that sambamba markdup is strictly preferable over Picard MarkDuplicates. If you agree, then the question becomes whether the slightly different results for samblaster are acceptable. Would it make sense to support both (samblaster/sambamba) in the pipeline for a while, in order to see how the final results change when switching the duplicate marker? |
Thanks for the comments! I have not checked memory usage, I will look into that further. So there is a bioconda package, good to know! I totally agree that sambamba is preferable to picard, so I would definitely exchange that. I could also include samblaster for now so we can test it out further. |
Agreed, both sound good. For now we can indicate what we would use with a default option. |
Ah, that explains our differing observations! :-) I’d prefer to get the Bioconda recipe to work on macOS as well instead of using the BioBuilds recipe. I’ll open an issue. |
Memory test performed on subset of previous data mapping to chr21. Memory usage results:
Commands analysed:
|
Oh, interesting! I’m must have remembered memory usage of samblaster incorrectly. Then that is definitely not a concern. |
Closed after PR #132 |
Consider whether possibly faster alternatives to Picard MarkDuplicates can be used:
The text was updated successfully, but these errors were encountered: