Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test alternative duplicate marking programs #113

Closed
marcelm opened this issue Oct 7, 2019 · 12 comments
Closed

Test alternative duplicate marking programs #113

marcelm opened this issue Oct 7, 2019 · 12 comments
Assignees

Comments

@marcelm
Copy link
Collaborator

marcelm commented Oct 7, 2019

Consider whether possibly faster alternatives to Picard MarkDuplicates can be used:

  • sambamba markdup
  • samblaster
@marcelm marcelm mentioned this issue Oct 7, 2019
10 tasks
@pontushojer pontushojer self-assigned this Oct 11, 2019
@pontushojer
Copy link
Collaborator

Run test on different duplicate marking programs.

Dataset: XVII.SM10.reseq_2 mapped using bowtie2 and chr1 extracted. Final number of mapped alignments is 60,753,598 st.

Program versions:

  • Picard: 2.20.4
  • sambamba: 0.6.6
  • samblaster: 0.1.24

Test 1. Threads = 20

Program Real time User time %duplicates
Picard MarkDuplicates 20m57.394s 58m27.382s 67.534
sambamba markdup 3m9.613s 42m14.153s 67.534
samblaster 2m6.075s 0m59.907s 64.64

Test 2. Threads = 2

Program Real time User time %duplicates
Picard MarkDuplicates 21m27.768s 21m14.566s 67.534
sambamba markdup 26m37.955s 34m7.832s 67.534
samblaster 5m7.100s 1m25.233s 64.64

Conclusion

  • samblaster is faster than both sambamba and picard on both 2 and 20 threads.
  • picard does not benefit from running in parallel unlike sambamba and samblaster.
  • samblaster seems to be less sensitive in detecting duplicates (64.64% detected relative to 67.53%).

Finally in testing I also want to point out a difference in implementation when using samblaster. Unlike picard and sambamba, samblaster requires a name-sorted SAM file for input and outputs in SAM format. This means that it should be implemented to take stdin directly from the mapper, for example:

bowtie2 -1 r1.fq -2 r2.fq -r ref.fa | samblaster | samtools sort > out_mkdup_sorted.bam.

@FrickTobias
Copy link
Owner

The reads which are not marked as read duplicates, do they seem to be duplicates? I'd say whether they are read duplicates or not is the most important factor for deciding which we should go for.

@pontushojer
Copy link
Collaborator

The reads which are not marked as read duplicates, do they seem to be duplicates? I'd say whether they are read duplicates or not is the most important factor for deciding which we should go for.

True, I am investigating this further.

@pontushojer
Copy link
Collaborator

From supplementary

"PICARD and SAMBLASTER are using different strategies to identify duplicates. Read pairs in which one read is mapped and the other unmapped are called “orphans”. SAMBLASTER compares orphans only to other orphans to find duplicates, and always marks both reads in an orphan pair as either duplicate or not duplicate. PICARD marks many more mapped reads in orphans as duplicate than SAMBLASTER, and marks no unmapped reads in orphans as duplicates. One possible explanation for this large number of mapped orphan duplicates is that Picard compares the mapped orphan reads to all mapped reads to determine if it is a duplicate."

So there is some difference where samblaster would only compare "orphan" alignments to other orphan alignment while picard compares against all.

I also found that samblaster and picard does a different selection of which alignment to mark as duplicate. Samblaster selects the lexicographically first alignment while picard performs some sort of comparison between them to select the one with the higher quality alignment. See quote from samblasters issues page.

"As to how the results will compare to Picard, the answer is the same for single-end reads as for paired-end reads. samblaster has much higher performance than Picard in terms of speed, but makes the trade-off that the first of a set of duplicates that are found in the input file is kept in the output file, while Picard will keep the "best" of a set of duplicates in the output file. In order to do this, Picard is forced to make two passes over an input file that has been landed to disk (not in a pipe), ergo the slower performance."

In summary:
samblaster
👍 Fastest tool
👍 Can read/write from stdin/out
👎 Naive duplicate marking.
👎 (- No conda release )

picard MarkDuplicates
👍 Standard tool to duplicate marking
👍 Optimized duplicate marking

👎 Slow, no parallel processing
👎 High memory consumption
👎 Cannot read/write from stdin/out

sambamba markdup
👍 Same duplicate definitions as picard
👍 Faster than picard, parallel processing.

👎 Cannot read/write from stdin/out

@marcelm
Copy link
Collaborator Author

marcelm commented Oct 22, 2019

Nice overview! Some extra points:

  • Did you also check memory usage? I would expect samblaster to use more memory than Picard when piping the aligner’s input directly into samblaster vs. running MarkDuplicates on the coordinate-sorted BAM.

  • There is a Bioconda package for samblaster.

  • samblaster is not maintained anymore.

It appears to me that sambamba markdup is strictly preferable over Picard MarkDuplicates. If you agree, then the question becomes whether the slightly different results for samblaster are acceptable.

Would it make sense to support both (samblaster/sambamba) in the pipeline for a while, in order to see how the final results change when switching the duplicate marker?

@pontushojer
Copy link
Collaborator

Thanks for the comments!

I have not checked memory usage, I will look into that further.

So there is a bioconda package, good to know!

I totally agree that sambamba is preferable to picard, so I would definitely exchange that. I could also include samblaster for now so we can test it out further.

@FrickTobias
Copy link
Owner

Agreed, both sound good. For now we can indicate what we would use with a default option.

@pontushojer
Copy link
Collaborator

@marcelm
Copy link
Collaborator Author

marcelm commented Oct 23, 2019

Ah, that explains our differing observations! :-)

I’d prefer to get the Bioconda recipe to work on macOS as well instead of using the BioBuilds recipe. I’ll open an issue.

@pontushojer
Copy link
Collaborator

pontushojer commented Oct 28, 2019

Memory test performed on subset of previous data mapping to chr21.

Memory usage results:

Program Threads Maximum resident set size (kbytes)
Picard 1 4,322,800
Picard 10 40,202,300
Sambamba 1 1,303,864
Sambamba 10 1,262,708
Samblaster 1 42,428

Commands analysed:

java -Xmx6g -jar /sw/bioinfo/picard/2.20.4/rackham/picard.jar MarkDuplicates I=chr21.bam O=chr21.bam.mkdup_picard.bam M=mkdup_picard.metrics ASSUME_SORT_ORDER=coordinate

java -Xmx60g -jar /sw/bioinfo/picard/2.20.4/rackham/picard.jar MarkDuplicates I=chr21.bam O=chr21.bam.mkdup_picard.bam M=mkdup_picard.metrics ASSUME_SORT_ORDER=coordinate

sambamba markdup -t 1 --tmpdir /scratch/10448853 chr21.bam chr21.bam.mkdup_sambamba.bam

sambamba markdup -t 10 --tmpdir /scratch/10448855 chr21.bam chr21.bam.mkdup_sambamba.bam

samblaster -o chr21.bam.mkdup_samblaster.sam

@marcelm
Copy link
Collaborator Author

marcelm commented Oct 28, 2019

Oh, interesting! I’m must have remembered memory usage of samblaster incorrectly. Then that is definitely not a concern.

@pontushojer
Copy link
Collaborator

Closed after PR #132

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants