Test alternative duplicate marking programs #113

marcelm · 2019-10-07T11:46:51Z

Consider whether possibly faster alternatives to Picard MarkDuplicates can be used:

sambamba markdup
samblaster

pontushojer · 2019-10-15T07:22:52Z

Run test on different duplicate marking programs.

Dataset: XVII.SM10.reseq_2 mapped using bowtie2 and chr1 extracted. Final number of mapped alignments is 60,753,598 st.

Program versions:

Picard: 2.20.4
sambamba: 0.6.6
samblaster: 0.1.24

Test 1. Threads = 20

Program	Real time	User time	%duplicates
Picard MarkDuplicates	20m57.394s	58m27.382s	67.534
sambamba markdup	3m9.613s	42m14.153s	67.534
samblaster	2m6.075s	0m59.907s	64.64

Test 2. Threads = 2

Program	Real time	User time	%duplicates
Picard MarkDuplicates	21m27.768s	21m14.566s	67.534
sambamba markdup	26m37.955s	34m7.832s	67.534
samblaster	5m7.100s	1m25.233s	64.64

Conclusion

samblaster is faster than both sambamba and picard on both 2 and 20 threads.
picard does not benefit from running in parallel unlike sambamba and samblaster.
samblaster seems to be less sensitive in detecting duplicates (64.64% detected relative to 67.53%).

Finally in testing I also want to point out a difference in implementation when using samblaster. Unlike picard and sambamba, samblaster requires a name-sorted SAM file for input and outputs in SAM format. This means that it should be implemented to take stdin directly from the mapper, for example:

bowtie2 -1 r1.fq -2 r2.fq -r ref.fa | samblaster | samtools sort > out_mkdup_sorted.bam.

FrickTobias · 2019-10-15T15:19:34Z

The reads which are not marked as read duplicates, do they seem to be duplicates? I'd say whether they are read duplicates or not is the most important factor for deciding which we should go for.

pontushojer · 2019-10-16T11:40:26Z

The reads which are not marked as read duplicates, do they seem to be duplicates? I'd say whether they are read duplicates or not is the most important factor for deciding which we should go for.

True, I am investigating this further.

pontushojer · 2019-10-18T12:42:22Z

From supplementary

"PICARD and SAMBLASTER are using different strategies to identify duplicates. Read pairs in which one read is mapped and the other unmapped are called “orphans”. SAMBLASTER compares orphans only to other orphans to find duplicates, and always marks both reads in an orphan pair as either duplicate or not duplicate. PICARD marks many more mapped reads in orphans as duplicate than SAMBLASTER, and marks no unmapped reads in orphans as duplicates. One possible explanation for this large number of mapped orphan duplicates is that Picard compares the mapped orphan reads to all mapped reads to determine if it is a duplicate."

So there is some difference where samblaster would only compare "orphan" alignments to other orphan alignment while picard compares against all.

I also found that samblaster and picard does a different selection of which alignment to mark as duplicate. Samblaster selects the lexicographically first alignment while picard performs some sort of comparison between them to select the one with the higher quality alignment. See quote from samblasters issues page.

"As to how the results will compare to Picard, the answer is the same for single-end reads as for paired-end reads. samblaster has much higher performance than Picard in terms of speed, but makes the trade-off that the first of a set of duplicates that are found in the input file is kept in the output file, while Picard will keep the "best" of a set of duplicates in the output file. In order to do this, Picard is forced to make two passes over an input file that has been landed to disk (not in a pipe), ergo the slower performance."

In summary:
samblaster
👍 Fastest tool
👍 Can read/write from stdin/out
👎 Naive duplicate marking.
👎 (- No conda release )

picard MarkDuplicates
👍 Standard tool to duplicate marking
👍 Optimized duplicate marking

👎 Slow, no parallel processing
👎 High memory consumption
👎 Cannot read/write from stdin/out

sambamba markdup
👍 Same duplicate definitions as picard
👍 Faster than picard, parallel processing.

👎 Cannot read/write from stdin/out

marcelm · 2019-10-22T08:17:00Z

Nice overview! Some extra points:

Did you also check memory usage? I would expect samblaster to use more memory than Picard when piping the aligner’s input directly into samblaster vs. running MarkDuplicates on the coordinate-sorted BAM.
There is a Bioconda package for samblaster.
samblaster is not maintained anymore.

It appears to me that sambamba markdup is strictly preferable over Picard MarkDuplicates. If you agree, then the question becomes whether the slightly different results for samblaster are acceptable.

Would it make sense to support both (samblaster/sambamba) in the pipeline for a while, in order to see how the final results change when switching the duplicate marker?

pontushojer · 2019-10-23T07:58:36Z

Thanks for the comments!

I have not checked memory usage, I will look into that further.

So there is a bioconda package, good to know!

I totally agree that sambamba is preferable to picard, so I would definitely exchange that. I could also include samblaster for now so we can test it out further.

FrickTobias · 2019-10-23T09:54:52Z

Agreed, both sound good. For now we can indicate what we would use with a default option.

pontushojer · 2019-10-23T13:22:57Z

So the bioconda package for samblaster is only for linux. There is however a BioBuilds channel which includes a functional version for both osx and linux.

marcelm · 2019-10-23T13:29:24Z

Ah, that explains our differing observations! :-)

I’d prefer to get the Bioconda recipe to work on macOS as well instead of using the BioBuilds recipe. I’ll open an issue.

pontushojer · 2019-10-28T13:02:37Z

Memory test performed on subset of previous data mapping to chr21.

Memory usage results:

Program	Threads	Maximum resident set size (kbytes)
Picard	1	4,322,800
Picard	10	40,202,300
Sambamba	1	1,303,864
Sambamba	10	1,262,708
Samblaster	1	42,428

Commands analysed:

java -Xmx6g -jar /sw/bioinfo/picard/2.20.4/rackham/picard.jar MarkDuplicates I=chr21.bam O=chr21.bam.mkdup_picard.bam M=mkdup_picard.metrics ASSUME_SORT_ORDER=coordinate

java -Xmx60g -jar /sw/bioinfo/picard/2.20.4/rackham/picard.jar MarkDuplicates I=chr21.bam O=chr21.bam.mkdup_picard.bam M=mkdup_picard.metrics ASSUME_SORT_ORDER=coordinate

sambamba markdup -t 1 --tmpdir /scratch/10448853 chr21.bam chr21.bam.mkdup_sambamba.bam

sambamba markdup -t 10 --tmpdir /scratch/10448855 chr21.bam chr21.bam.mkdup_sambamba.bam

samblaster -o chr21.bam.mkdup_samblaster.sam

marcelm · 2019-10-28T13:17:12Z

Oh, interesting! I’m must have remembered memory usage of samblaster incorrectly. Then that is definitely not a concern.

pontushojer · 2019-10-29T15:43:51Z

Closed after PR #132

marcelm mentioned this issue Oct 7, 2019

Performance improvements #21

Closed

10 tasks

pontushojer self-assigned this Oct 11, 2019

marcelm mentioned this issue Oct 23, 2019

Get the bioconda samblaster recipe to work on macOS #125

Closed

pontushojer mentioned this issue Oct 29, 2019

Implement alternative duplicate markers #132

Merged

pontushojer closed this as completed Oct 29, 2019

mattmaurano mentioned this issue Apr 23, 2020

SAMBLASTER duplicate criteria mauranolab/mapping#189

Open

bunop mentioned this issue Apr 4, 2024

⚡ replace picard/markduplicates with sambamba/markdup cnr-ibba/nf-resequencing-mem#70

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test alternative duplicate marking programs #113

Test alternative duplicate marking programs #113

marcelm commented Oct 7, 2019

pontushojer commented Oct 15, 2019

FrickTobias commented Oct 15, 2019

pontushojer commented Oct 16, 2019

pontushojer commented Oct 18, 2019

marcelm commented Oct 22, 2019

pontushojer commented Oct 23, 2019

FrickTobias commented Oct 23, 2019

pontushojer commented Oct 23, 2019

marcelm commented Oct 23, 2019

pontushojer commented Oct 28, 2019 •

edited

Loading

marcelm commented Oct 28, 2019

pontushojer commented Oct 29, 2019

Test alternative duplicate marking programs #113

Test alternative duplicate marking programs #113

Comments

marcelm commented Oct 7, 2019

pontushojer commented Oct 15, 2019

FrickTobias commented Oct 15, 2019

pontushojer commented Oct 16, 2019

pontushojer commented Oct 18, 2019

marcelm commented Oct 22, 2019

pontushojer commented Oct 23, 2019

FrickTobias commented Oct 23, 2019

pontushojer commented Oct 23, 2019

marcelm commented Oct 23, 2019

pontushojer commented Oct 28, 2019 • edited Loading

marcelm commented Oct 28, 2019

pontushojer commented Oct 29, 2019

pontushojer commented Oct 28, 2019 •

edited

Loading