Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for support of Sentieon-Dedup #1936

Closed
1 task done
asp8200 opened this issue Jun 16, 2023 · 4 comments
Closed
1 task done

Request for support of Sentieon-Dedup #1936

asp8200 opened this issue Jun 16, 2023 · 4 comments

Comments

@asp8200
Copy link

asp8200 commented Jun 16, 2023

Name of the tool

Sentieon Dedup

Tool homepage

https://support.sentieon.com/manual/usages/general/#dedup-algorithm

Tool description

The Dedup algorithm performs the marking/removing of duplicate reads.

Tool output

test.md.cram.metrics.zip

Log filename pattern

No response

Data suitable for MultiQC plot(s)

I imagine that the metrics from Sentieon-Dedup should be displayed pretty much like the corresponding metrics from GATK's Markduplicates.

Here is an example of how the first part of the metrics-file from Dedup may look:

#SentieonCommandLine: /opt/sentieon/sentieon-genomics-202112.06/libexec/driver -t 2 -i test.paired_end.sorted.bam -r genome.fasta --algo Dedup --score_info test.md.score --metrics test.md.cram.metrics test.md.cram
LIBRARY	UNPAIRED_READS_EXAMINED	READ_PAIRS_EXAMINED	SECONDARY_OR_SUPPLEMENTARY_RDS	UNMAPPED_READS	UNPAIRED_READ_DUPLICATES	READ_PAIR_DUPLICATES	READ_PAIR_OPTICAL_DUPLICATES	PERCENT_DUPLICATION	ESTIMATED_LIBRARY_SIZE
testN	0	2820	2	2	0	828	0	0.293617	3807

#HISTOGRAM
BIN	VALUE
1.0	0.999986
2.0	1.476740
3.0	1.704038
...

And here is the complete set of metrics from GATK's markduplicates for the same sample:

## htsjdk.samtools.metrics.StringHeader
# MarkDuplicates --INPUT test.bam --OUTPUT test.md.cram --METRICS_FILE test.md.cram.metrics --REMOVE_DUPLICATES false --TMP_DIR . --VALIDATION_STRINGENCY LENIENT --REFERENCE_SEQUENCE genome.fasta --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --FLOW_MODE false --FLOW_QUALITY_SUM_STRATEGY false --USE_END_IN_UNPAIRED_READS false --USE_UNPAIRED_CLIPPED_END false --UNPAIRED_END_UNCERTAINTY 0 --FLOW_SKIP_FIRST_N_FLOWS 0 --FLOW_Q_IS_KNOWN_END false --FLOW_EFFECTIVE_QUALITY_THRESHOLD 15 --ADD_PG_TAG_TO_READS true --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
## htsjdk.samtools.metrics.StringHeader
# Started on: Fri Jun 16 09:17:57 GMT 2023

## METRICS CLASS	picard.sam.DuplicationMetrics
LIBRARY	UNPAIRED_READS_EXAMINED	READ_PAIRS_EXAMINED	SECONDARY_OR_SUPPLEMENTARY_RDS	UNMAPPED_READS	UNPAIRED_READ_DUPLICATES	READ_PAIR_DUPLICATES	READ_PAIR_OPTICAL_DUPLICATES	PERCENT_DUPLICATION	ESTIMATED_LIBRARY_SIZE
test	8601	721	82	523429	3876	0	0	0.38594

## HISTOGRAM	java.lang.Double
set_size	all_sets	non_optical_sets
1.0	721	721

Most interesting data for the General Stats table

No response

Before submitting

  • I have included example data (zipped, not pasted) that can be used to write the module.
@asp8200
Copy link
Author

asp8200 commented Jun 25, 2023

MultiQC already contains support for three Sentieon DNAseq-modules but not Dedup. It was however already requested :

#1180

I plan to add the Dedup function here:

https://github.com/ewels/MultiQC/blob/master/multiqc/modules/sentieon/sentieon.py

And the Dedup function/module will probably be almost identical to the one for MarkDuplicates.

https://github.com/ewels/MultiQC/blob/master/multiqc/modules/picard/MarkDuplicates.py

In fact, I plan to just copy the MarkDuplicates module and adjust it for Sentieon 😆

@vladsavelyev
Copy link
Member

Thank you again @asp8200 for raising the issue, contributing a test example and creating a PR!

We addressed this by supporting Sentieon by the Picard module directly, so we can easier expand to other Sentieon QC tools matching Picard tools: #2110

Hope that works. Please don't hesitate to leave comments or raise issues if anything got broken, or if you have other requests :)

@asp8200
Copy link
Author

asp8200 commented Nov 13, 2023

Thanks, @vladsavelyev. So I guess that will be released as part of v1.18, right? When do you reckon that will hit the streets?

@ewels
Copy link
Member

ewels commented Nov 13, 2023

Friday, if all goes to plan 🤞🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants