Skip to content

Stitcher 5.2.9 Design Document

tamsen edited this page Sep 24, 2018 · 1 revision

Overview

The Stitcher takes paired reads from a bam, stitches them, and writes out a new bam. Reads that cannot be stitched are left unchanged. A stitched read is a single read, that has been bioinformatically stitched from a pair of reads. The stitched read contains one consensus sequence of base calls, one consensus sequence of base call qualities, one consensus cigar string, and one direction tag, which gives the original directions of the contributing reads. Optionally, The stitcher filters out reads that will not be used downstream, such as PCR duplicates or poorly mapped reads, in order to save on future processing time. Note that this document includes some references to extensions of current algorithms. These are highlighted in red, where not currently implemented.

Glossary

Pisces Glossary

Configuration

The Stitcher supports configuration of parameters so that its behavior can be fine tuned depending on the application context.

Format: dotnet Stitcher.dll [-options]

Example: dotnet Stitcher.dll –Bam C:\test.bam –OutFolder C:\OutFolder

SDS ID Specification
SDS-1 The Stitcher shall accept command line arguments as a whitespace-separated list of name and value pairs.
SDS-2 If an invalid command is given, the Stitcher shall exit with an error message describing the failed argument, the reason for failure, and the list of valid commands.
SDS-3 The Stitcher command line shall be capitalization invariant.
SDS ID Specification
SDS-4 The Stitcher shall require the command line arguments listed below:
Argument Name Type Default value Description
Bam string none File path for input bam
SDS ID Specification
SDS-5 The Stitcher shall optionally support the command line arguments listed below.
Argument Name Type Default value Description
OutFolder string none. By default the output destination will be the original bam folder destination for output bam
MinBaseCallQuality integer 20 in case of a stitching conflict, bases with qscore less than this value will automatically be disregarded in favor of the mate's bases.
FilterMinMapQuality integer 1 reads with map quality less than this value shall be filtered
FilterDuplicates bool true reads marked as duplicate reads shall be filtered
FilterForProperPairs bool false reads marked as not proper pairs shall be filtered
FilterUnstichablePairs bool false reads pairs with incompatible cigar strings shall be filtered
NifyUnstitchablePairs bool true read pairs with incompatible CIGAR strings shall be N-ified.
NifyDisagreement bool false whether or not to turn high-quality disagreeing overlap bases to Ns.
UseSoftClippedBases bool true allow bases soft clipped from the (non-probe) ends of reads to inform stitching.

Input

The Stitcher requires as input one BAM file. The BAM file is assumed to be sorted, i.e. alignment reads should be sorted by mapped reference position. This assumption allows The Stitcher to process BAMs in one pass without a large memory footprint. Most standard aligners produce sorted BAM files. Positions in BAM files are expected to use 0-based coordinate system.

SDS ID Specification
SDS-6 The Stitcher shall require as input one bam.
SDS-7 The Stitcher shall require as input an accompanying BAI file with the same file name in the same directory.

Output

The Stitcher outputs one (stitched) BAM file. The stitched file name shall be .stitched.bam. The bam shall be sorted.

SDS ID Specification
SDS-8 The Stitcher shall require as input one bam.
SDS-9 The Stitcher shall require as input an accompanying BAI file with the same file name in the same directory.
SDS-24 The output bam shall contain a single, merged read for each read pair that was successfully stitched. The original read pair shall not be outputted. Reads that were not successfully stitched should be outputted as their original read pair unless filtering unstitchable pairs is enabled. Singletons shall be outputted in their original form.
SDS-25 Merged reads in the bam outputted by the Stitcher shall be flagged as, Read 1, Unmated. Unmerged reads shall retain their original flags.

Design

The Stitcher is essentially a pipeline program. The Stitcher shall stream through the bam, holding on to each proper-pair read until its mate is found. Once the mate is found, the read and its mate shall be stitched. The stitched read will be held until the all possible upstream alignments have been written. Once this condition has been met, the stitched read shall be written to the output bam. Alignments that are not proper pairs or which fail mapping quality thresholds shall optionally be filtered. Paired reads shall be reconciled into one consensus read to be used for downstream processing. The XD tag will be used to give the directions of the stitched read. If no mate is found for a proper-pair read (that is flagged as having a proper mate in the bam), the Stitcher will assume the bam is corrupted, log an exception, and exit.

Error handling

SDS ID Specification
SDS-12 The Stitcher shall log an exception if a bam is detected to be corrupted.

Filtering

SDS ID Specification
SDS-13 The Stitcher shall filter reads with map quality less than FilterMinMapQuality from the final bam.
SDS-14 The Stitcher shall optionally filter out duplicate reads.
SDS-15 The Stitcher shall optionally filter out any reads not marked as proper pairs.
SDS-16 The Stitcher shall optionally filter out any reads that have incompatible cigar strings with its mate.

Processing Proper Pairs

A) If the pairs simply do not overlap, they may be evaluated separately or treated as a single read with a gap in the middle, to facilitate phasing. (see the "StitchGappedPairs" option) B) If the pairs do overlap, they are tested for compatibility, where compatible means the cigar strings match such that insertions and deletions are called at the same place and are of the same length within the stitched region. Indels must have the same start and endpoint, where ever closed. (insertions or deletions running off the edge of the overlap are considered acceptable if their mate-indel is fully anchored) Inserted content can disagree - The Stitcher will take the highest quality bases) Mismatch sections are of the same length. If a softclipped section falls within the overlap region and its sequence is compatible with its mate, it may contribute to the stitched section. (For example, if two bases are softclipped from the end of R1, but they match and overlap with the first two bases of an insertion coming from R2, we can call that insertion present in the stitched region)

If the reads are compatible, they are reconciled into one consensus read as follows:

Generating Consensus Reads

If the paired reads are compatible, the Stitcher shall reconcile the pair into one consensus read by region as follows below.

SDS ID Specification
SDS-17 Positions preceding overlap region, shall use the upstream read sequence, Qscores, and direction
SDS-22 Positions trailing overlap region, shall use the downstream read sequence, Qscores, and direction
SDS ID Specification
SDS-18 Matched/Mismatched Positions within overlap region, shall have the stitched direction, and for each position:
condition Base Qscore
Bases agree from read 1 Max(read1, read2)
Bases disagree and both qscores ≥ minimum basecall quality configured “N”, i.e. no call 0
Bases disagree and both qscores < minimum basecall quality Read1 if read1 qscore ≥ minimum basecall quality, otherwise read2 Qscore corresponding to read for which base was extracted
Bases disagree and either qscores < minimum basecall quality configured Read1 if read1 qscore ≥ minimum basecall quality, otherwise read2 Qscore corresponding to read for which base was extracted
SDS ID Specification
SDS-19 Insertion/Deletion Positions within overlap region, shall have the stitched direction, and for each position:
condition Base Qscore
Bases agree from read 1 Max(read1, read2)
Insertion agree in length, but disagree in content, and both qscores ≥ minimum basecall quality configured “N”, i.e. no call 0
Insertion agree in length, but disagree in content, and bases disagree and either qscores < minimum basecall quality configured choose inserted content from read1 if read1 qscore ≥ minimum basecall quality, otherwise read2 Qscore corresponding to read for which base was extracted
Deletions or insertions disagree in length, or one read has and indel and the other has no indel or a different indel. see un-stitchable strategy see un-stitchable strategy
SDS ID Specification
SDS-20 If (UseSoftClippedBases=TRUE), softclipped positions from the end of the reads within overlap region may inform read stitching, shall have the stitched direction, and for each position:
condition Base Qscore
Bases in both directions are softclipped. Aligner has given up. see un-stitchable strategy see un-stitchable strategy
Softclipped in one direction, and the softclipped base matches the inserted base from the other direction. “N”, i.e. no call 0
Softclipped in one direction, and the softclipped base disagrees with the base from the other direction Base from non-softclipped read. Qscore corresponding to read for which base was extracted

The consensus read shall have have a new XD tag:

SDS ID Specification
SDS-23 The rules for the XD tag are as follows:
Base category XD item
N consecutive forward bases NF
N consecutive reverse bases NR
N consecutive stitched NS

All bases relevant to the stitched cigar string must have a direction. This includes all bases in the stitched sequence, but also gaps (such as deletions or strings of N's that might result from reads that do not stitch easily). All these bases shall have a direction and shall be available to contribute to (directional) coverage counts made by Pisces. For example, the following pair of reads, would yield the following stitched cigar string and XD tag:

Original read pair:

Cigar Direction Coordinate -> 1 2 3 4 5 6 7 8 9
read1 2M2D4M Forward A A X X A A A A
read2 1M2D5M Reverse A X X A A A A A

Stitched read pair:

Cigar Direction Coordinate -> 1 2 3 4 5 6 7 8 9
read 2M2D5M 1F7S1R A A A X X A A A A

Incompatible Reads

For reads that have incompatible cigar strings, the stitcher shall

(A) if reads are not inconsistent (collapsable indels, equivalent softclips and inserted bases,) push it together and N-ify disagreements.

(B) if reads are inconsistent (can't even come up with an insertion length that makes sense) or have no way to anchor the overlap (overlapping softclips) If (FilterUnstichablePairs=TRUE) , these reads are omitted from the final bam. If (FilterUnstichablePairs=FALSE) , these reads are returned, unstitched, to the final bam.

SDS ID Specification
SDS-24 If (FilterUnstichablePairs=FALSE) The Stitcher shall treat un-stitchable reads as unpaired reads.
(if FilterUnstichablePairs=TRUE, see SDS-16)

Limitations

By design, the stitched read will contribute 1x coverage to down stream variant calling. Any overlapping read pair shall contribute 1x to coverage, as this is a more accurate reflection of what is happening on a molecular level. This might be counter intuitive to the user, who expects 2x coverage in the "overlap" region of the stitched read. Correspondingly, any unpaired reads included in the output bam will therefore count x2 to coverage, or be double-weighted relative to stitched reads.

It is possible that an MNV begins on a base with one read direction and ends on a base with another read direction. In this case, the MNVs direction should be considered stitched by the downstream variant caller (which exempts it from the strand bias filter).

The Stitcher does not do any form of indel realignment.

The Stitcher does not implement its own PCR duplicate detection algorithms, and as such is subject to errors made upstream in in the detection of PCR duplicates. (This is known to happen in the case of low-frequency variants and shot gun sequencing) Stitcher.exe optionally filters duplicates, when marked in the bam by an upstream duplicate-marking tool. The stitcher presumes if one read is a duplicate, so is its mate. (ie, if one read in a pair is a duplicate, both reads are considered duplicates.) This should always be true biologically, but is not a requirement of the mark duplicate tool.

The Stitcher is not multi-threaded, by default. Its expected that the calling pipeline will handle this, submitting 1 job per bam. By default, the Stitcher does not include indexing or sorting the output bam. This may be done with Samtools. In multi-threaded mode,the stitcher shall sort but not index the bams. The multi-threaded sort and samtools sort do not necessarily result in identical read orders when several reads have the same sort position. For unique output (ie, for test purposes) a unix sort may be performed.

The Stitcher does not include indexing of the output bam. This may be done with Samtools.

BAMs fed to Pisces and Stitcher.exe should always be sorted and then indexed.

General

5.2.10

5.2.9

5.2.7

5.2.5

5.2.0

5.1.6

5.1.3

Clone this wiki locally