Skip to content

Stitcher 5.1.6 Design Document

gwennberry edited this page Dec 2, 2016 · 4 revisions

Overview

The Stitcher takes paired reads from a bam, stitches them, and writes out a new bam.  Reads that cannot be stitched are left unchanged.  A stitched read is a single read that has been bioinformatically stitched from a pair of reads. The stitched read contains one consensus sequence of base calls, one consensus sequence of base call qualities, one consensus cigar string, and one direction tag, which gives the original directions of the contributing reads.  Optionally, The Stitcher filters out reads that will not be used downstream, such as PCR duplicates or poorly mapped reads, in order to save on future processing time.



Configuration

The Stitcher supports configuration of parameters so that its behavior can be fine tuned depending on the application context.

SDS ID Specification
SDS-1

The Stitcher shall accept command line arguments as a whitespace-separated list of name and value pairs.  For example:

Format: Stitcher.exe [-options]

Example: Stitcher.exe –Bam C:\test.bam –OutFolder C:\OutFolder

SDS-2 If an invalid command is given, the Stitcher shall exit with an error message describing the failed argument, the reason for failure, and the list of valid commands.
SDS-3 The Stitcher command line shall be capitalization invariant.

SDS-4

The Stitcher shall require the command line arguments listed below.

Argument Name Type Default value Description

-Bam

string

none

File path for input bam

-OutFolder string none Output destination for output bam

SDS-5

The Stitcher shall optionally support the command line arguments listed below.

Argument Name Type Default value Description
-MinBaseCallQuality integer 20 In case of a stitching conflict, bases with qscore less than this value will automatically be disregarded in favor of the mate's bases.
-FilterMinMapQuality integer 1 Reads with map quality less than this value shall be filtered
-FilterDuplicates bool true Whether reads marked as duplicate reads shall be filtered
-FilterForProperPairs bool false Whether reads marked as not proper pairs shall be filtered
-FilterUnstichablePairs bool false Whether reads pairs with incompatible cigar strings shall be filtered
-NifyUnstitchablePairs  bool true Whether read pairs with incompatible CIGAR strings shall be N-ified.
-NifyDisagreement bool false

Whether or not to turn high-quality disagreeing overlap bases to Ns.

-UseSoftClippedBases bool true Whether to allow bases soft clipped from the (non-probe) ends of reads to inform stitching.
-NifyDisagreement bool false Whether to turn high-quality disagreeing overlapping bases to Ns.
-Debug bool false Whether to run in debug (verbose) mode.
-LogFileName string StitcherLog.txt Name for stitcher log file.
-ThreadByChr bool false Whether to thread by chromosome (beta).
-DebugSummary bool false Whether we should output statistics at the end of the process summarizing overall read stitching outcomes.
-StitchProbeSoftclips bool false Whether to allow probe softclips taht overlap the mate to contribute to a stitched direction.
-MaxReadLength int 1024 The maximum expected length of individual reads, used to determine the maximum expected stitched read length (2*len - 1). For optimal performance, set as low as appropriate (i.e. the actual single-read length) for your data.
-v / -ver flag N/A Displays version information.

Input

The Stitcher requires as input one BAM file.  The BAM file is assumed to be sorted, i.e. alignment reads should be sorted by mapped reference position.  This assumption allows The Stitcher to process BAMs in one pass without a large memory footprint.  Most standard aligners produce sorted BAM files.  Positions in BAM files are expected to use 0-based coordinate system.

SDS ID Specification
SDS-6 The Stitcher shall require as input one bam.
SDS-7

The Stitcher shall require as input an accompanying BAI file with the same file name in the same directory.

Output

The Stitcher outputs one (stitched) BAM file. The stitched file name shall be .stitched.bam. 

SDS ID Specification
SDS-8 The Stitcher shall output one stitched bam.
SDS-9 The output bam name shall be .stitched.bam.
SDS-10 The output bam shall be sorted. NOT IMPLEMENTED IN THIS VERSION. (Workaround: samtools in pipeline).
SDS-11

The Stitcher shall output an accompanying BAI file with the corresponding file name (.stitched.bam.bai) in the same directory. NOT IMPLEMENTED IN THIS VERSION. (Workaround: samtools in pipeline).

SDS-24

The output bam shall contain a single, merged read for each read pair that was successfully stitched. The original read pair shall not be outputted. Reads that were not successfully stitched should be outputted as their original read pair unless filtering unstitchable pairs is enabled. Singletons shall be outputted in their original form.

SDS-25 Merged reads in the bam outputted by the Stitcher shall be flagged as:
  • Read 1
  • NOT properly paired (it is a single read)

Unmerged reads shall retain their original flags.

SDS-26

Merged reads shall inherit the following from the Read1 of the pair:

  • Name
  • RefID

They shall inherit the following from the component reads:

  • Read position - based on the minimum of the start positions of the component reads
  • Map quality - based on the maximum of the map qualities of the component reads
  • IsDuplex - set to true if either component read IsDuplex Note: duplex tag data is not preserved in merged reads in this version.

They shall be assigned the following as a result of stitching: Read sequence, Qualities, CIGAR, New tags - XD (Cigar Directions).

Unless otherwise explicitly mentioned above, all existing tags will be removed from the component reads in the merged read, as it would be unclear how to fairly and consistently reconcile tags from the two component reads.

Unmerged reads shall retain their existing tags.

Design

The Stitcher streams through the bam, holding on to each eligible read until its mate is found. Once the mate is found, the read and its mate shall be stitched.  

Alignments that are not proper pairs, are duplicates, or which fail mapping quality thresholds shall optionally be filtered. Unmapped or supplementary alignments will be automatically skipped.

Pairs of reads are processed once the pair is deemed complete. Complete pairs are those for which Read1, Read2, and any expected supplementary alignments have been collected. (Note: as a result, reads having supplementary alignments will never be processed and will either be outputted separately or not at all, depending on configuration settings. Supplementary reads are always skipped). Reads which do not overlap and reads which are improperly paired are optionally treated as incomplete. 

Stitching is then attempted on complete pairs. Paired reads that successfully stitch will be reconciled into one consensus read to be used for downstream processing.  The XD tag is used to give the directions of the stitched read. Read pairs which cannot be successfully stitched together will be outputted separately as their original alignments, or skipped if filtering unstitchable pairs.

Unpaired reads and incomplete pairs will be flushed at the end of each chromosome. Read pairs that represent fusions (i.e. reads are on different chromosomes) will not be stitched.

Process Flow

ac:link<ri:attachment ri:filename="Stitcher AlignmentPairFilter Implementation.pptx"></ri:attachment></ac:link>

The bulk of the processing behavior in the Stitcher is defined by the BamRewriter class, which itself calls an AlignmentPairFilter (in our case, the StitcherPairFilter) to collect alignments into pairs and determine which ones to keep, as well as a PairHandler to evaluate and manipulate a successfully matched pair.

BamRewriter

<ac:image ac:height="400"><ri:attachment ri:filename="image2016-11-28 13:25:10.png"></ri:attachment></ac:image>

StitcherPairFilter

Schematic of AlignmentPairFilter.TryPair functionality, with notes indicating the specific StitcherPairFilter behaviors in green.

<ac:image ac:height="400"><ri:attachment ri:filename="image2016-11-28 13:25:36.png"></ri:attachment></ac:image>

Detailed Design

Filtering

SDS ID Specification
SDS-13  The Stitcher shall filter reads with map quality less than FilterMinMapQuality from the final bam.
SDS-14 The Stitcher shall optionally filter out duplicate reads.
SDS-15 The Stitcher shall optionally filter out any reads not marked as proper pairs.
SDS-16 The Stitcher shall optionally filter out any reads that have incompatible cigar strings with its mate (i.e. "unstitchable" pairs).

Processing Proper Pairs

A) If the pairs simply do not overlap, they are outputted separately.

B) If the pairs do overlap, they are tested for compatibility, where compatible means the cigar strings match such that indels on one read do not overlap with match/mismatch segments on the other, or with disagreeing indels.

  • Compatibility is evaluated for each overlapping base.
  • An overlapping base is deemed incompatible if: it is an indel on one read and is either a match/mismatch on the other, or is the opposite type of indel (i.e. Read1 shows an insertion and Read2 shows a deletion at this position). 
  • Softclips are not considered disagreements.

Getting StitchedPositions Representations

The two reads are abstracted into an array of StitchedPositions, each of which can have cigar sub-strings for mapped (i.e. M or D) and unmapped (i.e. I or S) bases for each mate.

Redistributing Softclips

"Bookending" softclips (at either end of the component reads) are then redistributed so that they match up with the further-extending positions on the opposite read. In this way, the softclipped bases can be used to support match/mismatch and inserted bases on the opposite read. Optionally, softclipped bases can also support deletions – if allowing terminal softclips to support overlapping deletions, and if in the process of redistributing softclips we get to a deletion in the opposite read has mapped bases after it, we can "kick over" the softclip so that it supports the mapped bases on the other side, thereby also supporting the deletion. 

Reconciling StitchedPositions Between Reads

For each StitchedPosition in the combined representation of the two reads, the cigar operations for both reads are reconciled. Cigar operations are deemed to be compatible if they are the same for both read, empty for one of the reads, or softclips for one of the reads. If either read has an insertion while the opposite has no unmapped bases (insertion or softclipped), stitching has failed. Successfully reconciled softclips count toward stitched support only if configured to use softclipped bases, and only if it is not on the probe-end of the read or we are not ignoring probe softclips. The result of the reconciling step is a stitched CIGAR and directions array. Finally, if the stitching has created an internal softclip, we assert that stitching has failed (overlapping internal softclips are too nebulous for us to confidently call even the rest of the stitched read).

If the reads are compatible, they are reconciled into one consensus read as described in the section below.

Generating Consensus Reads

If the paired reads are compatible and successfully stitch into a merged read, the Stitcher will assemble a consensus sequence and qualities array based on the component reads. 

Worked examples with results directly from this release are given hereac:link<ri:page ri:content-title="Read Stitcher 5.1.6 Stitching Scenarios">ac:plain-text-link-body</ac:plain-text-link-body></ri:page></ac:link>

SDS ID Specification

Region

Rules

SDS-17 Positions preceding overlap region Use upstream read sequence, qscores, and direction.
SDS-18

Positions within overlap region

For each position:

Condition

Base

Qscore

Bases agree

Read1

Sum(read1, read2)

Bases disagree and configured to "Nify" disagreements “N”, i.e. no call 0

Bases disagree, not Nifying disagreements, and both qscores ≥ minimum basecall quality configured

Read1 if read1 qscore ≥ read2basecall quality, otherwise read2

0

<ac:inline-comment-marker ac:ref="30bef5e8-aac0-4b51-9217-20863380dfb6">Bases disagree, not Nifying disagreements, and both qscores < minimum basecall quality</ac:inline-comment-marker>

Read1 if read1 qscore ≥ read2basecall quality, otherwise read2

Qscore corresponding to read for which base was extracted

Bases disagree, not Nifying disagreements, and either qscores < minimum basecall quality configured

Read1 if read1 qscore ≥ read2 basecall quality, otherwise read2

Qscore corresponding to read for which base was extracted

Direction is "stitched".

SDS-22

Positions trailing overlap region

Use downstream sequence, qscores, and direction.

The consensus read shall have have a new XD tag. The rules for the XD tag are as follows:

SDS ID specification
SDS-23 **base category** **XD item**
N consecutive forward bases NF
N consecutive reverse bases NR
N consecutive stitched NS

All bases relevant to the stitched cigar string must have a direction.  This includes all bases in the stitched sequence, but also gaps (such as deletions or strings of N's that might result from reads that do not stitch easily). All these bases shall have a direction and shall be available to contribute to (directional) coverage counts made by Pisces.

For example, the following pair of reads, would yield the following stitched cigar string and XD tag:

Read pair:

Cigar Direction Coordinate -> 1 2 3 4 5 6 7 8 9
read1 2M2D4M Forward x x
read2 1M2D5M Reverse x x

Stitched read:

Cigar XD tag Coordinate -> 1 2 3 4 5 6 7 8 9
stitched read 2M2D5M 1F7S1R x x

Incompatible Reads

For reads that are completely un-stitchable (incompatible cigar strings), The Stitcher

  • If (FilterUnstichablePairs=TRUE), the Stitcher shall not try to stitch these reads. These reads are omitted from the final bam.

  • Otherwise:

    • Easy Case: dont stitch these reads. Treat them as unpaired reads.

    • Medium Case:

      • if reads are not inconsistent (collapseble indels, softclips,)

        • push it together and N-ify disagreements.
      • if reads are inconsistent (can't even come up with an insertion length that makes sense)

        • If one read is bad throughout the overlap (by mapping or by basecall q score), stitch, but take the other read through the overlap. Qscore and direction corresponding to the read that was taken.
        • If both reads are bad, or have some mix of acceptable data in the overlap (by mapping and basecall q score), stitch, but make the overlap entirely N. Score for each base is 0 and direction is "stitched".
    • Hard Case:
      • TBD, probably involves some local realignment.
SDS ID Specification
SDS-24

If (FilterUnstichablePairs=FALSE) The Stitcher shall treat un-stitchable reads as unpaired reads.

(if FilterUnstichablePairs=TRUE, see SDS-16)

Limitations

By design, the stitched read will contribute 1x coverage to down stream variant calling.  Any overlapping read pair shall contribute 1x to coverage, as this is a more accurate reflection of what is happening on a molecular level. This might be counter-intuitive to the user, who expects 2x coverage in the "overlap" region of the stitched read. Correspondingly, any unpaired reads included in the output bam will therefore count x2 to coverage, or be double-weighted relative to stitched reads.

It is possible that an MNV begins on a base with one read direction and ends on a base with another read direction. In this case, the MNVs direction should be considered stitched by the downstream variant caller (which exempts it from the strand bias filter).

The Stitcher does not do any form of indel realignment.

The Stitcher does not implement its own PCR duplicate detection algorithms, and as such is subject to errors made upstream in in the detection of PCR duplicates. (This is known to happen in the case of low-frequency variants and shot gun sequencing). The Stitcher optionally filters duplicates, when marked in the bam by an upstream duplicate-marking tool. The stitcher presumes if one read is a duplicate, so is its mate. (ie, if one read in a pair is a duplicate, both reads are considered duplicates.) This should always be true biologically, but is not a requirement of the mark duplicate tool.

The Stitcher is not multi-threaded by sample. Its expected that the calling pipeline will handle this, submitting 1 job per bam.

The Stitcher does not include indexing or sorting the output bam. This may be done with Samtools.

The Stitcher expects that input bams are sorted and indexed.

The Stitcher is not able to correctly process BAMs which have basecall quality scores of >46 (this is highly unlikely to occur, but may happen in the case of some kind of basecall quality post-processing). This limitation is due to the fact that Stitcher sums basecall qualities for agreeing overlapping bases between R1 and R2. The Phred formatting used in BAM quality score encoding does not support values over 93, so having a quality score of >46 in both agreeing bases would lead to invalid BAM quality encoding.

General

5.2.10

5.2.9

5.2.7

5.2.5

5.2.0

5.1.6

5.1.3

Clone this wiki locally