Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
The Hierarchical Genome Assembly Process (HGAP) for long single pass reads generated by the PacBio® Single Molecule Real Time (SMRT) sequencer was developed to allow the complete and accurate shotgun assembly of bacterial sized genomes. The process itself relies on a succession of steps to generate de novo assemblies of a genome. Several implementations exist.
This document contains:
- A conceptional overview of the different steps involved in HGAP;
- Descriptions of the various implementations; and
- Sample results.
The three main steps involved in HGAP are preassembly, assembly, and consensus polishing, and are detailed below.
The goal of Preassembly is to generate long and highly accurate sequences. This is accomplished by mapping single pass reads to seed reads, which represent the longest portion of the read length distribution. Subsequently, a consensus sequence of the mapped reads is generated, resulting in long and highly accurate fragments of the target genome. To improve the ability to detect sequence overlaps and therefore the accuracy of the following assembly step, quality trimming of the preassembled consensus sequences plays a crucial role in the success of HGAP.
The choice of assembly algorithm is at the user's discretion, but typically Overlap - Layout - Consensus (OLC) assemblers are better suited for the de novo assembly of multi-kb long reads. The success of the assembly process depends on the total coverage and the length distribution of the trimmed preassembled reads in relation to the repeat content of the genome. Full genome closure typically depends on the ability to generate sufficient coverage of ultra-long trimmed preassembled reads that uniquely anchor the longest repeat regions in the genome assembly.
3. Consensus Polishing
To significantly reduce the remaining InDel and base substitution errors in the draft assembly, a quality-aware consensus algorithm that uses the rich quality scores embedded in Pacific Biosciences' bas.h5 files is typically the best option. Four different per-base Quality Values (QV scores) represent the intrinsically calculated error probabilities for inserted, deleted, substituted and merged base calls in single pass reads. These four values allow the Quiver algorithm to derive a highly accurate consensus for the final assembly, which frequently exceeds QV50 (99.999% accuracy).
|HGAP||A workflow to first preassemble reads, assemble the preassembled reads using Celera® Assembler, then polish using Quiver.
|Falcon||An experimental diploid assembler, tested on multi Gb genomes. 2014 AGBT presentation by Jason Chin.|
|Canu||A fork of the Celera Assembler designed for high-noise single-molecule sequencing.|