rhallPB edited this page Sep 1, 2016 · 46 revisions

The Hierarchical Genome Assembly Process (HGAP) for long single pass reads generated by the PacBio® Single Molecule Real Time (SMRT) sequencer was developed to allow the complete and accurate shotgun assembly of bacterial sized genomes. The process itself relies on a succession of steps to generate de novo assemblies of a genome. Several implementations exist.

HGAP was recently published in Nature Methods. A video tutorial walkthrough (handheld version) in SMRT Portal is available.

This document contains:


HGAP Workflow

The three main steps involved in HGAP are preassembly, assembly, and consensus polishing, and are detailed below.

1. Preassembly
The goal of Preassembly is to generate long and highly accurate sequences. This is accomplished by mapping single pass reads to seed reads, which represent the longest portion of the read length distribution. Subsequently, a consensus sequence of the mapped reads is generated, resulting in long and highly accurate fragments of the target genome. To improve the ability to detect sequence overlaps and therefore the accuracy of the following assembly step, quality trimming of the preassembled consensus sequences plays a crucial role in the success of HGAP.

2. Assembly
The choice of assembly algorithm is at the user's discretion, but typically Overlap - Layout - Consensus (OLC) assemblers are better suited for the de novo assembly of multi-kb long reads. The success of the assembly process depends on the total coverage and the length distribution of the trimmed preassembled reads in relation to the repeat content of the genome. Full genome closure typically depends on the ability to generate sufficient coverage of ultra-long trimmed preassembled reads that uniquely anchor the longest repeat regions in the genome assembly.

3. Consensus Polishing
To significantly reduce the remaining InDel and base substitution errors in the draft assembly, a quality-aware consensus algorithm that uses the rich quality scores embedded in Pacific Biosciences' bas.h5 files is typically the best option. Four different per-base Quality Values (QV scores) represent the intrinsically calculated error probabilities for inserted, deleted, substituted and merged base calls in single pass reads. These four values allow the Quiver algorithm to derive a highly accurate consensus for the final assembly, which frequently exceeds QV50 (99.999% accuracy).

Implementations

Name Description
  HGAP A workflow to first preassemble reads, assemble the preassembled reads using Celera® Assembler, then polish using Quiver.
  • Supports up to 100 Mb from SMRT Portal, which is part of SMRT Analysis.
  • Larger genomes are possible from the command line using either smrtpipe.py or the Makefile-based smrtmake.
  Falcon An experimental diploid assembler, tested on multi Gb genomes. 2014 AGBT presentation by Jason Chin.
  Canu A fork of the Celera Assembler designed for high-noise single-molecule sequencing.
Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.