Large Genome Assembly with PacBio Long Reads

rhallPB edited this page Apr 20, 2016 · 59 revisions
Clone this wiki locally

PacBio long reads can be used in a number of ways to generate and improve de novo assemblies for large genomes. You can take several different approaches:

  1. PacBio-only de novo assembly. Using just PacBio reads from a long insert library, the reads are often preprocessed before being assembled using an Overlap-Layout-Consensus algorithm. The best known implementation of this is HGAP.

  2. Hybrid de novo assembly. Using a combination of PacBio and short read data, the reads are used together during assembly to generate a hybrid assembly.

  3. Gap filling. Starting with an existing mate-pair based assembly, the internal gaps (consisting of Ns) inside the scaffolds are filled using PacBio sequences.

  4. Scaffolding. Using an existing assembly (such as an assembly based on short read data), PacBio reads are used to join contigs.

    Figure 1. Illustration of PacBio assembly approaches

Below we discuss what software is available, choosing software, and additional considerations.

Software Options

Name Description
PacBio-only
  HGAP A workflow to first preassemble reads, assemble the preassembled reads using Celera® Assembler, then polish using Quiver.
  • Supports up to 100 Mb from SMRT Portal, which is part of SMRT Analysis.
  • Larger genomes are possible from the command line using either smrtpipe.py or the Makefile-based smrtmake.
  Falcon An experimental diploid assembler, tested on multi Gb genomes. 2014 AGBT presentation by Jason Chin.
  Canu A fork of the Celera Assembler designed for high-noise single-molecule sequencing.
  Celera® Assembler Celera® Assembler 8.1 now offers a way to directly assemble subreads.
  Sprai A preassembly-based assembler that aims to generate longer contigs.
Hybrid
  pacBioToCA An error correction module in Celera® Assembler originally designed to align short reads to PacBio reads and generate consensus sequences. These error corrected reads can then be assembled by Celera® Assembler.
  ECTools A set of tools that uses contigs instead of short reads for correction.
  SPAdes A short read assembler that added PacBio hybrid assembly support as of version 3.0.
  Cerulean Cerulean starts with an assembly graph from ABySS and extends contigs by resolving bubbles in the graph using PacBio long reads. Was successfully run on genomes <100 Mb.
  dbg2olc dbg2olc uses Illumina contigs as anchors to build an overlap graph with PacBio reads, allowing very fast performance.
Gap Filling
  PBJelly 2 PBJelly upgrades genomes by using PacBio reads to fill in gaps in scaffolds. Has been shown to work with genomes >1 Gb. Part of the PBSuite of applications including PB Honey. See also PAG 2014: Kim Worley, "Improving Genomes using Long Reads and PB Jelly 2

Considerations

Coverage and Choosing Software

The choice of algorithms depends on how much PacBio sequencing can be obtained and what types of short read data are available. We recommend PacBio-only de novo assembly when it is possible to get at least 50X PacBio coverage. HGAP performs best with the minimum recommended coverage; with higher coverage a greater number of the longest reads becomes available for assembly. For larger genomes, PBcR in Celera Assembler 8.2 beta uses MHAP which offers faster assembly times.

For a hybrid assembly involving both PacBio and short read sequencing, PBcR and ECTools can work well with around 20X PacBio coverage. If a high quality set of scaffolds exists, then PBJelly 2 can be used. We recommend at least PacBio 5X coverage to fill gaps; higher coverage enables better consensuses in gap filled regions and increases the number of addressable gaps, as random sampling at lower coverage can lead to coverage gaps.

Figure 2. PacBio algorithm suggestions from a PAG 2014 presentation by Mike Schatz

Repetitive Content

One of the biggest challenges with de novo assembly is repeat content. In general, the solution is to work with insert sizes that can span repeats and identify unique anchoring sequence on each side. PacBio long reads are uniquely useful in sequencing long inserts, given that they can read from one end of the insert to the other.

Ploidy

Most existing assemblers were designed for haploid genomes. When a diploid genome has little structural variation between the chromosome copies, then a haploid approach can work well, with the occasional structural heterozygosity appearing as separate contigs. In diploid genomes with larger structural variation or multiploid genomes, assemblies based on haploid assemblers are increasingly fragmented. For these genomes, consider Falcon - though it is considered experimental code. Note also that Celera® Assembler can be configured to favor merging haplotypes.

If possible, select strains to minimize heterozygosity, which helps facilitate assembly. This includes using inbred lines, double haploid strains, and other effectively haploid genomes. For example, the human mole sequenced is a double haploid genome.

Coverage Bias with Short-Read Data

Short read data has coverage bias in regions with extreme GC composition because short read technologies require amplification. Even if PCR-free sample preparation methods are used, ultimately there is bridge amplification during sequencing.

In addition, with error correction approaches such as PBcR, short reads made of simple repeats are difficult to use given that the kmers used to seed overlaps are at high frequency and thus often filtered out (see PAG 2014, Mike Schatz slide 12).

Computational Requirements

De novo assembly algorithms using PacBio reads generally use an overlap-layout-consensus algorithm to arrange long reads (such as Celera® Assembler, which HGAP and pacBioToCA both use). Because the overlap phase requires an all-by-all alignment, computation time scales quadratically with the genome size. For larger genomes approaching one gigabase and greater, assembling genomes of this size requires significant computational resources. For example, the initial overlap step in preassembly for the 54X human assembly required 405,000 CPU hours. Compute times are also described in the pacBioToCA-based drosophila assembly. There are efforts to reduce the computational burden, such as Dazzler (blog) and MHAP (blog post, webinar).

Hybrid assembly using PBcR also adds a layer of computational complexity, since aligning 100X of short reads to PacBio reads is a computationally intensive task. One way to reduce computational time is to align short read contigs to PacBio reads, such as through ectools, which effectively compresses down the short read data. This type of approach also has the advantage of increasing the mappability of short read data, since assembled contigs are longer than the individual reads.

Draft Genome Quality

Gap filling of mate pair-based scaffolded assemblies are particularly sensitive to the quality of the starting assembly. When aligning PacBio reads across gaps in the scaffolds, misassemblies in the scaffolds can result in improper alignments and incorrectly-filled or unfillable gaps.

Large insert libraries

Even though this is a discussion of assembly algorithms, key to a successful assembly is the longest reads possible through careful sample preparation. We recommend the largest insert libraries possible (e.g. 20 kb) using BluePippin™ size selection (see 20 kb Template Preparation Using BluePippin Size-Selection) and sequencing with the P6-C4 chemistry.

Datasets and Example Projects

Additional Links