HGAP in SMRT Analysis

lhon edited this page Jun 9, 2014 · 2 revisions

This page contains information about the current release of HGAP (SMRT Analysis 2.2).

There have been multiple iterations of the HGAP implementation in SMRT Analysis, with performance improvements added to each iteration. In SMRT Analysis 2.2, HGAP.3 was introduced, significantly speeding up HGAP execution. In most cases the speedup gained by HGAP.3 makes it the preferred protocol. In production environments, HGAP.2 might be preferred. We recommend using the latest version of SMRT Analysis to ensure you are getting the best performance with HGAP.

SMRT Analysis v2.1 has a new implementation of HGAP that speeds up preassembly by >10X. This is found in the RS_HGAP_Assembly.2 protocol. RS_HGAP_Assembly.1 is not available in SMRT Analysis v2.2.0 and later versions.

SMRT Analysis v2.2 contains a further improvement to HGAP, in which the overlap assembly stage is sped up, this new protocol is named RS_HGAP_Assembly.3. The preassembly stage of versions 2 and 3 is largely the same.

The table below summarizes the differences between the HGAP versions:

HGAP.3 HGAP.2 HGAP.1
SMRT Portal Protocol name RS_HGAP_Assembly.3 RS_HGAP_Assembly.2 RS_HGAP_Assembly.1
Status New in SMRT Analysis 2.2   Deprecated in SMRT Analysis 2.2
Description Performance improvements by replacing the consensus step with pbutgcns Performance improvements by replacing the correction step with pbdagcon Initial HGAP production implementation
HGAP workflow components
   Alignment BLASR BLASR BLASR
   Correction PB/dagcon PB/dagcon (new) AMOS/make-consensus
   Overlap CA/overlap CA/overlap CA/overlap
   Layout CA/unitigger CA/unitigger CA/unitigger
   Consensus PB/utgcns (new) CA/utgcns CA/utgcns
   Polish Quiver Quiver Quiver

Important parameters

1. Genome Size
To accurately determine the Minimum Seed Read Length and the coverage of trimmed preassembled reads going into the assembly step, it is important to adjust the target genome size as accurately as possible.

2. Automatic Minimum Seed Read Length calculation
The Minimum Seed Read Length that results in at least 30X target genome coverage by the longest subreads is being calculated automatically (the default option). To use the user-selected Minimum Seed Read Length, the default option has to be deselected. If less than 30X coverage is being used for the HGAP process, the algorithm will use the user-selected Minimum Seed Length (6kb default), so lowering the default setting to 500bp is required to allow all-vs-all PreAssembly at lower than 30X coverage.

Genome Size

At the moment, HGAP in SMRT Analysis supports genomes up to 130 MB; further improvements to scaling the workflow will enable support for larger genomes.

Older versions of SMRT Analysis may have lower genome size limits. SMRT Analysis 2.0 was limited to a 10 Mb genome size. We do not recommend using older versions of SMRT Analysis since they can have significant performance limitations; please upgrade if possible.

Usage notes

For microbial assemblies we have seen improved assembly results using the latest workflows (HGAP.2 and HGAP.3).

The HGAP 2 & 3 workflows have been configured for larger genomes with potentially lower coverage, with defaults that may not make sense for high-coverage bacterial genomes. To assemble these high-coverage genomes using HGAP 2 or 3, consider changing the default Assembly->Target Coverage parameter from 30 to 15.

For samples with a lot of coverage (e.g. significantly greater than 100X coverage), you may see a larger number of contigs resulting from overwhelming the built-in contamination and chimera filtering that is part of the HGAP process. This can be addressed by using the ~100X longest subreads for HGAP, which can be selected by increasing the minimum subread length.

Selecting specific reads in the filtering step

In cases where an external filtering procedure is used, e.g. for contaminant removal, it is possible to expand the filtering procedure to take a list of reads (polymerase reads, not subreads) that are to be included in the HGAP assembly. The list of reads should be in a text file with a movieName/readID on single lines, i.e.

m130419_183857_42161_c100470830070000001823071806131332_s1_p0/8537
m130419_183857_42161_c100470830070000001823071806131332_s1_p0/48643
m130419_183857_42161_c100470830070000001823071806131332_s1_p0/8537
...

The settings.xml file should then be edited to include the extra parameter under the filtering <moduleStage>.

<param name="whiteList" label="Minimum Subread Length">
	<value>full-path-to-read-list-file</value>
</param>

There is also an HGAP Whitelisting Tutorial demonstrating its usage.

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.