Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
HGAP in SMRT Analysis
This page contains information about the current release of HGAP (SMRT Analysis 2.2).
There have been multiple iterations of the HGAP implementation in SMRT Analysis, with performance improvements added to each iteration. In SMRT Analysis 2.2, HGAP.3 was introduced, significantly speeding up HGAP execution. In most cases the speedup gained by HGAP.3 makes it the preferred protocol. In production environments, HGAP.2 might be preferred. We recommend using the latest version of SMRT Analysis to ensure you are getting the best performance with HGAP.
SMRT Analysis v2.1 has a new implementation of HGAP that speeds up preassembly by >10X. This is found in the
RS_HGAP_Assembly.1 is not available in SMRT Analysis v2.2.0 and later versions.
SMRT Analysis v2.2 contains a further improvement to HGAP, in which the overlap assembly stage is sped up, this new protocol is named
RS_HGAP_Assembly.3. The preassembly stage of versions 2 and 3 is largely the same.
The table below summarizes the differences between the HGAP versions:
|SMRT Portal Protocol name||RS_HGAP_Assembly.3||RS_HGAP_Assembly.2||RS_HGAP_Assembly.1|
|Status||New in SMRT Analysis 2.2||Deprecated in SMRT Analysis 2.2|
|Description||Performance improvements by replacing the consensus step with pbutgcns||Performance improvements by replacing the correction step with pbdagcon||Initial HGAP production implementation|
|HGAP workflow components|
1. Genome Size
To accurately determine the Minimum Seed Read Length and the coverage of trimmed preassembled reads going into the assembly step, it is important to adjust the target genome size as accurately as possible.
2. Automatic Minimum Seed Read Length calculation
The Minimum Seed Read Length that results in at least 30X target genome coverage by the longest subreads is being calculated automatically (the default option). To use the user-selected Minimum Seed Read Length, the default option has to be deselected. If less than 30X coverage is being used for the HGAP process, the algorithm will use the user-selected Minimum Seed Length (6kb default), so lowering the default setting to 500bp is required to allow all-vs-all PreAssembly at lower than 30X coverage.
At the moment, HGAP in SMRT Analysis supports genomes up to 130 MB; further improvements to scaling the workflow will enable support for larger genomes.
Older versions of SMRT Analysis may have lower genome size limits. SMRT Analysis 2.0 was limited to a 10 Mb genome size. We do not recommend using older versions of SMRT Analysis since they can have significant performance limitations; please upgrade if possible.
For microbial assemblies we have seen improved assembly results using the latest workflows (HGAP.2 and HGAP.3).
The HGAP 2 & 3 workflows have been configured for larger genomes with potentially lower coverage, with defaults that may not make sense for high-coverage bacterial genomes. To assemble these high-coverage genomes using HGAP 2 or 3, consider changing the default Assembly->Target Coverage parameter from 30 to 15.
For samples with a lot of coverage (e.g. significantly greater than 100X coverage), you may see a larger number of contigs resulting from overwhelming the built-in contamination and chimera filtering that is part of the HGAP process. This can be addressed by using the ~100X longest subreads for HGAP, which can be selected by increasing the minimum subread length.
Selecting specific reads in the filtering step
In cases where an external filtering procedure is used, e.g. for contaminant removal, it is possible to expand the filtering procedure to take a list of reads (polymerase reads, not subreads) that are to be included in the HGAP assembly. The list of reads should be in a text file with a movieName/readID on single lines, i.e.
m130419_183857_42161_c100470830070000001823071806131332_s1_p0/8537 m130419_183857_42161_c100470830070000001823071806131332_s1_p0/48643 m130419_183857_42161_c100470830070000001823071806131332_s1_p0/8537 ...
The settings.xml file should then be edited to include the extra parameter under the filtering
<param name="whiteList" label="Minimum Subread Length"> <value>full-path-to-read-list-file</value> </param>
There is also an HGAP Whitelisting Tutorial demonstrating its usage.