SMRT Pipe Reference Guide v2.1

fripp edited this page Nov 21, 2013 · 97 revisions
Clone this wiki locally

Introduction

This document describes the underlying command-line interface to SMRT Pipe, and is for use by bioinformaticians working with secondary analysis results.

SMRT Pipe is Pacific Biosciences’ underlying analysis framework for secondary analysis functions. SMRT Pipe is a python-based general-purpose workflow engine. It is easily extensible, and supports logging, distributed computation, error handling, analysis parameters, and temporary files.

In a typical installation of the SMRT Analysis Software, the SMRT Portal web application calls SMRT Pipe when a job is started. SMRT Portal provides a convenient and user-friendly way to analyze Pacific Biosciences’ sequencing data through SMRT Pipe. Power users will find that there is more flexibility and customization available by instead running SMRT Pipe analyses from the command line.

Note: Throughout this documentation, the path /opt/smrtanalysis is used to refer to the installation directory for SMRT Analysis (also known as $SEYMOUR_HOME). Replace this path with the path appropriate to your installation when using this document.

Installation

SMRT Pipe is installed as part of the SMRT Analysis software installation. For details, see SMRT Analysis Software Installation.

Using the Command Line

In a typical SMRT Analysis installation, SMRT Pipe is in your path after sourcing the setup.sh file. This file declares the $SEYMOUR_HOME environment variable and also sources two subsequent files, $SEYMOUR_HOME/analysis/etc/setup.sh and $SEYMOUR_HOME/common/etc/setup.sh. Do not declare $SEYMOUR_HOME in ~/.bashrc or any other environment setting file because it will cause conflicts.

Invoke the smrtpipe.py script in by executing:

. /path/to/smrtanalysis/etc/setup.sh && smrtpipe.py [--help] [options] --params=settings.xml xml:input.xml

Replace /path/to/smrtanalysis/ with the path to your SMRT Analysis installation. The is the same way smrtpipe.py is invoked in SMRT Portal using the job.sh script.

Logging messages are printed to stderr as well as a log file (log/smrtpipe.log). It is standard practice to pipe the stderr messages to a file using redirection in your shell, for example appending &> smrtpipe.err to the command line if running under bash.

Command Line Options

Following are some of the available options for invoking smrtpipe.py:

-D key=value
  • Overrides a configuration variable. Configuration variables are key-value pairs that are read from the global file smrtpipe.rc before starting an analysis. An example is the NPROC variable which controls the number of simultaneous processors to use during the analysis. To restrict SMRT Pipe to 4 processors, use -D NPROC=4.
--debug
  • Activates debugging output in the stderr and log outputs. To set this flag as a default, specify DEBUG=True in the smrtpipe.rc file.
--distribute
  • Distributes the computation across a compute cluster. For information on onfiguring SMRT Pipe for a distributed computation environment, see SMRT Analysis Software Installation.
--help
  • Displays information about command-line usage and options, and then exits.
--noreports
  • Turns off the production of XML/HTML/PNG reports.
--nohtml
  • Turns off the conversion of XML reports into HTML. (This conversion requires that Java be installed.)
--output=outputDir
  • Specifies a root directory to use for all SMRT Pipe outputs for this analysis. SMRT Pipe places outputs in this directory, as well as in data, results, and log subdirectories.
params=params.xml
  • Specifies a settings XML file for running the pipeline analysis. If this option is not specified, SMRT Pipe prints a message and then exits.
--totalCells
  • Specifies that if the number of cells in the job is less than totalCells, the job is not marked complete when it finishes. Data from additional cells will be appended to the outputs, until the number of cells reaches totalCells.
--recover
  • Attempts to rerun a SMRT Pipe analysis starting from the last successful stage. The same initial arguments should be specified in this case.
--version
  • Displays the version number of SMRT Pipe and then exits.
--kill
  • Kills a SMRT Pipe job running in the current directory. This works with output.

Utility Scripts

For convenience, you can create several utility scripts:

run_smrtpipe_singlenode.sh

SMRT_ROOT=/path/to/smrtanalysis/
.  $SMRT_ROOT/common/etc/setup.sh && smrtpipe.py  --params=settings.xml   xml:input.xml

run_smrtpipe_distribute.sh

SMRT_ROOT=/path/to/smrtanalysis/
.   $SMRT_ROOT/common/etc/setup.sh && smrtpipe.py  --distribute --params=settings.xml   xml:input.xml

run_smrtpipe_debug.sh

SMRT_ROOT=/path/to/smrtanalysis/
.   $SMRT_ROOT/common/etc/setup.sh && smrtpipe.py  --debug --params=settings.xml   xml:input.xml

Specifying SMRT Pipe Inputs

The input file is an XML file specifying the sequencing data to process. Generally, you specify the inputs as URIs (Universal Resource Identifiers) which are resolved by code internal to SMRT Pipe. In practice, this is most useful to large enterprise users that have a data management scheme and are able to modify the SMRT Pipe code to include their own resolver.

The simpler way to specify inputs is to fully resolve the path to each input file, which as of v2.0, is a bax.h5 file. For more information, see bas.h5 Reference Guide.

The script fofnToSmrtpipeInput.py is provided to convert a FOFN (a "file of file names" file) to the input format expected by SMRT Pipe. If my_inputs.fofn looks like

/mnt/data/2770276/0006/Analysis_Results/m130512_050747_42209_c100524962550000001823079609281357_s1_p0.2.bax.h5
/mnt/data/2770276/0006/Analysis_Results/m130512_050747_42209_c100524962550000001823079609281357_s1_p0.3.bax.h5
/mnt/data/2770276/0006/Analysis_Results/m130512_050747_42209_c100524962550000001823079609281357_s1_p0.1.bax.h5

or, for SMRT Pipe versions before v2.1:

/share/data/run_1/m100923_005722_00122_c15301919401091173_s0_p0.bas.h5
/share/data/run_2/m100820_063008_00118_c04442556811011070_s0_p0.bas.h5

then it can be converted to a SMRT Pipe input XML file by entering:

fofnToSmrtpipeInput.py my_inputs.fofn > my_inputs.xml

Following is the resulting XML file for SMRT Pipe v2.1:

<?xml version="1.0"?>
<pacbioAnalysisInputs>
  <dataReferences>
    <url ref="run:0000000-0000"><location>/mnt/data/2770276/0006/Analysis_Results/m130512_050747_42209_c100524
962550000001823079609281357_s1_p0.2.bax.h5</location></url>
    <url ref="run:0000000-0001"><location>/mnt/data/2770276/0006/Analysis_Results/m130512_050747_42209_c100524
962550000001823079609281357_s1_p0.3.bax.h5</location></url>
    <url ref="run:0000000-0002"><location>/mnt/data/2770276/0006/Analysis_Results/m130512_050747_42209_c100524
962550000001823079609281357_s1_p0.1.bax.h5</location></url>
  </dataReferences>
</pacbioAnalysisInputs>

For SMRT Pipe versions before v2.1:

<?xml version="1.0"?>
<pacbioAnalysisInputs>
 <dataReferences>
    <url ref="run:0000000-0000"><location>/share/data/
    /share/data/run_1 m100923_005722_00122_c15301919401091173_s0_p0.bas.h5
    <url ref="run:0000000-0001"><location>/share/data/
    /share/data/run_2/m100820_063008_00118_c04442556811011070_s0_p0.bas.h5
 </dataReferences>
</pacbioAnalysisInputs>

To run an analysis using these input files, use the following command:

smrtpipe.py --params=settings.xml xml:my_inputs.xml

The SMRT Pipe input format lets you specify annotations, such as job IDs, job names, and job comments, in a job-management environment. The fofnToSmrtpipeInput.py application has command-line options for setting these optional attributes.

Note: To get help for a script, run the script with the --help option and no additional arguments. For example:

fofnToSmrtpipeInput.py --help

Specifying SMRT Pipe Parameters

The --params option is the most important SMRT Pipe option, and is required for any sophisticated use. The option specifies an XML file that controls:

  • The analysis modules to run.
  • The order of execution.
  • The parameters used by the modules.

The general structure of the settings XML file is as follows:

<?xml version="1.0"?>
<smrtpipeSettings>

<protocol>
...global parameters...
</protocol>

<module id="module_1">
...parameters...
</module>

<module id="module_2">
...parameters...
</module>

</smrtpipeSettings>
  • The protocol element allows setting global parameters that could possibly be used by all modules.
  • Each module element defines an analysis module to run.
  • The order of the module elements defines the order in which the modules execute.

SMRT Portal protocol templates are located in: $SEYMOUR_HOME/common/protocols/.

SMRT Pipe modules are located in: $SEYMOUR_HOME/analysis/lib/pythonx.x/pbpy-0.1-py2.7.egg/pbpy/smrtpipe/modules/.

You specify parameters by entering a key-value pair in a param element.

  • The name of the key is in the name attribute of the param element.
  • The value of the key is contained in a nested value element.

For example, to set the parameter named reference, you specify:

<param name="reference">
  <value>/share/references/repository/celegans</value>
</param>

Note: To reference a parameter value in other parameters, use the notation ${variable} when specifying a value. For example, to reference a global parameter named home, use it in other parameters as ${home}. SMRT Pipe supports arbitrary parameters in the settings XML file, so the use of temporary variables like this can help readability and maintainability.

Following is a complete example of a settings file for running filtering, mapping, and consensus steps against the E. coli reference genome:

<?xml version="1.0" encoding="utf-8"?>
<smrtpipeSettings>
 <protocol>
  <param name="reference">
   <value>/share/references/repository/ecoli</value>
  </param>
 </protocol>

 <module name="P_Filter">
  <param name="minLength">
    <value>50</value>
  </param>
  <param name="readScore">
    <value>0.75</value>
  </param>
 </module>

 <module name="P_FilterReports" />

 <module name="P_Mapping">
  <param name="align_opts" hidden="true">
   <value>--minAccuracy=0.75 --minLength=50 -x </value>
  </param>
 </module>

 <module name="P_MappingReports" />
 <module name="P_Consensus" />
 <module name="P_ConsensusReports" />

</smrtpipeSettings>

SMRT Portal Protocols

Following are the secondary analysis protocols included in SMRT Analysis v2.1, with the SMRT Pipe module(s) called by each protocol. Many of these modules are described later in this document.

11k_Unrolled_Resequencing

  • Used for 11k plasmidbell resequencing against an unrolled reference.
  • Designed for troubleshooting the performance of 120 minute movies to get a more accurate estimate of the full polymerase read length.
* P_Filter
* BLASR_Unrolled
* P_MotifFinder
* P_ModificationDetection
* P_AnalysisHook

ecoliK12_RS_Resequencing:

  • Used for E. coli whole genome resequencing.  
  • Reads are filtered, mapped to the E. coli reference sequence, and consensus and variants are identified versus this reference.
* P_Filter
* P_Mapping
* P_GenomicConsensus
* P_AnalysisHook

lambda_RS_Resequencing:

  • Used for lambda resequencing.
  • Reads are filtered, mapped to the lambda reference sequence, and consensus and variants are identified versus this reference.
* P_Filter
* P_Mapping
* P_GenomicConcensus
* P_AnalysisHook

BridgeMapper_beta:

  • Used for whole-genome or targeted resequencing.
  • Returns split alignments of PacBio reads using BLASR. 
  • Reads are filtered by length and quality, mapped to a provided reference sequence, and consensus and variants are identified versus this reference using the Quiver algorithm.
* P_Filter
* P_Mapping

RS_AHA_Scaffolding:

  • Used for hybrid assembly of genomes up to 10 Mbp.
  • Improve existing assemblies up to 200 Mb in size by scaffolding with PacBio long reads to join contigs.
  • Reads are filtered and assembled with high confidence contigs into scaffolds using a combination of algorithms developed by Pacific Biosciences and the AMOS open-source project.
* P_Filter
* P_AHA

RS_cDNA_Mapping:

  • Aligns reads from cDNA to a genomic DNA reference using the third-party software tool GMAP.
  • Reads are filtered by length and quality and then mapped against the reference using GMAP to span introns.
* P_Filter
* P_GMAP

RS_CeleraAssembler:

  • Performs de novo assembly of genomes up to 200 Mbp using pacBioToCA for error correction and Celera® Assembler 7.0 for assembly.
  • Combines long reads (ideally from a 10 kb or longer insert library) with shorter, high-accuracy reads (Reads of Insert reads or reads from another sequencing technology).
  • This workflow (comprised of the P_PacBioToCA and P_CeleraAssembler modules) wraps the Celera® Assembler’s error correction and assembly programs. For full documentation of pacBioToCA and the Celera Assembler, see http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Main_Page.
  • The error correction may be run with external high confidence reads, such as those from Illumina® data, or from internally generated Reads of Insert reads.

Input:

  • settings.xml: Specifies the parameters.

  • input.xml: Specifies the inputs.

Sample settings.xml file:

<module id="P_PacBioToCA" label="PacBioToCA v1" editableInJob="true">
  <description>This module wraps pacBioToCA, the error correction pipeline of Celera Assembler v7.0</description>

If useCCSFastq is set to True, CSS reads are used as the high-confidence reads:

  <param name="useCCSFastq" label="Correct With CCS">
     <title>Use CCS reads to correct long reads</title>
     <value>True</value>
     <input type="checkbox" />
  </param>

To use a FASTQ file, set the value of shortReadFastqA to the full path.

  <param name="shortReadFastqA" label="FASTQ to Correct With">
     <title>(Optional) FASTQ file of reads to correct long reads with </title>
     <input type="text" />
     <value></value>
     <rule remote="api/protocols/resource-exists?paramName=shortReadFastqA" required="false" message="File does not exist" />
  </param>

The shortReadTechnology option selects the platform on which the reads were generated. This sets library feature flags to enable different correction, trimming and untagging algorithms. The default is illumina.

  <param name="shortReadTechnology" label="FASTQ Read Type">
     <title>Sequencing platform used to generate the FASTQ file, if specified</title>
     <value>illumina</value>
     <select>
       <option value="sanger">~1kb (e.g., Sanger and PacBio CCS)</option>
       <option value="454">~600bp (e.g., 454)</option>
       <option value="illumina">~100bp (e.g., Illumina)</option>
     </select>
  </param>

The shortReadType option selects the type of QV encoding (sanger, solexa, or illumina) in the FASTQ file.

  <param name="shortReadType" label="FASTQ Quality Value Encoding">
     <title>ASCII encoding of the quality values in the FASTQ file, if specified</title>
     <value>illumina</value>
     <select>
       <option value="sanger">Phred+33 (e.g., Sanger and PacBio fastq)
       </option>
       <option value="solexa">Solexa+64 (e.g., Solexa fastq)</option>
       <option value="illumina">Phred+64 (e.g., Illumina fastq)</option>
     </select>
  </param>

  <param name="pbReadMinLength" label="Min fragment length">
     <title>Minimum length of PacBio RS fragment to keep.</title>
     <input type="text" />
     <value>1000</value>
     <rule type="digits" message="Value must be an integer" required="true" />
  </param>

  <param name="specInPacBioToCA" label="Pre-defined spec file">
    <title>Enter the server path to an existing spec file</title>
    <input type="text" />
    <rule remote="api/protocols/resource-exists?paramName=specInPacBioToCA" required="false" message="File does not exist" />
  </param>
</Module>

<module id="P_CeleraAssembler" label="CeleraAssembler v1" editableInJob="true">
   <description>This module wraps the Celera Assembler v7.0</description>

  <param name="genomeSize" label="Genome Size (bp)">
     <title>Approximate genome size in base pairs</title>
     <value>5000000</value>
     <input type="text" />
     <rule type="digits" message="Must be a value between 1 and 200000000" min="1"
     required="true" max="200000000" />
  </param>

  <param name="defaultFrgMinLen" hidden="true">
    <input type="text" />
    <value>1000</value>
  </param>

  <param name="xCoverage" label="Target Coverage">
    <title>Fold coverage to target when picking frgMinLen for assembly. Typically 15 to 25.</title>
    <input type="text" />
    <value>15</value>
    <rule type="digits" message="Value must be an integer between 10 and 30, inclusive” min="10" max="30" />
  </param>

  <param name="ovlErrorRate" label="Overlapper error rate">
    <title>Overlapper error rate</title>
    <input type="text" />
    <value>0.015</value>
    <rule type="number" message="Value must be numeric" />
  </param>

  <param name="ovlMinLen" label="Overlapper min length">
    <title>Overlaps shorter than this length are not computed.</title>
    <input type="text" />
    <value>40</value>
    <rule type="digits" message="Value must be an integer" />
  </param>

  <param name="specInRunCA" label="Pre-defined spec file">
    <title>Enter the server path to an existing spec file</title>
    <input type="text" />
    <rule remote="api/protocols/resource-exists?paramName=specInRunCA" required="false" message="File does not exist" />
  </param>

 </module>
</smrtpipeSettings>

Output:

Note: Some of the worklflow outputs are produced by the Celera Assembler, and some by Pacific Biosciences’ software.

  • data/runCa.spec: The specification file used to run the assembly program. The P_CeleraAssembler module auto-generates the specification file based on the input data and selected parameters. Alternatively, you can provide an explicit specification file.

  • data/pacBioToCA.spec: The specification file used to run the error correction program. The P_PacBioToCA module auto-generates the specification file based on the input data and selected parameters. Alternatively, you can provide an explicit specification file.

  • data/celera-assembler.asm: The official output of Celera Assembler’s assembly program.

  • data/assembled_reads.cmp.h5: The pairwise alignment for each read against its assembled contig consensus.

  • data/assembled_summary.gff.gz: Summary information about each of the contigs.

  • data/castats.txt: Assembly statistics report.

To run the error correction and assembly modules, enter the following:

smrtpipe.py --params=settings.xml xml:input.xml >& smrtpipe.err

RS_Filter_Only:

  • Filters reads based on the minimum read length and read quality specified. No additional analysis is performed.
* P_Filter

RS_HGAP_Assembly.1, RS_HGAP_Assembly.2:

  • HGAP (Hierarchical Genome Assembly Process) performs high quality de novo assembly using a single PacBio library preparation.
  • HGAP consists of pre-assembly, de novo assembly with Celera® Assembler, and assembly polishing with Quiver.
  • HGAP.2 protocol is designed to position the software to scale against larger genomes.
* P_PreAssembler (HGAP.1)
* P_PreAssemblerDagcon (HGAP.2)
* P_CeleraAssembler
* P_Mapping
* P_AssemblyPolishing 

RS_Long_Amplicon_Analysis (BETA):

  • Used to determine phased consensus sequences for pooled amplicon data.
  • Up to 20 distinct amplicons can be pooled. Reads are clustered into high-level groups, then each group is phased and consensus is called using the Quiver algorithm.
  • Optionally splits reads by barcode if the sample is barcoded.
* P_AmpliconAssembly
* P_Barcode

RS_Minor_and_Compound_Variants (BETA):

  • A single-molecule analysis workflow for detection of minor variants and compound mutations relative to a reference using Reads of Insert data. Suitable for cancer amplicons and low-divergence viral samples.
* P_Filter
* BLASR_Minor_and_Compound_Variants
* P_CorrelatedVariants

RS_Modification_Detection:

  • A resequencing analysis that performs base-modification identification for 6-mA, 4-mC, and optionally TET-converted 5-mC. Also performs variant detection.
  • Reads are filtered by length and quality, mapped to a provided reference sequence, and consensus and variants are identified.
* P_Filter
* P_Mapping
* P_GenomicConsensus
* P_ModificationDetection

RS_Modification_and_Motif_Analysis:

  • A resequencing analysis that identifies common bacterial base modifications (6-mA, 4-mC, and optionally TET-converted 5-mC), and then analyzes the methyltransferase recognition motifs.
  • Reads are filtered by length and quality, mapped against a specified reference sequence, and then variants are called.
* P_Filter
* P_Mapping
* P_GenomicConsensus
* P_ModificationDetection
* P_MotifFinder

RS_PreAssembler:

  • Used to construct a set of highly accurate long reads for use in de novo assembly, using the hierarchical genome assembly process (HGAP).
  • Takes each read exceeding a minimum length, aligns all reads against it, trims the edges, and then takes the consensus.
* PreAssemblerSFilter
* P_PreAssembler

RS_ReadsOfInsert:

  • Generates reads from the insert sequence of single molecules, optionally splitting by barcode.
  • Used to estimate the length of the insert sequence loaded onto a SMRT® Cell.
  • Replaces the Circular Consensus Sequencing (CCS) protocol, which has been moved off the primary analysis instrument.
  • To obtain the closest approximation of CCS as it existed on-instrument, specify MinCompletePasses = 2 and MinPredictedAccuracy = 0.9 in the SMRT® Portal Reads of Insert protocol dialog box.
* P_CCS
* P_Barcode

RS_Resequencing:

  • Used for whole-genome or targeted resequencing.
  • Reads are filtered, mapped to a provided reference sequence, and consensus and variants are identified against this reference.
  • Haploid variants and small indels, but not diploid variants, are called during consensus.
* P_Filter
* P_Mapping
* P_GenomicConsensus

RS_Resequencing_ReadsOfInsert:

  • Used for whole-genome or targeted resequencing.
  • Reads are filtered, mapped to a provided reference sequence, and consensus and variants are identified against this reference.
  • Haploid variants and small indels, but not diploid variants, are called during consensus.
  • Uses Reads of Insert (formerly known as CCS) data during mapping.
* P_Filter
* P_CCS
* BLASR_De_Novo_CCS

RS_Resequencing_GATK_Barcode:

  • Used for heterozygous and homozygous SNP calling of targeted regions and whole genomes.
  • Reads are filtered by length and quality, mapped to a provided reference sequence, and consensus and variants are identified versus the reference using the GATK Unified Genotyper.
  • Note: This protocol is deprecated, and will be removed in a future release of SMRT Pipe.
* P_Filter
* BLASR_Barcode
* P_GATKVC

RS_Site_Acceptance_Test:

  • Site acceptance test workflow for lambda resequencing.
  • Generates a report displaying site acceptance test metrics.
* P_Filter
* P_Mapping
* P_GenomicConsensus

SMRT Pipe Modules and their Parameters

Following is an overview of some of the common modules included in SMRT Pipe and their parameters. Not all modules or parameters are listed here.

Developers interested in even finer control should look inside the validateSettings method for each python analysis module. By convention, all of the settings known to the analysis module are referenced in this method.

Global Parameters

Global parameters are potentially used in multiple modules. In the SMRT Pipe internals, they are accessed in the “global” namespace. Following are some common global parameters:

reference
  • Specifies the name of a reference repository entry or FASTA file for mapping reads. Required for resequencing workflows.
  • Default value = None
control
  • Specifies the name of a reference repository entry or FASTA file for mapping spike-in control reads. (Optional)
  • Default value = None
use_subreads
  • Specifies whether to divide reads into subreads using the adapter region boundaries found by the primary analysis software. (Optional)
  • Default value = True
num_stats_regions
  • Specifies how many regions to use when reporting region statistics such as depth of coverage and variant density. (Optional)
  • Default value = 500

P_Fetch Module

This module fetches the input data and generates a file of the file names of the input .pls files for downstream analysis. This module has no exposed parameters.

Output:

  • pls.fofn (File containing file names of the input .pls files)

P_Filter Module

This module filters and trims the raw reads produced by Pacific Biosciences’ primary analysis software. Options are available for taking the information found in the bas.h5 files and using this to pass reads and portions of reads forward.

Input:

  • bas.h5 files (pre v2.1) or bax.h5 files (post v2.1)

Output:

  • data/filtering_summary.csv: Includes raw metrics and filtering information for each read (not subread) found in the original bas.h5 files.
  • rgn.h5 (one for each input bas.h5 file): Filtering information generated by the module.

Parameters:

  • minLength Reads with a high quality region read length below this threshold are filtered out. (Optional)

  • maxLength Reads with a high quality region read length above this threshold are filtered out. (Optional)

  • minSubReadLength Subreads shorter than this length are filtered out.

  • maxSubReadLength Subreads longer than this length are filtered out.

  • minSNR Reads with signal-to-noise ratio below this threshold are filtered out. (Optional)

  • readScore Reads with a high quality region (Read Quality) score below this threshold are filtered out. (Optional)

  • trim Default value = True, Specifies whether to trim reads to the high-quality region. (Optional)

  • artifact Reads with a read artifact score less than this (negative) number are filtered out. No number indicates no artifact filtering. Reasonable thresholds are typically between -1000 and -200. (Optional)

P_PreAssembler Module

This module takes as input long reads and short reads in standard formats, aligns the short reads to the long reads, and outputs a consensus from the preassembled short reads using the long reads as seeds. Note: You must run the P_Fetch and P_Filter modules before running P_PreAssembler to get meaningful results.

Input:

  • Long reads ("seed reads"): PacBio pls.h5/bas.h5 file(s) and optionally associated rgn.h5 file(s).
  • Short reads: Can be one of the following:
    • Arbitrary high-quality reads in FASTQ format, such as Illumina® reads, without Ns.
    • PacBio pls.h5/bas.h5 file(s): The same reads as used for the long reads. This mode is the first step of HGAP (Hierarchical Genome Assembly Procedure.)
  • params.xml
  • input.xml

The module can run on bas.h5 files only, and on bas.h5 and FASTQ files. Following are sample XML inputs for both modes.

Sample input.xml,bas.h5-only input mode

  • Note: bas.h5 input files must have the suffix bas.h5.
<pacbioAnalysisInputs>
<?xml version="1.0"?>
<pacbioAnalysisInputs>
   <dataReferences>
      <url ref="run:0000000-0001">
   <location>
      /path/to/input.bas.h5
   </location>
</url>
   </dataReferences>
</pacbioAnalysisInputs>

Sample params.xml, bas.h5-only input mode

  • This XML parameter file was tested on 90X short reads and 24X long reads.
<module name="P_PreAssembler">
   <param name="useFastqAsShortReads">
     <value>False</value>
   </param>
   <param name="useFastaAsLongReads">
     <value>False</value>
   </param>
   <param name="useLongReadsInConsensus">
     <value>False</value>
   </param>
   <param name="useUnalignedReadsInConsensus">
     <value>False</value>
   </param>
   <param name="useCCS">
     <value>False</value>
   </param>
   <param name="minLongReadLength">
     <value>5000</value>
   </param>
   <param name="blasrOpts">
     <value> -minReadLength 200 -maxScore -1000 -bestn 24 -maxLCPLength 16 -nCandidates 24 </value>
   </param>
   <param name="consensusOpts">
     <value> -L </value>
   </param>
   <param name="layoutOpts">
     <value> --overlapTolerance 100 --trimHit 50 </value>
   </param>
   <param name="consensusChunks">
     <value>60</value>
   </param>
   <param name="trimFastq">
     <value>True</value>
   </param>
   <param name="trimOpts">
     <value> --qvCut=59.5 --minSeqLen=500 </value>
   </param>
</module>

Sample input.xml (FASTQ and bas.h5 input mode)

  • This parameter XML file was tested on 50X 100 bp Illumina® reads correcting 15X PacBio long reads.
<?xml version="1.0"?>
<pacbioAnalysisInputs>
   <dataReferences>
     <url ref="run:0000000-0001">
   <location>
     /path/to/input.bas.h5
   </location>
   </url>
     <url ref="fastq:/path/to/input.fastq"/>
   </dataReferences>
</pacbioAnalysisInputs>

Sample params.xml (FASTQ and bas.h5 input mode)

<?xml version="1.0" ?>
<smrtpipeSettings>
  <module name="P_Fetch"/>
  <module name="P_Filter">
    <param name="filters">
       <value>MinRL=1000,MinReadScore=0.80</value>
    </param>
    <param name="artifact">
       <value>-1000</value>
    </param>
  </module>
  <module name="P_PreAssembler">
    <param name="useFastqAsShortReads">
       <value>True</value>
    </param>
    <param name="useFastaAsLongReads">
       <value>False</value>
    </param>
    <param name="useLongReadsInConsensus">
       <value>False</value>
    </param>
    <param name="useUnalignedReadsInConsensus">
       <value>False</value>
    </param>
    <param name="blasrOpts">
       <value>-minMatch 8 -minReadLength 30 -maxScore -100 -minPctIdentity 70 -bestn 100</value>
    </param>
    <param name="layoutOpts">
       <value>--overlapTolerance=25</value>
    </param>
    <param name="consensusOpts">
       <value>-w 2</value>
    </param>
</module>
</smrtpipeSettings>

Output:

  • corrected.fasta,corrected.fastq: FASTQ or FASTA file of corrected long reads.
  • idmap.csv: csv file mapping corrected long read ids to original read ids.

P_PreAssemblerDagcon Module (Beta)

This module provides the primary difference in HGAP 2. P_PreAssemblerDagcon was designed as a drop-in replacement for the correction step in HGAP 1, providing the same functionality much faster and more efficiently than the P_PreAssembler module. It includes a simple, alignment-based chimera filter that reduces effects caused by missing SMRTbell™ adapters, such as spurious contigs in assemblies.

Note that the quality values in the FASTQ file for the corrected reads are a uniformly set to QV24. This is determined by mapping corrected reads to a known reference and appears to work well on a broad set of data. We are considering deriving QV values directly from the data for a future release.

As the HGAP 2 implementation was completely redesigned and includes much new code, it is labeled as "Beta" for this release.

Input:

  • Filtered subreads fasta file (generated by P_Filter)
  • params.xml
  • input.xml

The module is a much simpler design and can only be run using smrtpipe in combination with the filtered subreads module. Auto-seed cutoff still targets 30x seed reads.

Parameters:

  • targetChunks How many chunks to split the seed reads (target) into. In the example below the value is set to 6, which generates approximately 5x (30x/6) worth of sequence per split file, or chunk. If set to 1, then set splitBestn to the same value as totalBestn.

  • splitBestn Must be adjusted based on targetChunk. Roughly 1.5 - 2 the coverage found in a given split file, though may produce false positives in some cases, affecting correction so be careful.

  • totalBestn Default value = 24. Based on the total coverage of 30x. Default is sensible in most cases.

Sample input.xml,bas.h5-only input mode

  • Note: bas.h5 input files must have the suffix bas.h5.
<pacbioAnalysisInputs>
<?xml version="1.0"?>
<pacbioAnalysisInputs>
   <dataReferences>
      <url ref="run:0000000-0001">
   <location>
      /path/to/input.bas.h5
   </location>
</url>
   </dataReferences>
</pacbioAnalysisInputs>

Sample params.xml, bas.h5-only input mode

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<smrtpipeSettings>
    <module id="P_Filter" >
        <param name="minLength"><value>100</value></param>
        <param name="minSubReadLength"><value>500</value></param>
        <param name="readScore"><value>0.80</value></param>
    </module>
    <module id="P_PreAssemblerDagcon">
        <param name="computeLengthCutoff"><value>true</value></param>
        <param name="minLongReadLength"><value>6000</value></param>
        <param name="targetChunks"><value>6</value></param>
        <param name="splitBestn"><value>11</value></param>
        <param name="totalBestn"><value>24</value></param>
        <param name="blasrOpts"><value> -noSplitSubreads -minReadLength 200 -maxScore -1000 -maxLCPLength 16 </value></param>
    </module>
</smrtpipeSettings>

Output:

  • data/corrected.fasta,data/corrected.fastq: FASTQ and FASTA file of corrected long reads.
  • preassembler_report.json: JSON-formated pre-assembly report.
  • preassembler_report.html: HTML-formated pre-assembly report.

P_Mapping (BLASR) Module

This module aligns reads against a reference sequence, possibly a multi-contig reference. If the P_Filter module is run first, then only the reads which passed filtering are aligned.

Output:

  • data/aligned_reads.cmp.h5: The pairwise alignments for each read.
  • data/alignment_summary.gff: Summary information.

Parameters:

  • align_opts Default value = Empty string, Passes options to the underlying compareSequences.py script. (Optional)

  • --useCcs= Default value = None, A parameter sent to the underlying compareSequences.py script via the align_opts parameter value above. Values are {denovo|fullpass|allpass}. (Optional)

    • denovo: Maps just the de novo called sequence and report. (Does not include quality values.)

    • fullpass: Maps the de novo called sequence, then aligns full passes to the sequence that the de novo called sequence aligns to.

    • allpass: Maps the de novo called sequence, then aligns all passes (even ones that do not span the length of the template) to the sequence the de novo called sequence aligned to.

  • load_pulses: Default value = True, Specifies whether to load pulse metric information into the cmp.h5 file. (Optional)

  • maxHits: Default value = None, Attempts to find sub-optimal alignments and report up to this many hits per read. (Optional)

  • minAnchorSize: Default value = None, Ignores anchors smaller than this size when finding candidate hits for dynamic programming alignment. (Optional)

  • maxDivergence: Default value = None, Specifies maximum divergence between read and reference to allow a mapping. Divergence = (1 - accuracy).

  • output_bam: Default value = False, Specifies whether to output a BAM representation of the cmp.h5 file. (Optional)

  • output_sam: Default value = False, Specifies whether to output a SAM representation of the cmp.h5 file. (Optional)

  • gff2Bed: Default value = False, Specifies whether to output a BED representation of the depth of coverage summary. (Optional)

P_GenomicConsensus (Quiver) Module

This module takes the alignments generated by the P_Mapping module and calls the consensus sequence across the reads.

Input:

  • data/aligned_reads.cmp.h5: The pairwise alignments for each read.

  • data/alignment_summary.gff: Summary information.

Output:

  • data/aligned_reads.cmp.h5

  • data/variants.gff.gz: A gzipped GFF3 file containing variants versus the reference.

  • data/consensus.fastq.gz: The consensus sequence in FASTQ format.

  • data/alignment_summary.gff, data/variants.vc: Useful information about variants.

Parameters:

  • makeBed: Default value = True, Specifies whether to output a BED representation of the variants. (Optional)

  • makeVcf: Default value = True, Specifies whether to output a VCF representation of the variants. (Optional)

P_AssemblyPolishing

This module is used in HGAP to polish draft assemblies using Quiver.

Input:

  • data/aligned_reads.cmp.h5: The pairwise alignments for each read against the draft assembly.

  • data/alignment_summary.gff: Summary information.

Output:

  • data/polished_assembly.fasta.gz: The consensus sequence in FASTA format.

  • data/polished_assembly.fastq.gz: The consensus sequence in FASTQ format.

  • results/polished_report.html: HTML-formatted report for the polished assembly.

  • results/polished_report.xml: XML-formatted report for the polished assembly.

Parameters:

  • enableMapQVFilter Default value = True

P_AnalysisHook Module

This module allows you to call executable code as part of a SMRT Pipe analysis. P_AnalysisHook can be called multiple times in a settings XML file, allowing for an arbitrary number of calls to external (non-SMRT Pipe) code.

Parameters:

  • scriptDir: Default value = None, All executables in this directory are called serially with the command line exeCmd jobDir, where jobDir is the root of the SMRT Pipe output for this analysis. (Optional)

  • script: Default value = None, Path to an executable called with the command line exeCmd jobDir, where jobDir is the root of the SMRT Pipe output for this analysis. (Optional)

P_AHA (AHA Scaffolding) Module

This module scaffolds high-confidence contigs, such as those from Illumina® data, using Pacific Biosciences’ long reads.

Input:

P_AHA.py uses two kinds of input instead of one:

  • A FASTA file of high-confidence sequences to be scaffolded. These are typically contigs assembled from Illumina® short-read sequence data. They are passed to AHA as a reference sequence in the settings.xml input file.

  • Pacific Biosciences’ long reads, in HDF5 format. These are used to join the high-confidence contigs into a scaffold. Note that versions of the AHA Scaffolding algorithm prior to v2.1 accepted reads in FASTA format. After v2.1, users with FASTA formatted reads should use the underlying executable pbaha.py.

Sample settings.xml file for long reads, with only customer-facing parameters:

<?xml version="1.0"?>
<smrtpipeSettings>
  <global>
    <param name="reference">
        <value>/mnt/secondary-siv/references/ecoli_contig</value>
    </param>
  </global>
  <module name="P_Fetch"/>
  <module name="P_Filter">
    <param name="minLength">
        <value>50</value>
    </param>
    <param name="minSubReadLength">
        <value>50</value>
    </param>
    <param name="readScore">
        <value>0.75</value>
    </param>
  </module>
  <module name="P_FilterReports"/>
  <module name="P_AHA">
    <param name="fillin">
        <value>False</value>
    </param>
    <param name="blasrOpts">
        <value>-minMatch 10 -minPctIdentity 70 -bestn 10 -noSplitSubreads</value>
    </param>
    <param name="instrumentModel">
        <value>RS</value>
    </param>
    <param name="paramSchedule">
        <value>6,3,75,100;6,3,75,100;5,3,75,100;6,2,75,100;6,2,75,100;5,2,75,100</value>
    </param>
    <param name="maxIterations">
        <value>6</value>
    </param>
    <param name="description">
        <value>AHA ("A Hybrid Assembler") is the PacBio hybrid assembly algorithm. It is based on the open source assembly software package AMOS, with additional software components tailored to PacBio's long reads and error profile.</value>
    </param>
  </module>
</smrtpipeSettings>

Output:

  • data/scaffold.gml: A GraphML file that contains the final scaffold. This file can be readily parsed in python using the networkx package.

  • data/scaffold.fasta: A FASTA file with a single entry for each scaffold.

Parameters:

  • paramSchedule: Default value = None Specifies parameter schedules used for iterative hybrid assembly. Schedules are in comma-delimited tuples, separated by semicolons. Example: 6,3,75;6,3,75;6,2,75;6,2,75. The fields, in order, are:

    • Minimum alignment score. Higher is more stringent.
    • Minimum number of reads needed to link two contigs. (Redundancy)
    • Minimum subread length to participate in alignment.
    • Minimum contig length to participate in alignment.
  • fillin: Default value = False Specifies whether to use long reads.

  • blasrOpts: Default value = -minMatch 10 -minPctIdentity 60 -bestn 10 -noSplitSubreads Options passed directly to BLASR for aligning reads to contigs.

  • maxIterations: Default value = 6 Specifies the maximum number of iterations to use from paramSchedule. If paramSchedule has more than maxIterations, it will be truncated at maxIterations. If paramSchedule has less than maxIterations, the last iteration of paramSchedule is repeated.

  • cleanup: Default value = True Specifies whether to clean up intermediate files. This can be useful for debugging purposes.

  • runNucmer: Default value = True Specifies whether to use Nucmer to detect repeat locations. This can improve assemblies, but can be very slow on large highly repetitive genomes.

  • gapFillOpts: Default value = “” Options to be passed directly to gapFiller.py.

  • noScaffoldImages: Default value = True Specifies not producing SVG files of the scaffolds. Creating these files can be expensive for large assemblies, but is recommended for small assemblies.

To run P_AHA.py, enter the following:

smrtpipe.py --params=settings.xml xml:input.xml >& smrtpipe.err

Known Issues

  • Depending on the repetitive content of the high-confidence input contigs, a large fraction of the sequence in the contigs can be called repeats. To avoid this, turn off the split repeats step by setting the minimum repeat identity to a number greater than 100, for example:
<minRepeatIdentity>1000</minRepeatIdentity>

P_GATKVC (GATK Unified Genotyper) Module

This module wraps the Broad Institute's GATK Unified Genotyper for Bayesian diploid and haploid SNP calling, using base quality score recalibration and default settings. The module calls both homozygous and heterozygous SNPs.

Note: This module is deprecated, and will be removed in a future release of SMRT Pipe.

We recommend that you use a dbSNP file as a prior for base quality score recalibration. By default, the P_GATKVC (GATK Unified Genotyper) module uses a null prior. To use a prior, see the script vcfUploader.py, included with the SMRT Analysis installation.

Note: Indel calling and other options are not currently supported through SMRT Pipe.

P_ModificationDetection Module

This module uses the cmp.h5 output by the P_Mapping module to:

  1. Compare observed IPDs in the cmp.h5 file at each reference position on each strand with control IPDs. Control IPDs are supplied by either an in-silico computational model, or observed IPDs from unmodified “control” DNA.

  2. Generate modifications.csv and modifications.gff reporting statistics on the IPD comparison.

Predicted Kinetic Background Control vs Case-Control Analysis

By default, the control IPDs are generated per-base of the reference with an in-silico model of the expected IPD values for each position, based on sequence context. The computational model is called the Predicted IPD Background Control. Even in normal unmodified DNA, the IPD at any particular point will vary. Internal studies at Pacific Biosciences show that most of the variation in mean IPD across a genome can be predicted from a 12-base sequence context surrounding the active site of the DNA polymerase. The bounds of the relevant context window correspond to the window of DNA in contact with the polymerase, as seen in DNA/polymerase crystal structures. The module includes a pre-trained lookup table mapping 12-mer DNA sequences to mean IPDs observed in PacBio® data.

Filtering and Trimming

Some PacBio data features require special attention for good modification detection performance. The module inspects the alignment between the observed bases and the reference sequence. For an IPD measurement to be included in the analysis, the read sequence must match the reference sequence for K around the cognate base; currently, K = 1. The IPD distribution at some locus can be seen as a mixture of the “normal” incorporation process IPD (sensitive to the local sequence context and DNA modifications) and a contaminating “pause” process IPD which has a much longer duration (the mean is >10x longer than normal), but rarely happens (~1% of IPDs).

Pauses are defined as pulses with an IPD >10x longer than the mean IPD at that context. Heuristics are used to filter out the pauses.

Statistical Testing

The module tests the hypothesis that IPDs observed at a particular locus in the sample have longer means than IPDs observed at the same locus in unmodified DNA. If a Whole-Genome-Amplified dataset is generated, which removes DNA modifications, the module uses a case-control, two-sample t-test.

The module also provides a pre-calibrated Predicted Kinetic Background Control model which predicts the unmodified IPD, given a 12-base sequence context. In that case, the module uses a one-sample t-test, with an adjustment to account for error in the control model.

Input:

  • aligned_reads.cmp.h5: A standard cmp.h5 file with alignments and IPD information that supplies the kinetic data for modification detection.

  • Reference Sequence: The path to a SMRT Portal reference repository entry for the reference sequence used to perform alignments.

Output:

  • modifications.csv: Contains one row for each (reference position, strand) pair that appeared in the dataset with coverage of at least x. (x defaults to 3, but is configurable using the ipdSummary.py –minCoverage flag.) The reference position index is 1-based for compatibility with the GFF file in the R environment.

  • modifications.gff: Each template position/strand pair whose p-value exceeds the p-value threshold displays as a row. (The default threshold is p=0.01 or score=20.) The file is compliant with the GFF version 3 specification, and the template position is 1-based, per the GFF specification. The strand column refers to the strand carrying the detected modification, which is the opposite strand from those used to detect the modification.

The auxiliary data column of the GFF file contains other statistics useful for downstream analysis or filtering. This includes the coverage level of the reads used to make the call, and +/- 20 bp sequence context surrounding the site.

Results are generally indexed by reference position and reference strand. In all cases, the strand value refers to the strand carrying the modification in the DNA sample. The kinetic effect of the modification is observed in read sequences aligning to the opposite strand, so reads aligning to the positive strand carry information about modification on the negative strand and vice versa. The module always reports the strand containing the putative modification.

Parameters

  • identifyModifications: Default value = False, Specifies whether to use a multi-site model to identify the modification type.

  • tetTreated: Default value = False, Specifies whether the sample was TET-treated to amplify the signal of m5C modifications.

P_CorrelatedVariants (Minor and Compound Variants) Module

This module calls and correlates rare variants from a sample and provides support for determining whether or not sets of mutations are co-located. Note: This only includes SNPs, not indels.

The module takes high-coverage Reads of Insert reads that are aligned without quality scores to a similar reference. The module requires the following:

  • Reads of Insert reads only.
  • High coverage (at least 500x).
  • Alignment to reference without using quality scores.
  • The sample cannot be highly divergent from the reference.

The algorithm uses simple column counting and a plurality call. While it works well with higher depths (> 500x), it does suffer from reference bias, systematic alignment error and sizeable divergence from the reference.

Variants may not only coexist on the same molecule, but they may also be coselected for; that is, inherited together. This algorithm attempts to define a measurable relationship between a set of co-located variants found with some significance on a set of reads.

  1. The variant information is read from the GFF input file, then the corresponding cmp.h5 file is searched for reads that cover the variant. Reads that contain the variant are tracked and later assigned any other variants they contain, building a picture of the different haplotypes occuring within the read sets.

  2. The frequencies and coverage values for each haplotype are computed. These values will likely deviate (to the downside) from those found in the GFF file as the read set is constrained by whether or not they completely span the region defined by the variant set. Only reads that cover all variants are included in the frequency and coverage calculation.

  3. The frequency and coverage values are used to calculate an observed probability of each permutation within the variant set. These probabilities are used to compute the Mutual Information score for the set. Frequency informs mutual information, but does not define it. It is possible to have a lower frequency variant set with a higher mutual information score than a high frequency one.

Input:

  • A GFF file containing CCS-based variant calls at each position including read information: ID, start, and stop. (Start and stop are in (+) strand genomic coordinates.)

  • A cmp.h5 alignment file aligned without quality, and with a minimum accuracy of 95%.

  • (Optional) score: Include the mutual information score in the output. (Default value = Don't include)

  • (Optional)out: The output file name. (Default value = Output to screen)

Output:

  • data/rare_variants.gff(.gz): Contains rare variant information, accessible from SMRT Portal.

  • data/correlated_variants.gff: Accessible from SMRT Portal.

  • results/topRareVariants.xml: Viewable report based on the contents of the GFF file.

  • results/topRareVariants.xml: Viewable report based on the contents of the GFF file.

  • CSV file containing the location and count of co-variants. Example:

ref,haplotype,frequency,coverage,percent,mutinf
ref000001,285-G|297-G,133,2970,4.48,0.263799623501
ref000001,285-T|286-T,128,2971,4.31,0.256253924909
ref000001,285-G|406-G,103,2963,3.48,0.217737973781
ref000001,99-C|285-G,45,2963,1.52,0.113489812305
ref000001,286-T|406-G,43,2963,1.45,0.109404796397
ref000001,285-G|286-T,38,2971,1.28,0.0987697454578
ref000001,99-C|286-T,31,2963,1.05,0.0838430015349

P_MotifFinder (Motif Analysis) Module

This module finds sequence motifs containing base modifications. The primary application is finding restriction-modification systems in prokaryotic genomes. P_MotifFinder analyzes the output of the P_ModificationDetection module.

Input:

  • modifications.csv: Contains one row for each (reference position, strand) pair that appeared in the dataset with coverage of at least x.

  • modifications.gff: Each template position/strand pair whose p-value exceeds the p-value threshold displays as a row.

Output:

  • data/motif_summary.csv: A summary of the detected motifs, as well as the evidence for motifs.

  • data/motifs.gff: A reprocessed version of modifications.gff (from P_ModificationDetection) containing motif annotations.

Parameters:

  • minScore Default value = 35 Only consider detected modifications with a Modification QV above this threshold.

P_GMAP Module

This module maps PacBio reads onto a reference as if they were cDNA, allowing for large insertions corresponding to putative introns.

The way SMRT Pipe currently computes accuracy is incompatible with these large gaps. As a result, P_GMAP does not report an accuracy histogram like the other alignment modules.

GMAP is a third-party tool, and requires that we build a GMAP-type database before running the tool. The time to build the database, as well as its size, are prohibitive, and not all references will need it. The database is built on the fly once, the first time the module is run against a reference. This results in an extended execution time.

Input:

  • input.fofn(base files): File containing the names of the raw input files used for the analysis.

  • data/filtered_regions.fofn

  • The path to a reference in the PacBio reference repository.

Sample params.xml file:

<?xml version="1.0" ?>
  <smrtpipeSettings>
    <protocol id="my_protocol">
      <param name="reference">
        <value>/data/references/my_reference</value>
        <select>
          <import contentType="text/xml" element="reference" filter="state='active' type='sample'" isPath="true" name="name" value="directory"> /data/references/index.xml</import>
        </select>
      </param>
    </protocol>
    <module id="P_GMAP" label="GMAP v1">
  </smrtpipeSettings>

Output:

  • /data/alignment_summary.gff
  • /data/aligned_reads.sam
  • /data/aligned_reads.cmp.h5
  • /results/gmap_quality.xml

P_Barcode Module

This module provides access to the pbbarcode command-line tools, which you use to identify barcodes in PacBio reads.

Input:

  • Complete barcode FASTA file: A standard FASTA file with barcodes less than 48 bp in length. Based on the score mode you specify, the barcode file might need to contain an even number of barcodes. Example:
<param name="barcode.fasta">
  <value>/mnt/secondary/Smrtpipe/martin/prod/data/workflows/barcode_complete.fasta</value>
</param>
  • Barcode scoring method: This directly relates to the particular sample preparation used to construct the molecules. Depending on the scoring mode, the barcodes are grouped together in different ways. Valid options are:

    • symmetric: Supports barcode designs with two identical barcodes on both sides of a SMRTbell™ template. Example: For barcodes (A, B), molecules are labeled as A--A or B--B.

    • paired: Supports barcode designs with two distinct barcodes on each side of the molecule, with neither barcode appearing without its mate. Minimum example: (ALeft, ARight, BLeft, BRight), where the following barcode sets are checked: ALeft--ARight, BLeft--BRight.

Example:

<param name="mode">
  <value>symmetric</value>
</param>
  • Pad arguments: Defines how many bases to include from the adapter, and how many bases to include from the insert. Ideally, this is 0 and 0. This produces shorter alignments; however, if the adapter-calling algorithm slips a little one might lose a little sensitivity and/or specificity because of this. Do not set these unless you have a compelling use case. Examples:
<param name="adapterSidePad">
   <value>2</value>
</param>

<param name="insertSidePad">
   <value>2</value>
</param>

Output:

  • /data/*.bc.h5: Barcode calls and their scores for each ZMW.

  • /data/barcode.fofn: Contains a list of files.

  • /data/aligned_reads.cmp.h5

P_AmpliconAssembly (Amplicon Analysis) Module

This module finds de novo phased consensus sequences from a pooled set of (possibly diploid) amplicons.

Input:

  • bas.h5 files

Output:

  • data/amplicon_analysis.fasta: A FASTA file containing the consensus sequences of each haplotype of each amplicon.

  • data/amplicon_analysis.fastq: A FASTQ file containing the consensus sequences and base-confidence of each haplotype of each amplicon.

  • data/amplicon_analysis.csv: A .csv file containing one row per base in each consensus sequence, with extra metadata about base quality and coverage.

Parameters:

  • minLength Default value = 1000 Only use subreads longer than this threshold. Should be set to ~75% of the shortest amplicon length.

  • minReadScore Default value = 0.78 Only use reads with a ReadScore higher than this value.

  • maxReads Default value = 2000 Use at most this number of reads to find results. Values greater than 10000 may cause long run times.

P_CCS (Reads of Insert) Module

This module computes Read of Insert/CCS sequences from single-molecule reads. It is used to estimate the length of the insert sequence loaded onto a SMRT® Cell. Reads of Insert replaces the Circular Consensus Sequencing (CCS) protocol, which has been moved off the primary analysis instrument.

Input:

  • bas.h5 files

Output:

  • data/<movie_name>.fasta: A FASTA file containing the consensus sequences of each molecule passing quality filtering.

  • data/<movie_name>.fastq: A FASTQ file containing the consensus sequences and base quality of each molecule passing quality filtering.

  • data/<movie_name>.ccs.h5: A ccs.h5 (similar to a bas.h5) file containing a representation of the CCS sequences and quality values.

Parameters:

Note: Use the default values to obtain the closest approximation of CCS as it existed on-instrument.

  • minFullPasses Default value = 2 The raw sequence must make at least this number of passes over the insert sequence to emit a CCS read for this ZMW.

  • minPredictedAccuracy Default value = 0.9 The minimum allowed value of the predicted consensus accuracy to emit a CCS read for this ZMW.

P_BridgeMapper Module (Beta)

This module creates split alignments of Pacific Biosciences' reads for viewing with SMRT View. The split alignments can be used to infer the presence of assembly errors or structural variation. P_BridgeMapper works by first using BLASR to get primary alignments for filtered subreads. Then, P_BridgeMapper calls BLASR again, mapping any portions of those subreads not contained in the primary alignments.

Input:

  • Filtered_subreads.fastq: A FASTQ file containing subreads that passed quality filters.

Output:

  • data/split_reads.bridgemapper.gz: A gzipped, tab-separated file of split alignments. This file is consumed by SMRT View.

Parameters:

  • minRootLength Default value = 250 Only consider subreads with primary alignments longer than this threshold.

  • minAffixLength Default value = 50 Only report split alignments with secondary alignments longer than this threshold.

SMRT Pipe Tools

Tools are programs that run as part of SMRT Pipe. A module, such as P_Mapping, can call several tools (such as the mapping tools summarizeCoverage.py or compareSequences.py) to actually perform the underlying processing.

All the tools are located at $SEYMOUR_HOME/analysis/bin.

Use the --help option to see usage information for each tool. (Some tools are undocumented.)

Building the SMRT Pipe tools manually, without SMRT Portal, SMRT View, or Kodos

It is currently not possible to build the SMRT Pipe tools without SMRT Portal, SMRT View, or Kodos.

SMRT Pipe File Structure

Note: The output of a SMRT Pipe analysis includes more files than described here; interested users should explore the file structure. Following are details about the major files.

 <jobID>/job.sh
  • Contains the SMRT Pipe command line call for the job.
<jobID>/settings.xml
  • Contains the modules (and their associated parameters) to be run as part of the SMRT Pipe run.
<jobID>/metadata.rdf
  • Contains all important metadata associated with the job. This includes metadata propagated from primary results, links to all reports and data files exposed to users, and high-level summary metrics computed during the job. The file is an entry point to the job by tools such as SMRT Portal and SMRT View. metadata.rdf is formatted as an RDF-XML file using OWL ontologies. See http://www.w3.org/standards/semanticweb/ for an introduction to Semantic Web technologies.
<jobID>/input.fofn:  File containing the file names of the raw input files used for the analysis.
  • This file (“file of file names”) is generated early during a job and contains the file names of the raw input files used for the analysis.
<jobID>/input.xml
  • Used to specify the input files to be analyzed in a job, and is passed on to the command line.
log/smrtpipe.log
  • Contains debugging output from SMRT Pipe modules. This is typically shown by way of the View Log button in SMRT Portal.

Data Files

The Data directory is where most raw files generated by the pipeline are stored. (Note: The following are example output files - for more details about specific files, see the sections dealing with individual modules.)

aligned_reads.cmp.h5, aligned_reads.sam, aligned_reads.bam
  • Mapping and consensus data from secondary analysis.
alignment_summary.gff
  • Alignment data summarized on sequence regions.
variants.gff.gz
  • All sequence variants called from consensus sequence.
toc.xml
  • Deprecated - The master index information for the job outputs is now included in the metadata.rdf file.

Results/Reports Files

Modules with Reports in their name produce HTML reports with static PNG images using XML+XSLT. These reports are located in the results subdirectory. The underlying XML document for each report is also preserved there; these can be useful files for data-mining the outputs of SMRT Pipe.

The Reference Repository

The reference repository is a file-based data store used by SMRT Analysis to manage reference sequences and associated information. The full description of all of the attributes of the reference repository is beyond the scope of this document, but you need to use some basic aspects of the reference repository in most SMRT Pipe analyses.

Example: Analysis of multi-contig references can only be handled by supplying a reference entry from a reference repository.

It is simple to create and use a reference repository:

  • A reference repository can be any directory on your system. You can have as many reference repositories as you wish; the input to SMRT Pipe is a fully resolved path to a reference entry, so this can live in any accessible reference repository.

Starting with the FASTA sequence genome.fasta, you upload the sequence to your reference repository using the following command:

referenceUploader -c -p/path/to/repository -nGenomeName
-fgenome.fasta

where:

  • /path/to/repository is the path to your reference repository.
  • GenomeName is the name to use for the reference entry that will be created.
  • genome.fasta is the FASTA file containing the reference sequence to upload.

For a large genome, we highly recommended that you produce the BLASR suffix array during this upload step. Use the following command:

referenceUploader -c -p/path/to/repository -nHumanGenome -fhuman.fasta --saw='sawriter -welter'

There are many more options for reference management. Consult the MAN page entry for referenceUploader by entering referenceUploader -h.

To learn more about what is being stored in the reference entries, look at the directory containing a reference entry. You will find a metadata description (reference.info.xml) of the reference and its associated files. For example, various static indices for BLASR and SMRT View are stored in the sequence directory along with the FASTA sequence.


For Research Use Only. Not for use in diagnostic procedures. © Copyright 2010 - 2013, Pacific Biosciences of California, Inc. All rights reserved. Information in this document is subject to change without notice. Pacific Biosciences assumes no responsibility for any errors or omissions in this document. Certain notices, terms, conditions and/or use restrictions may pertain to your use of Pacific Biosciences products and/or third party products. Please refer to the applicable Pacific Biosciences Terms and Conditions of Sale and to the applicable license terms at http://www.pacificbiosciences.com/licenses.html. P/N 001-353-082-05