SMRT Pipe Reference Guide v2.0

rhallPB edited this page May 14, 2013 · 13 revisions
Clone this wiki locally

Introduction

This document describes the underlying command-line interface to SMRT Pipe, and is for use by bioinformaticians working with secondary analysis results.

SMRT Pipe is Pacific Biosciences’ underlying analysis framework for secondary analysis functions. SMRT Pipe is a python-based general-purpose workflow engine. It is easily extensible, and supports logging, distributed computation, error handling, analysis parameters, and temporary files.

In a typical installation of the SMRT Analysis Software, the SMRT Portal web application calls SMRT Pipe when a job is started. SMRT Portal provides a convenient and user-friendly way to analyze Pacific Biosciences’ sequencing data through SMRT Pipe. Power users will find that there is more flexibility and customization available by instead running SMRT Pipe analyses from the command line.

  • The latest version of SMRT Pipe is available here.

  • SMRT Pipe can also be accessed using the Secondary Analysis Web Services API. For details, see Secondary Analysis Web Services API.

Note: Throughout this documentation, the path /opt/smrtanalysis is used to refer to the installation directory for SMRT Analysis (also known as $SEYMOUR_HOME). Replace this path with the path appropriate to your installation when using this document.

Installation

SMRT Pipe is installed as part of the SMRT Analysis software installation. For details, see SMRT Analysis Software Installation.

Using the Command Line

In a typical SMRT Analysis installation, SMRT Pipe is in your path after sourcing the setup.sh file. To do so, enter the following:

. /opt/smrtanalysis/etc/setup.sh

Note: Make sure to replace /opt/smrtanalysis with the path to your SMRT Analysis installation.

To check that SMRT Pipe is available, enter the following:

smrtpipe.py --help

This displays a help message describing how to run smrtpipe.py and all of the available command-line options.

You invoke SMRT Pipe with the following command:

smrtpipe.py [--help] [options] --params=settings.xml xml:inputFile

Logging messages are printed to stderr as well as a log file (log/smrtpipe.log). It is standard practice to pipe the stderr messages to a file using redirection in your shell, for example appending &> smrtpipe.err to the command line if running under bash.

Command Line Options

Following are some of the available options for invoking smrtpipe.py:

-D key=value
  • Overrides a configuration variable. Configuration variables are key-value pairs that are read from the global file smrtpipe.rc before starting an analysis. An example is the NPROC variable which controls the number of simultaneous processors to use during the analysis. To restrict SMRT Pipe to 4 processors, use -D NPROC=4.
--debug
  • Activates debugging output in the stderr and log outputs. To set this flag as a default, specify DEBUG=True in the smrtpipe.rc file.
--distribute
  • Distributes the computation across a compute cluster. For information on onfiguring SMRT Pipe for a distributed computation environment, see SMRT Analysis Software Installation.
--help
  • Displays information about command-line usage and options, and then exits.
--noreports
  • Turns off the production of XML/HTML/PNG reports.
--nohtml
  • Turns off the conversion of XML reports into HTML. (This conversion requires that Java be installed.)
--output=outputDir
  • Specifies a root directory to use for all SMRT Pipe outputs for this analysis. SMRT Pipe places outputs in this directory, as well as in data, results, and log subdirectories.
params=params.xml
  • Specifies a settings XML file for running the pipeline analysis. If this option is not specified, SMRT Pipe prints a message and then exits.
--totalCells
  • Specifies that if the number of cells in the job is less than totalCells, the job is not marked complete when it finishes. Data from additional cells will be appended to the outputs, until the number of cells reaches totalCells.
--recover
  • Attempts to rerun a SMRT Pipe analysis starting from the last successful stage. The same initial arguments should be specified in this case.
--version
  • Displays the version number of SMRT Pipe and then exits.
--kill
  • Kills a SMRT Pipe job running in the current directory. This works with output.

Specifying SMRT Pipe Inputs

The input file is an XML file specifying the sequencing data to process. Generally, you specify the inputs as URIs (Universal Resource Identifiers) which are resolved by code internal to SMRT Pipe. In practice, this is most useful to large enterprise users that have a data management scheme and are able to modify the SMRT Pipe code to include their own resolver.

The simpler way to specify inputs is to fully resolve the path to each input file, which as of version 2.0.0, is a bax.h5 file. (For more information, see bas.h5 Reference Guide at http://files.pacb.com/software/instrument/2.0.0/bas.h5%20Reference%20Guide.pdf)

The script fofnToSmrtpipeInput.py is provided to convert a FOFN (a "file of file names" file) to the input format expected by SMRT Pipe. If my_inputs.fofn looks like

/mnt/data/2770276/0006/Analysis_Results/m130512_050747_42209_c100524962550000001823079609281357_s1_p0.2.bax.h5
/mnt/data/2770276/0006/Analysis_Results/m130512_050747_42209_c100524962550000001823079609281357_s1_p0.3.bax.h5
/mnt/data/2770276/0006/Analysis_Results/m130512_050747_42209_c100524962550000001823079609281357_s1_p0.1.bax.h5

or, for SMRT Pipe versions <2.0.0:

/share/data/run_1/m100923_005722_00122_c15301919401091173_s0_p0.bas.h5
/share/data/run_2/m100820_063008_00118_c04442556811011070_s0_p0.bas.h5

then it can be converted to a SMRT Pipe input XML file by entering:

fofnToSmrtpipeInput.py my_inputs.fofn > my_inputs.xml

Following is the resulting XML file for SMRT Pipe version 2.0.0:

<?xml version="1.0"?>
<pacbioAnalysisInputs>
  <dataReferences>
    <url ref="run:0000000-0000"><location>/mnt/data/2770276/0006/Analysis_Results/m130512_050747_42209_c100524
962550000001823079609281357_s1_p0.2.bax.h5</location></url>
    <url ref="run:0000000-0001"><location>/mnt/data/2770276/0006/Analysis_Results/m130512_050747_42209_c100524
962550000001823079609281357_s1_p0.3.bax.h5</location></url>
    <url ref="run:0000000-0002"><location>/mnt/data/2770276/0006/Analysis_Results/m130512_050747_42209_c100524
962550000001823079609281357_s1_p0.1.bax.h5</location></url>
  </dataReferences>
</pacbioAnalysisInputs>

...and for SMRT Pipe versions <2.0.0:

<?xml version="1.0"?>
<pacbioAnalysisInputs>
 <dataReferences>
    <url ref="run:0000000-0000"><location>/share/data/
    /share/data/run_1 m100923_005722_00122_c15301919401091173_s0_p0.bas.h5
    <url ref="run:0000000-0001"><location>/share/data/
    /share/data/run_2/m100820_063008_00118_c04442556811011070_s0_p0.bas.h5
 </dataReferences>
</pacbioAnalysisInputs>

To run an analysis using these input files, use the following command:

smrtpipe.py --params=settings.xml xml:my_inputs.xml

The SMRT Pipe input format lets you specify annotations, such as job IDs, job names, and job comments, in a job-management environment. The fofnToSmrtpipeInput.py application has command-line options for setting these optional attributes.

Note: To get help for a script, execute the script with the --help option and no additional arguments. For example:

fofnToSmrtpipeInput.py --help

Specifying SMRT Pipe Parameters

The --params option is the most important SMRT Pipe option, and is required for any sophisticated use. The option specifies an XML file that controls:

  • The analysis modules to run.
  • The order of execution.
  • The parameters used by the modules.

The general structure of the settings XML file is as follows:

<?xml version="1.0"?>
<smrtpipeSettings>

<protocol>
...global parameters...
</protocol>

<module id="module_1">
...parameters...
</module>

<module id="module_2">
...parameters...
</module>

</smrtpipeSettings>
  • The protocol element allows setting global parameters that could possibly be used by all modules.
  • Each module element defines an analysis module to run.
  • The order of the module elements defines the order in which the modules execute.

SMRT Portal protocol templates are located in: $SEYMOUR_HOME/common/protocols/.

SMRT Pipe modules are located in: $SEYMOUR_HOME/analysis/lib/pythonx.x/pbpy-0.1-py2.7.egg/pbpy/smrtpipe/modules/.

You specify parameters by entering a key-value pair in a param element.

  • The name of the key is in the name attribute of the param element.
  • The value of the key is contained in a nested value element.

For example, to set the parameter named reference, you specify:

<param name="reference">
  <value>/share/references/repository/celegans</value>
</param>

Note: To reference a parameter value in other parameters, use the notation ${variable} when specifying a value. For example, to reference a global parameter named home, use it in other parameters as ${home}. SMRT Pipe supports arbitrary parameters in the settings XML file, so the use of temporary variables like this can help readability and maintainability.

Following is a complete example of a settings file for running filtering, mapping, and consensus steps against the E coli reference genome:

<?xml version="1.0" encoding="utf-8"?>
<smrtpipeSettings>
 <protocol>
  <param name="reference">
   <value>/share/references/repository/ecoli</value>
  </param>
 </protocol>

 <module name="P_Filter">
  <param name="minLength">
    <value>50</value>
  </param>
  <param name="readScore">
    <value>0.75</value>
  </param>
 </module>

 <module name="P_FilterReports" />

 <module name="P_Mapping">
  <param name="align_opts" hidden="true">
   <value>--minAccuracy=0.75 --minLength=50 -x </value>
  </param>
 </module>

 <module name="P_MappingReports" />
 <module name="P_Consensus" />
 <module name="P_ConsensusReports" />

</smrtpipeSettings>

SMRT Portal Protocols

Following are the secondary analysis protocols included in SMRT Analysis v2.0, with the SMRT Pipe module(s) called by each protocol. Many of these modules are described later in this document.

RS_AHA_Scaffolding
  • P_Filter
  • HybridAssembly
RS_ALLORA_Assembly
  • AlloraSFilter
  • Assembly
RS_ALLORA_Assembly_EC
  • AlloraSFIlter
  • Assembly
RS_CeleraAssembler
  • P_PacBioToCA
  • P_CeleraAssembler
RS_Filter_Only
  • P_Filter
RS_Minor_and_Compound_Variants
  • P_Filter
  • BLASR_Minor_and_Compound_Variants
  • P_CorrelatedVariants
RS_Modification_Detection
  • P_Filter
  • P_Mapping
  • P_GenomicConsensus
  • P_Modification Detection
RS_Modification_and_Motif_Analysis
  • P_Filter
  • P_Mapping
  • P_GenomicConsensus
  • P_MotifFinder
RS_PreAssembler
  • PreAssemblerSFilter
  • P_PreAssembler
RS_PreAssembler_Allora
  • PreAssemblerSFilter
  • AlloraWithPreAssembler
RS_Resequencing
  • P_Filter
  • P_Mapping
  • P_GenomicConsensus
RS_Resequencing_CCS
  • P_Filter
  • BLASR_De_Novo_CCS
  • GenomicConcensus_Plurality
RS_Resequencing_CCS_GATK
  • P_Filter
  • BLASR_De_Novo_CCS
  • P_GATKVC
RS_Resequencing_GATK
  • P_Filter
  • P_Mapping
  • P_GATKVC
RS_Resequencing_GATK_Barcode
  • P_Filter
  • BLASR_Barcode
  • P_GATKVC
RS_Site_Acceptance_Test
  • P_Filter
  • P_Mapping
  • P_GenomicConsensus
RS_cDNA_Mapping
  • P_Filter
  • P_GMAP
11k_Unrolled_Resequencing
  • P_Filter
  • BLASR_Unrolled
  • P_MotifFinder
  • P_Modification Detection
  • P_AnalysisHook
ecoliK12_RS_Resequencing
  • P_Filter
  • P_Mapping
  • P_GenomicConsensus
  • P_AnalysisHook
lambda_RS_Resequencing
  • P_Filter
  • P_Mapping
  • P_GenomicConcensus
  • P_AnalysisHook

SMRT Pipe Modules and their Parameters

Following is an overview of some of the common modules included in SMRT Pipe and their parameters. Not all modules or parameters are listed here.

Developers interested in even finer control should look inside the validateSettings method for each python analysis module. By convention, all of the settings known to the analysis module are referenced in this method.

Global Parameters

Global parameters are potentially used in multiple modules. In the SMRT Pipe internals, they are accessed in the “global” namespace. Following are some common global parameters:

reference
  • Specifies the name of a reference repository entry or FASTA file for mapping reads. Required for resequencing workflows.
  • Default value: None
control
  • Specifies the name of a reference repository entry or FASTA file for mapping spike-in control reads. Optional
  • Default value: None
use_subreads
  • Specifies whether to divide reads into subreads using the adapter region boundaries found by the primary analysis software. Optional
  • Default value: True
num_stats_regions
  • Specifies how many regions to use when reporting region statistics such as depth of coverage and variant density. Optional
  • Default value: 500

P_Fetch Module

This module fetches the input data and generates a file of the file names of the input .pls files for downstream analysis. This module has no exposed parameters.

Output:

  • pls.fofn (file of file names of the input .pls files)

P_Filter Module

This module filters and trims the raw reads produced by Pacific Biosciences’ primary analysis software. Options are available for taking the information found in the bas.h5 files and using this to pass reads and portions of reads forward.

Input:

  • bas.h5 files

Output:

  • data/filtering_summary.csv: Includes raw metrics and filtering information for each read (not subread) found in the original bas.h5 files.
  • rgn.h (one for each input bas.h5 file): Filtering information generated by the module.

Parameters:

  • minLength Reads with a high quality region read length below this threshold are filtered out. (Optional)

  • maxLength Reads with a high quality region read length above this threshold are filtered out. (Optional)

  • minSubReadLength Subreads shorter than this length are filtered out.

  • maxSubReadLength Subreads longer than this length are filtered out.

  • minSNR Reads with signal-to-noise ratio below this threshold are filtered out. (Optional)

  • readScore Reads with a high quality region (Read Quality) score below this threshold are filtered out. (Optional)

  • trim Default value = True, Specifies whether to trim reads to the high-quality region. (Optional)

  • artifact Reads with a read artifact score less than this (negative) number are filtered out. No number indicates no artifact filtering. Reasonable thresholds are typically between -1000 and -200. (Optional)

P_PreAssembler Module

This module takes as input long reads and short reads in standard formats, aligns the short reads to the long reads, and outputs a consensus from the preassembled short reads using the long reads as seeds. Note: You must run the P_Fetch and P_Filter modules before running P_PreAssembler to get meaningful results.

Input:

  • Long reads ("seed reads"): PacBio pls.h5/bas.h5 file(s) and optionally associated rgn.h5 file(s).
  • Short reads: Can be one of the following:
    • PacBio CCS pls.h5/bas.h5 file(s), without associated rgn.h5 file(s).
    • Arbitrary high-quality reads in FASTQ format, such as Illumina reads, without Ns.
    • PacBio pls.h5/bas.h5 file(s): The same reads as used for the long reads. This mode is the first step of HGAP (Hierarchical Genome Assembly Procedure.)
  • params.xml
  • input.xml

The module can run on bas.h5 files only, and on bas.h5 and FASTQ files. Following are sample XML inputs for both modes.

Sample input.xml,bas.h5-only input mode

  • Note: bas.h5 input files must have the suffix bas.h5.
<pacbioAnalysisInputs>
<?xml version="1.0"?>
<pacbioAnalysisInputs>
   <dataReferences>
      <url ref="run:0000000-0001">
   <location>
      /path/to/input.bas.h5
   </location>
</url>
   </dataReferences>
</pacbioAnalysisInputs>

Sample params.xml, bas.h5-only input mode

  • This XML parameter file was tested on 90X short reads and 24X long reads.
<module name="P_PreAssembler">
   <param name="useFastqAsShortReads">
     <value>False</value>
   </param>
   <param name="useFastaAsLongReads">
     <value>False</value>
   </param>
   <param name="useLongReadsInConsensus">
     <value>False</value>
   </param>
   <param name="useUnalignedReadsInConsensus">
     <value>False</value>
   </param>
   <param name="useCCS">
     <value>False</value>
   </param>
   <param name="minLongReadLength">
     <value>5000</value>
   </param>
   <param name="blasrOpts">
     <value> -minReadLength 200 -maxScore -1000 -bestn 24 -maxLCPLength 16 -nCandidates 24 </value>
   </param>
   <param name="consensusOpts">
     <value> -L </value>
   </param>
   <param name="layoutOpts">
     <value> --overlapTolerance 100 --trimHit 50 </value>
   </param>
   <param name="consensusChunks">
     <value>60</value>
   </param>
   <param name="trimFastq">
     <value>True</value>
   </param>
   <param name="trimOpts">
     <value> --qvCut=59.5 --minSeqLen=500 </value>
   </param>
</module>

Sample params.xml (bas.h5-only input mode, CCS)

<?xml version="1.0" ?>
<smrtpipeSettings>
  <module name="P_Fetch"/>
  <module name="P_PreAssembler">
   <param name="useFastqAsShortReads">
     <value>False</value>
   </param>
   <param name="useFastaAsLongReads">
     <value>False</value>
   </param>
   <param name="useLongReadsInConsensus">
     <value>False</value>
   </param>
   <param name="useUnalignedReadsInConsensus">
     <value>False</value>
   </param>
   <param name="blasrOpts">
     <value>-advanceHalf -noSplitSubreads -ignoreQuality -minMatch 10 -minPctIdentity 70 -bestn 20</value>
   </param>
   <param name="layoutOpts">
     <value>--overlapTolerance=25</value>
   </param>
</module>
</smrtpipeSettings>

Sample input.xml (FASTQ and bas.h5 input mode)

  • This parameter XML file was tested on 50X 100bp Illumina® reads correcting 15X PacBio long reads.
<?xml version="1.0"?>
<pacbioAnalysisInputs>
   <dataReferences>
     <url ref="run:0000000-0001">
   <location>
     /path/to/input.bas.h5
   </location>
   </url>
     <url ref="fastq:/path/to/input.fastq"/>
   </dataReferences>
</pacbioAnalysisInputs>

Sample params.xml (FASTQ and bas.h5 input mode)

<?xml version="1.0" ?>
<smrtpipeSettings>
  <module name="P_Fetch"/>
  <module name="P_Filter">
    <param name="filters">
       <value>MinRL=1000,MinReadScore=0.80</value>
    </param>
    <param name="artifact">
       <value>-1000</value>
    </param>
  </module>
  <module name="P_PreAssembler">
    <param name="useFastqAsShortReads">
       <value>True</value>
    </param>
    <param name="useFastaAsLongReads">
       <value>False</value>
    </param>
    <param name="useLongReadsInConsensus">
       <value>False</value>
    </param>
    <param name="useUnalignedReadsInConsensus">
       <value>False</value>
    </param>
    <param name="blasrOpts">
       <value>-minMatch 8 -minReadLength 30 -maxScore -100 -minPctIdentity 70 -bestn 100</value>
    </param>
    <param name="layoutOpts">
       <value>--overlapTolerance=25</value>
    </param>
    <param name="consensusOpts">
       <value>-w 2</value>
    </param>
</module>
</smrtpipeSettings>

Output:

  • corrected.fasta,corrected.fastq: FASTQ or FASTA file of corrected long reads.
  • idmap.csv: csv file mapping corrected long read ids to original read ids

P_Mapping (BLASR) Module

This module aligns reads against a reference sequence, possibly a multi-contig reference. If the P_Filter module is run first, then only the reads which passed filtering are aligned.

Output:

  • data/aligned_reads.cmp.h5: The pairwise alignments for each read.
  • data/alignment_summary.gff: Summary information.

Parameters:

  • align_opts Default value = Empty string, Passes options to the underlying compareSequences.py script. (Optional)

  • --useCcs= Default value = None, A parameter sent to the underlying compareSequences.py script via the align_opts parameter value above. Values are {denovo|fullpass|allpass}. (Optional)

    • denovo: Maps just the de novo called sequence and report. (Does not include quality values.)

    • fullpass: Maps the de novo called sequence, then aligns full passes to the sequence that the de novo called sequence aligns to.

    • allpass: Maps the de novo called sequence, then aligns all passes (even ones that don't span the length of the template) to the sequence the de novo called sequence aligned to.

  • load_pulses: Default value = True, Specifies whether to load pulse metric information into the cmp.h5 file. (Optional)

  • maxHits: Default value = None, Attempts to find sub-optimal alignments and report up to this many hits per read. (Optional)

  • minAnchorSize: Default value = None, Ignores anchors smaller than this size when finding candidate hits for dynamic programming alignment. (Optional)

  • maxDivergence: Default value = None, Specifies maximum divergence between read and reference to allow a mapping. Divergence = (1 - accuracy).

  • output_bam: Default value = False, Specifies whether to output a BAM representation of the cmp.h5 file. (Optional)

  • output_sam: Default value = False, Specifies whether to output a SAM representation of the cmp.h5 file. (Optional)

  • gff2Bed: Default value = False, Specifies whether to output a BED representation of the depth of coverage summary. (Optional)

P_GenomicConsensus (Quiver) Module

This module takes the alignments generated by the P_Mapping module and calls the consensus sequence across the reads.

Input:

  • data/aligned_reads.cmp.h5: The pairwise alignments for each read.

  • data/alignment_summary.gff: Summary information.

Output:

  • data/aligned_reads.cmp.h5

  • data/variants.gff.gz: A gzipped GFF3 file containing variants versus the reference.

  • data/consensus.fastq.gz: The consensus sequence in FASTQ format.

  • data/alignment_summary.gff, data/variants.vc: Useful information about variants.

Parameters:

  • makeBed: Default value = True, Specifies whether to output a BED representation of the variants. (Optional)

  • makeVcf: Default value = True, Specifies whether to output a VCF representation of the variants. (Optional)

P_AnalysisHook Module

This module allows you to call executable code as part of a SMRT Pipe analysis. P_AnalysisHook can be called multiple times in a settings XML file, allowing for an arbitrary number of calls to external (non-SMRT Pipe) code.

Parameters:

  • scriptDir: Default value = None, All executables in this directory are called serially with the command line exeCmd jobDir, where jobDir is the root of the SMRT Pipe output for this analysis. (Optional)

  • script: Default value = None, Path to an executable called with the command line exeCmd jobDir, where jobDir is the root of the SMRT Pipe output for this analysis. (Optional)

Assembly (Allora Assembly) Module

This module takes the trimmed reads which pass filtering and attempts to assemble them into contiguous sequences (contigs).

Input:

  • input.xml with bas.h5

  • You can also run Allora on raw FASTA or FASTQ files using the following syntax: smrtpipe.py --params=settings.xml input.fast[a|q]

Sample settings.xml file:

<?xml version="1.0" ?>
<smrtpipeSettings>
  <module id="Assembly" label="Allora v1">
   <param label="Overlap permissiveness" name="overlapScoreThreshold">
    <title>
      Overlap permissiveness affects how strictly or permissively reads and contigs will be joined.
    </title>
    <value>700</value>
    <select>
     <option value="1300">Least permissive</option>
     <option value="1000">Less permissive</option>
     <option value="700">Normal</option>
     <option value="500">More permissive</option>
     <option value="300">Most permissive</option>
    </select>
   </param>
   <param label="Expected genome size (bp)" name="genomeSize">
    <title>
      The expected genome size helps to estimate coverage and can lead to more accurate assembly.
    </title>
    <value>50000000</value>
    <rule message="Value must be positive" min="1" type="number"/>
   </param>
   <param label="Minimum number of iterations" name="minIterations">
    <title>
     The minimum number of iterations before the algorithm halts.
    </title>
    <value>1</value>
    <rule max="1000" message="Value must be an integer between 0 and 1000" min="0" type="digits"/>
   </param>
   <param label="Maximum number of iterations" name="maxIterations">
    <title>
     The maximum number of iterations before the algorithm halts.
    </title>
    <value>10</value>
    <rule max="1000" message="Value must be an integer between 0 and 1000" min="0" type="digits"/>
   </param>
   <param name="detectChimeras">
    <title>
     Whether to detect chimeras using all-vs-all read comparison.
    </title>
    <value>True</value>
   </param>
   <param name="detectChimerasOptions">
    <value>--detector=Iterative:threshold=2</value>
   </param>
   <param name="trimLayouts">
    <value>False</value>
   </param>
   <param label="Write an ACE output file." name="outputAce">
    <value>False</value>
   </param>
   <param label="Write an AMOS bank directory." name="outputBank">
    <value>True</value>
   </param>
   <param hidden="true" name="autoParameters">
    <value>True</value>
   </param>
  </module>
</smrtpipeSettings>

Output:

  • data/assembled.fsta: A FASTA file containing the assembled contig consensus sequences.
  • data/assembled_reads.cmp.h5: The pairwise alignments for each read against its assembled contig consensus.
  • data/assembled_summary.gff: Summary information about each of the contigs.
  • data/assembled.ace: The assembly, in ACE format. (Optional)
  • data/assembled.bnk.tar.gz: The assembly, as a compressed AMOS bank. (Optional)

Parameters:

  • overlapScoreThreshold: Default value = 700, The score threshold for accepting an overlap between two reads. The suggested range is between 300 and 1500, with 700 being a typical threshold. (Optional)

  • genomeSize: Default value = 100,000 bases, A rough estimate of the expected size of this genome. This provides an estimate of expected coverage and modulates parameters based on genome size. (Optional, but strongly recommended.)

  • maxIterations: Default value = 10, Specifies the maximum number of iterations for progressive assembly. (Optional)

  • minIterations: Default value = 4, Specifies the minimum number of iterations for progressive assembly. (Optional)

  • outputAce: Default value = False, Specifies whether to output an ACE representation of the assembly. (Optional)

  • outputBank: Default value = False, Specifies whether to output an AMOS representation of the assembly. (Optional)

  • autoParameters: Default value = False, Specifies whether alignment parameters should be automatically set by the module. Note: You can sometimes improve the accuracy of contigs output from the module by feeding these contigs to the Resequencing protocol as a reference sequence, and re-aligning the reads to this reference. The resulting consensus will often improve on the original de novo assembly accuracy.

HybridAssembly (AHA Scaffolding) Module

This module scaffolds high-confidence contigs, such as those from Illumina® data, using Pacific Biosciences’ long reads.

Input:

  • settings.xml: Specifies the parameters to run.
  • input.xml: Specifies the inputs.

HybridAssembly.py uses two kinds of input instead of one:

  • A FASTA file of high-confidence sequences to be scaffolded. These are typically contigs assembled from Illumina® short-read sequence data.

  • Pacific Biosciences’ long reads, in HDF5 or FASTA format. These are used to join the high-confidence contigs into a scaffold.

Sample input.xml file:

<?xml version="1.0"?>
<pacbioAnalysisInputs>
<dataReferences>
<!-- High-confidence sequences fasta file -->
<url ref="assembled_contigs:test_contigs.fsta"/>
<!-- PacBio reads, either in fasta or in bas.h5 format. -->
<url ref="file:test_reads.fsta" />
</dataReferences>
</pacbioAnalysisInputs>

Sample settings.xml file for long reads, with only customer-facing parameters:

<?xml version="1.0"?>
<smrtpipeSettings>

  <!-- HybridAssembly 1.2.0 parameter file for long reads -->
  <module name="HybridAssembly">

  <!-- General options -->
  <!-- Parameter schedules are used for iterative hybrid assembly. 
They are given in comma delimited tuples separate by semicolons. The fields in order are:
- Minimum alignment score (aka Z-score). Higher is more stringent.
- Minimum number of reads needed to link two contigs. (Redundancy)
- Minimum subread length to participate in alignment.
- Minimum contig length to participate in alignment.
If a tuple contains less than 4 fields, defaults will be used for the remaining fields. -->

  <paramSchedule>6,3,75;6,3,75;6,2,75;6,2,75</paramSchedule>

<!-- Untangling occurs after the main scaffolding step. 
Valid values are "bambus" and "pacbio" (recommended and the default). -->
  <untangler>pacbio</untangler>

<!-- Gap fillin can be turned on by setting to True or off by setting to False -->
  <fillin>False</fillin>

<!-- These options allow long reads -->
  <longReadsAsStrobe>True</longReadsAsStrobe>
  <blasrOpts>-minMatch 10 -minPctIdentity 70 -bestn 10 -noSplitSubreads</blasrOpts>

<!-- Parallelization options -->
  <numberProcesses>16</numberProcesses>
  </module>
</smrtpipeSettings>

Output:

  • data/scaffold.gml: A GraphML file that contains the final scaffold. This file can be readily parsed in python using the networkx package.

  • data/scaffold.fasta: A FASTA file with a single entry for each scaffold.

To run HybridAssembly.py, enter the following:

smrtpipe.py --params=settings.xml xml:input.xml >& smrtpipe.err

Parameters:

  • paramSchedule: Default value = None, Specifies parameter schedules used for iterative hybrid assembly. Schedules are in comma-delimited tuples, separated by semicolons. Example: 6,3,75;6,3,75;6,2,75;6,2,75. The fields, in order, are:

    • Minimum alignment score. Higher is more stringent.
    • Minimum number of reads needed to link two contigs. (Redundancy)
    • Minimum subread length to participate in alignment.
    • Minimum contig length to participate in alignment.
  • untangler: Default value = pacbio, Untangling occurs after the main scaffolding step. Valid values are bambus and pacbio. (Recommended)

  • fillin: Default value = False, Specifies whether to use long reads.

  • blasrOpts: Default value = -minMatch 10 -minPctIdentity 60 -bestn 10 -noSplitSubreads, Options passed directly to BLASR for aligning reads to contigs.

  • maxIterations: Default value = 6, Specifies the maximum number of iterations to use from paramSchedule. If paramSchedule has more than maxIterations, it will be truncated at maxIterations. If paramSchedule has less than maxIterations, the last iteration of paramSchedule is repeated.

  • cleanup: Default value = True, Specifies whether to clean up intermediate files. This can be useful for debugging purposes.

  • runNucmer: Default value = True, Specifies whether to use Nucmer to detect repeat locations. This can improve assemblies, but can be very slow on large highly repetitive genomes.

  • maxContigsLength: Default value = 170000000, Specifies the maximum total length of contigs. Note: Overriding this value could be risky.

  • maxNumContigs: Default value = 20000, Specifies the maximum number of contigs. Note: Overriding this value could be risky.

  • maxNumReads: Default value = 2000000, Specifies the maximum number of reads. Note: Overriding this value could be risky.

  • gapFillOpts: Default value = “”, Options to be passed directly to gapFiller.py.

  • noScaffoldImages: Default value = True, Specifies not producing SVG files of the scaffolds. Creating these files can be expensive for large assemblies, but is recommended for small assemblies.

Known Issues

  • Depending on the repetitive content of the high-confidence input contigs, a large fraction of the sequence in the contigs can be called repeats. To avoid this, turn off the split repeats step by setting the minimum repeat identity to a number greater than 100, for example:
<minRepeatIdentity>1000</minRepeatIdentity>

P_GATKVC (GATK Unified Genotyper) Module

This module wraps the Broad Institute's GATK Unified Genotyper for Bayesian diploid and haploid SNP calling, using base quality score recalibration and default settings. The module calls both homozygous and heterozygous SNPs.

We recommend that you use a dbSNP file as a prior for base quality score recalibration. By default, the P_GATKVC (GATK Unified Genotyper) module uses a null prior. To use a prior, see the script vcfUploader.py, included with the SMRT Analysis installation.

Note: Indel calling and other options are not currently supported through SMRT Pipe.

P_Modification Detection Module

This module uses the cmp.h5 output by the P_Mapping module to:

  1. Compare observed IPDs in the cmp.h5 file at each reference position on each strand with control IPDs. Control IPDs are supplied by either an in-silico computational model, or observed IPDs from unmodified “control” DNA.

  2. Generate modifications.csv and modifications.gff reporting statistics on the IPD comparison.

Predicted Kinetic Background Control vs Case-Control Analysis

By default, the control IPDs are generated per-base of the reference with an in-silico model of the expected IPD values for each position, based on sequence context. The computational model is called the Predicted IPD Background Control. Even in normal unmodified DNA, the IPD at any particular point will vary. Internal studies at Pacific Biosciences show that most of the variation in mean IPD across a genome can be predicted from a 12-base sequence context surrounding the active site of the DNA polymerase. The bounds of the relevant context window correspond to the window of DNA in contact with the polymerase, as seen in DNA/polymerase crystal structures. The module includes a pre-trained lookup table mapping 12-mer DNA sequences to mean IPDs observed in PacBio® data.

Filtering and Trimming

Some PacBio data features require special attention for good modification detection performance. The module inspects the alignment between the observed bases and the reference sequence. For an IPD measurement to be included in the analysis, the read sequence must match the reference sequence for K around the cognate base; currently, K = 1. The IPD distribution at some locus can be seen as a mixture of the “normal” incorporation process IPD (sensitive to the local sequence context and DNA modifications) and a contaminating “pause” process IPD which has a much longer duration (the mean is >10x longer than normal), but rarely happens (~1% of IPDs).

Pauses are defined as pulses with an IPD >10x longer than the mean IPD at that context. Heuristics are used to filter out the pauses.

Statistical Testing

The module tests the hypothesis that IPDs observed at a particular locus in the sample have longer means than IPDs observed at the same locus in unmodified DNA. If a Whole-Genome-Amplified dataset is generated, which removes DNA modifications, the module uses a case-control, two-sample t-test.

The module also provides a pre-calibrated Predicted Kinetic Background Control model which predicts the unmodified IPD, given a 12-base sequence context. In that case, the module uses a one-sample t-test, with an adjustment to account for error in the control model.

Input:

  • aligned_reads.cmp.h5: A standard cmp.h5 file with alignments and IPD information that supplies the kinetic data for modification detection.

  • Reference Sequence: The path to a SMRT Portal reference repository entry for the reference sequence used to perform alignments.

Output:

  • modifications.csv: Contains one row for each (reference position, strand) pair that appeared in the dataset with coverage of at least x. (x defaults to 3, but is configurable using the ipdSummary.py –minCoverage flag.) The reference position index is 1-based for compatibility with the GFF file in the R environment.

  • modifications.gff: Each template position/strand pair whose p-value exceeds the p-value threshold displays as a row. (The default threshold is p=0.01 or score=20.) The file is compliant with the GFF version 3 specification, and the template position is 1-based, per the GFF specification. The strand column refers to the strand carrying the detected modification, which is the opposite strand from those used to detect the modification.

The auxiliary data column of the GFF file contains other statistics useful for downstream analysis or filtering. This includes the coverage level of the reads used to make the call, and +/- 20 bp sequence context surrounding the site.

Results are generally indexed by reference position and reference strand. In all cases, the strand value refers to the strand carrying the modification in the DNA sample. The kinetic effect of the modification is observed in read sequences aligning to the opposite strand, so reads aligning to the positive strand carry information about modification on the negative strand and vice versa. The module always reports the strand containing the putative modification.

Parameters

  • identifyModifications: Default value = False, Specifies whether to use a multi-site model to identify the modification type.

  • tetTreated: Default value = False, Specifies whether the sample was TET-treated to amplify the signal of m5C modifications.

RS_Celera Assembler Workflow

This workflow (comprised of the P_PacBioToCA and P_CeleraAssembler modules) wraps the Celera® Assembler’s error correction and assembly programs.

For full documentation of pacBioToCA and the Celera Assembler, see http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Main_Page.

The error correction may be run with external high confidence reads, such as those from Illumina® data, or from internally generated CCS reads.

Input:

  • settings.xml: Specifies the parameters.

  • input.xml: Specifies the inputs.

Sample settings.xml file:

<module id="P_PacBioToCA" label="PacBioToCA v1" editableInJob="true">
  <description>This module wraps pacBioToCA, the error correction pipeline of Celera Assembler v7.0</description>

If useCCSFastq is set to True, CSS reads are used as the high-confidence reads:

  <param name="useCCSFastq" label="Correct With CCS">
     <title>Use CCS reads to correct long reads</title>
     <value>True</value>
     <input type="checkbox" />
  </param>

To use a FASTQ file, set the value of shortReadFastqA to the full path.

  <param name="shortReadFastqA" label="FASTQ to Correct With">
     <title>(Optional) FASTQ file of reads to correct long reads with </title>
     <input type="text" />
     <value></value>
     <rule remote="api/protocols/resource-exists?paramName=shortReadFastqA" required="false" message="File does not exist" />
  </param>

To use a FASTQ file, set the value of shortReadTechnology to the correct quality encoding value: sanger, 454, or illumina. (Illumina® encoding is shown in the example.)

  <param name="shortReadTechnology" label="FASTQ Read Type">
     <title>Sequencing platform used to generate the FASTQ file, if specified</title>
     <value>illumina</value>
     <select>
       <option value="sanger">~1kb (e.g., Sanger and PacBio CCS)</option>
       <option value="454">~600bp (e.g., 454)</option>
       <option value="illumina">~100bp (e.g., Illumina)</option>
     </select>
  </param>

To use a FASTQ file, set the value of shortReadType to the correct quality encoding value: sanger, solexa, or illumina. (Illumina® encoding is shown in the example.)

  <param name="shortReadType" label="FASTQ Quality Value Encoding">
     <title>ASCII encoding of the quality values in the FASTQ file, if specified</title>
     <value>illumina</value>
     <select>
       <option value="sanger">Phred+33 (e.g., Sanger and PacBio fastq)
       </option>
       <option value="solexa">Solexa+64 (e.g., Solexa fastq)</option>
       <option value="illumina">Phred+64 (e.g., Illumina fastq)</option>
     </select>
  </param>

  <param name="pbReadMinLength" label="Min fragment length">
     <title>Minimum length of PacBio RS fragment to keep.</title>
     <input type="text" />
     <value>1000</value>
     <rule type="digits" message="Value must be an integer" required="true" />
  </param>

  <param name="specInPacBioToCA" label="Pre-defined spec file">
    <title>Enter the server path to an existing spec file</title>
    <input type="text" />
    <rule remote="api/protocols/resource-exists?paramName=specInPacBioToCA" required="false" message="File does not exist" />
  </param>
</Module>

<module id="P_CeleraAssembler" label="CeleraAssembler v1" editableInJob="true">
   <description>This module wraps the Celera Assembler v7.0</description>

  <param name="genomeSize" label="Genome Size (bp)">
     <title>Approximate genome size in base pairs</title>
     <value>5000000</value>
     <input type="text" />
     <rule type="digits" message="Must be a value between 1 and 200000000" min="1"
     required="true" max="200000000" />
  </param>

  <param name="defaultFrgMinLen" hidden="true">
    <input type="text" />
    <value>1000</value>
  </param>

  <param name="xCoverage" label="Target Coverage">
    <title>Fold coverage to target when picking frgMinLen for assembly. Typically 15 to 25.</title>
    <input type="text" />
    <value>15</value>
    <rule type="digits" message="Value must be an integer between 10 and 30, inclusive” min="10" max="30" />
  </param>

  <param name="ovlErrorRate" label="Overlapper error rate">
    <title>Overlapper error rate</title>
    <input type="text" />
    <value>0.015</value>
    <rule type="number" message="Value must be numeric" />
  </param>

  <param name="ovlMinLen" label="Overlapper min length">
    <title>Overlaps shorter than this length are not computed.</title>
    <input type="text" />
    <value>40</value>
    <rule type="digits" message="Value must be an integer" />
  </param>

  <param name="specInRunCA" label="Pre-defined spec file">
    <title>Enter the server path to an existing spec file</title>
    <input type="text" />
    <rule remote="api/protocols/resource-exists?paramName=specInRunCA" required="false" message="File does not exist" />
  </param>

 </module>
</smrtpipeSettings>

Output:

Note: Some of the worklflow outputs are produced by the Celera Assembler, and some by Pacific Biosciences’ software.

  • data/runCa.spec: The specification file used to run the assembly program. The P_CeleraAssembler module auto-generates the specification file based on the input data and selected parameters. Alternatively, you can provide an explicit specification file.

  • data/pacBioToCA.spec: The specification file used to run the error correction program. The P_PacBioToCA module auto-generates the specification file based on the input data and selected parameters. Alternatively, you can provide an explicit specification file.

  • data/celera-assembler.asm: The official output of Celera Assembler’s assembly program.

  • data/assembled_reads.cmp.h5: The pairwise alignment for each read against its assembled contig consensus.

  • data/assembled_summary.gff.gz: Summary information about each of the contigs.

  • data/castats.txt: Assembly statistics report.

To run the error correction and assembly modules, enter the following:

smrtpipe.py --params=settings.xml xml:input.xml >& smrtpipe.err

P_CorrelatedVariants (Minor and Compound Variants) Module

This module calls and correlates rare variants from a sample and provides support for determining whether or not sets of mutations are co-located. Note: This only includes SNPs, not indels.

The module takes high-coverage CCS reads that are aligned without quality scores to a similar reference. The module requires the following:

  • CCS reads only.
  • High coverage (at least 500x).
  • Alignment to reference without using quality scores.
  • The sample cannot be highly divergent from the reference.

The algorithm uses simple column counting and a plurality call. While it work well with higher depths (> 500x), it does suffer from reference bias, systematic alignment error and sizeable divergence from the reference.

Variants may not only coexist on the same molecule, but they may also be coselected for; that is, inherited together. This algorithm attempts to define a measurable relationship between a set of co-located variants found with some significance on a set of reads.

  1. The variant information is read from the GFF input file, then the corresponding cmp.h5 file is searched for reads that cover the variant. Reads that contain the variant are tracked and later assigned any other variants they contain, building a picture of the different haplotypes occuring within the read sets.

  2. The frequencies and coverage values for each haplotype are computed. These values will likely deviate (to the downside) from those found in the GFF file as the read set is constrained by whether or not they completely span the region defined by the variant set. Only reads that cover all variants are included in the frequency and coverage calculation.

  3. The frequency and coverage values are used to calculate an observed probability of each permutation within the variant set. These probabilities are used to compute the Mutual Information score for the set. Frequency informs mutual information, but does not define it. It is possible to have a lower frequency variant set with a higher mutual information score than a high frequency one.

Input:

  • A GFF file containing CCS-based variant calls at each position including read information: ID, start, and stop. (start and stop are in (+) strand genomic coordinates.)

  • A cmp.h5 alignment file aligned without quality, and with a minimum accuracy of 95%.

  • (Optional) score: Include the mutual information score in the output. (Default: Don't include.)

  • (Optional)out: The output file name. (Default: Output to screen)

Output:

  • data/rare_variants.gff(.gz): Contains rare variant information, accessible from SMRT Portal.

  • data/correlated_variants.gff: Accessible from SMRT Portal.

  • results/topRareVariants.xml: Viewable report based on the contents of the GFF file.

  • results/topRareVariants.xml: Viewable report based on the contents of the GFF file.

  • CSV file containing the location and count of co-variants. Example:

ref,haplotype,frequency,coverage,percent,mutinf
ref000001,285-G|297-G,133,2970,4.48,0.263799623501
ref000001,285-T|286-T,128,2971,4.31,0.256253924909
ref000001,285-G|406-G,103,2963,3.48,0.217737973781
ref000001,99-C|285-G,45,2963,1.52,0.113489812305
ref000001,286-T|406-G,43,2963,1.45,0.109404796397
ref000001,285-G|286-T,38,2971,1.28,0.0987697454578
ref000001,99-C|286-T,31,2963,1.05,0.0838430015349

P_MotifFinder (Motif Analysis) Module

This module finds sequence motifs containing base modifications. The primary application is finding restriction-modification systems in prokaryotic genomes. P_MotifFinder analyzes the output of the P_ModificationDetection module.

Input:

  • modifications.csv: Contains one row for each (reference position, strand) pair that appeared in the dataset with coverage of at least x.

  • modifications.gff: Each template position/strand pair whose p-value exceeds the p-value threshold displays as a row.

Output:

  • data/motif_summary.csv: A summary of the detected motifs, as well as the evidence for motifs.

  • data/motifs.gff: A reprocessed version of modifications.gff (from P_ModificationDetection) containing motif annotations.

Parameters:

  • minScore Default value = 35 Only consider detected modifications with a Modification QV above this threshold.

P_GMAP Module

This module maps PacBio reads onto a reference as if they were cDNA, allowing for large insertions corresponding to putative introns.

The way SMRT Pipe currently computes accuracy is incompatible with these large gaps. As a result, P_GMAP does not report an accuracy histogram like the other alignment modules.

GMAP is a third-party tool, and requires that we build a GMAP-type database before running the tool. The time to build the database, as well as its size, are prohibitive, and not all references will need it. The database is built on the fly once, the first time the module is run against a reference. This results in an extended execution time.

Input:

  • input.fofn(base files): Names of the raw input files used for the analysis.

  • data/filtered_regions.fofn

  • The path to a reference in the PacBio reference repository.

Sample params.xml file:

<?xml version="1.0" ?>
  <smrtpipeSettings>
    <protocol id="my_protocol">
      <param name="reference">
        <value>/data/references/my_reference</value>
        <select>
          <import contentType="text/xml" element="reference" filter="state='active' type='sample'" isPath="true" name="name" value="directory"> /data/references/index.xml</import>
        </select>
      </param>
    </protocol>
    <module id="P_GMAP" label="GMAP v1">
  </smrtpipeSettings>

Output:

  • /data/alignment_summary.gff
  • /data/aligned_reads.sam
  • /data/aligned_reads.cmp.h5
  • /results/gmap_quality.xml

P_Barcode Module

This module provides access to the pbbarcode command-line tools, which you use to identify barcodes in PacBio reads.

Input:

  • Complete barcode FASTA file: A standard FASTA file with barcodes less than 48 bp in length. Based on the score mode you specify, the barcode file might need to contain an even number of barcodes. Example:
<param name="barcode.fasta">
  <value>/mnt/secondary/Smrtpipe/martin/prod/data/workflows/barcode_complete.fasta</value>
</param>
  • Barcode scoring method: This directly relates to the particular sample preparation used to construct the molecules. Valid options are:

    • symmetric: Supports barcode designs with two identical barcodes on both sides of a SMRTbell™ template. Example: For barcodes (A, B), molecules are labeled as A--A or B--B.

    • asymmetric: Supports barcode designs with different barcodes on each side of a molecule, with no constraints on which barcodes must appear together. Example: For barcodes (A,B,C), the following barcode sets are checked, A--B, A--C, and B--C. There is no orientation, hence there is no difference between A--B and B--A; A--B is arbitrarily chosen because A appears before B in the barcode list.

    • paired: Each set of two barcodes, (1,2; 3,4, ... (2n-1, n)) are considered as pairs. In contrast, the barcodes, (1,2,3,4,5, ..., n) are considered separate and aligned alone yielding a n*(n-1)/2 possible barcode labels.

Example:

<param name="mode">
  <value>symmetric</value>
</param>
  • Pad arguments: Defines how many bases and to include from the adapter and how many bases to include from the insert. Ideally, this is 0 and 0. This produces shorter alignments; however, if the adapter-calling algorithm slips a little one might lose a little sensitivity and/or specificity because of this. Do not set these unless you have a compelling use case. Examples:
<param name="adapterSidePad">
   <value>2</value>
</param>

<param name="insertSidePad">
   <value>2</value>
</param>

Output:

  • /data/*.bc.h5: Barcode calls and their scores for each ZMW.

  • /data/barcode.fofn: Contains a list of files.

  • /data/aligned_reads.cmp.h5

SMRT Pipe Tools

Tools are programs that run as part of SMRT Pipe. A module, such as P_Mapping, can call several tools (such as the mapping tools summarizeCoverage.py or compareSequences.py) to actually perform the underlying processing.

All the tools are located at $SEYMOUR_HOME/analysis/bin.

Use the --help option to see usage information for each tool. (Some tools are undocumented.)

Building the SMRT Pipe tools manually, without SMRT Portal, SMRT View, or Kodos

Requirements:

BOOST C++ Libraries (http://www.boost.org/); usually located at /usr/include if it came with the Operating System. For this procedure, we assume that the BOOST Library is located here: /usr/local/boost_1_47_0:

$ export BOOST=/usr/local/boost_1_47_0

To prepare the source files:

Assume that the user saved the source tarball at $HOME/Downloads/smrtpipe-sources-2.0.0.tar.gz:

$ tar zxf $HOME/Downloads/smrtpipe-sources-2.0.0.tar.gz -C $HOME

The file structure should look like this before starting to build the software:

├── assembly
│   ├── build
│   ├── cpp
│   ├── java
│   ├── papers
│   ├── pbpy
│   ├── seymour
│   ├── smrtpipe-doc
│   └── third-party
├── bioinformatics
│   ├── doc
│   ├── lib
│   ├── release-utils
│   ├── third-party
│   └── tools
└── common
    └── ConsensusCore

To determine where to install smrtpipe.py, and set the environment variable SEYMOUR_HOME to this location:

$ mkdir $HOME/smrtpipe-build
$ export SEYMOUR_HOME=$HOME/smrtpipe-build/

To build the software:


$ cd $HOME/smrtpipe-sources-2.0.0/assembly
$ make
$ make install

To test the smrtpipe build:

$ . $SEYMOUR_HOME/etc/setup.sh
$ smrtpipe.py –h

This should display the man page for smrtpipe.py.

SMRT Pipe File Structure

Note: The output of a SMRT Pipe analysis includes more files than described here; interested users should explore the file structure. Following are details about the major files.

 <jobID>/job.sh
  • Contains the SMRT Pipe command line call for the job.
<jobID>/settings.xml
  • Contains the modules (and their associated parameters) to be run as part of the SMRT Pipe run.
<jobID>/metadata.rdf
  • Contains all important metadata associated with the job. This includes metadata propagated from primary results, links to all reports and data files exposed to users, and high-level summary metrics computed during the job. The file is an entry point to the job by tools such as SMRT Portal and SMRT View. metadata.rdf is formatted as an RDF-XML file using OWL ontologies. See http://www.w3.org/standards/semanticweb/ for an introduction to Semantic Web technologies.
<jobID>/input.fofn
  • This file (“file of file names”) is generated early during a job and contains the file names of the raw input files used for the analysis.
<jobID>/input.xml
  • Used to specify the input files to be analyzed in a job, and is passed on to the command line.
<jobID>/vis.jnlp
  • Deprecated - no longer generated in v1.4.0. To visualize data, install SMRT View and choose File > Open Data from Server.
log/smrtpipe.log
  • Contains debugging output from SMRT Pipe modules. This is typically shown by way of the View Log button in SMRT Portal.

Data Files

The Data directory is where most raw files generated by the pipeline are stored. (Note: The following are example output files - for more details about specific files, see the sections dealing with individual modules.)

aligned_reads.cmp.h5, aligned_reads.sam, aligned_reads.bam
  • Mapping and consensus data from secondary analysis.
alignment_summary.gff
  • Alignment data summarized on sequence regions.
variants.gff.gz
  • All sequence variants called from consensus sequence.
toc.xml
  • Deprecated - The master index information for the job outputs is now included in the metadata.rdf file.

Results/Reports Files

Modules with Reports in their name produce HTML reports with static PNG images using XML+XSLT. These reports are located in the results subdirectory. The underlying XML document for each report is preserved there as well; these can be useful files for data-mining the outputs of SMRT Pipe.

The Reference Repository

The reference repository is a file-based data store used by SMRT Analysis to manage reference sequences and associated information. The full description of all of the attributes of the reference repository is beyond the scope of this document, but you need to use some basic aspects of the reference repository in most SMRT Pipe analyses.

Example: Analysis of multi-contig references can only be handled by supplying a reference entry from a reference repository.

It is simple to create and use a reference repository:

  • A reference repository can be any directory on your system. You can have as many reference repositories as you wish; the input to SMRT Pipe is a fully resolved path to a reference entry, so this can live in any accessible reference repository.

Starting with the FASTA sequence genome.fasta, you upload the sequence to your reference repository using the following command:

referenceUploader -c -p/path/to/repository -nGenomeName
-fgenome.fasta

where:

  • /path/to/repository is the path to your reference repository.
  • GenomeName is the name to use for the reference entry that will be created.
  • genome.fasta is the FASTA file containing the reference sequence to upload.

For a large genome, we highly recommended that you produce the BLASR suffix array during this upload step. Use the following command:

referenceUploader -c -p/path/to/repository -nHumanGenome -fhuman.fasta --Saw='sawriter -welter'

There are many more options for reference management. Consult the MAN page entry for referenceUploader by entering referenceUploader -h.

To learn more about what is being stored in the reference entries, look at the directory containing a reference entry. You will find a metadata description (reference.info.xml) of the reference and its associated files. For example, various static indices for BLASR and SMRT View are stored in the sequence directory along with the FASTA sequence.


For Research Use Only. Not for use in diagnostic procedures. © Copyright 2010 - 2013, Pacific Biosciences of California, Inc. All rights reserved. Information in this document is subject to change without notice. Pacific Biosciences assumes no responsibility for any errors or omissions in this document. Certain notices, terms, conditions and/or use restrictions may pertain to your use of Pacific Biosciences products and/or third party products. Please refer to the applicable Pacific Biosciences Terms and Conditions of Sale and to the applicable license terms at http://www.pacificbiosciences.com/licenses.html.