Home

Lythimus edited this page Sep 9, 2011 · 28 revisions
Clone this wiki locally

#PARSES

#Introduction Next generation sequencing technologies such as the Illumina Genome Analyzer have made RNA-Sequencing a viable method for interrogating the transcriptome of an organism under a variety of conditions. In an RNA-Sequencing experiment, RNA is extracted and reverse transcribed to cDNA for sequencing via next generation sequencing technology. The reads produced represent the portions of the genome that were expressed under the given conditions of the biological experiment in question. Typically, these reads are then mapped back to the host organism's reference genome in order to find which portions of the genome were expressed and then the data is analyzed further. With RNA-Sequencing of human DNA, it is also important to take into consideration the splicing junctions that reads can span, and specialized tools have been developed to solve this mapping problem. These tools largely ignore the reads that do not map well to the reference genome of the host organism and focus on only those reads that can be attributed to host DNA. However, important information could be lost by ignoring these additional reads. Contamination is an important factor that must be considered for any biological experiment, including RNA-Seq. Some of the reads that do not map to the host genome could be indicative of contamination that could ultimately affect the outcome of the experiment. Other reads may contain viral sequences that could be implicated in the condition in question for the RNA-Seq experiment. As an example, we found significant evidence of Epstein-Barr virus transcription in type I Burkitt's lymphoma cells. Many of the tools that are necessary to analyze exogenous sequences in RNA-Sequencing experiments already exist; PARSES is a pipeline built from existing tools in order to automate the analysis of exogenous sequences present in RNA-Sequencing experiments.
#Implementation ![Data flow through PARSES](https://github.com/downloads/Lythimus/PARSES/Pipeline%20Figure%20v2.png)

Our pipeline leverages existing tools for sequence similarity search and metagenomics analysis in order to explore the presence of exogenous sequences in RNA-Sequencing data. The goal is to search the sequence reads against a very large, broad database in order to find the most likely origin of the reads which do not match well to the host genome. BLAST+ provides a fast search but does not always yield optimal results and it is difficult and time consuming to interpret the output of BLAST+ searches that yield many results. We employ a tool for metagenomics that provides convenient visualization and taxonomic analysis of BLAST+ results called MEGAN. The biological researcher can interactively refine the views provided by MEGAN and extract reads of interest that were assigned to specific taxonomical categories for further investigation. Since these reads are relatively short, by transcriptional standards, de novo sequence assembly can be performed in order to reconstruct the transcripts in an extracted taxonomical group in order to investigate the longer transcripts further. This process allows the researcher to proceed from RNA-Sequencing data to de novo assembled transcripts of exogenous DNA with minimal interaction and supervision of the process and without having to learn all of the tools involved in the mapping and processing of sequences. The latest versions of the following resources utilized by PARSES are automatically downloaded, configured, and executed:

##Sequence Alignment The first step of PARSES is to perform the typical mapping of all RNA-Seq reads to the host genome using a sequence alignment tool. We chose novoalign for this purpose since it is a very accurate alignment tool that can align illumina sequencing data to a large genome using a reasonable amount of computing resources. Novoalign partitions the reads into four categories:
  • Uniquely mapped reads are those which map to a single location with a significantly higher score than other locations in the genome
  • Repeat mapped reads are those that map to multiple locations with similarly significant scores
  • Unmapped reads are those that do not map to any location in the genome with a significant score
  • Quality Controlled reads are those where the base qualities are not sufficiently high to have confidence in any mapping.

The Uniquely mapped reads and Repeat mapped reads have been clearly identified as belonging to the human genome and are the usual data sets that RNA-Seq analysis will handle while ignoring the Unmapped reads and Quality Controlled reads. For our investigation into exogenous sequences, we set the Uniquely mapped reads and Repeat mapped reads aside and concentrate on the Unmapped reads for further investigation. These Unmapped reads are extracted from the novoalign results and set aside for further analysis. On average 12.5% of our reads from an exemplary set of RNA-Seq experiments in human were identified as Unmapped. This is a nearly 10-fold decrease in the number of reads to consider in further steps of the processing pipeline.

We also align our reads using the TopHat tool which identifies splice junctions in addition to the reads that are directly mappable back to the human genome. TopHat uses a faster aligner which is less accurate but applies this aligner multiple times in order to find reads that span splice junctions. The next step of PARSES is to use an intermediate file from the TopHat output in order to remove reads from the Unmapped category of novoalign that were identified by TopHat as splice junction reads. These reads would not have been mapped by novoalign to the human reference genome, due to the presence of introns, but TopHat has identified them as human reads that span splice junctions. This processing further reduces the number of reads to consider by an average of 45.2% and leaves us with roughly 5.6% of the original reads from the Illumina sequencing to consider further. Hence the filtering of identified human reads by these two aligners has resulted in a 20-fold decrease in the amount of data we need to consider further as possibly coming from exogenous sequences. Of course the novoalign and TopHat results are maintained for use by the researcher in other standard RNA-Seq analysis that looks at the endogenous sequences.

##Sequence Searching BLAST+ is utilized to search the possible exogenous sequence reads against the NT database, one of the largest sequence databases maintained by NCBI. The run time and memory requirements of BLAST+ are impacted not only by the size of the database being searched but also by the number of query sequences. This observation explains the need for the prior processing steps using sequence aligners to remove identified endogenous sequence reads from further consideration because it would be far too computationally intensive to search all of the Illumina reads against the NT database. BLAST+ results are stored in the BLASTTAB format with the taxon ID programmatically extracted from the GI to TaxID Nucleotide database appended to each hit via a perl script for compatibility with MEGAN in the next step of PARSES. We found this method to be most efficient though it is lacking the graphical display of the sequence pair alignment which is included in both the Pairwise and BLASTXML formats. ##Taxonomical Analysis MEGAN is a tool developed for metagenomic analysis which we have repurposed in PARSES to categorize potential exogenous sequence results. The BLAST+ output above is imported into MEGAN which uses a lowest common ancestor algorithm to assign reads to taxa. Reads that can be assigned at a species level are assigned to taxa near the leaves of the NCBI tree, whereas widely conserved sequences are assigned to higher-order taxa further up the tree. Reads that are not assigned by the least common ancestor algorithm are grouped separately in a not assigned category. Similarly, a no hits category is maintained for those reads that did not return any significant alignments to the NT database. The researcher can interact with MEGAN to extract FASTA files for all of the reads in a given taxonomic grouping for further analysis. This determination would be based on prior biological knowledge of the experiment at hand, or hypotheses that the researcher wants to test given the taxonomic groupings displayed by MEGAN. ##De Novo Assembly Each group of reads extracted by the user is subsequently passed to ABySS, a de novo parallel sequence assembler, to assemble the reads into longer transcripts for further analysis. Memory usage scales super-linearly with the number of reads, so it is important for the user to be involved in the process of selecting the categories from MEGAN to perform de novo assembly on. Hence, this is the only portion of PARSES that is not automated and requires user intervention. The user can also choose to include reads from the "not assigned" or "no hits" category with the reads being de novo assembled in case they were portions of longer transcripts that simply were not found by BLAST+. Such reads could possibly contribute to the formation of longer transcripts for further analysis. ##Reanalysis Once a group of reads has been assembled into longer transcripts they can be looped back to the prior step of BLAST+ analysis to find more significant hits in the NT database and give the researcher a better idea of which exogenous transcripts were found in the samples. The results can then be run back through MEGAN to perform further analysis and comparisons on each set of assembled reads. This provides the researcher with a more refined view of the exogenous sequences present in his or her samples.
#Requirements PARSES is intended to be executed with Solexa data on a *NIX-based desktop computer. It may not be used for any sort of financial gain. Licenses for both MEGAN and Novoalign are strictly for non-profit research at non-profit institutions and academic usage. ##Supported Data Types * fasta * sanger * solexa * illumina1.3 * illumina1.5 ##System Requirements PARSES's system requirements are directly dependent on the size of the data set being processed. It is recommended to be run on a Linux or OS X 64-bit machine with at least 4GBs of memory. You must also have root privileges to the machine if you are installing software. ##Software Requirements * * Working directory is the directory in which the FASTQ sequence file is contained. * [gcc](http://adcdownload.apple.com/Developer_Tools/xcode_3.2.6_and_ios_sdk_4.3__final/xcode_3.2.6_and_ios_sdk_4.3.dmg) * [Ruby](http://www.ruby-lang.org/en/downloads) * [Rake](http://rake.rubyforge.org) * abyssKmerOptimizer.pl marked as executable in same folder as rakefile * addTaxon.pl marked as executable in same folder as rakefile * fac.pl marked as executable in same folder as rakefile * addTaxon.pl marked as executable in same folder as rakefile * parallelBlast.sh marked as executable in same folder as rakefile * Xextractspans.pl marked as executable in same folder as rakefile * Xfilterspans.pl marked as executable in same folder as rakefile * Xnovotonm.pl marked as executable in same folder as rakefile
#Installation Execute all commands from the directory of the data. Place all scripts for PARSES into a single directory and mark as executable. All latest versions of programs will be installed, in the event of an error during installation a repository of the programs may be used which is not guaranteed to be up to date. The repository can be activated by including the repo=true command. Installation will automatically be performed during any execution but it can be manually performed by evoking any of the following installation commands:
rake -f /rake/file/location install #installs and indexes all resources.
rake -f /rake/file/location novoalignInstall
rake -f /rake/file/location bowtieInstall
rake -f /rake/file/location hgInstall
rake -f /rake/file/location novoIndex
rake -f /rake/file/location bowtieIndex
rake -f /rake/file/location samtoolsInstall
rake -f /rake/file/location tophatInstall
rake -f /rake/file/location abyssInstall
rake -f /rake/file/location blastInstall
rake -f /rake/file/location ntInstall
rake -f /rake/file/location meganInstall
rake -f /rake/file/location giTaxIdNuclInstall
rake -f /rake/file/location parallelIteratorInstall
rake -f /rake/file/location indexName=mouse indexGlob="/usr/share/mouse/chr*.fa" otherIndex

#Benchmarks

##Abbreviated PARSES Benchmarks PARSES Wall Time To Process One Read PARSES Itemized Processing Time for Normalized Data Sets PARSES Itemized Accumulated Processing Time

##BLAST

BLASTTAB format

BLASTTAB BLASTN with 5000 reads

Pairwise format

Pairwise BLASTN with 5000 reads


#Examples Example executions.

##First Execution rake -f /rake/file/location seq=NameYouGiveToYourSequence file=YourSequenceFileName.fastq type=illumina1.3

##Subsequent Executions rake -f /rake/file/location seq=NameYouGiveToYourSequence

##Install using repository of links rake -f /rake/file/location repo=true install

##Install index of other host organism's genome rake -f /rake/file/location indexName=mouse indexGlob="/usr/share/mouse/chr*.fa" otherIndex

##Set non-human, indexed organism as host and execute rake -f /rake/file/location seq=NameYouGiveToYourSequence indexName=mouse indexGlob="/usr/share/mouse/chr*.fa"

##Run to Specified Point rake -f /rake/file/location seq=NameYouGiveToYourSequence file=YourSequenceFileName.fastq type=illumina1.3 localAlignContigs

##Run truncated version of PARSES (only execute the specified task and ignore prerequisites) rake -f /rake/file/location seq=NameYouGiveToYourSequence file=YourSequenceFileName.fastq type=illumina1.3 truncate=true localAlignContigs

##Run truncated version of PARSES (only execute the specified task and ignore prerequisites) and override the file naming schema rake -f /rake/file/location seq=NameYouGiveToYourSequence file=YourFileToProcess type=illumina1.3 truncate=true forcefile=true localAlignContigs

##List All Tasks rake -f /rake/file/location -T

##Output Files

datafile.fastq

  • alignSequence (Novoalign)

datafile.fastq.novo

  • removeHuman (Xnovotonm)

datafile.fastq.novo.NM.fasta

  • removeSpans (Tophat)

datafile.fastq.novo.NM.fasta.nospans

  • localAlignReads (BLAST+)

datafile.fastq.novo.NM.fasta.nospans.blast

  • metaGenomeAnalyzeReads (MEGAN)

datafile.fastq.novo.NM.fasta.nospans.blast.megan.rma

  • denovoAssembleCluster (ABySS)

datafile.fastq.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa

  • localAlignContigs (BLAST+)

datafile.fastq.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa.blast

  • metaGenomeAnalyzeContigs (MEGAN)

datafile.fastq.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa.blast.megan.rma

## Clean and Clobber

rake -f /rake/file/location seq=PARSES_SRR036397vi_1 clean

rake -f /rake/file/location seq=PARSES_SRR036397vi_1 clobber

File Name Run Clean Clobber
.PARSES_SRR036397vi_1 + + +
.time +
PARSES_SRR036397vi_1.log + +
SRR036397vi_1.fastq + + +
SRR036397vi_1.fastq.novo + +
SRR036397vi_1.fastq.novo.NM.fasta + +
SRR036397vi_1.fastq.novo.NM.fasta.nospans + +
SRR036397vi_1.fastq.novo.NM.fasta.nospans.aa +
SRR036397vi_1.fastq.novo.NM.fasta.nospans.aa.blast +
SRR036397vi_1.fastq.novo.NM.fasta.nospans.ab +
SRR036397vi_1.fastq.novo.NM.fasta.nospans.ab.blast +
SRR036397vi_1.fastq.novo.NM.fasta.nospans.blast + +
SRR036397vi_1.fastq.novo.NM.fasta.nospans.blast.megan.rma + +
SRR036397vi_1.fastq.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa + +
SRR036397vi_1.fastq.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa.blast + +
SRR036397vi_1.fastq.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa.blast.megan.rma + + +
SRR036397vi_1.fastq.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa.blast.pdf + + +
SRR036397vi_1.fastq.novo.NM.fasta.nospans.blast.pdf + +
SRR036397vi_1.fastq.novo.NM.fasta.nospans.details +
SRR036397vi_1.fastq.novo.NM.fasta.nospans.mergedBlast +
SRR036397vi_1.fastq.novo.NM.fastq + +
SRR036397vi_1.fastq.novo.counts +
reads-No_hits.fasta + +
reads-No_hits.fasta.1.kmer.contigs.coverage +
reads-No_hits.fasta.1.kmer.contigs.fa +
reads-No_hits.fasta.10.kmer.contigs.coverage +
reads-No_hits.fasta.10.kmer.contigs.fa +
reads-No_hits.fasta.11.kmer.contigs.coverage +
reads-No_hits.fasta.11.kmer.contigs.fa +
reads-No_hits.fasta.12.kmer.contigs.coverage +
reads-No_hits.fasta.12.kmer.contigs.fa +
reads-No_hits.fasta.13.kmer.contigs.coverage +
reads-No_hits.fasta.13.kmer.contigs.fa +
reads-No_hits.fasta.14.kmer.contigs.coverage +
reads-No_hits.fasta.14.kmer.contigs.fa +
reads-No_hits.fasta.15.kmer.contigs.coverage +
reads-No_hits.fasta.15.kmer.contigs.fa +
reads-No_hits.fasta.16.kmer.contigs.coverage +
reads-No_hits.fasta.16.kmer.contigs.fa +
reads-No_hits.fasta.17.kmer.contigs.coverage +
reads-No_hits.fasta.17.kmer.contigs.fa +
reads-No_hits.fasta.18.kmer.contigs.coverage +
reads-No_hits.fasta.18.kmer.contigs.fa +
reads-No_hits.fasta.19.kmer.contigs.coverage +
reads-No_hits.fasta.19.kmer.contigs.fa +
reads-No_hits.fasta.2.kmer.contigs.coverage +
reads-No_hits.fasta.2.kmer.contigs.fa +
reads-No_hits.fasta.20.kmer.contigs.coverage +
reads-No_hits.fasta.20.kmer.contigs.fa +
reads-No_hits.fasta.21.kmer.contigs.coverage +
reads-No_hits.fasta.21.kmer.contigs.fa +
reads-No_hits.fasta.22.kmer.contigs.coverage +
reads-No_hits.fasta.22.kmer.contigs.fa +
reads-No_hits.fasta.23.kmer.contigs.coverage +
reads-No_hits.fasta.23.kmer.contigs.fa +
reads-No_hits.fasta.24.kmer.contigs.coverage +
reads-No_hits.fasta.24.kmer.contigs.fa +
reads-No_hits.fasta.25.kmer.contigs.coverage +
reads-No_hits.fasta.25.kmer.contigs.fa +
reads-No_hits.fasta.26.kmer.contigs.coverage +
reads-No_hits.fasta.26.kmer.contigs.fa +
reads-No_hits.fasta.26.kmer.contigs.fa.kmerOptimized.fa + +
reads-No_hits.fasta.3.kmer.contigs.coverage +
reads-No_hits.fasta.3.kmer.contigs.fa +
reads-No_hits.fasta.4.kmer.contigs.coverage +
reads-No_hits.fasta.4.kmer.contigs.fa +
reads-No_hits.fasta.5.kmer.contigs.coverage +
reads-No_hits.fasta.5.kmer.contigs.fa +
reads-No_hits.fasta.6.kmer.contigs.coverage +
reads-No_hits.fasta.6.kmer.contigs.fa +
reads-No_hits.fasta.7.kmer.contigs.coverage +
reads-No_hits.fasta.7.kmer.contigs.fa +
reads-No_hits.fasta.8.kmer.contigs.coverage +
reads-No_hits.fasta.8.kmer.contigs.fa +
reads-No_hits.fasta.9.kmer.contigs.coverage +
reads-No_hits.fasta.9.kmer.contigs.fa +
reads-root.fasta + +
reads-root.fasta.1.kmer.contigs.coverage +
reads-root.fasta.1.kmer.contigs.fa +
reads-root.fasta.10.kmer.contigs.coverage +
reads-root.fasta.10.kmer.contigs.fa +
reads-root.fasta.11.kmer.contigs.coverage +
reads-root.fasta.11.kmer.contigs.fa +
reads-root.fasta.12.kmer.contigs.coverage +
reads-root.fasta.12.kmer.contigs.fa +
reads-root.fasta.13.kmer.contigs.coverage +
reads-root.fasta.13.kmer.contigs.fa +
reads-root.fasta.14.kmer.contigs.coverage +
reads-root.fasta.14.kmer.contigs.fa +
reads-root.fasta.15.kmer.contigs.coverage +
reads-root.fasta.15.kmer.contigs.fa +
reads-root.fasta.16.kmer.contigs.coverage +
reads-root.fasta.16.kmer.contigs.fa +
reads-root.fasta.17.kmer.contigs.coverage +
reads-root.fasta.17.kmer.contigs.fa +
reads-root.fasta.18.kmer.contigs.coverage +
reads-root.fasta.18.kmer.contigs.fa +
reads-root.fasta.19.kmer.contigs.coverage +
reads-root.fasta.19.kmer.contigs.fa +
reads-root.fasta.2.kmer.contigs.coverage +
reads-root.fasta.2.kmer.contigs.fa +
reads-root.fasta.20.kmer.contigs.coverage +
reads-root.fasta.20.kmer.contigs.fa +
reads-root.fasta.21.kmer.contigs.coverage +
reads-root.fasta.21.kmer.contigs.fa +
reads-root.fasta.22.kmer.contigs.coverage +
reads-root.fasta.22.kmer.contigs.fa +
reads-root.fasta.23.kmer.contigs.coverage +
reads-root.fasta.23.kmer.contigs.fa +
reads-root.fasta.24.kmer.contigs.coverage +
reads-root.fasta.24.kmer.contigs.fa +
reads-root.fasta.25.kmer.contigs.coverage +
reads-root.fasta.25.kmer.contigs.fa +
reads-root.fasta.26.kmer.contigs.coverage +
reads-root.fasta.26.kmer.contigs.fa +
reads-root.fasta.26.kmer.contigs.fa.kmerOptimized.fa + +
reads-root.fasta.3.kmer.contigs.coverage +
reads-root.fasta.3.kmer.contigs.fa +
reads-root.fasta.4.kmer.contigs.coverage +
reads-root.fasta.4.kmer.contigs.fa +
reads-root.fasta.5.kmer.contigs.coverage +
reads-root.fasta.5.kmer.contigs.fa +
reads-root.fasta.6.kmer.contigs.coverage +
reads-root.fasta.6.kmer.contigs.fa +
reads-root.fasta.7.kmer.contigs.coverage +
reads-root.fasta.7.kmer.contigs.fa +
reads-root.fasta.8.kmer.contigs.coverage +
reads-root.fasta.8.kmer.contigs.fa +
reads-root.fasta.9.kmer.contigs.coverage +
reads-root.fasta.9.kmer.contigs.fa +

#Task List

##Basic Tasks

  1. alignSequence # Novoalign - Align reads in to base genome
  2. removeHuman # Xnovotonm - Harvest non-base organism reads
  3. removeSpans # Tophat - Align spanning reads in order to remove base organism reads
  4. localAlignReads # BLAST - Associate reads with organisms.
  5. metaGenomeAnalyzeReads # MEGAN - Separate reads into taxonomies.
  6. denovoAssembleCluster # ABySS - Assemble reads associated with clusters of taxonomies.
  7. localAlignContigs # BLAST - Associate contigs with organisms.
  8. metaGenomeAnalyzeContigs # MEGAN - Separate contigs into taxonomies.

##Special Tasks

  • clean # Remove any temporary products.
  • clobber # Remove any generated file.
  • reserialize # Automatically saving any settings changes which may have been made

##Installation Tasks

  • abyssInstall # Install latest version of ABySS with Google Sparsehash.
  • blastInstall # Install latest version of BLAST+.
  • bowtieIndex # Create an index for bowtie/tophat of the human genome database.
  • bowtieInstall # Install latest version of Bowtie.
  • giTaxIdNuclInstall # Install GI to Taxonomy ID database.
  • hgInstall # Install latest version of human genome database.
  • install # Install latest version of everything.
  • meganInstall # Install latest version of MEGAN.
  • novoIndex # Create an index for novoalign of the human genome database.
  • novoalignInstall # Install latest version of Novoalign.
  • ntInstall # Install latest version of the NT database.
  • otherIndex # Index the genome of a specified organism with Novo and Bowtie.
  • parallelIteratorInstall # Install latest version of Parallel::Iterator for perl.
  • samtoolsInstall # Install latest version of Samtools.
  • tophatInstall # Install latest version of Tophat.

#Analyzing Results ##MEGAN The primary method for viewing results involves perusing the RMA file produced by MEGAN--though a PDF is also produced with standard LCA parameters. To familiarize yourself with MEGAN you can watch the [MEGAN Introduction Youtube video](http://www.youtube.com/watch?v=i7-OCW-DctY).

Your results will resemble the following:

You may begin to refine the least common ancestor parameters to get the view of the data you desire at this point. There are four parameters which can be tuned nearly in close to real-time (depending on the size of the data set) to achieve this:

Min Support
can be used to set a threshold for the minimum support that a taxon requires, that is, the number of reads that must be assigned to it so that it appears in the result. Any read that is assigned to a taxon that does not have the required support is counted as unassigned. By default, the minimum number of reads required for a taxon to appear in the result is 5.
Min Score
can be used to set a minimum threshold for the bit score of hits. Any hit in the input data set that scores less than the given threshold is ignored.
Top Percentage
can be used to set a threshold for the maximum percentage by which the score of a hit may fall below the best score achieved for a given read. Any hit that falls below this threshold is discarded.
Win Score
can be used to try and separate matches due to sequence identity and ones due to homology. If a win score is set, then, for a given read, if any match exceeds the win score, only matches exceeding the win score (“winners”) are used to place the given read. The hope is that secondary, homology-induced matches are discarded in the presence of stronger primary matches.

For a detailed example of what results changing LCA parameters will have click the image below (or here for a PDF version).

##Report Reporting is currently stored in logs. ##Cleaning Up Clean and Clobber are both supported in PARSES as an extension of Rake though their functions have been altered slightly. Clean removes all temporary files produced by PARSES which are not directly used as input to tasks. Clobber on the other hand removes all but the original data file, the configuration file, and the final MEGAN files produced. An example of the files produced can be found in the examples section. If a sequence is not specified, all sequence files will be cleaned/clobbered if their configuration file exists.
#Logging A log file is produced in the data directory with the filename `sequenceName.log`. It not only logs all commands executed, in addition to errors returned, but it also contains report information on the results of the data analysis. Below is a list of all information logged at each step of PARSES. ##Non-task specific * Sequence name. * Read length. * Data type. * Timing information for each portion of PARSES. * Arguments used for each program.

##Initialization

  • Information regarding the specifications of the machine on which PARSES is executing. (log)
  • Command used to invoke PARSES. (log)
  • Total number of reads. (log)

##alignSequence (Novoalign)

  • Number of reads which align to the human genome. (log)
  • Percentage of total reads which align to the human genome. (log)

##removeHuman (Xnovotonm)

  • Number of reads left after removing human reads. (log)
  • Percentage of total reads left. (log)

##removeSpans (Tophat)

  • Number of reads left after removing splicing junctions of human reads. (log)
  • Percentage of reads left after removing splicing regions. (log)

##localAlignReads (BLAST+)

##metaGenomeAnalyzeReads (MEGAN)

  • Potential number of exogenous reads. (log)
  • Number of reads assigned to a taxonomy. (log)
  • Percentage of total reads assigned to a taxonomy. (log)
  • Number of reads not assigned to a taxonomy. (log)
  • Percentage of total reads not assigned to a taxonomy. (log)
  • Number of reads which did not significantly align. (log)
  • Percentage of total reads which did not significantly align. (log)
  • Image of taxonomy overview with default LCA parameters.

##denovoAssembleCluster (ABySS)

  • Coverage threshold. (log)
  • Median kmer coverage. (log)
  • Number of kmers in contigs. (log)
  • Number of contigs. (log)
  • Optimum kmer. (log)

##localAlignContigs (BLAST+)

##metaGenomeAnalyzeContigs (MEGAN)

  • Potential number of exogenous contigs. (log)
  • Number of contigs assigned to a taxonomy. (log)
  • Number of contigs not assigned to a taxonomy. (log)
  • Number of contigs which did not significantly align. (log)

Below is an example of a log file (truncated for display purposes):

# Logfile created on Mon Nov 29 20:21:37 -0600 2010 by logger.rb/22285
I, [2010-12-27T14:54:16.761320 #94508]  INFO -- : Begin run for seq=akata file=s_4_sequence_Akata.txt type=ill...
I, [2010-12-27T14:54:16.849927 #94508]  INFO -- : Executing PARSES v0.30 in a Darwin environment with 64GB of ...
I, [2010-12-27T17:56:20.865114 #94508]  INFO -- : tophat -p 24 --solexa1.3-quals --output-dir akata_tophat_out...
I, [2010-12-27T17:56:20.865442 #94508]  INFO -- : samtools view -h -o akata_tophat_out/accepted_hits.sam akata...
I, [2010-12-27T17:56:20.865475 #94508]  INFO -- : Xextractspans.pl akata_tophat_out/accepted_hits.sam
I, [2010-12-27T17:56:20.865506 #94508]  INFO -- : Xfilterspans.pl s_4_sequence_Akata.txt.novo.NM.fasta akata_t...

#Configuration ##PARSES Configuration A configuration file is produced for PARSES in `$HOME/.PARSES`. It contains the paths to the human genome and NT databases as well as the paths to the Bowtie and TopHat indices. It has the following form:
--- !ruby/object:Settings
bowtieIndex: /usr/share/hgChrAll
humanGenomeDatabase: /usr/share
novoIndex: /usr/share/hgChrAll.ndx
ntDatabase: /usr/share/nt/nt
##Sequence Configuration In addition, configuation files are produced for each sequence in `$(pwd)/.sequenceName`. Settings for each program are chosen by default but can be changed via the sequence file which has the following form:
--- !ruby/object:Sequence
abyssPath: s_4_sequence_Akata.txt.novo.NM.fasta.nospans.blast.megan.rma
abyssPathGlob: reads-*.fasta
blast1Path: s_4_sequence_Akata.txt.novo.NM.fasta.nospans
blast2Path: s_4_sequence_Akata.txt.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa
blastOutputFormat: 0
blastPathGlob: reads-*.fasta.*.kmer.contigs.fa.kmerOptimized.fa
dataType: illumina1.3
eValue1: 1.0e-06
eValue2: 100
expansionNumber: 10
filePath: s_4_sequence_Akata.txt
imageFileType: jpg
maxKmerLength: 38
maxMatches: 0
megan1Path: s_4_sequence_Akata.txt.novo.NM.fasta.nospans.blast
megan2Path: s_4_sequence_Akata.txt.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa.blast
minKmerLength: 7
minScoreByLength: 0
minSupport: 5
novoalignPath: s_4_sequence_Akata.txt.novo
pipeEndPath: s_4_sequence_Akata.txt.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa.blast.megan.rma
readLength: 38
removeNonMappedPath: s_4_sequence_Akata.txt.novo.NM.fasta
topPercent: 10.0
useCogs: "false"
useGos: "false"
winScore: 0.0

It is recommended the path variables not be adjusted. Everything except expansionNumber, which is the number of times to execute the expansion command when generating a picture from MEGAN is straight-forward.

##System Configuration In addition, the amount of RAM, number of CPU cores, CPU architecture, operating system, default shell, and existence of locate database is automatically computed each execution but not stored. ##Tuning Parameters See table below for Tunable parameters. Where B means consider tuning and ? means research parameter further.

Process | Tune? | Cand? | Parameter | Description | Set value | Default value

  • | - | - | - | - | - | - ABySS | Y | Y | erode | Erode bases at the ends of blunt contigs with coverage less than this threshold | | Tuned by ABySS. ABySS | N | B | no-bubbles | Do not pop bubbles ABySS | N | B | bubbles | Pop bubbles shorter than this threshold. ABySS | Y | Y | kmer | Length to divide reads into when assembling ABySS | N | ? | chastity | Discard unchaste reads | | TRUE ABySS | N | B | trim-masked | Trim masked bases from the ends of reads | | TRUE ABySS | N | B | trim-quality | Trim bases from the ends of reads whose quality is less than the threshold ABySS | N | B | trim-length | Maximum length of dangling edges to trim ABySS | N | B | coverage | Remove contigs with mean k-mer coverage less than this threshold ABySS | N | B | erode-strand | Erode bases at the ends of blunt contigs with coverage less than this threshold on either strand BLAST | N | Y | E-value | Expectation value threshold for saving hits. | 0.001/100 | 10 BLAST | N | B | word_Size | Length of best perfect match. BLAST | N | B | gapOpen | Cost to open a gap. BLAST | N | B | gapExtend | Cost to extend a gap. BLAST | N | B | penalty | Penalty for a nt mismatch. BLAST | N | B | reward | Reward for a nt match. BLAST | N | Y | use_index | Use MegaBLAST database index. BLAST | N | B | dust | Filter query sequence with DUST | | 20 64 1 BLAST | N | ? | filtering_db | BLAST database containing filtering elements BLAST | N | B | window_masker_taxid | Enable WindowMasker filtering using a Taxonomic ID BLAST | N | B | window_masker_db | Enable WindowMasker filtering using this repeats database BLAST | N | ? | soft_masking | Apply filtering locations as soft masks | | TRUE BLAST | N | N | lcase_masking | Use lower case filtering in query and subject sequence(s)? BLAST | N | N | gilist | Restrict search of database to list of GI's BLAST | N | N | seqidlist | Restrict search of database to list of SeqId's BLAST | N | B | negative_gilist | Restrict search of database to everything except the listed GIs BLAST | N | ? | entrez_query | Restrict search with the given Entrez query BLAST | N | ? | db_soft_mask | Filtering algorithm ID to apply to the BLAST database as soft masking BLAST | N | B | perc_identity | Percent identity. | | 0 BLAST | N | B | culling_limit | If the query range of a hit is enveloped by that of at least this many higher-scoring hits, delete the hit. BLAST | N | ? | best_hit_overhang | Best Hit algorithm overhang value (recommended value: 0.1) BLAST | N | ? | best_hit_score_edge | Best Hit algorithm score edge value (recommended value: 0.1) BLAST | N | B | max_target_seqs | Maximum number of aligned sequences to keep. BLAST | N | B | dbsize | Effective length of the database. BLAST | N | B | searchsp | Effective length of search space BLAST | N | B | import_search_search_strategy | File name to record the search strategy used. BLAST | N | ? | xdrop_ungap | X-dropoff value (in bits) for ungapped extensions BLAST | N | ? | xdrop_gap | X-dropoff value (in bits) for preliminary gapped extensions BLAST | N | ? | xdrop_gap_final | X-dropoff value (in bits) for final gapped alignment BLAST | N | B | no_greedy | Use non-greedy dynamic programming extension BLAST | N | ? | min_raw_gapped_score | Minimum raw gapped score to keep an alignment in the preliminary gapped and traceback stages BLAST | N | B | ungapped | Perform ungapped alignment only? BLAST | N | ? | window_size | Multiple hits window size, use 0 to specify 1-hit algorithm BLAST | N | ? | off_diagonal_range | Number of off-diagonals to search for the 2nd hit, use 0 to turn off | | 0 BLAST | Y | Y | num_threads | Number of threads (CPUs) to use in the BLAST search. | Computed per machine. | 1 BLAST | N | ? | remote | Execute search remotely. MEGAN | N | Y | minScore | The Min Score item can be used to set a minimum threshold for the bit score of hits. Any hit in the input data set that scores less than the given threshold is ignored. | 35 | 35 MEGAN | N | Y | topPercent | The Top Percentage item can be used to set a threshold for the maximum percentage by which the score of a hit may fall below the best score achieved for a given read. Any hit that falls below this threshold is discarded. | 10 | 10 MEGAN | N | Y | winScore | The Win Score item can be used to try and separate matches due to sequence identity and ones due to homology. If a win score is set, then, for a given read, if any match exceeds the win score, only matches exceeding the win score (“winners”) are used to place the given read. The hope is that secondary, homology-induced matches are discarded in the presence of stronger primary matches. | 0 | 0 MEGAN | N | Y | minSupport | The Min Support item can be used to set a threshold for the minimum support that a taxon requires, that is, the number of reads that must be assigned to it so that it appears in the result. Any read that is assigned to a taxon that does not have the required support is counted as unassigned. By default, the minimum number of reads required for a taxon to appear in the result is 5. | 1% of reads or five if more than 500 reads. | 5 MEGAN | N | ? | minMatches? MEGAN | N | Y | minScoreByLength? | Divide minimum score by the number of reads in the dataset. MEGAN | N | B | useseed | Use the classification of gene function consisting of a collection of biologically defined subsystems known as SEED. | FALSE MEGAN | N | B | usekegg | Use the collection of metabolic pathways and other pathways known as KEGG. MEGAN | N | Y | usegos | | FALSE MEGAN | N | Y | usecogs | | FALSE MEGAN | N | N | pairedreads | | N MEGAN | N | Y | maxMatches | | 100 | 100 MEGAN | N | Y | db | Store data in Postgres database. MEGAN | N | B | totalreads | Set the total number of reads in the analysis Novoalign | N | ? | (H)ard Clip | Hard clip trailing bases with quality <= 2 Novoalign | Y | ? | Minimum (I)nformation Content | Sets the minimum information content for a read in base pairs. Measure uses base qualities to determine information content of the read in bits and divides by 2 to get effective length in bases. | | log4(Ng) + 5 where Ng is the length of the reference genome Novoalign | N | B | Tru(n)cate | Truncate reads to the specified length before alignment. | | 150 Novoalign | N | ? | (p)olyclonal filter threshold | Sets polyclonal filter thresholds. The first pair of values (n,t) sets the number of bases and threshold for the first 20 base pairs of each read. If there are n or more bases with phred quality below t then the read is flagged as polyclonal and will not be aligned. The alignment status is 'QC'. The second pair of values applies to the entire read rather than just the first 20bp and is entered as fraction of bases in the read below the threshold. Setting -p -1 disables the filter. | | -1,10 -1.00,10 Novoalign | N | ? | (a)dapter stripping | Enables adapter stripping from 3' end of reads before aligning. The second adapter is used for the second read in paired end mode. | | TCGTATGCCGTCTTCTGCTTG Novoalign | N | ? | homopolymer filter score | Sets homopolymer and optional dinucleotide filter score. Any read that matches a homopolymer or dinuc with score less than or equal to this threshold will skipped (reported as QC). | | 20 Novoalign | Y | Y | alignmen(t) score | Sets the maximum alignment score acceptable for the best alignment. | | Automatically set based on read length, genome size and other factors (see manual) Novoalign | N | B | (g)ap opening penalty | Sets the gap opening penalty. | | 40 Novoalign | N | B | gap e(x)tend penalty | Sets the gap extend penalty. | | 15 Novoalign | N | ? | (u)nconverted CHG/CHH penalty | Penalty for unconverted CHG or CHH cytosine in bisulfite alignment mode. | | 0 Novoalign | N | ? | (b)isulphite alignment mode | Sets Bisulphite alignment mode. | | 4 Novoalign | Y | Y | (N)umber bp in source not in reference | Sets the number of bp of source DNA that are not represented in the reference sequences (index). This value is used in calculation of prior probability that the read originated in sequence that we cannot align to because it is not in the reference sequence set. Set to zero to disable inclusion of this in quality calculations. | | The number of bases coded as N's in the reference genome Novoalign | N | B | Softclip | Turns off soft clipping for alignments. | | TRUE for SAM Novoalign | N | B | FullNW | Turns off soft clipping for alignments. | | TRUE for Native report and pairwise Novoalign | N | B | (R)epeat score diff | Sets score difference between best and second best alignment for calling a repeat | | 5 Novoalign | N | ? | (r)epeat strategy | Sets strategy for reporting repeats. | | None Novoalign | N | B | alignment (Q)uality | Sets lower limit on alignment quality for reporting. | | 0 Novoalign | N | B | alignm(e)nt limit | Sets a limit on number of alignments for a single read. | | 1000 Novoalign | N | B | (m)iRNA mode | Sets miRNA mode. In this mode each alignment to a read is given an additional score based on nearby alignment to the opposite strand of the read. Optional parameter sets maximum distance in bp between alignment and it's reverse complement, Default 100bp. Setting miRNA mode changes the default report mode to 'All'. Novoalign | N | B | read trimming (s)tep size | Turns on read trimming and sets trimming step size. Default step size is 2bp. Unaligned reads are trimmed until they align or fail the QC tests. TopHat | N | B | minor-anchor-length | The "anchor length". TopHat will report junctions spanned by reads with at least this many bases on each side of the junction. Note that individual spliced alignments may span a junction with fewer than this many bases on one side. However, every junction involved in spliced alignments is supported by at least one read with this many bases on each side. | | 8 TopHat | N | B | splice-mismatches | The maximum number of mismatches that may appear in the "anchor" region of a spliced alignment. | | 0 TopHat | N | B | min-intron-length | The minimum intron length. TopHat will ignore donor/acceptor pairs closer than this many bases apart. | | 70 TopHat | N | B | max-intron-length | The maximum intron length. When searching for junctions ab initio, TopHat will ignore donor/acceptor pairs farther than this many bases apart, except when such a pair is supported by a split segment alignment of a long read. | | 50000 TopHat | N | B | min-isoform-fraction | TopHat filters out junctions supported by too few alignments. Suppose a junction spanning two exons, is supported by S reads. Let the average depth of coverage of exon A be D, and assume that it is higher than B. If S / D is less than the minimum isoform fraction, the junction is not reported. A value of zero disables the filter. | | 0.15 TopHat | Y | Y | num-threads | Use this many threads to align reads. | Computed on a machine basis. | 1 TopHat | N | B | max-multihits | Instructs TopHat to allow up to this many alignments to the reference for a given read, and suppresses all alignments for reads with more than this many alignments. | | 40 TopHat | N | N | closure-search | Enables the mate pair closure-based search for junctions. Closure-based search should only be used when the expected inner distance between mates is small (<= 50bp) | | FALSE TopHat | N | B | no-coverage-search | Disables the coverage based search for junctions. TopHat | N | B | coverage-search | Enables the coverage based search for junctions. Use when coverage search is disabled by default (such as for reads 75bp or longer), for maximum sensitivity. TopHat | N | B | microexon-search | With this option, the pipeline will attempt to find alignments incident to microexons. Works only for reads 50bp or longer. TopHat | N | B | butterfly-search | TopHat will use a slower but potentially more sensitive algorithm to find junctions in addition to its standard search. Consider using this if you expect that your experiment produced a lot of reads from pre-mRNA, that fall within the introns of your transcripts. TopHat | N | B | library-type | TopHat will treat the reads as strand specific. Every read alignment will have an XS attribute tag. Consider supplying library type options below to select the correct RNA-seq protocol. TopHat | N | B | segment-mismatches | Read segments are mapped independently, allowing up to this many mismatches in each segment alignment. | | 2 TopHat | N | B | segment-length | Each read is cut up into segments, each at least this long. These segments are mapped independently. | | 25 TopHat | N | B | min-closure-exon | During closure search for paired end reads, exonic hops in the potential splice graph must be at least this long. | | 50 TopHat | N | B | min-closure-intron | The minimum intron length that may be found during closure search. | | 50 TopHat | N | B | max-closure-intron | The maximum intron length that may be found during closure search. | | 5000 TopHat | N | B | min-coverage-intron | The minimum intron length that may be found during coverage search. | | 50 TopHat | N | B | max-coverage-intron | The maximum intron length that may be found during coverage search. | | 20000 TopHat | N | B | min-segment-intron | The minimum intron length that may be found during split-segment search. | | 50 TopHat | N | B | max-segment-intron | The maximum intron length that may be found during split-segment search. | | 500000
##Versioning Version number is simply incremented every release appropriately with respect to how the extent of the features which were add or bugs which were fixed. Ideally version 1.0 will be GUI-based.