Skip to content

Commit

Permalink
update to documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
rob-p committed May 7, 2015
1 parent c5dbe53 commit 6e56d55
Showing 1 changed file with 79 additions and 68 deletions.
147 changes: 79 additions & 68 deletions doc/source/salmon.rst
Original file line number Diff line number Diff line change
@@ -1,31 +1,33 @@
Salmon
================

Salmon is a tool for **wicked** transcript quantification from RNA-seq data. It
requires a set of target transcripts (either from a reference or *de-novo*
assembly) to quantify. All you need to run Salmon is a fasta file containing
your reference transcripts and a (set of) fasta/fastq file(s) containing your
reads. Optinonally, Salmon can make use of pre-computed alignments (in the
form of a SAM/BAM file) to the transcripts rather than the raw reads.

The read-based mode of Salmon runs in two phases; indexing and quantification.
The indexing step is independent of the reads, and only need to be run one for
a particular set of reference transcripts. The quantification step, obviously,
is specific to the set of RNA-seq reads and is thus run more frequently. For a
more complete description of all available options in Salmon, see below.

The alignment-based mode of Salmon does not require indexing. Rather, you can
Salmon is a tool for **wicked-fast** transcript quantification from RNA-seq
data. It requires a set of target transcripts (either from a reference or
*de-novo* assembly) to quantify. All you need to run Salmon is a FASTA file
containing your reference transcripts and a (set of) FASTA/FASTQ file(s)
containing your reads. Optionally, Salmon can make use of pre-computed
alignments (in the form of a SAM/BAM file) to the transcripts rather than the
raw reads.

The **lightweight-alignment**-based mode of Salmon runs in two phases; indexing and
quantification. The indexing step is independent of the reads, and only need to
be run one for a particular set of reference transcripts. The quantification
step, obviously, is specific to the set of RNA-seq reads and is thus run more
frequently. For a more complete description of all available options in Salmon,
see below.

The **alignment**-based mode of Salmon does not require indexing. Rather, you can
simply provide Salmon with a FASTA file of the transcripts and a SAM/BAM file
containing the alignments you wish to use for quantification.

Using Salmon
------------

As mentioned above, there are two "modes" of operation for Salmon. The first,
like Sailfish, requires you to build an index for the transcriptome, but then
subsequently processes reads directly. The second mode simply requires you to
provide a FASTA file of the transcriptome and a ``.sam`` or ``.bam`` file
containing a set of alignments.
requires you to build an index for the transcriptome, but then subsequently
processes reads directly. The second mode simply requires you to provide a
FASTA file of the transcriptome and a ``.sam`` or ``.bam`` file containing a
set of alignments.

.. note:: Read / alignment order

Expand All @@ -36,40 +38,74 @@ containing a set of alignments.
**not** be sorted by target or position. If your reads or alignments
do not appear in a random order with respect to the target transcripts,
please randomize / shuffle them before performing quantification with
salmon.
Salmon.

.. note:: Number of Threads

The number of threads that salmon can effectively make use of depends
The number of threads that Salmon can effectively make use of depends
upon the mode in which it is being run. In alignment-based mode, the
main bottleneck is in parsing and decompressing the input BAM file.
We make use of the `Staden IO <http://sourceforge.net/projects/staden/files/io_lib/>`_
library for SAM/BAM/CRAM I/O (CRAM is, in theory, supported, but has not been
thorougly tested). This means that multiple threads can be effectively used
to aid in BAM decompression. However, we find that throwing more than a
few threads at file decompression does not result in increased processing
speed. Thus, alignment-based salmon will only ever allocate up to 4 threads
speed. Thus, alignment-based Salmon will only ever allocate up to 4 threads
to file decompression, with the rest being allocated to quantification.
If these threads are starved, they will sleep (the quantification threads
do not busy wait), but there is a point beyond which allocating more threads
will not speed up alignment-based quantification. We find that allocating
8 --- 12 threads results in the maximum speed, threads allocated above this
limit will likely spend most of their time idle / sleeping.

For read-based salmon, the story is somewhat different. Generally,
performance continues to improve as more threads are made available. This
is because the determiniation of the potential mapping locations of each
read is, generally, the slowest step in read-based quantification. Since
this process is trivially parallelizable (and well-parallelized within
salmon), more threads generally equates to faster quantification. However,
there may still be a limit to the return on invested threads. Specifically,
writing to the mapping cache (see `Misc`_ below) is done via a single
thread. With a huge number of quantification threads or in environments
with a very slow disk, this may become the limiting step. If you're certain
that you have more than the required number of observations, or if you have
reason to suspect that your disk is particularly slow on writes, then you
can disable the mapping cache (``--disableMappingCache``), and potentially
increase the parallelizability of read-based salmon.
For lightweight-alignment-based Salmon, the story is somewhat different.
Generally, performance continues to improve as more threads are made
available. This is because the determiniation of the potential mapping
locations of each read is, generally, the slowest step in
lightweight-alignment-based quantification. Since this process is
trivially parallelizable (and well-parallelized within Salmon), more
threads generally equates to faster quantification. However, there may
still be a limit to the return on invested threads. Specifically, writing
to the mapping cache (see `Misc`_ below) is done via a single thread. With
a huge number of quantification threads or in environments with a very slow
disk, this may become the limiting step. If you're certain that you have
more than the required number of observations, or if you have reason to
suspect that your disk is particularly slow on writes, then you can disable
the mapping cache (``--disableMappingCache``), and potentially increase the
parallelizability of lightweight-alignment-based Salmon.

Lightweight-alignment-based mode
--------------------------------

One of the novel and innovative features of Salmon is its ability to accurately
quantify transcripts using *lightweight* alignments. Lightweight alignments
are mappings of reads to transcript positions that are computed without
performing a base-to-base alignment of the read to the transcript. Lightweight
alignments are typically much faster to compute than traditional (or full)
alignments, and can sometimes provide superior accuracy by being more robust
to errors in the read or genomic variation from the reference sequence.
If you want to use Salmon in lightweight alignment-based mode, then you first
have to build an Salmon index for your transcriptome. Assume that
``transcripts.fa`` contains the set of transcripts you wish to quantify. First,
you run the Salmon indexer:

::
> ./bin/salmon index -t transcripts.fa -i transcripts_index

Then, you can quantify any set of reads (say, paired-end reads in files
`reads1.fa` and `reads2.fa`) directly against this index using the Salmon
``quant`` command as follows:

::

> ./bin/salmon quant -i transcripts_index -l <LIBTYPE> -1 reads1.fa -2 reads2.fa -o transcripts_quant

You can, of course, pass a number of options to control things such as the
number of threads used or the different cutoffs used for counting reads.
Just as with the alignment-based mode, after Salmon has finished running, there
will be a directory called ``salmon_quant``, that contains a file called
``quant.sf`` containing the quantification results.

Alignment-based mode
--------------------
Expand All @@ -95,15 +131,15 @@ mode, and a description of each, run ``salmon quant --help-alignment``.
.. note:: Genomic vs. Transcriptomic alignments

Salmon expects that the alignment files provided are with respect to the
transcripts given in the corresponding fasta file. That is, salmon expects
transcripts given in the corresponding fasta file. That is, Salmon expects
that the reads have been aligned directly to the transcriptome (like RSEM,
eXpress, etc.) rather than to the genome (as does, e.g. Cufflinks). If you
have reads that have already been aligned to the genome, there are
currently 3 options for converting them for use with Salmon. First, you
could convert the SAM/BAM file to a FAST{A/Q} file and then use the

This comment has been minimized.

Copy link
@mdshw5

mdshw5 Jan 1, 2016

Contributor

@rob-p Would it be much work to ingest BAM input for the lightweight alignment mode?

This comment has been minimized.

Copy link
@vals

vals Jan 1, 2016

Contributor

This is high on my wishlist too. My sequencing facility deliver data in CRAM files. It feels a bit silly that converting from CRAM to FASTQ takes longer than running quantification... And explodes disk usage quite a bit.

This comment has been minimized.

Copy link
@mdshw5

mdshw5 Jan 2, 2016

Contributor

If there were a great, streaming, way to chain a few commands together to achieve this it would be lower on my wish-list...

This comment has been minimized.

Copy link
@rob-p

rob-p Jan 2, 2016

Author Collaborator

@vals --- Yes, it's absolutely silly that converting formats takes longer than quant (I feel this every time I run fastq-dump). Is it the case that one cannot convert from BAM/CRAM to fastq in a streaming manner? I don't have much experience with dealing with unaligned BAMs as input, but since salmon already deals with alignments, we have a multithreaded BAM parser in the codebase. If there really is no good way to go from BAM -> fastq on the fly, I can look into supporting unaligned BAMs as input; it would just take some refactoring.

This comment has been minimized.

Copy link
@vals

vals Jan 2, 2016

Contributor

I'll give it some time this evening to try to make a streamed CRAM -> quant.sf workflow.

In the new samtools 1.3 there is a brand new samtools fastq that can do streaming conversion to fastq. Though I just found out that this command completely ignores the sort order of the CRAM and outputs paired reads in an unmatched fashion if the CRAM is sorted by coordinate... Which our facility does because those are more compressible I think.

This comment has been minimized.

Copy link
@roryk

roryk Jan 2, 2016

Contributor

https://github.com/gt1/biobambam2 can do streaming conversion too.

read-based mode of salmon described below. Second, given the converted
lightweight-alignment-based mode of Salmon described below. Second, given the converted
FASTA{A/Q} file, you could re-align these converted reads directly to the
transcripts with your favorite aligner and run salmon in alignment-based
transcripts with your favorite aligner and run Salmon in alignment-based
mode as described above. Third, you could use a tool like `sam-xlate <https://github.com/mozack/ubu/wiki>`_
to try and convert the genome-coordinate BAM files directly into transcript
coordinates. This avoids the necessity of having to re-map the reads. However,
Expand All @@ -112,39 +148,14 @@ mode, and a description of each, run ``salmon quant --help-alignment``.
.. topic:: Multiple alignment files

If your alignments for the sample you want to quantify appear in multiple
.bam/.sam files, then you can simply provide the salmon ``-a`` parameter
.bam/.sam files, then you can simply provide the Salmon ``-a`` parameter
with a (space-separated) list of these files. Salmon will automatically
read through these one after the other quantifying transcripts using the
alignments contained therein. However, it is currently the case that these
separate files must (1) all be of the same library type and (2) all be
aligned with respect to the same reference (i.e. the @SQ records in the
header sections must be identical).

Read-based mode
---------------

If you want to use salmon like sailfish, then you first have to build an salmon
index for your transcriptome. Again, assume that ``transcripts.fa`` contains
the set of transcripts you wish to quantify. First, you run the salmon
indexer:

::
> ./bin/salmon index -t transcripts.fa -i transcripts_index

Then, you can quantify any set of reads (say, paired-end reads in files
reads1.fa and reads2.fa) directly against this index using the salmon ``quant``
command as follows:

::

> ./bin/salmon quant -i transcripts_index -l <LIBTYPE> -1 reads1.fa -2 reads2.fa -o transcripts_quant

You can, of course, pass a number of options to control things such as the
number of threads used or the different cutoffs used for counting reads.
Just as with the alignment-based mode, after salmon has finished running, there
will be a directory called ``salmon_quant``, that contains a file called
``quant.sf`` containing the quantification results.

What's this ``LIBTYPE``?
------------------------
Expand Down Expand Up @@ -257,10 +268,10 @@ Misc
----

Salmon deals with reading from compressed read files in the same way as
sailfish --- by using process substitution. Say in the read-based salmon
example above, the reads were actually in the files ``reads1.fa.gz`` and
``reads2.fa.gz``, then you'd run the following command to decompress the reads
"on-the-fly":
sailfish --- by using process substitution. Say in the
lightweigh-alignment-based salmon example above, the reads were actually in the
files ``reads1.fa.gz`` and ``reads2.fa.gz``, then you'd run the following
command to decompress the reads "on-the-fly":

::

Expand Down

0 comments on commit 6e56d55

Please sign in to comment.