Salmon v0.7.0
Salmon 0.7.0
Version 0.7.0 of Salmon introduces a considerable number of improvements and new features. The main changes with respect to v0.6.0 are listed below.
- Renaming of default auxiliary directory — The default name for the auxiliary directory has been changed from
aux
toaux_info
. This eliminates a very annoying issue when copying quantification results to a Windows machine, as Windows forbids certain directory names (among whichaux
is one). - Automatic library type detection — Salmon now has the ability to guess the type of library being provided. To use this feature, provide the automatic type as the library type, which is denoted by
A
(e.g.-l A
or--libType A
). You must still provide either-r
for single-end reads or-1
and-2
for paired-end reads. When automatic library type detection is enabled, Salmon will examine the manner in which the first 50,000 reads map, and will use compatibility with different library types to guess the type of the library (e.g.IU
,ISR
, etc.). This library type will then be applied to all subsequent reads. Salmon will write to the console the library type it guessed, and it will also record this information at the end of the run in themeta_info.json
file in theaux_info
sub-directory of the quantification directory. - Output unmapped read information — Salmon can optionally output information about reads that were unmapped during quantification. If you pass the
--writeUnmappedNames
flag to Salmon, then it will create a file calledunmapped_names.txt
under theaux_info
directory that will contain the names of the unmapped reads. Salmon writes only the name of the unmapped read, so you will have to go back to the FASTA/Q file for the sequence itself. For single-end reads, the read is either mapped (and so doesn't appear in the file) or is unmapped, in which case Salmon will write the read name to the file followed by the character 'u'. For paired-end reads, 'u' means that neither end maps. The other possibilities are 'm1' (only read 1 mapped — read 1 is an orphan), 'm2' (only read 2 mapped — read 2 is an orphan), 'm12' (both reads 1 and 2 mapped, but never to the same transcript). From this information, and the original read set, one can recover the unmapped sequences. - Modification to default computation of effective lengths — In version 0.7.0, Salmon computes effective transcript lengths using the approach of kallisto1. That is, the effective length of a transcript is computed as the original length of that transcript (say l), minus the mean of the conditional fragment length distribution (the fragment length distribution for all lengths less than or equal to l). The effective lengths computed in this manner are similar to effective lengths computed in more traditional ways, but the biggest differences are for short transcripts, which typically receive less extreme corrections under this approach.
- Modification to sequence-specific bias correction — As of version 0.7.0, Salmon adopts a new bias correction methodology. The new model uses a variable-length Markov model (VLMM) to model the sequence-specific bias and is closely based off the approach introduced by Roberts et al.2. This model accounts separately for sequence specific biases at both the 5' and 3' ends of sequenced fragments. The correction of sequence-specific bias is enabled with the
--seqBias
flag. - New fragment-GC bias correction — As of version 0.7.0, Salmon includes the ability to correct for fragment-GC bias. This bias is separate from sequence-specific bias, and is a result of preferential sequencing of fragments based on the GC content of the fragment itself. A thorough investigation of numerous samples suggests that this is a prevalent bias in existing RNA-seq samples, the effect of which can be even greater than that of sequence-specific bias (and persists even after the removal of 5' and 3' sequence-specific bias)3. The correction of fragment-GC bias is enabled with the
--gcBias
flag. Sequence-specific and fragment-GC bias can both be corrected at the same time by passing both flags; in this case, a conditional fragment-GC bias model is used (despite the fact that the biases are different, they are not completely independent). - New read parser — Salmon now relies on FQFeeder (which, in turn relies on kseq and moodycamel's concurrent queue). As such, this means that Salmon now supports reading from gzipped input files directly. The previous approach (i.e. redirecting input to Salmon via process substitution) is still supported, and may, in fact, be faster in some situations. However, direct support for compressed files makes commands easier to type and reduces friction in some environments where process substitution syntax is not directly supported.
- Removal or hiding of deprecated options — A number of options that are no longer functional, or will no longer be functional with the planned removal of FMD-based mapping mode, have been either removed or hidden from the help menu. These include most of the options related to FMD-based mapping mode, as well as some other features that are now deprecated. Specifically, the
--useFSPD
option would, in conjunction with other options and on certain datasets, result in non-deterministic crashes. Further, a new and improved positional bias model (currently experimental, but which can be enabled with the--posBias
flag) is currently in testing and slated for the next release. We hope that removing some of the now-defunct options will reduce the number of reasonable choices the user has to make. - Modification of the variational Bayesian prior — The default variational Bayesian prior has been modified in form and value. Rather than a per-transcript prior, as of v0.7.0, Salmon uses a per-nucleotide prior instead (a per-transcript prior can be enabled by passing the
--perTranscriptPrior
option to Salmon). By default, this prior is 0.001 nucleotides per-base (so, e.g. a transcript with an effective length of 1,000 will have a prior count of 1). The value of the prior itself can be modified with the--vbPrior
option. We note that this prior is, on average, substantially larger than the prior used in previous versions of Salmon. This larger prior will often result in Salmon reporting more expressed transcripts (though at a very low abundance), but this typically increases the robustness at low abundances. - Bug fix — Fixed a bug in variational Bayesian mode (
--useVBOpt
) that would sometimes result innan
in the output. The bug was the result of attempting to compute estimated read counts for very low abundance transcripts, where evaluating the digamma function would cause evaluation at the function's pole. Now a small minimum abundance is required and checked before evaluating the digamma function.
1: Bray, Nicolas L., et al. "Near-optimal probabilistic RNA-seq quantification." Nature biotechnology 34.5 (2016): 525-527.
2: Roberts, Adam, et al. "Improving RNA-Seq expression estimates by correcting for fragment bias." Genome biology 12.3 (2011): 1.
3: Love, Michael I., John B. Hogenesch, and Rafael A. Irizarry. "Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation." bioRxiv (2015): 025767.