Skip to content

Releases: COMBINE-lab/salmon

Salmon v0.9.1

29 Nov 04:06
Compare
Choose a tag to compare

Salmon 0.9.1 Release Notes

Note: Version 0.9.1 fixes a warning with the indexer that was introduced by an API change that occurred due to an updated Fasta/q parser. The warning does not affect the indexing process, but nonetheless, the proper API should be obeyed. Also, v0.9.1 fixes a very small but long-standing indexing bug that would cause a single k-mer (the lexicographically largest) to not be indexed properly. The Salmon v0.9.0 release notes are recapitulated below for the convenience of those upgrading directly from v0.8.2.

As always, the newest release is easily installable via bioconda and Docker.

New features

  • During indexing, Salmon will now discard duplicate transcripts (i.e., transcripts with exactly the same sequence) by default. The information about the duplicate transcripts is written to a file in the index directory called duplicate_clusters.tsv. This is a two-column TSV file where the first column lists the name of a retained transcript and the second column lists the name of a discarded duplicate transcript (i.e., a transcript with identical sequence to the retained transcript, but which was discarded). Note: If you wish to retain multiple identical transcripts in the input (the prior behavior), this can be achieved by passing the Salmon indexing command the --keepDuplicates flag.

  • This is not a new feature, per se, but brings further parity between the alignment and mapping-based modes. It is now possible to dump the equivalence class files --dumpEq when using Salmon in alignment-based mode.

  • The range-factorization has been merged into the master branch. This allows using the data-driven likelihood factorization, which can improve quantification accuracy on certain classes of "difficult" transcripts. Currently, this feature interacts best (i.e., yields the most considerable improvements) when using alignment-based mode and when enabling error modeling --useErrorModel, though it can yield improvements in the mapping-based mode as well. This feature will also interact constructively with selective-alignment, which should land in the next (non-bug fix) release.

  • Added the quantmerge command. This allows producing a multi-sample TSV file with aggregated abundance metrics over samples from many different quantification runs. This can be useful to ease e.g. uploading of quantified data to certain online analysis tools like Degust.

Other improvements, features and changes

  • The multi-threaded read parser used by Salmon has been updated to considerably improve CPU utilization. Specifically, the previous queue management strategy (busy waiting) has been replaced by an intelligent, bounded, exponential-backoff strategy. Many improvements (and much of the code) comes from this series of blog posts by David Geier. Basically, what this means is that the performance will be the same as the prior implementation if your disks can feed reads to Salmon quickly enough, but if they can't, considerably less CPU time will be wasted waiting on input (i.e. processing speed will be better matched to I/O throughput).

  • In addition to the improved parser behavior, some of the noisy logger messages in the parser have been eliminated. In "pathological" situations with very fast disks and slow CPUs (or vice-versa), the previous parser may have generated an inordinate amount of output, creating large log files and otherwise slowing down processing. This should no longer happen.

  • Salmon will now terminate early (with a non-zero exit code) and report a meaningful error message if a corrupt input file is detected. Previously, corrupted compressed input files could have caused the parser to hang indefinitely. This behavior was fixed upstream in kseq, and the current parser wraps this detection with a descriptive exception message.

  • Renamed the --allowOrphans flag to --allowOrphansFMD, and added a --discardOrphansQuasi flag. This is a bit messy currently (the default in FMD mapping is to discard orphans and in quasi-mapping is to keep them). These flags to the obvious things and are docuemented more in the command line help. We are considering how best to clean-up simplify these flags in future releases.

  • Many other small improvements and bug fixes.

Salmon v0.9.0

26 Nov 02:48
Compare
Choose a tag to compare

Salmon 0.9.0 Release Notes

As always, the newest release is easily installable via bioconda and Docker.

New features

  • During indexing, Salmon will now discard duplicate transcripts (i.e., transcripts with exactly the same sequence) by default. The information about the duplicate transcripts is written to a file in the index directory called duplicate_clusters.tsv. This is a two-column TSV file where the first column lists the name of a retained transcript and the second column lists the name of a discarded duplicate transcript (i.e., a transcript with identical sequence to the retained transcript, but which was discarded). Note: If you wish to retain multiple identical transcripts in the input (the prior behavior), this can be achieved by passing the Salmon indexing command the --keepDuplicates flag.

  • This is not a new feature, per se, but brings further parity between the alignment and mapping-based modes. It is now possible to dump the equivalence class files --dumpEq when using Salmon in alignment-based mode.

  • The range-factorization has been merged into the master branch. This allows using the data-driven likelihood factorization, which can improve quantification accuracy on certain classes of "difficult" transcripts. Currently, this feature interacts best (i.e., yields the most considerable improvements) when using alignment-based mode and when enabling error modeling --useErrorModel, though it can yield improvements in the mapping-based mode as well. This feature will also interact constructively with selective-alignment, which should land in the next (non-bug fix) release.

  • Added the quantmerge command. This allows producing a multi-sample TSV file with aggregated abundance metrics over samples from many different quantification runs. This can be useful to ease e.g. uploading of quantified data to certain online analysis tools like Degust.

Other improvements, features and changes

  • The multi-threaded read parser used by Salmon has been updated to considerably improve CPU utilization. Specifically, the previous queue management strategy (busy waiting) has been replaced by an intelligent, bounded, exponential-backoff strategy. Many improvements (and much of the code) comes from this series of blog posts by David Geier. Basically, what this means is that the performance will be the same as the prior implementation if your disks can feed reads to Salmon quickly enough, but if they can't, considerably less CPU time will be wasted waiting on input (i.e. processing speed will be better matched to I/O throughput).

  • In addition to the improved parser behavior, some of the noisy logger messages in the parser have been eliminated. In "pathological" situations with very fast disks and slow CPUs (or vice-versa), the previous parser may have generated an inordinate amount of output, creating large log files and otherwise slowing down processing. This should no longer happen.

  • Salmon will now terminate early (with a non-zero exit code) and report a meaningful error message if a corrupt input file is detected. Previously, corrupted compressed input files could have caused the parser to hang indefinitely. This behavior was fixed upstream in kseq, and the current parser wraps this detection with a descriptive exception message.

  • Renamed the --allowOrphans flag to --allowOrphansFMD, and added a --discardOrphansQuasi flag. This is a bit messy currently (the default in FMD mapping is to discard orphans and in quasi-mapping is to keep them). These flags to the obvious things and are docuemented more in the command line help. We are considering how best to clean-up simplify these flags in future releases.

  • Many other small improvements and bug fixes.

Salmon v0.8.2

17 Mar 19:37
Compare
Choose a tag to compare

The main purpose of this release is to fix a bug (introduced in v0.8.0) that would prevent Salmon from being able to properly load 64-bit indexes (i.e. when the size of the transcriptome is > int32_t). If you are affected by this bug, it has been fixed in this release. Further, the bug existed only in the loading code. Hence, 64-bit indices made with Salmon v0.8.1 can now be properly loaded in Salmon v0.8.2.

Bug Fixes

  • This release fixes a bug that would prevent Salmon from being able to properly load 64-bit indices.

Minor enhancements and improvements

  • Changed the default size of the asynchronous queues used for logging. Previously, the queues were made un-necessarily large. This inflated the memory usage un-necessarily and the large allocation slowed down startup time a bit (only really noticeable with small indices). Salmon should now use a bit less memory and start up a bit faster.

  • Removed unused extraneous posWeight in equivalence classes, leading to a small bump in speed and a small decrease in memory usage.

  • The build system has been enhanced for this release to allow SHA256 verification of all packages downloaded during build. This will allow the new version to be made available via Homebrew (which has not accepted PRs on Salmon since v0.7.2 since it did not validate the non-vendored, downloaded packages).

Salmon v0.8.1

06 Mar 15:15
Compare
Choose a tag to compare

Even though the changes and fixes are minor, it is recommended you update to Salmon 0.8.1 if you are using a previous version.

Bug Fixes

  • This release includes a fix for a bug that would cause Salmon (in alignment-based mode only) to write the inferential sampling (i.e., Gibbs sampling or bootstrap) files to the current working directory, rather than the quantification directory. This release resolves that bug, and now the inferential samples are properly written to the target quantification directory in both modes.

Changes & improvements

  • The indexer now computes a signature (SHA256 sum) of both the sequence and headers in the fasta file used to build the index. These signatures are propagated into the quantification estimates in the meta_info.json file. This will allow one to verify the exact index used for quantification, and will be utilized in software we are working on for metadata propagation across RNA-seq pipelines. However, this means that indices built for previous version of Salmon will need to be re-built for v0.8.1.

  • JEMalloc is now built with the --disable-debug. This avoids spurious zone allocator warnings on MacOS Sierra.

  • Bumped version of JEMalloc to 4.5.0.

  • Bumped to the latest version of spdlog.

  • Bumped included version of libcuckoo.

  • Bumped included version of sparsepp (via RapMap).

  • Bumped included version of RapMap.

  • Internal refactoring and improvement of option parsing and argument handling code. This is the first phase of a larger-scale unification of the quasi-mapping-based and alignment-based modes that will be made over the next few releases.

Salmon v0.8.0

23 Jan 06:40
Compare
Choose a tag to compare

Bug Fixes

  • Fixed a bug in .gtf-based gene aggregation output that could cause a transcript to be attributed to the wrong gene if the transcript was not present in the gtf file.
  • Fixed bug that required a qualified path be provided when writing the quasi-mapping file (i.e., .sam).
  • Fixed a bug that could cause the SAM header to fail to be written when writing quasi-mappings to stdout.
  • Fixed behavior of --numPreAuxModelSamples so that it is consistent between quasi-mapping and alignment-based mode (and has an effect in both).
  • Fixed a "short style" option collision.
  • Fixed a bug that would cause bias correction not to be run if only the --posBias flag was passed.

Minor changes & improvements

  • Bumped to the latest version of spdlog.
  • Bumped included version of libcuckoo.
  • Bumped included version of sparsepp (via RapMap).
  • Bumped included version of RapMap.
  • meta_info.json now contains more information about the length classes used for positional bias correction when enabled (these length classes are now data driven.)
  • meta_info.json now records if equivalence classes were dumped, and if so, what properties were dumped as well (e.g. rich weights).
  • meta_info.json now includes the end as well as beginning time of each run.
  • Improvements to fragment-GC bias modeling for fragments that fall very close to the beginning or end of transcripts.
  • Added .gff and .gff3 (and capitalized variants of all) as recognized file formats for gene aggregation mode.
  • Changed the default prior mean and standard deviation of the fragment length distribution to better match more recent protocols and libraries.
  • Made slight improvements to the computation of the conditional fragment probabilities (i.e., P(f | t) in the model). Now the probability of a fragment length is conditioned on the transcript length, and the probability of a start position takes that length into account.

New features

  • Some important new indexing improvements due to improvements in RapMap; read more below.

  • Substantial overhaul and improvements to the posterior Gibbs sampler. The methodology now generally follows that of mmseq1. Specifically, the new (uncollapsed) sampler improves estimates of sampling variance (and uses the same methodology as before to account for inferential uncertainty).

  • Added --thinningFactor flag that lets the user specify how many Gibbs samples should be skipped between saved samples. Increasing this causes the Gibbs chain to run longer to generate a given number of target samples (but potentially reduces the autocorrelation between samples). The default is 16.

  • Added --meta flag, that automatically selects internal options optimized for metagenomic & microbiomic quantification.

  • Added --dumpEqWeights option that includes the rich equivalence class weights in the output file when equivalence classes are written to file.

  • Added experimental --noLengthCorrection option. This is intended to be used when quantifying based on protocols (e.g., Lexogen Quantseq) where the number of sequenced fragments / tags deriving from a target are assumed to be independent of that target's length. (This feature is still experimental, and requires more testing, so please provide feedback if you use it).

  • Added new --quasiCoverage option. This is analogous to the --coverage option, but the latter applies only to mapping under the FMD-based index (which is no longer recommended). This option enforces that a certain fraction of the read is covered by exact matches (specifically, maximum mappable prefixes) in order to consider a mapping as valid. The value is expressed as a number between 0 and 1; a larger value is more stringent, and less likely to allow spurious mappings, but can reduce sensitivity.

    New features due to changes and improvements in RapMap

    • New hash map for default index - The default quasiindex command now uses the sparsepp sparse hash map. While providing very similar lookup performance to the prior hash map implementation, sparsepp provides a number of benefits. Specifically, it uses substantially less memory (typically ~50% less) and, crucially, the memory usage grows gradually with the number of keys. A big problem with the previous implementation being used (Google's dense hash map) is that, on resize, the map would double and memory usage would jump by a factor of 3 (a new map of twice the size as the old, plus the original map from which to copy the keys). This means that even if you had enough memory to hold the final map, you might not be able to build it. Sparsepp, on the other hand exhibits memory usage that scales almost linearly with the number of items in the map. For more details on the performance characteristics of the new default hash used in the index, please see the sparsepp benchmarks here.
    • New frugal perfect hash index - The vastly improved memory usage of the new default quasi index essentially obviates the previous perfect-hash-based index. Specifically, since that perfect hash also stored the keys (to validate queries from outside the universe on which the hash was built), the size of the resulting index was similar, it simply required less memory to build. However, sparsepp achieves very similar memory usage to the previous perfect-hash-based index. Instead of removing the perfect-hash-based index entirely, the -p/--perfectHash flag now tells the quasiindex command to build a frugal perfect-hash-based index. This index uses a number of aggressive space-saving techniques which results in a much smaller memory footprint (but it is also slower to construct and has slower lookups than the default index). For large references, the new frugal perfect-hash-based index exhibits a memory reduction (over the new, reduced-memory, default index) of 40-50% (hence, it shows close to this savings over the old perfect-hash-based index as well). Also, for large references, the size of the index on disk is ~40% smaller. The cost of this substantial size reduction is that the frugal perfect-hash-based index takes 2-2.5 times longer to build, and lookups are slower. This slower lookup speed can, conceivably, reduce quasi-mapping speed a bit, but the speed hit (if there is one) is dataset dependent. This new indexing scheme should allow the construction of quasi indices on substantially larger references for a fixed RAM budget, and also reduces the memory required to retain the index in memory during mapping as well. Note: This type of index is specifically recommended if you need to build an index on a large set of targets (e.g., for metagenomic or microbiomic use).

References

[1] Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads. Turro E, Su S-Y, Goncalves A, Coin L, Richardson S and Lewin A. Genome Biology, 2011 Feb; 12:R13. doi: 10.1186/gb-2011-12-2-r13.

Salmon 0.7.2

31 Aug 23:10
Compare
Choose a tag to compare

Bug Fixes

  • Removed attempt to copy un-necessary file (related to EMPH, the old perfect hash function) in fetchRapMap.sh
  • Fixed default alignment-mode options to be consistent with mapping-mode
    • Default maximum fragment length increased from 800 to 1000
    • Default alignment-mode auxiliary directory is now named aux_info as in mapping-mode
  • Fixed description of library types in lib_format_counts.json
    • Fixed string representation of single-end stranded reads (SF and SR rather than F & R as before)
    • Fixed duplicate entires in lib_format_counts.json
  • Fixed various other typos in the --help menus, including the removal of a duplicate option in alignment-based mode

Minor changes & improvements

  • Bumped to the latest version of spdlog.
  • Bumped to the latest version of Staden IO
  • The automatically detected library type is now applied slightly earlier, so that fewer fragments that map inconsistently with which is eventually considered to be the library format will be considered.
  • meta_info.json now contains information about the number of posterior samples, regardless of whether these were obtained with bootstrapping or posterior Gibbs sampling.

New features

  • Added the ability to write out mapping information to SAM format — When Salmon is run in mapping mode (with the quasi index), you can now have it write out information about the quasi-mappings it uses for quantification. This behavior is enabled with the option --writeMappings. If this option is provided with no arguments, it will, by default, write the mapping information to stdout (this can then be piped to e.g. samtools and converted to BAM format). Optionally, you may also provide this argument with a filename, in which case the mapping information will be written to that file in SAM format.
    • Note: In the 0.7.2 release, the file provided to --writeMappings must use a qualified path (e.g. --writeMappings=./out.sam rather than --writeMappings=out.sam), this constraint is already addressed on develop and will be fixed in the next release. Further, note that, because --writeMappings has an implicit argument (stdout), any explicit argument must be directly adjacent to the option; i.e. --writeMappings=./out.sam is OK, but --writeMappings ./out.sam is not — this is a fundamental limitation of how boost program_options handles implicit values.
    • Note: The mapping information is computed and written before library type compatibility checks take place, thus the mapping file will contain information about all mappings of the reads considered by Salmon, even those that may later be filtered out due to incompatibility with the library type.
  • Added the ability to perform automatic library type detection in alignment-based mode.
    • Note: The implementation of this feature involves opening the BAM file, peaking at the first record, and then closing it to determine if the library should be treated as single-end or paired-end. Thus, in alignment-based mode automatic library type detection will not work with an input stream. If your input is a regular file, everything should work as expected; otherwise, you should provide the library type explicitly in alignment-based mode.
    • Note: The automatic library type detection is performed on the basis of the alignments in the file. Thus, for example, if the upstream aligner has been told to perform strand-aware mapping (i.e. to ignore potential alignments that don't map in the expected manner), but the actual library is unstranded, automatic library type detection cannot detect this. It will attempt to detect the library type that is most consistent with the alignment that are provided.

Thanks

This release contains fixes to bugs reported by (or features suggested by) the following people:

Salmon v0.7.1

22 Aug 05:09
Compare
Choose a tag to compare

Salmon 0.7.1

Salmon 0.7.1 is a bug-fix release. In maintains all of the major changes and improvements implemented in v0.7.0, but improves program behavior and fixes a few bugs (most of which predate v0.7.0). It is recommended that all users upgrade to v0.7.1 when possible.

New minor feature:

  • The --gencode flag has been added as an optional flag to the index building step. When building the quasi index, this will expect the transcript fasta to be in Gencode format. Instead of using the entire line following > as the transcript name, it will split the header at the first occurrence of the | character, and will use the first token preceding this | as the transcript name. This will make the names used from the fasta file properly correspond to the name the transcripts are given in the GTF. Thanks to @nicolasstransky for suggesting this feature.

Improved program behavior:

  • Salmon invoked with no options returns exit code 1 (addresses #71)
  • Salmon help has been made (hopefully) less confusing (addresses #72)
  • Salmon invoked with no options now prints the help message (addresses #73)
  • Salmon's different help menus now have consistent return codes (addresses #74)

Bug fixes

  • Fixed a bug that would cause Salmon to crash when setting --incompatPrior to 0 in alignment-based mode. Thanks to @zhangchipku for reporting this and providing a test case. Further, the behavior of --incompatPrior 0 has been modified in both alignment and read based mode to be more consistent with the behavior one might expect (i.e. fragments aligning / mapping in a manner other than as specified by the library type are completely ignored). Previously, a floating-point precision issue when using --incompatPrior 0 would prevent these fragments from being completely ignored. Additionally, messages concerning zero probability fragments have been summarized to one per thread (rather than one per such fragment).
  • Fixed a bug that could cause Salmon to crash when using --seqBias or --gcBias. This bug was triggered only when run on processors with hardware lock elision (HLE) enabled. It seemed to be triggered only on very specific versions of Xeon processors (in conjunction with specific versions of linux distributions). Nonetheless, this bug could trigger a call to unlock() from a thread not owning the corresponding lock, which could lead pthread to segfault. This bug has been addressed by moving the unlock() call and also switching the lock type from a mutex to a custom "tryable" spin-lock, which doesn't exhibit such behavior. Thanks to @nicolasstransky for uncovering and reporting this issue, and providing a test case that would reproduce it (on appropriate hardware).

Salmon v0.7.0

15 Aug 03:12
Compare
Choose a tag to compare

Salmon 0.7.0

Version 0.7.0 of Salmon introduces a considerable number of improvements and new features. The main changes with respect to v0.6.0 are listed below.

  • Renaming of default auxiliary directory — The default name for the auxiliary directory has been changed from aux to aux_info. This eliminates a very annoying issue when copying quantification results to a Windows machine, as Windows forbids certain directory names (among which aux is one).
  • Automatic library type detection — Salmon now has the ability to guess the type of library being provided. To use this feature, provide the automatic type as the library type, which is denoted by A (e.g. -l A or --libType A). You must still provide either -r for single-end reads or -1 and -2 for paired-end reads. When automatic library type detection is enabled, Salmon will examine the manner in which the first 50,000 reads map, and will use compatibility with different library types to guess the type of the library (e.g. IU, ISR, etc.). This library type will then be applied to all subsequent reads. Salmon will write to the console the library type it guessed, and it will also record this information at the end of the run in the meta_info.json file in the aux_info sub-directory of the quantification directory.
  • Output unmapped read information — Salmon can optionally output information about reads that were unmapped during quantification. If you pass the --writeUnmappedNames flag to Salmon, then it will create a file called unmapped_names.txt under the aux_info directory that will contain the names of the unmapped reads. Salmon writes only the name of the unmapped read, so you will have to go back to the FASTA/Q file for the sequence itself. For single-end reads, the read is either mapped (and so doesn't appear in the file) or is unmapped, in which case Salmon will write the read name to the file followed by the character 'u'. For paired-end reads, 'u' means that neither end maps. The other possibilities are 'm1' (only read 1 mapped — read 1 is an orphan), 'm2' (only read 2 mapped — read 2 is an orphan), 'm12' (both reads 1 and 2 mapped, but never to the same transcript). From this information, and the original read set, one can recover the unmapped sequences.
  • Modification to default computation of effective lengths — In version 0.7.0, Salmon computes effective transcript lengths using the approach of kallisto1. That is, the effective length of a transcript is computed as the original length of that transcript (say l), minus the mean of the conditional fragment length distribution (the fragment length distribution for all lengths less than or equal to l). The effective lengths computed in this manner are similar to effective lengths computed in more traditional ways, but the biggest differences are for short transcripts, which typically receive less extreme corrections under this approach.
  • Modification to sequence-specific bias correction — As of version 0.7.0, Salmon adopts a new bias correction methodology. The new model uses a variable-length Markov model (VLMM) to model the sequence-specific bias and is closely based off the approach introduced by Roberts et al.2. This model accounts separately for sequence specific biases at both the 5' and 3' ends of sequenced fragments. The correction of sequence-specific bias is enabled with the --seqBias flag.
  • New fragment-GC bias correction — As of version 0.7.0, Salmon includes the ability to correct for fragment-GC bias. This bias is separate from sequence-specific bias, and is a result of preferential sequencing of fragments based on the GC content of the fragment itself. A thorough investigation of numerous samples suggests that this is a prevalent bias in existing RNA-seq samples, the effect of which can be even greater than that of sequence-specific bias (and persists even after the removal of 5' and 3' sequence-specific bias)3. The correction of fragment-GC bias is enabled with the --gcBias flag. Sequence-specific and fragment-GC bias can both be corrected at the same time by passing both flags; in this case, a conditional fragment-GC bias model is used (despite the fact that the biases are different, they are not completely independent).
  • New read parser — Salmon now relies on FQFeeder (which, in turn relies on kseq and moodycamel's concurrent queue). As such, this means that Salmon now supports reading from gzipped input files directly. The previous approach (i.e. redirecting input to Salmon via process substitution) is still supported, and may, in fact, be faster in some situations. However, direct support for compressed files makes commands easier to type and reduces friction in some environments where process substitution syntax is not directly supported.
  • Removal or hiding of deprecated options — A number of options that are no longer functional, or will no longer be functional with the planned removal of FMD-based mapping mode, have been either removed or hidden from the help menu. These include most of the options related to FMD-based mapping mode, as well as some other features that are now deprecated. Specifically, the --useFSPD option would, in conjunction with other options and on certain datasets, result in non-deterministic crashes. Further, a new and improved positional bias model (currently experimental, but which can be enabled with the --posBias flag) is currently in testing and slated for the next release. We hope that removing some of the now-defunct options will reduce the number of reasonable choices the user has to make.
  • Modification of the variational Bayesian prior — The default variational Bayesian prior has been modified in form and value. Rather than a per-transcript prior, as of v0.7.0, Salmon uses a per-nucleotide prior instead (a per-transcript prior can be enabled by passing the --perTranscriptPrior option to Salmon). By default, this prior is 0.001 nucleotides per-base (so, e.g. a transcript with an effective length of 1,000 will have a prior count of 1). The value of the prior itself can be modified with the --vbPrior option. We note that this prior is, on average, substantially larger than the prior used in previous versions of Salmon. This larger prior will often result in Salmon reporting more expressed transcripts (though at a very low abundance), but this typically increases the robustness at low abundances.
  • Bug fix — Fixed a bug in variational Bayesian mode (--useVBOpt) that would sometimes result in nan in the output. The bug was the result of attempting to compute estimated read counts for very low abundance transcripts, where evaluating the digamma function would cause evaluation at the function's pole. Now a small minimum abundance is required and checked before evaluating the digamma function.

1: Bray, Nicolas L., et al. "Near-optimal probabilistic RNA-seq quantification." Nature biotechnology 34.5 (2016): 525-527.

2: Roberts, Adam, et al. "Improving RNA-Seq expression estimates by correcting for fragment bias." Genome biology 12.3 (2011): 1.

3: Love, Michael I., John B. Hogenesch, and Rafael A. Irizarry. "Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation." bioRxiv (2015): 025767.

Salmon v0.6.0

01 Jan 20:16
Compare
Choose a tag to compare

This is a fairly major new release of Salmon (thus the major version bump). It includes some new features and makes minor but backward-incompatible changes to the output format. Many of these changes track the latest changes to Sailfish.

Note for OSX binary:

If you receive a message that a library cannot be found (i.e. if you run into an @rpath issue), try running Salmon using the following command:

$ DYLD_FALLBACK_LIBRARY_PATH=<PATH_TO_SALMON>/lib <PATH_TO_SALMON>/bin/salmon

If this works, you can add the library path to the DYLD_FALLBACK_LIBRARY_PATH variable automatically by placing the line:

export DYLD_FALLBACK_LIBRARY_PATH=<PATH_TO_SALMON>/lib <PATH_TO_SALMON>/bin/salmon:$DYLD_FALLBACK_LIBRARY_PATH

in your ~/.profile file.

Major Changes

  • Default index --- The quasi index has been made the default type. This means that it is no longer necessary to provide the --type option to the index command. The fmd index remains enabled, but may be removed in a future version. We urge you to move over to the quasi index if you are not already using it.
  • Sequence-specific bias correction --- The old bias correction methodology has been removed from Salmon and replaced with a new sequence-specific bias correction model. Bias correction is enabled with the --biasCorrect flag. The new model has numerous benefits over the old. First, it should more accurately correct for sequence specific biases, leading to better estimates in biased samples. Second, it should not suffer from the same pathological "over-correction" failure cases of the old model --- if there is no substantial bias in the sample, it should have only a minimal effect on quantification results.
  • New output format --- The new output format adds another column, EffectiveLength, to the output which records the effective length of each transcript. This is the third column, and the TPM and NumReads columns have both been shifted by 1. Also, the quant.sf output file has been simplified and now contains no comment lines. The first row in the file is an (un-commented) header that lists the column names, and the subsequent rows are the quantification estimates.
  • Information about the command used --- Since the comment lines have been removed from the quant.sf file, this information (and more), which can sometimes be useful, has been output to other locations. There is a JSON formatted file in the top-level output directory called cmd_info.json. This contains a JSON structure with the relevant command line parameters (which used to appear in the quant.sf comments).
  • Meta-information about the run --- Quite a bit of useful information appears in the file aux/meta_info.json under the main quantification directory. This records information such as the number of reads processed, the number mapped, the percentage mapped, which type of posterior sampling (e.g. Gibbs / bootstrap), if any, was performed.
  • Auxiliary parameters from the run --- In addition to the meta_info.json file, the aux/ directory of the main quantification directory contains other useful files. Specifically, it contains gzipped, binary, data for any bootstrap or Gibbs samples that were generated, and gzipped binary data about the fragment length distribution and bias parameters (the latter is only meaningful if bias-correction was performed).

Minor Changes

  • Position specific start distribution --- Modeling of the position-specific start distribution has been improved, and the way that it is enabled / disabled has been changed. This model is off by default, but is enabled with the --useFSPD.

Bug Fixes

  • This release fixes a bug where the mapping location of a fragment may have been miscalculated by a small number of bases in certain cases. This in turn could lead to a small shift in the fragment length distribution and in the resulting quantification estimates.

Acknowledgements

  • Special thanks go to Ayush Sengupta for helping out with the implementation of sequence-specific bias correction.
  • Special thanks go to Mike Love for testing the effectiveness of the sequence-specific bias correction implementation (in Sailfish, but this uses the same model) on some experimental (GEUVADIS) data!

Note

As you may note, there are two DebianSqueeze binaries listed below. The binary called SalmonBeta-0.6.0_DebianSqueeze.tar.gz is the "standard" binary, which is built to use the JEMalloc memory allocator. In certain situations (involving files on NFS) this allocator has been observed to segfault upon program termination. This doesn't seem to affect the results, which have already been written by the time this occurs. However, if you encounter this problem, you can try SalmonBeta-0.6.0_DebianSqueeze_tcmalloc.tar.gz, which is built to use the TCMalloc memory allocator instead; which doesn't seem to suffer from this same issue.

Salmon v0.5.1

13 Nov 08:31
Compare
Choose a tag to compare

This release includes some important improvements and bug-fixes which include:

  • A very rare bug in which boost::hash_combine() would exhibit pathologically bad behavior. This caused the concurrent hash map to continue size-doubling, which could consume massive amounts of memory. [Thanks to Nick Schurch for finding this bug and for sharing data that reproduces it].
  • In the stats.tsv file, the effective length was actually the log of the effective length --- this has been fixed.
  • On small datasets with high multi-mapping, conflicts in updating the stored "auxiliary" weights within each equivalence class could negatively affect inference. Though such conflicts are rare, the updates to these weights have been made atomic so that they are always consistent. [Thanks to Tom Smith for providing data that triggers this bug].

This release also includes improvements in quasi-mapping which reduce mapping ambiguity and improve results for very similar expressed transcripts that reside on opposite strands.