Skip to content

Salmon v0.8.0

Compare
Choose a tag to compare
@rob-p rob-p released this 23 Jan 06:40
· 1450 commits to master since this release

Bug Fixes

  • Fixed a bug in .gtf-based gene aggregation output that could cause a transcript to be attributed to the wrong gene if the transcript was not present in the gtf file.
  • Fixed bug that required a qualified path be provided when writing the quasi-mapping file (i.e., .sam).
  • Fixed a bug that could cause the SAM header to fail to be written when writing quasi-mappings to stdout.
  • Fixed behavior of --numPreAuxModelSamples so that it is consistent between quasi-mapping and alignment-based mode (and has an effect in both).
  • Fixed a "short style" option collision.
  • Fixed a bug that would cause bias correction not to be run if only the --posBias flag was passed.

Minor changes & improvements

  • Bumped to the latest version of spdlog.
  • Bumped included version of libcuckoo.
  • Bumped included version of sparsepp (via RapMap).
  • Bumped included version of RapMap.
  • meta_info.json now contains more information about the length classes used for positional bias correction when enabled (these length classes are now data driven.)
  • meta_info.json now records if equivalence classes were dumped, and if so, what properties were dumped as well (e.g. rich weights).
  • meta_info.json now includes the end as well as beginning time of each run.
  • Improvements to fragment-GC bias modeling for fragments that fall very close to the beginning or end of transcripts.
  • Added .gff and .gff3 (and capitalized variants of all) as recognized file formats for gene aggregation mode.
  • Changed the default prior mean and standard deviation of the fragment length distribution to better match more recent protocols and libraries.
  • Made slight improvements to the computation of the conditional fragment probabilities (i.e., P(f | t) in the model). Now the probability of a fragment length is conditioned on the transcript length, and the probability of a start position takes that length into account.

New features

  • Some important new indexing improvements due to improvements in RapMap; read more below.

  • Substantial overhaul and improvements to the posterior Gibbs sampler. The methodology now generally follows that of mmseq1. Specifically, the new (uncollapsed) sampler improves estimates of sampling variance (and uses the same methodology as before to account for inferential uncertainty).

  • Added --thinningFactor flag that lets the user specify how many Gibbs samples should be skipped between saved samples. Increasing this causes the Gibbs chain to run longer to generate a given number of target samples (but potentially reduces the autocorrelation between samples). The default is 16.

  • Added --meta flag, that automatically selects internal options optimized for metagenomic & microbiomic quantification.

  • Added --dumpEqWeights option that includes the rich equivalence class weights in the output file when equivalence classes are written to file.

  • Added experimental --noLengthCorrection option. This is intended to be used when quantifying based on protocols (e.g., Lexogen Quantseq) where the number of sequenced fragments / tags deriving from a target are assumed to be independent of that target's length. (This feature is still experimental, and requires more testing, so please provide feedback if you use it).

  • Added new --quasiCoverage option. This is analogous to the --coverage option, but the latter applies only to mapping under the FMD-based index (which is no longer recommended). This option enforces that a certain fraction of the read is covered by exact matches (specifically, maximum mappable prefixes) in order to consider a mapping as valid. The value is expressed as a number between 0 and 1; a larger value is more stringent, and less likely to allow spurious mappings, but can reduce sensitivity.

    New features due to changes and improvements in RapMap

    • New hash map for default index - The default quasiindex command now uses the sparsepp sparse hash map. While providing very similar lookup performance to the prior hash map implementation, sparsepp provides a number of benefits. Specifically, it uses substantially less memory (typically ~50% less) and, crucially, the memory usage grows gradually with the number of keys. A big problem with the previous implementation being used (Google's dense hash map) is that, on resize, the map would double and memory usage would jump by a factor of 3 (a new map of twice the size as the old, plus the original map from which to copy the keys). This means that even if you had enough memory to hold the final map, you might not be able to build it. Sparsepp, on the other hand exhibits memory usage that scales almost linearly with the number of items in the map. For more details on the performance characteristics of the new default hash used in the index, please see the sparsepp benchmarks here.
    • New frugal perfect hash index - The vastly improved memory usage of the new default quasi index essentially obviates the previous perfect-hash-based index. Specifically, since that perfect hash also stored the keys (to validate queries from outside the universe on which the hash was built), the size of the resulting index was similar, it simply required less memory to build. However, sparsepp achieves very similar memory usage to the previous perfect-hash-based index. Instead of removing the perfect-hash-based index entirely, the -p/--perfectHash flag now tells the quasiindex command to build a frugal perfect-hash-based index. This index uses a number of aggressive space-saving techniques which results in a much smaller memory footprint (but it is also slower to construct and has slower lookups than the default index). For large references, the new frugal perfect-hash-based index exhibits a memory reduction (over the new, reduced-memory, default index) of 40-50% (hence, it shows close to this savings over the old perfect-hash-based index as well). Also, for large references, the size of the index on disk is ~40% smaller. The cost of this substantial size reduction is that the frugal perfect-hash-based index takes 2-2.5 times longer to build, and lookups are slower. This slower lookup speed can, conceivably, reduce quasi-mapping speed a bit, but the speed hit (if there is one) is dataset dependent. This new indexing scheme should allow the construction of quasi indices on substantially larger references for a fixed RAM budget, and also reduces the memory required to retain the index in memory during mapping as well. Note: This type of index is specifically recommended if you need to build an index on a large set of targets (e.g., for metagenomic or microbiomic use).

References

[1] Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads. Turro E, Su S-Y, Goncalves A, Coin L, Richardson S and Lewin A. Genome Biology, 2011 Feb; 12:R13. doi: 10.1186/gb-2011-12-2-r13.