Skip to content

Salmon v0.9.0

Compare
Choose a tag to compare
@rob-p rob-p released this 26 Nov 02:48
· 1338 commits to master since this release

Salmon 0.9.0 Release Notes

As always, the newest release is easily installable via bioconda and Docker.

New features

  • During indexing, Salmon will now discard duplicate transcripts (i.e., transcripts with exactly the same sequence) by default. The information about the duplicate transcripts is written to a file in the index directory called duplicate_clusters.tsv. This is a two-column TSV file where the first column lists the name of a retained transcript and the second column lists the name of a discarded duplicate transcript (i.e., a transcript with identical sequence to the retained transcript, but which was discarded). Note: If you wish to retain multiple identical transcripts in the input (the prior behavior), this can be achieved by passing the Salmon indexing command the --keepDuplicates flag.

  • This is not a new feature, per se, but brings further parity between the alignment and mapping-based modes. It is now possible to dump the equivalence class files --dumpEq when using Salmon in alignment-based mode.

  • The range-factorization has been merged into the master branch. This allows using the data-driven likelihood factorization, which can improve quantification accuracy on certain classes of "difficult" transcripts. Currently, this feature interacts best (i.e., yields the most considerable improvements) when using alignment-based mode and when enabling error modeling --useErrorModel, though it can yield improvements in the mapping-based mode as well. This feature will also interact constructively with selective-alignment, which should land in the next (non-bug fix) release.

  • Added the quantmerge command. This allows producing a multi-sample TSV file with aggregated abundance metrics over samples from many different quantification runs. This can be useful to ease e.g. uploading of quantified data to certain online analysis tools like Degust.

Other improvements, features and changes

  • The multi-threaded read parser used by Salmon has been updated to considerably improve CPU utilization. Specifically, the previous queue management strategy (busy waiting) has been replaced by an intelligent, bounded, exponential-backoff strategy. Many improvements (and much of the code) comes from this series of blog posts by David Geier. Basically, what this means is that the performance will be the same as the prior implementation if your disks can feed reads to Salmon quickly enough, but if they can't, considerably less CPU time will be wasted waiting on input (i.e. processing speed will be better matched to I/O throughput).

  • In addition to the improved parser behavior, some of the noisy logger messages in the parser have been eliminated. In "pathological" situations with very fast disks and slow CPUs (or vice-versa), the previous parser may have generated an inordinate amount of output, creating large log files and otherwise slowing down processing. This should no longer happen.

  • Salmon will now terminate early (with a non-zero exit code) and report a meaningful error message if a corrupt input file is detected. Previously, corrupted compressed input files could have caused the parser to hang indefinitely. This behavior was fixed upstream in kseq, and the current parser wraps this detection with a descriptive exception message.

  • Renamed the --allowOrphans flag to --allowOrphansFMD, and added a --discardOrphansQuasi flag. This is a bit messy currently (the default in FMD mapping is to discard orphans and in quasi-mapping is to keep them). These flags to the obvious things and are docuemented more in the command line help. We are considering how best to clean-up simplify these flags in future releases.

  • Many other small improvements and bug fixes.