Skip to content

salmon v0.12.0

Compare
Choose a tag to compare
@rob-p rob-p released this 06 Dec 05:25
· 816 commits to master since this release

Release Notes

Major Release (including major updates to Alevin and improvements to mapping validation)

We are very excited to release a major upgrade to the single-cell framework of the Salmon tool --- Alevin.

Alevin is a droplet based single-cell RNA-seq data quantification tool which currently supports the following protocols:

  1. Drop-seq
  2. 10x-Chromium v2 (v1 via wrapper)
  3. 10x-Chromimum v3
  4. CEL-Seq2

With the latest release, the UMI deduplication step has been completely changed, and it is now driven by a new, efficient and robust algorithm. The latest algorithm, instead of discarding gene-ambiguous reads, utilizes the UMI networks generated by transcript level equivalence classes to better deduplicate the UMIs; while still correcting for UMI collisions. We also show that including the gene ambiguous reads into the analyses significantly improves the accuracy of the quantification of the gene count matrix in our latest preprint. Moreover, Alevin introduces a new categorization of the genes into informative tiers, allowing concise assessment of the quality of evidence that led to each UMI count in each cell. Along with many other minor bug fixes, the latest release adds two more ways of selecting an initial whitelist for starting the Alevin pipeline more robustly.

New Flags and Features for Alevin:

  • Along with already present customizable CB and UMI length command line flags, Alevin now support two more single-cell protocols without explicit configuration. --chromiumV3 for v3 chemistry of 10x data, works same as v2 chemistry except the UMI length has been increased from 10 to 12. --celseq2 for CelSeq2 data where both CB and UMI length by default has been configured to 6.

  • Alevin, with the latest release, would be using --validateMapping and --minScoreFraction w/ value 0.8 as the default (although tweakble), mapping based option. This significantly improves the mapping rate of the algorithm while providing a good tradeoff between senstivity and specificity.

  • By default, Alevin now dumps the gene-tiers categorization matrix with the name quants_tier_mat.gz, where the row and column order stays the same as quants_mat.gz.

  • --forceCells X command line flag forces the Alevin pipeline to use top X number of Cellular Barcodes in initial whitelisting part of the pipeline -- skipping the knee method.

  • --expectCells X command line flag uses the 10x approach of selecting the whitelist barcodes, putting an upper bound on the total number of expected cells -- skipping the knee method. In brief, it only allows CBs with frequency more than 1/10th of the top 1% of the CBs as the initial whitelist.

  • A new command line flag--numCellBootstraps X has been added to perform multiple rounds of optimization by bootstrapping the number of mapped reads in the equivalence classes. Alevin dumps the mean and the variance of each entry in the Cell-Gene count matrix within two files quants_mean_mat.gz and quants_var_mat.gz. Note: The syntax for parsing the generated binary files stays the same as quants_mat.gz, but the order of the rows in the mean/variance matrix is stored in a different file with the name quants_boot_rows.txt, where column order stays the same as quants_mat.gz.

  • Alevin peforms intelligent whitelisting downstream of the quantification pipeline and has to make some assumptions like choosing a fraction of reads to learn low confidence CBs and in turn might erroneously exit, if the data results in no mapped or deduplicated reads to a CB in low confidence CBs. The problem doesn't happen when provided with external whitelist but if there is an error and the user is confident about it being just a warning, the error can be skipped by running Alevin with --debug flag.

  • raw_cb_frequency.txt now includes the frequency of all the observed Cellular Barcodes instead of only the whitelisted ones.

  • Alevin no longer supports the --naive command line flag.

  • By default the Command line flag --debug has been set True. NOTE the pipeline will not exit when observed Cellular Barcodes from High Confidence Region have relatively less (mapped)reads instead will continue with a warning. It's user's responsibility to keep notice of the warning generated by the pipeline.

New Flags and Features for Salmon:

  • Note : Mapping validation (--validateMappings) is a recommended flag, and may become a default in future releases. Considerable improvements to mapping validation have been implemented. For paired-end reads, mapping validation will now consider multiple equally-best chains when scoring chains of MMPs. When reads map to multiple positions on a single transcript, this allows for improved mapping, since multiple positions for the individual reads will be propagated to the algorithm that selects the best mapping for the pair (which can take into account the expected pairing constraints).

  • Added new flag in mapping validation mode --maxMMPExtension (default value of 7). This flag limits the length of the MMP by which a match between the suffix array and read can be extended. Smaller values for this parameter can potentially increase the mapping sensitivity at the cost of requiring more suffix array lookups. The default value should generally work well, and increases the sensitivity with respect to unconstrained mapping validation with little impact on runtime. This heursitic is meant to approximate some of the ideas from selective alignment. Note that this flag can be used in conjunction with --consensusSlack to increase the sensitivity of mapping in mapping validation mode (which is safe from the perspective of specificity as these mappings will be score anyway). For example, setting --maxMMPExtension 5 --consensusSlack 7 would shorten maximum extensions even more, and consider many more potential loci when chaining, which could lead to more sensitivity. However, the default values have been tuned to provide fairly high sensitivity for minimal extra computational expense.

  • Added a new flag in mapping validation mode --mimicStrictBT2. This flag attempts to mimic the very strict mapping parameters with which Bowtie2 is invoked when it is used with RSEM. Specifically, it disallows orphans, indels in alignments, and dovetailing reads. It also sets the minimum score fraction (--minScoreFraction) to 0.8. We do not generally recommend using this flag, as these parameters tend to be overly strict and can eliminate many valid mappings / alignments. However, if one wishes to attempt to mimic that behavior with mapping validation, this "meta-flag" sets the relevant corresponding parameters. We are still optimizing details of the parameters set by this meta-flag, but it already obtains the desired effect across a large variety of datasets.

  • Salmon now supports a flag --discardOrphans in alignment-based mode. If run with this flag in alignment mode, salmon will simply ignore orphaned alignments for the purposes of quantification (matching the behavior of other tools like RSEM and eXpress).

  • Salmon now only waits 1 second for version information before timing out.

Bug Fixes:

  • Fixed two small bugs (caught by @fataltes) in the MMP chaining algorithm. This can improve the predicted mapping locations in some difficult corner cases.

  • Fixed a bug that made it difficult to set the alignment scoring parameters (match score, mismatch cost, gap open cost and gap extension cost) in mapping validation mode. Previously, these were parsed as int8_t type variables, now they are parsed as int16_t and then they are explicitly checked to adhere to the required bounds.

  • Fixed the bioconda-based OSX build: In v0.11.3, the binary created via bioconda for OSX would segfault. This was the result of a bug that only seemed to affect the C++ compiler in older versions of OSX (like that used in bioconda). This has been addressed in the current release, and you should be able to obtain a working bioconda build of 0.12.0 for OSX.