Skip to content

Releases: COMBINE-lab/salmon

salmon v0.13.0

05 Mar 04:12
23417df
Compare
Choose a tag to compare

Salmon 0.13.0 release notes

Change to default behavior

Starting from this version of salmon, dovetailed mappings (see the Bowtie2 manual for a description) are not accepted by default using the built-in mapping (with or without --validateMappings). Moreover v0.13.0 has no flag to allow dovetail mappings. The --allowDovetail option has been added to v0.13.1 to enable this behavior, if desired.

Exotic library types (e.g. MU, MSF, MSR) are no longer supported. If you need support for such a library type, please submit a feature request describing the use-case.

Improvements and new flags

Again, there have been significant improvements to mapping validation. Through broad benchmarking across many samples, we have worked to considerably improve the algorithm and its sensitivity. We note that it is likely that mapping validation will turned on by default in future releases, and we strongly encourage all users to make use of this feature and report their experiences with it.

Along with the default mapping validation (enabled via --validateMappings), there are two "meta" flags that enable mapping validation parameters meant to mimic configurations in which users might be interested.

  • --mimicBT2 : This flag is a "meta-flag" that sets the parameters related to mapping and mapping validation to mimic alignment using Bowtie2 (with the flags --no-discordant and --no-mixed), but using the default scoring scheme and allowing both mismatches and indels in alignments.

  • --mimicStrictBT2 : This flag is a "meta-flag" that sets the parameters related to mapping and mapping validation to mimic alignment using Bowtie2 (with the flags suggested by RSEM), but using the default scoring scheme and allowing both mismatches and indels in alignments. These setting essentially disallow indels in the resulting alignments.

In addition to these "meta-flags", a few other flags have been introduced that can alter the behavior of mapping:

  • --recoverOrphans : This flag (which should only be used in conjunction with mapping validation), performs orphan "rescue" for reads. That is, if mappings are discovered for only one end of a fragment, or if the mappings for the ends of the fragment don't fall on the same transcript, then this flag will cause salmon to look upstream or downstream of the discovered mapping (anchor) for a match for the opposite end of the given fragment. This is done by performing "infix" alignment within the maximum fragment length upstream of downstream of the anchor mapping using edlib.

  • --hardFilter : This flag (which should only be used with mapping validation) turns off soft filtering and range-factorized equivalence classes, and removes all but the equally highest scoring mappings from the equivalence class label for each fragment. While we recommend using soft filtering (the default) for quantification, this flag can produce easier-to-understand equivalence classes if that is the primary object of study.

  • --skipQuant : Related to the above, this flag will stop execution before the actual quantification algorithm is run.

  • --bandwidth : This flag (which is only meaningful in conjunction with mapping validation), sets the bandwidth parameter of the relevant calls to ksw2's alignment function. This determines how wide an area around the diagonal in the DP matrix should be calculated.

  • --maxMMPExtension : This flag (which should only be used with mapping validation) limits the length that a mappable prefix of a fragment may be extended before another search along the fragment is started. Smaller values for this flag can improve the sensitivity of mapping, but could increase run time.

The default setting for --numPreAuxModelSamples has been lowered from 1,000,000 to 5,000. This simply means that the basic models (and cruically the read alignment error model) will start being applied much earlier on in the online algorithm. This has very little effect on samples with a decent number of fragments, but can considerably improve estimates (especially in alignment-based mode) for samples with only a small number of fragments.

The definition of --consensusSlack has changed. Instead of being an absolute number, it is now a fractional value (between 0 and 1) the describes the number of "hits" (i.e. suffix array intervals) that a mapping may miss and still be consdered valid for chaining.

Improvements and changes to alevin

  • With this release alevin will dump a summary statistics of a single cell experiment into the file alevin_meta_info.json inside the aux folder of the output directory.

  • EquivalenceClassBuilder object will now have a single cell SCRGValue templaization, which will marginally reduce the memory used by the object.

  • Salmon's --initUniform flag has been linked with alevin, if enabled through command line (default false) it initialized the EM step with a uniform prior instead of with a unique equivalence class evidence.

  • Alevin can directly consume bfh file format generated using --dumpBfh. It provides an independant entry point into alevin's UMI deduplication step instead of the raw FASTQ files.

  • A bug in UMI deduplication step has been fixed. Previously the vertices in the maximum connected components of an arborescence were not being removed.

  • The custom mode of the single cell protocol for alevin, does not need explicit protocol specific command line flag. Although the full triplet --umiLength --barcodeLength --end command line options has to be specified to enable the custom mode.

  • Maximum allowable length of a barcode and/or the UMI has been set to 20 for the custom mode of a single cell experiment.

  • A new command line option --keepCBFraction has been added, which expects a value in the range (0, 1]. This parameter forces alevin to use the specified fraction of all the observed Cellular barcode in the input reads after sequence correction.

Bug fixes, deprecations and removals

  • Fixed a rare bug that could cause salmon and alevin to "hang" when many read files were provided as input at the number of records in the read file were a divisor of the mini-batch size. Thanks to @rbenel for finding a dataset that triggers this bug and reporting it in #329.

  • The --strictIntersect flag led to unnecessary complexity in the codebase, and it seems, was not really used by anyone, so it was removed to simplify and streamline the code.

  • The --useFSPD flag has been deprecated for many releases and was removed.

salmon v0.12.0

06 Dec 05:25
Compare
Choose a tag to compare

Release Notes

Major Release (including major updates to Alevin and improvements to mapping validation)

We are very excited to release a major upgrade to the single-cell framework of the Salmon tool --- Alevin.

Alevin is a droplet based single-cell RNA-seq data quantification tool which currently supports the following protocols:

  1. Drop-seq
  2. 10x-Chromium v2 (v1 via wrapper)
  3. 10x-Chromimum v3
  4. CEL-Seq2

With the latest release, the UMI deduplication step has been completely changed, and it is now driven by a new, efficient and robust algorithm. The latest algorithm, instead of discarding gene-ambiguous reads, utilizes the UMI networks generated by transcript level equivalence classes to better deduplicate the UMIs; while still correcting for UMI collisions. We also show that including the gene ambiguous reads into the analyses significantly improves the accuracy of the quantification of the gene count matrix in our latest preprint. Moreover, Alevin introduces a new categorization of the genes into informative tiers, allowing concise assessment of the quality of evidence that led to each UMI count in each cell. Along with many other minor bug fixes, the latest release adds two more ways of selecting an initial whitelist for starting the Alevin pipeline more robustly.

New Flags and Features for Alevin:

  • Along with already present customizable CB and UMI length command line flags, Alevin now support two more single-cell protocols without explicit configuration. --chromiumV3 for v3 chemistry of 10x data, works same as v2 chemistry except the UMI length has been increased from 10 to 12. --celseq2 for CelSeq2 data where both CB and UMI length by default has been configured to 6.

  • Alevin, with the latest release, would be using --validateMapping and --minScoreFraction w/ value 0.8 as the default (although tweakble), mapping based option. This significantly improves the mapping rate of the algorithm while providing a good tradeoff between senstivity and specificity.

  • By default, Alevin now dumps the gene-tiers categorization matrix with the name quants_tier_mat.gz, where the row and column order stays the same as quants_mat.gz.

  • --forceCells X command line flag forces the Alevin pipeline to use top X number of Cellular Barcodes in initial whitelisting part of the pipeline -- skipping the knee method.

  • --expectCells X command line flag uses the 10x approach of selecting the whitelist barcodes, putting an upper bound on the total number of expected cells -- skipping the knee method. In brief, it only allows CBs with frequency more than 1/10th of the top 1% of the CBs as the initial whitelist.

  • A new command line flag--numCellBootstraps X has been added to perform multiple rounds of optimization by bootstrapping the number of mapped reads in the equivalence classes. Alevin dumps the mean and the variance of each entry in the Cell-Gene count matrix within two files quants_mean_mat.gz and quants_var_mat.gz. Note: The syntax for parsing the generated binary files stays the same as quants_mat.gz, but the order of the rows in the mean/variance matrix is stored in a different file with the name quants_boot_rows.txt, where column order stays the same as quants_mat.gz.

  • Alevin peforms intelligent whitelisting downstream of the quantification pipeline and has to make some assumptions like choosing a fraction of reads to learn low confidence CBs and in turn might erroneously exit, if the data results in no mapped or deduplicated reads to a CB in low confidence CBs. The problem doesn't happen when provided with external whitelist but if there is an error and the user is confident about it being just a warning, the error can be skipped by running Alevin with --debug flag.

  • raw_cb_frequency.txt now includes the frequency of all the observed Cellular Barcodes instead of only the whitelisted ones.

  • Alevin no longer supports the --naive command line flag.

  • By default the Command line flag --debug has been set True. NOTE the pipeline will not exit when observed Cellular Barcodes from High Confidence Region have relatively less (mapped)reads instead will continue with a warning. It's user's responsibility to keep notice of the warning generated by the pipeline.

New Flags and Features for Salmon:

  • Note : Mapping validation (--validateMappings) is a recommended flag, and may become a default in future releases. Considerable improvements to mapping validation have been implemented. For paired-end reads, mapping validation will now consider multiple equally-best chains when scoring chains of MMPs. When reads map to multiple positions on a single transcript, this allows for improved mapping, since multiple positions for the individual reads will be propagated to the algorithm that selects the best mapping for the pair (which can take into account the expected pairing constraints).

  • Added new flag in mapping validation mode --maxMMPExtension (default value of 7). This flag limits the length of the MMP by which a match between the suffix array and read can be extended. Smaller values for this parameter can potentially increase the mapping sensitivity at the cost of requiring more suffix array lookups. The default value should generally work well, and increases the sensitivity with respect to unconstrained mapping validation with little impact on runtime. This heursitic is meant to approximate some of the ideas from selective alignment. Note that this flag can be used in conjunction with --consensusSlack to increase the sensitivity of mapping in mapping validation mode (which is safe from the perspective of specificity as these mappings will be score anyway). For example, setting --maxMMPExtension 5 --consensusSlack 7 would shorten maximum extensions even more, and consider many more potential loci when chaining, which could lead to more sensitivity. However, the default values have been tuned to provide fairly high sensitivity for minimal extra computational expense.

  • Added a new flag in mapping validation mode --mimicStrictBT2. This flag attempts to mimic the very strict mapping parameters with which Bowtie2 is invoked when it is used with RSEM. Specifically, it disallows orphans, indels in alignments, and dovetailing reads. It also sets the minimum score fraction (--minScoreFraction) to 0.8. We do not generally recommend using this flag, as these parameters tend to be overly strict and can eliminate many valid mappings / alignments. However, if one wishes to attempt to mimic that behavior with mapping validation, this "meta-flag" sets the relevant corresponding parameters. We are still optimizing details of the parameters set by this meta-flag, but it already obtains the desired effect across a large variety of datasets.

  • Salmon now supports a flag --discardOrphans in alignment-based mode. If run with this flag in alignment mode, salmon will simply ignore orphaned alignments for the purposes of quantification (matching the behavior of other tools like RSEM and eXpress).

  • Salmon now only waits 1 second for version information before timing out.

Bug Fixes:

  • Fixed two small bugs (caught by @fataltes) in the MMP chaining algorithm. This can improve the predicted mapping locations in some difficult corner cases.

  • Fixed a bug that made it difficult to set the alignment scoring parameters (match score, mismatch cost, gap open cost and gap extension cost) in mapping validation mode. Previously, these were parsed as int8_t type variables, now they are parsed as int16_t and then they are explicitly checked to adhere to the required bounds.

  • Fixed the bioconda-based OSX build: In v0.11.3, the binary created via bioconda for OSX would segfault. This was the result of a bug that only seemed to affect the C++ compiler in older versions of OSX (like that used in bioconda). This has been addressed in the current release, and you should be able to obtain a working bioconda build of 0.12.0 for OSX.

Major Alevin pre-release

24 Oct 13:54
Compare
Choose a tag to compare
Pre-release

** NOTE ** : The full v0.12.0 has been released. Please do not use this executable or tag any longer; prefer v0.12.0 instead.

This pre-release includes major upgrades to the Alevin framework of Salmon including:

  • Use of mapping validation for the mapping stage, making the overall analysis more accurate.
  • New improved and much efficient UMI deduplication algorithm which utilizes the transcript multi mapped reads.
  • New categorization of the genes into informative tiers, allowing concise assessment of the quality of evidence that led to each UMI count in each cell.

salmon v0.11.3

30 Aug 01:56
Compare
Choose a tag to compare

Note: Though not technically a bug fix or improvement in salmon itself, we were also able to successfully build salmon again in bioconda under OSX (previously, boost/icu related linking issues were preventing this and only linux builds were available). This means you can upgrade to this version of salmon in bioconda under OSX.

Deprecation note: With the upgrade of bioconda to conda-build 3, all compilers testing salmon builds (on our internal machines, on Travis CI, on our internal CI, in Docker images, and in bioconda) are now invoked in C++14 mode. Moving forward, the minimum required version of GCC supported is 5.2, and C++14 support will be assumed. Technically, this version of salmon can be compiled with C++11 (GCC >= 4.8.2) if the appropriate flags are passed and the preprocessor directive SALMON_DEPRECATED_COMPILER is defined. However, the guarded code will be removed in the next release, and compilation with C++11 will no longer be possible.

Bug fixes & Improvements:

This release implements the following bug fixes:

  1. Fixed a bug that caused salmon to infer the library type as unstranded (U) when run in alignment-based mode with single-end reads. Note: This bug only occurred in alignment-based mode, and only occurred with single-end reads, but the result is that libraries were always detected as unstranded. This has been fixed and single-end libraries in alignment based mode are now properly detected (at least as accurately as is possible using the same heuristic as for all other types of automated detection).

  2. Potential improvement to the accuracy of NumReads, when salmon is run in alignment-based mode. This fix avoids an extra extrapolation of the estimated number of reads from the TPM and total library size in alignment-based mode, which could result in a slight decrease in precision.

Special thanks to Jeremy Simon and Travis Ptacek (UNC Neuroscience Center Bioinformatics Core) for bringing the above to our attention, for providing test data to help reproduce the issues, and for verifying the fixes.

  1. Avoid writing NaN to lib_format_counts.json where strand_mapping_bias in lib_format_counts.json would be written as NaN, in paired-end libraries where reads mapped but none mapped concordantly. That is, when all of the mapped reads were orphans, the strand_mapping_bias resulted in a value of NaN which is not technically valid JSON. This case is now reported as 0.0. This fix addresses #279 --- thanks to @kurtwheeler.

  2. Fixed a minor thread synchronization bug that could cause the number of reported skipped cell barcodes to be incorrectly reported (off by a small number) in Alevin's log.

and the following improvements:

  1. Read libraries are now written in lib_format_counts.json as a proper JSON list, rather than as a single string in a custom format. Thanks to @Miserlou for this suggestion.

  2. Numerous improvements were made to the TravisCI configuration (and associated parts of the build system). Salmon is now compiled in Travis CI under both GCC 5 and 7 to improve testing of compatibility. Many thanks to @junaruga for the pull requests implementing these improvements!

  3. Bumped included versions of fmt and spdlog, and improved pretty printing of log and console messages in a number of places.

New feature:

This release also introduces a new "debug" flag for Alevin (activated by --debug). Thanks to the suggestions from @habilzare, @patrickvdb made in issue #253. When debug mode is activated, Alevin will run in a relatively lenient manner, relaxing the following assumptions, and not raising when the below conditions occur. This allows the pipeline to run to completion even when these conditions occur (though warnings are issued):

  • All externally provided whitelisted barcodes, if given through --whitelist flag, must have some reads in the FASTQ file assigned to it.
  • All the "High Confidence" cells must have at least 10 reads confidently mapped to them.
  • All the whitelisted cellular barcodes must have at least one deduplicated UMI.

That is, Alevin will normally terminate if any of the above conditions occur. With the --debug flag, it will instead run to completion and report the above as warnings.

salmon v0.11.2

04 Aug 17:30
Compare
Choose a tag to compare

Release notes

  • Moved incompatPrior option from "basic" to "advanced" category.

  • Made minor speed improvements to validateMappings, including using
    faster hash table for alignment caching and reverse-complement on
    demand.

  • Fixed a bug that caused inflated posterior variance for likely non-expressed (often point estimate of 0 TPM) transcripts when bias correction was enabled (thanks to @kvittingseerup for finding this and pointing it out to us).

salmon v0.11.1

27 Jul 22:27
Compare
Choose a tag to compare

Salmon v0.11.1 Release Notes

This version adds no features over v0.11.0, but improves some behavior and fixes a bug that was previously introduced that prevented alevin from properly handling DropSeq data.

Changes:

  • This version of salmon (and likely all future versions) drops support for the FMD indexing mode. This considerably reduces the maintenance burden, and paves the way for future additions and engineering improvements. If you need to use the FMD index, please use v0.11.0. In the future, indexing and mapping against large sequence databases will be enabled using the pufferfish index, which will be integrated into salmon.

  • This version removes the often-overlooked and potentially confusing behavior of applying an implicit library type to read libraries that appear before a library type flag. Salmon now requires that the library type flag appear on the command line before the library to which it applies. It will also produce more verbose and meaningful error messages if this is overlooked.

Bug fixes:

  • This version resolves a couple of bugs in alevin.
    • Resolves issue #258 (thanks to @pophipi) where alevin was erroneously exiting when using dropseq mode, with an error in parsing UMI; which has been fixed.
    • dumpfeatures was previously seg-faulting if all the features were not used for intelligent whitelisting. Now, automatic detection of the number of features is done when alevin is asked to dump the features.

Salmon v0.11.0

16 Jul 04:35
Compare
Choose a tag to compare

Version 0.11.0 introduces further enhancements and bug fixes, and makes some modifications to the default parameters.

Note : Though we provide a pre-compiled binary here, we strongly suggest installing the latest version of salmon through Bioconda (or building via source).

New / enhanced features

  • The sensitivity and accuracy of mapping validation (enabled via the --mappingValidation flag) has been further enhanced. This flag allows salmon to score mappings and validate their quality, both reducing the instances of spurious mapping and improving assignment of reads to the correct transcript in complex mapping situations.

  • Variational Bayesian (VB) optimization is now the default algorithm for the offline phase of salmon. While salmon always uses stochastic collapsed VB inference during the online phase, the default optimization algorithm during the offline phase was Expectation Maximization (EM), and the use of VB optimization had to be explicitly enabled with --useVBOpt. Now, VB optimization is the default. This decision was made as the result of testing that suggested that the VB algorithm is often more accurate, and, specifically, is less likely to give non-negative (even if small) TPM to a non-expressed transcript. One can make salmon use the EM algorithm (reverting to the old behavior) by passing the --useEM flag. Further, for backwards compatibility, salmon still accepts the --useVBOpt flag, but it may be removed in the future. It is worth noting that the VB optimization algorithm employs a prior (which can be set using the --vbPrior flag). The default prior has been tuned for typical use-cases, and should work well. However, if you have a mechanism for validation, it may be worth exploring this option to see if a sparser (smaller) or less-sparse (larger) prior is useful in your setting.

  • The new flag --sigDigits (default value = 3) tells salmon how many significant digits to use when writing out effective lengths and estimated counts (it will still use more digits for TPM). Since more significant digits are not often necessary, this can considerably reduce output size if one is processing many samples. Of course, using this flag, one can always request higher-precision output.

  • The new flag --consensusSlack allows one to modify the behavior of the mapping consensus mechanism. Passing a larger value will allow more "liberal" mapping. This is an advanced flag, and it is not likely that it need to be set by the casual user. The consensus slack is set to 1 by default when mapping validation is enabled, and to 0 otherwise.

Bug fixes

  • This release fixes a bug in mapping validation (i.e. the --validateMapping) flag that could result in a segmentation fault in rare situations.

  • This release fixes a bug that was present when using the built in gene-level aggregation. The quant.genes.sf gene lengths and effective lengths columns are corrected, being generally wrong for multi-isoform genes since 0.6.0 (with an internals-varying permutation of isoforms before the fix being weighted 1, 2, 3, ... times what they should be in the TPM-weighted averaging denominator, resulting in lengths that were too short, possibly even shorter than the shortest isoform). This miscalculation applies only to the lengths and effective lengths; TPM and NumRead columns remain unchanged (and correct). We thank Shawn Cokus (Cokus@ucla.edu), UCLA MCDB/BSCRC for discovering this issue and bringing it to our attention. Note: While we are maintaining the built-in gene-level aggregation code, tximport is, and has been, the recommended way to aggregate transcript-level abundances to the gene-level. It provides several benefits over the built-in methodology, including the ability to derive lengths looking across replicates (rather than processing each replicate individually).

salmon v0.10.2

07 Jun 22:25
Compare
Choose a tag to compare

v0.10.2 is a bug-fix release following closely behind v0.10.1. It introduces no new features, but addresses issue #232 which can pop up in rare circumstances in alignment mode. Thanks to @francicco for helping discover and provide test cases for this issue. The v0.10.1 release notes are repeated below. Please either build the latest version from source, or grab a pre-built binary for your operating system via bioconda.

Salmon 0.10.0 is a major feature release. It includes a family of algorithms to perform single cell analysis, but also a number of new feature and performance enhancements. We highly-recommend that all users upgrade when they have the chance.

Note : Due to the inclusion of the SHA512 hash in the salmon index (see in other changes below), existing salmon indices should be rebuilt.

alevin

Welcome alevin to the salmon family !

You can find a tutorial describing how to use alevin here.

Working under the salmon engine, alevin brings new algorithms and infrastructure to perform single-cell quantification and analysis based on 3' tagged-end sequencing. The alevin mode is activated by using the alevin command, and currently supports quantification of Drop-seq (--dropseq) and 10x v1/2 (--chromium) single-cell protocols (v1 chemistry requires use of a special wrapper). Alevin works on raw-FASTA/Q files and performs the following tasks:

  • Intial Whitelisting: If not given --whitelist (an already known set of whitelisted barcodes e.g. as produced by Cell Ranger), alevin finds a rough estimate for the set of the whitelisted CB (Cellular Barcodes) based on their frequency.

  • Barcode Correction: In the first pass over the CB file, alevin constructs a dictionary for the correction of CB (if not on the whitelist) by correcting CB within 1-edit distance of the whitelisted CB. In case of multiple whitelist candidates, preference is given to SNP over indels. Optionally, a probabilistic model can be used to soft-assign barcodes, although that behavior is disabled by default. (--noSoftMap is true ).

  • UMI Correction & Deduplication: alevin introduces a novel method for deduplicating the UMIs (Unique Molecule identifiers) present in a sample. Alevin's algorithm uses equivalence-class-level information to infer when the same UMI must arise from different isoforms of a gene (to avoid over-collapsing UMI counts), but also accounts for the fact that collisions between UMIs within a gene are expected to be very rare (i.e. if UMIs arise within different equivalence classes of a gene, they are most likely to derive from different positions in the same underlying molecule). To use a baseline (i.e. simple gene-level) UMI deduplication algorthm, alevin can be used with --naive to disable its collision correction.

  • CB classification: Alevin uses various features in a machine-learning-based framework to classify the set of observed CBs that are likely to derive from valid captured cells (i.e. final whitelisting). This approach to CB classification is like that performed by the method of Petukhov et al.. Alevin uses features like the abundance of mitochndrial genes (--mrna), ribosomal geness (--rrna) and others, to for classification.

  • Cell-Gene count Matrix: By default, alevin outputs a cell-by-gene matrix out in a compressed binary format. However, --dumpCsvCounts can be used to dump a human-readable count matrix.

  • other features: --dumpfq does fast concatentation of corrected CB to the read names of the sequence containing fastq file; --dumpFeatures dumps the features and counts used by alevin to perform the ML-based CB classification; --dumpBfh dumps the full CB-Eqclass-UMI-Count data-structure used internally by alevin.

Note : We are actively developing and improving alevin, and are happy and excited to get feedback from the community. If you encounter an issue when using alevin, please be sure to tag your GitHub issue with the alevin tag when reporting the issue via GitHub.

mapping validation

Mapping validation is a new feature that allows salmon to validate its mappings via a traditional (affine-gap penalty) alignment procedure; it is enabled by passing the flag --validateMappings. This validation is made efficient (and fast) through a combination of :

  • using the very-efficient and highly-vectorized alignment implementation of @lh3's ksw2 library.

  • devising a novel caching heuristic that avoids re-aligning reads when sub-problems are redundant (this turns out to be a major computational bottleneck when aligning against the transcriptome).

Using the --validateMappings flag has two main potential benefits. First, this will help prevent salmon from considering potentially spurious mappings (i.e., mappings supported by only a few MMPs but which nonetheless would not support a high-quality read alignment). Second, this will help assign more appropriate mapping scores to reads that map to similar (but not identical) reference sequences --- essentially helping to appropriately down-weight sub-optimal mappings. Along with this flag, salmon introduces flags to set the match score (--ma), mismatch penalty (--mp), and gap open (--go) and extension (--ge) scores used when computing the alignment. It also allows the user to specify the minimum relative alignment score that will be considered as a valid mapping (--minScoreFraction). While these can all be customized, the defaults should be reasonable for typical use cases.

other changes

  • Salmon now enables the alignment error model by default in alignment-based mode. This means that the --useErrorModel flag is no longer valid, since its behavior is now the default. This flag has been removed, and a new flag added in its place. Passing alignment-based salmon the --noErrorModel flag will turn off the alignment error model in alignment-based mode.

  • Related to the above; the alignment error model works best in conjunction with range factorization. Thus, the default behavior is now to turn on range-based factorization in alignment mode (in conjunction with the error model).

  • New default VB prior : The default per-nucleotide VB prior has been changed to 1e-5. While this is still an ongoing area of research, a considerable amount of testing is suggesting that variational Bayesian optimization with a sparsity inducing prior regularly leads to more accurate abundance estimates than the default EM algorithm. While we are leaving the EM algorithm as the default for the offline-phase in the current release, this may change in future versions. We encourage users who may not already be doing so to explore the variational Bayesian-based offline optimization feature of salmon (enabled with --useVBOpt).

  • The library type compatibility is now enforced strictly. Previously mapping that disagreed with the inferred or provided library type simply had their probability decreased. Now, the default behavior is to discard such mappings. The new behavior is equivalent to running with the option --incompatPrior 0. The older behavior can be obtained by setting --incompatPrior to a small non-zero value.

  • The library format count statistics are now computed in a different (and hopefully less confusing) manner. Specifically, rather than being computed over the number of mappings of each type, the statistics are computed over the number of fragments that have at least one mapping of that type. This means that, e.g., if a fragment maps to 2 places in the forward orientation and 1 place in the reverse-complement orientation, this will now contribute only 1 count to the forward and reverse-complement compatibilites each. This should help reduce any reference bias when computing these summary statistics.

  • The default value of --gcSpeedSamp has been set to 5.

  • Inclusion of SHA512 hashes for the salmon index : When indexing salmon now computes both SHA256 and SHA512 indices for the reference. This is done to allow future-compatibility with GA4GH hashes (which will use a truncated variant of SHA512).

  • The default k-mer class (used for certain operations within salmon) has been migrated from the jellyfish implementation to a custom implementation. This results in a small performance increase on our testing systems under linux, and a moderate performance increase under OSX.

  • Salmon is now compiled in C++14 mode (i.e. --std=c++14) by default rather than C++11 mode. This is the last salmon release that will support C++11 (by compiling with -DCONDA_BUILD=TRUE). Moving forward, C++14 compliance will be considered the minimum requirement to compile salmon from source and C++14 features will be used in new code.

version 0.10.1 fixes

This version addresses issues #228 and #229. The first issue could result in a segfault under OSX when running salmon with the new --validateMappings flag, and was the result of an errant <= in place of a < in a sorting comparator. Issue #229 likely predates v0.10.0 considerably, and could occur in VBOpt mode when the normalizer of a rich equivalence class had too small a numeric value. To address this, the numeric cutoffs have been adjusted so that both the normalizer and its inverse can be properly represented.

salmon v0.10.1

01 Jun 16:16
Compare
Choose a tag to compare

v0.10.1 is a bug-fix release following closely behind v0.10.0. It introduces no new features, but addresses the issues mentioned below. Thanks to @knokknok for discovering and helping to resolve these issues. For this purpose, the v0.10.0 release notes are repeated below. Please either build the latest version from source, or grab a pre-built binary for your operating system via bioconda.

Salmon 0.10.0 is a major feature release. It includes a family of algorithms to perform single cell analysis, but also a number of new feature and performance enhancements. We highly-recommend that all users upgrade when they have the chance.

Note : Due to the inclusion of the SHA512 hash in the salmon index (see in other changes below), existing salmon indices should be rebuilt.

alevin

Welcome alevin to the salmon family !

You can find a tutorial describing how to use alevin here.

Working under the salmon engine, alevin brings new algorithms and infrastructure to perform single-cell quantification and analysis based on 3' tagged-end sequencing. The alevin mode is activated by using the alevin command, and currently supports quantification of Drop-seq (--dropseq) and 10x v1/2 (--chromium) single-cell protocols (v1 chemistry requires use of a special wrapper). Alevin works on raw-FASTA/Q files and performs the following tasks:

  • Intial Whitelisting: If not given --whitelist (an already known set of whitelisted barcodes e.g. as produced by Cell Ranger), alevin finds a rough estimate for the set of the whitelisted CB (Cellular Barcodes) based on their frequency.

  • Barcode Correction: In the first pass over the CB file, alevin constructs a dictionary for the correction of CB (if not on the whitelist) by correcting CB within 1-edit distance of the whitelisted CB. In case of multiple whitelist candidates, preference is given to SNP over indels. Optionally, a probabilistic model can be used to soft-assign barcodes, although that behavior is disabled by default. (--noSoftMap is true ).

  • UMI Correction & Deduplication: alevin introduces a novel method for deduplicating the UMIs (Unique Molecule identifiers) present in a sample. Alevin's algorithm uses equivalence-class-level information to infer when the same UMI must arise from different isoforms of a gene (to avoid over-collapsing UMI counts), but also accounts for the fact that collisions between UMIs within a gene are expected to be very rare (i.e. if UMIs arise within different equivalence classes of a gene, they are most likely to derive from different positions in the same underlying molecule). To use a baseline (i.e. simple gene-level) UMI deduplication algorthm, alevin can be used with --naive to disable its collision correction.

  • CB classification: Alevin uses various features in a machine-learning-based framework to classify the set of observed CBs that are likely to derive from valid captured cells (i.e. final whitelisting). This approach to CB classification is like that performed by the method of Petukhov et al.. Alevin uses features like the abundance of mitochndrial genes (--mrna), ribosomal geness (--rrna) and others, to for classification.

  • Cell-Gene count Matrix: By default, alevin outputs a cell-by-gene matrix out in a compressed binary format. However, --dumpCsvCounts can be used to dump a human-readable count matrix.

  • other features: --dumpfq does fast concatentation of corrected CB to the read names of the sequence containing fastq file; --dumpFeatures dumps the features and counts used by alevin to perform the ML-based CB classification; --dumpBfh dumps the full CB-Eqclass-UMI-Count data-structure used internally by alevin.

Note : We are actively developing and improving alevin, and are happy and excited to get feedback from the community. If you encounter an issue when using alevin, please be sure to tag your GitHub issue with the alevin tag when reporting the issue via GitHub.

mapping validation

Mapping validation is a new feature that allows salmon to validate its mappings via a traditional (affine-gap penalty) alignment procedure; it is enabled by passing the flag --validateMappings. This validation is made efficient (and fast) through a combination of :

  • using the very-efficient and highly-vectorized alignment implementation of @lh3's ksw2 library.

  • devising a novel caching heuristic that avoids re-aligning reads when sub-problems are redundant (this turns out to be a major computational bottleneck when aligning against the transcriptome).

Using the --validateMappings flag has two main potential benefits. First, this will help prevent salmon from considering potentially spurious mappings (i.e., mappings supported by only a few MMPs but which nonetheless would not support a high-quality read alignment). Second, this will help assign more appropriate mapping scores to reads that map to similar (but not identical) reference sequences --- essentially helping to appropriately down-weight sub-optimal mappings. Along with this flag, salmon introduces flags to set the match score (--ma), mismatch penalty (--mp), and gap open (--go) and extension (--ge) scores used when computing the alignment. It also allows the user to specify the minimum relative alignment score that will be considered as a valid mapping (--minScoreFraction). While these can all be customized, the defaults should be reasonable for typical use cases.

other changes

  • Salmon now enables the alignment error model by default in alignment-based mode. This means that the --useErrorModel flag is no longer valid, since its behavior is now the default. This flag has been removed, and a new flag added in its place. Passing alignment-based salmon the --noErrorModel flag will turn off the alignment error model in alignment-based mode.

  • Related to the above; the alignment error model works best in conjunction with range factorization. Thus, the default behavior is now to turn on range-based factorization in alignment mode (in conjunction with the error model).

  • New default VB prior : The default per-nucleotide VB prior has been changed to 1e-5. While this is still an ongoing area of research, a considerable amount of testing is suggesting that variational Bayesian optimization with a sparsity inducing prior regularly leads to more accurate abundance estimates than the default EM algorithm. While we are leaving the EM algorithm as the default for the offline-phase in the current release, this may change in future versions. We encourage users who may not already be doing so to explore the variational Bayesian-based offline optimization feature of salmon (enabled with --useVBOpt).

  • The library type compatibility is now enforced strictly. Previously mapping that disagreed with the inferred or provided library type simply had their probability decreased. Now, the default behavior is to discard such mappings. The new behavior is equivalent to running with the option --incompatPrior 0. The older behavior can be obtained by setting --incompatPrior to a small non-zero value.

  • The library format count statistics are now computed in a different (and hopefully less confusing) manner. Specifically, rather than being computed over the number of mappings of each type, the statistics are computed over the number of fragments that have at least one mapping of that type. This means that, e.g., if a fragment maps to 2 places in the forward orientation and 1 place in the reverse-complement orientation, this will now contribute only 1 count to the forward and reverse-complement compatibilites each. This should help reduce any reference bias when computing these summary statistics.

  • The default value of --gcSpeedSamp has been set to 5.

  • Inclusion of SHA512 hashes for the salmon index : When indexing salmon now computes both SHA256 and SHA512 indices for the reference. This is done to allow future-compatibility with GA4GH hashes (which will use a truncated variant of SHA512).

  • The default k-mer class (used for certain operations within salmon) has been migrated from the jellyfish implementation to a custom implementation. This results in a small performance increase on our testing systems under linux, and a moderate performance increase under OSX.

  • Salmon is now compiled in C++14 mode (i.e. --std=c++14) by default rather than C++11 mode. This is the last salmon release that will support C++11 (by compiling with -DCONDA_BUILD=TRUE). Moving forward, C++14 compliance will be considered the minimum requirement to compile salmon from source and C++14 features will be used in new code.

version 0.10.1 fixes

This version addresses issues #228 and #229. The first issue could result in a segfault under OSX when running salmon with the new --validateMappings flag, and was the result of an errant <= in place of a < in a sorting comparator. Issue #229 likely predates v0.10.0 considerably, and could occur in VBOpt mode when the normalizer of a rich equivalence class had too small a numeric value. To address this, the numeric cutoffs have been adjusted so that both the normalizer and its inverse can be properly represented.

salmon v0.10.0

29 May 21:55
Compare
Choose a tag to compare

Salmon 0.10.0 is a major feature release. It includes a family of algorithms to perform single cell analysis, but also a number of new feature and performance enhancements. We highly-recommend that all users upgrade when they have the chance.

Note : Due to the inclusion of the SHA512 hash in the salmon index (see in other changes below), existing salmon indices should be rebuilt.

alevin

Welcome alevin to the salmon family !

Working under the salmon engine, alevin brings new algorithms and infrastructure to perform single-cell quantification and analysis based on 3' tagged-end sequencing. The alevin mode is activated by using the alevin command, and currently supports quantification of Drop-seq (--dropseq) and 10x v1/2 (--chromium) single-cell protocols (v1 chemistry requires use of a special wrapper). Alevin works on raw-FASTA/Q files and performs the following tasks:

  • Intial Whitelisting: If not given --whitelist (an already known set of whitelisted barcodes e.g. as produced by Cell Ranger), alevin finds a rough estimate for the set of the whitelisted CB (Cellular Barcodes) based on their frequency.

  • Barcode Correction: In the first pass over the CB file, alevin constructs a dictionary for the correction of CB (if not on the whitelist) by correcting CB within 1-edit distance of the whitelisted CB. In case of multiple whitelist candidates, preference is given to SNP over indels. Optionally, a probabilistic model can be used to soft-assign barcodes, although that behavior is disabled by default. (--noSoftMap is true ).

  • UMI Correction & Deduplication: alevin introduces a novel method for deduplicating the UMIs (Unique Molecule identifiers) present in a sample. Alevin's algorithm uses equivalence-class-level information to infer when the same UMI must arise from different isoforms of a gene (to avoid over-collapsing UMI counts), but also accounts for the fact that collisions between UMIs within a gene are expected to be very rare (i.e. if UMIs arise within different equivalence classes of a gene, they are most likely to derive from different positions in the same underlying molecule). To use a baseline (i.e. simple gene-level) UMI deduplication algorthm, alevin can be used with --naive to disable its collision correction.

  • CB classification: Alevin uses various features in a machine-learning-based framework to classify the set of observed CBs that are likely to derive from valid captured cells (i.e. final whitelisting). This approach to CB classification is like that performed by the method of Petukhov et al.. Alevin uses features like the abundance of mitochndrial genes (--mrna), ribosomal geness (--rrna) and others, to for classification.

  • Cell-Gene count Matrix: By default, alevin outputs a cell-by-gene matrix out in a compressed binary format. However, --dumpCsvCounts can be used to dump a human-readable count matrix.

  • other features: --dumpfq does fast concatentation of corrected CB to the read names of the sequence containing fastq file; --dumpFeatures dumps the features and counts used by alevin to perform the ML-based CB classification; --dumpBfh dumps the full CB-Eqclass-UMI-Count data-structure used internally by alevin.

Note : We are actively developing and improving alevin, and are happy and excited to get feedback from the community. If you encounter an issue when using alevin, please be sure to tag your GitHub issue with the alevin tag when reporting the issue via GitHub.

mapping validation

Mapping validation is a new feature that allows salmon to validate its mappings via a traditional (affine-gap penalty) alignment procedure; it is enabled by passing the flag --validateMappings. This validation is made efficient (and fast) through a combination of :

  • using the very-efficient and highly-vectorized alignment implementation of @lh3's ksw2 library.

  • devising a novel caching heuristic that avoids re-aligning reads when sub-problems are redundant (this turns out to be a major computational bottleneck when aligning against the transcriptome).

Using the --validateMappings flag has two main potential benefits. First, this will help prevent salmon from considering potentially spurious mappings (i.e., mappings supported by only a few MMPs but which nonetheless would not support a high-quality read alignment). Second, this will help assign more appropriate mapping scores to reads that map to similar (but not identical) reference sequences --- essentially helping to appropriately down-weight sub-optimal mappings. Along with this flag, salmon introduces flags to set the match score (--ma), mismatch penalty (--mp), and gap open (--go) and extension (--ge) scores used when computing the alignment. It also allows the user to specify the minimum relative alignment score that will be considered as a valid mapping (--minScoreFraction). While these can all be customized, the defaults should be reasonable for typical use cases.

other changes

  • Salmon now enables the alignment error model by default in alignment-based mode. This means that the --useErrorModel flag is no longer valid, since its behavior is now the default. This flag has been removed, and a new flag added in its place. Passing alignment-based salmon the --noErrorModel flag will turn off the alignment error model in alignment-based mode.

  • Related to the above; the alignment error model works best in conjunction with range factorization. Thus, the default behavior is now to turn on range-based factorization in alignment mode (in conjunction with the error model).

  • New default VB prior : The default per-nucleotide VB prior has been changed to 1e-5. While this is still an ongoing area of research, a considerable amount of testing is suggesting that variational Bayesian optimization with a sparsity inducing prior regularly leads to more accurate abundance estimates than the default EM algorithm. While we are leaving the EM algorithm as the default for the offline-phase in the current release, this may change in future versions. We encourage users who may not already be doing so to explore the variational Bayesian-based offline optimization feature of salmon (enabled with --useVBOpt).

  • The library type compatibility is now enforced strictly. Previously mapping that disagreed with the inferred or provided library type simply had their probability decreased. Now, the default behavior is to discard such mappings. The new behavior is equivalent to running with the option --incompatPrior 0. The older behavior can be obtained by setting --incompatPrior to a small non-zero value.

  • The library format count statistics are now computed in a different (and hopefully less confusing) manner. Specifically, rather than being computed over the number of mappings of each type, the statistics are computed over the number of fragments that have at least one mapping of that type. This means that, e.g., if a fragment maps to 2 places in the forward orientation and 1 place in the reverse-complement orientation, this will now contribute only 1 count to the forward and reverse-complement compatibilites each. This should help reduce any reference bias when computing these summary statistics.

  • The default value of --gcSpeedSamp has been set to 5.

  • Inclusion of SHA512 hashes for the salmon index : When indexing salmon now computes both SHA256 and SHA512 indices for the reference. This is done to allow future-compatibility with GA4GH hashes (which will use a truncated variant of SHA512).

  • The default k-mer class (used for certain operations within salmon) has been migrated from the jellyfish implementation to a custom implementation. This results in a small performance increase on our testing systems under linux, and a moderate performance increase under OSX.

  • Salmon is now compiled in C++14 mode (i.e. --std=c++14) by default rather than C++11 mode. This is the last salmon release that will support C++11 (by compiling with -DCONDA_BUILD=TRUE). Moving forward, C++14 compliance will be considered the minimum requirement to compile salmon from source and C++14 features will be used in new code.