Salmon v0.11.0

@rob-p rob-p released this Jul 16, 2018 · 4 commits to master since this release

Version 0.11.0 introduces further enhancements and bug fixes, and makes some modifications to the default parameters.

Note : Though we provide a pre-compiled binary here, we strongly suggest installing the latest version of salmon through Bioconda (or building via source).

New / enhanced features

  • The sensitivity and accuracy of mapping validation (enabled via the --mappingValidation flag) has been further enhanced. This flag allows salmon to score mappings and validate their quality, both reducing the instances of spurious mapping and improving assignment of reads to the correct transcript in complex mapping situations.

  • Variational Bayesian (VB) optimization is now the default algorithm for the offline phase of salmon. While salmon always uses stochastic collapsed VB inference during the online phase, the default optimization algorithm during the offline phase was Expectation Maximization (EM), and the use of VB optimization had to be explicitly enabled with --useVBOpt. Now, VB optimization is the default. This decision was made as the result of testing that suggested that the VB algorithm is often more accurate, and, specifically, is less likely to give non-negative (even if small) TPM to a non-expressed transcript. One can make salmon use the EM algorithm (reverting to the old behavior) by passing the --useEM flag. Further, for backwards compatibility, salmon still accepts the --useVBOpt flag, but it may be removed in the future. It is worth noting that the VB optimization algorithm employs a prior (which can be set using the --vbPrior flag). The default prior has been tuned for typical use-cases, and should work well. However, if you have a mechanism for validation, it may be worth exploring this option to see if a sparser (smaller) or less-sparse (larger) prior is useful in your setting.

  • The new flag --sigDigits (default value = 3) tells salmon how many significant digits to use when writing out effective lengths and estimated counts (it will still use more digits for TPM). Since more significant digits are not often necessary, this can considerably reduce output size if one is processing many samples. Of course, using this flag, one can always request higher-precision output.

  • The new flag --consensusSlack allows one to modify the behavior of the mapping consensus mechanism. Passing a larger value will allow more "liberal" mapping. This is an advanced flag, and it is not likely that it need to be set by the casual user. The consensus slack is set to 1 by default when mapping validation is enabled, and to 0 otherwise.

Bug fixes

  • This release fixes a bug in mapping validation (i.e. the --validateMapping) flag that could result in a segmentation fault in rare situations.

  • This release fixes a bug that was present when using the built in gene-level aggregation. The quant.genes.sf gene lengths and effective lengths columns are corrected, being generally wrong for multi-isoform genes since 0.6.0 (with an internals-varying permutation of isoforms before the fix being weighted 1, 2, 3, ... times what they should be in the TPM-weighted averaging denominator, resulting in lengths that were too short, possibly even shorter than the shortest isoform). This miscalculation applies only to the lengths and effective lengths; TPM and NumRead columns remain unchanged (and correct). We thank Shawn Cokus (Cokus@ucla.edu), UCLA MCDB/BSCRC for discovering this issue and bringing it to our attention. Note: While we are maintaining the built-in gene-level aggregation code, tximport is, and has been, the recommended way to aggregate transcript-level abundances to the gene-level. It provides several benefits over the built-in methodology, including the ability to derive lengths looking across replicates (rather than processing each replicate individually).

salmon v0.10.2

@rob-p rob-p released this Jun 7, 2018 · 50 commits to master since this release

v0.10.2 is a bug-fix release following closely behind v0.10.1. It introduces no new features, but addresses issue #232 which can pop up in rare circumstances in alignment mode. Thanks to @francicco for helping discover and provide test cases for this issue. The v0.10.1 release notes are repeated below. Please either build the latest version from source, or grab a pre-built binary for your operating system via bioconda.

Salmon 0.10.0 is a major feature release. It includes a family of algorithms to perform single cell analysis, but also a number of new feature and performance enhancements. We highly-recommend that all users upgrade when they have the chance.

Note : Due to the inclusion of the SHA512 hash in the salmon index (see in other changes below), existing salmon indices should be rebuilt.

alevin

Welcome alevin to the salmon family !

You can find a tutorial describing how to use alevin here.

Working under the salmon engine, alevin brings new algorithms and infrastructure to perform single-cell quantification and analysis based on 3' tagged-end sequencing. The alevin mode is activated by using the alevin command, and currently supports quantification of Drop-seq (--dropseq) and 10x v1/2 (--chromium) single-cell protocols (v1 chemistry requires use of a special wrapper). Alevin works on raw-FASTA/Q files and performs the following tasks:

  • Intial Whitelisting: If not given --whitelist (an already known set of whitelisted barcodes e.g. as produced by Cell Ranger), alevin finds a rough estimate for the set of the whitelisted CB (Cellular Barcodes) based on their frequency.

  • Barcode Correction: In the first pass over the CB file, alevin constructs a dictionary for the correction of CB (if not on the whitelist) by correcting CB within 1-edit distance of the whitelisted CB. In case of multiple whitelist candidates, preference is given to SNP over indels. Optionally, a probabilistic model can be used to soft-assign barcodes, although that behavior is disabled by default. (--noSoftMap is true ).

  • UMI Correction & Deduplication: alevin introduces a novel method for deduplicating the UMIs (Unique Molecule identifiers) present in a sample. Alevin's algorithm uses equivalence-class-level information to infer when the same UMI must arise from different isoforms of a gene (to avoid over-collapsing UMI counts), but also accounts for the fact that collisions between UMIs within a gene are expected to be very rare (i.e. if UMIs arise within different equivalence classes of a gene, they are most likely to derive from different positions in the same underlying molecule). To use a baseline (i.e. simple gene-level) UMI deduplication algorthm, alevin can be used with --naive to disable its collision correction.

  • CB classification: Alevin uses various features in a machine-learning-based framework to classify the set of observed CBs that are likely to derive from valid captured cells (i.e. final whitelisting). This approach to CB classification is like that performed by the method of Petukhov et al.. Alevin uses features like the abundance of mitochndrial genes (--mrna), ribosomal geness (--rrna) and others, to for classification.

  • Cell-Gene count Matrix: By default, alevin outputs a cell-by-gene matrix out in a compressed binary format. However, --dumpCsvCounts can be used to dump a human-readable count matrix.

  • other features: --dumpfq does fast concatentation of corrected CB to the read names of the sequence containing fastq file; --dumpFeatures dumps the features and counts used by alevin to perform the ML-based CB classification; --dumpBfh dumps the full CB-Eqclass-UMI-Count data-structure used internally by alevin.

Note : We are actively developing and improving alevin, and are happy and excited to get feedback from the community. If you encounter an issue when using alevin, please be sure to tag your GitHub issue with the alevin tag when reporting the issue via GitHub.

mapping validation

Mapping validation is a new feature that allows salmon to validate its mappings via a traditional (affine-gap penalty) alignment procedure; it is enabled by passing the flag --validateMappings. This validation is made efficient (and fast) through a combination of :

  • using the very-efficient and highly-vectorized alignment implementation of @lh3's ksw2 library.

  • devising a novel caching heuristic that avoids re-aligning reads when sub-problems are redundant (this turns out to be a major computational bottleneck when aligning against the transcriptome).

Using the --validateMappings flag has two main potential benefits. First, this will help prevent salmon from considering potentially spurious mappings (i.e., mappings supported by only a few MMPs but which nonetheless would not support a high-quality read alignment). Second, this will help assign more appropriate mapping scores to reads that map to similar (but not identical) reference sequences --- essentially helping to appropriately down-weight sub-optimal mappings. Along with this flag, salmon introduces flags to set the match score (--ma), mismatch penalty (--mp), and gap open (--go) and extension (--ge) scores used when computing the alignment. It also allows the user to specify the minimum relative alignment score that will be considered as a valid mapping (--minScoreFraction). While these can all be customized, the defaults should be reasonable for typical use cases.

other changes

  • Salmon now enables the alignment error model by default in alignment-based mode. This means that the --useErrorModel flag is no longer valid, since its behavior is now the default. This flag has been removed, and a new flag added in its place. Passing alignment-based salmon the --noErrorModel flag will turn off the alignment error model in alignment-based mode.

  • Related to the above; the alignment error model works best in conjunction with range factorization. Thus, the default behavior is now to turn on range-based factorization in alignment mode (in conjunction with the error model).

  • New default VB prior : The default per-nucleotide VB prior has been changed to 1e-5. While this is still an ongoing area of research, a considerable amount of testing is suggesting that variational Bayesian optimization with a sparsity inducing prior regularly leads to more accurate abundance estimates than the default EM algorithm. While we are leaving the EM algorithm as the default for the offline-phase in the current release, this may change in future versions. We encourage users who may not already be doing so to explore the variational Bayesian-based offline optimization feature of salmon (enabled with --useVBOpt).

  • The library type compatibility is now enforced strictly. Previously mapping that disagreed with the inferred or provided library type simply had their probability decreased. Now, the default behavior is to discard such mappings. The new behavior is equivalent to running with the option --incompatPrior 0. The older behavior can be obtained by setting --incompatPrior to a small non-zero value.

  • The library format count statistics are now computed in a different (and hopefully less confusing) manner. Specifically, rather than being computed over the number of mappings of each type, the statistics are computed over the number of fragments that have at least one mapping of that type. This means that, e.g., if a fragment maps to 2 places in the forward orientation and 1 place in the reverse-complement orientation, this will now contribute only 1 count to the forward and reverse-complement compatibilites each. This should help reduce any reference bias when computing these summary statistics.

  • The default value of --gcSpeedSamp has been set to 5.

  • Inclusion of SHA512 hashes for the salmon index : When indexing salmon now computes both SHA256 and SHA512 indices for the reference. This is done to allow future-compatibility with GA4GH hashes (which will use a truncated variant of SHA512).

  • The default k-mer class (used for certain operations within salmon) has been migrated from the jellyfish implementation to a custom implementation. This results in a small performance increase on our testing systems under linux, and a moderate performance increase under OSX.

  • Salmon is now compiled in C++14 mode (i.e. --std=c++14) by default rather than C++11 mode. This is the last salmon release that will support C++11 (by compiling with -DCONDA_BUILD=TRUE). Moving forward, C++14 compliance will be considered the minimum requirement to compile salmon from source and C++14 features will be used in new code.

version 0.10.1 fixes

This version addresses issues #228 and #229. The first issue could result in a segfault under OSX when running salmon with the new --validateMappings flag, and was the result of an errant <= in place of a < in a sorting comparator. Issue #229 likely predates v0.10.0 considerably, and could occur in VBOpt mode when the normalizer of a rich equivalence class had too small a numeric value. To address this, the numeric cutoffs have been adjusted so that both the normalizer and its inverse can be properly represented.

salmon v0.10.1

@rob-p rob-p released this Jun 1, 2018 · 65 commits to master since this release

v0.10.1 is a bug-fix release following closely behind v0.10.0. It introduces no new features, but addresses the issues mentioned below. Thanks to @knokknok for discovering and helping to resolve these issues. For this purpose, the v0.10.0 release notes are repeated below. Please either build the latest version from source, or grab a pre-built binary for your operating system via bioconda.

Salmon 0.10.0 is a major feature release. It includes a family of algorithms to perform single cell analysis, but also a number of new feature and performance enhancements. We highly-recommend that all users upgrade when they have the chance.

Note : Due to the inclusion of the SHA512 hash in the salmon index (see in other changes below), existing salmon indices should be rebuilt.

alevin

Welcome alevin to the salmon family !

You can find a tutorial describing how to use alevin here.

Working under the salmon engine, alevin brings new algorithms and infrastructure to perform single-cell quantification and analysis based on 3' tagged-end sequencing. The alevin mode is activated by using the alevin command, and currently supports quantification of Drop-seq (--dropseq) and 10x v1/2 (--chromium) single-cell protocols (v1 chemistry requires use of a special wrapper). Alevin works on raw-FASTA/Q files and performs the following tasks:

  • Intial Whitelisting: If not given --whitelist (an already known set of whitelisted barcodes e.g. as produced by Cell Ranger), alevin finds a rough estimate for the set of the whitelisted CB (Cellular Barcodes) based on their frequency.

  • Barcode Correction: In the first pass over the CB file, alevin constructs a dictionary for the correction of CB (if not on the whitelist) by correcting CB within 1-edit distance of the whitelisted CB. In case of multiple whitelist candidates, preference is given to SNP over indels. Optionally, a probabilistic model can be used to soft-assign barcodes, although that behavior is disabled by default. (--noSoftMap is true ).

  • UMI Correction & Deduplication: alevin introduces a novel method for deduplicating the UMIs (Unique Molecule identifiers) present in a sample. Alevin's algorithm uses equivalence-class-level information to infer when the same UMI must arise from different isoforms of a gene (to avoid over-collapsing UMI counts), but also accounts for the fact that collisions between UMIs within a gene are expected to be very rare (i.e. if UMIs arise within different equivalence classes of a gene, they are most likely to derive from different positions in the same underlying molecule). To use a baseline (i.e. simple gene-level) UMI deduplication algorthm, alevin can be used with --naive to disable its collision correction.

  • CB classification: Alevin uses various features in a machine-learning-based framework to classify the set of observed CBs that are likely to derive from valid captured cells (i.e. final whitelisting). This approach to CB classification is like that performed by the method of Petukhov et al.. Alevin uses features like the abundance of mitochndrial genes (--mrna), ribosomal geness (--rrna) and others, to for classification.

  • Cell-Gene count Matrix: By default, alevin outputs a cell-by-gene matrix out in a compressed binary format. However, --dumpCsvCounts can be used to dump a human-readable count matrix.

  • other features: --dumpfq does fast concatentation of corrected CB to the read names of the sequence containing fastq file; --dumpFeatures dumps the features and counts used by alevin to perform the ML-based CB classification; --dumpBfh dumps the full CB-Eqclass-UMI-Count data-structure used internally by alevin.

Note : We are actively developing and improving alevin, and are happy and excited to get feedback from the community. If you encounter an issue when using alevin, please be sure to tag your GitHub issue with the alevin tag when reporting the issue via GitHub.

mapping validation

Mapping validation is a new feature that allows salmon to validate its mappings via a traditional (affine-gap penalty) alignment procedure; it is enabled by passing the flag --validateMappings. This validation is made efficient (and fast) through a combination of :

  • using the very-efficient and highly-vectorized alignment implementation of @lh3's ksw2 library.

  • devising a novel caching heuristic that avoids re-aligning reads when sub-problems are redundant (this turns out to be a major computational bottleneck when aligning against the transcriptome).

Using the --validateMappings flag has two main potential benefits. First, this will help prevent salmon from considering potentially spurious mappings (i.e., mappings supported by only a few MMPs but which nonetheless would not support a high-quality read alignment). Second, this will help assign more appropriate mapping scores to reads that map to similar (but not identical) reference sequences --- essentially helping to appropriately down-weight sub-optimal mappings. Along with this flag, salmon introduces flags to set the match score (--ma), mismatch penalty (--mp), and gap open (--go) and extension (--ge) scores used when computing the alignment. It also allows the user to specify the minimum relative alignment score that will be considered as a valid mapping (--minScoreFraction). While these can all be customized, the defaults should be reasonable for typical use cases.

other changes

  • Salmon now enables the alignment error model by default in alignment-based mode. This means that the --useErrorModel flag is no longer valid, since its behavior is now the default. This flag has been removed, and a new flag added in its place. Passing alignment-based salmon the --noErrorModel flag will turn off the alignment error model in alignment-based mode.

  • Related to the above; the alignment error model works best in conjunction with range factorization. Thus, the default behavior is now to turn on range-based factorization in alignment mode (in conjunction with the error model).

  • New default VB prior : The default per-nucleotide VB prior has been changed to 1e-5. While this is still an ongoing area of research, a considerable amount of testing is suggesting that variational Bayesian optimization with a sparsity inducing prior regularly leads to more accurate abundance estimates than the default EM algorithm. While we are leaving the EM algorithm as the default for the offline-phase in the current release, this may change in future versions. We encourage users who may not already be doing so to explore the variational Bayesian-based offline optimization feature of salmon (enabled with --useVBOpt).

  • The library type compatibility is now enforced strictly. Previously mapping that disagreed with the inferred or provided library type simply had their probability decreased. Now, the default behavior is to discard such mappings. The new behavior is equivalent to running with the option --incompatPrior 0. The older behavior can be obtained by setting --incompatPrior to a small non-zero value.

  • The library format count statistics are now computed in a different (and hopefully less confusing) manner. Specifically, rather than being computed over the number of mappings of each type, the statistics are computed over the number of fragments that have at least one mapping of that type. This means that, e.g., if a fragment maps to 2 places in the forward orientation and 1 place in the reverse-complement orientation, this will now contribute only 1 count to the forward and reverse-complement compatibilites each. This should help reduce any reference bias when computing these summary statistics.

  • The default value of --gcSpeedSamp has been set to 5.

  • Inclusion of SHA512 hashes for the salmon index : When indexing salmon now computes both SHA256 and SHA512 indices for the reference. This is done to allow future-compatibility with GA4GH hashes (which will use a truncated variant of SHA512).

  • The default k-mer class (used for certain operations within salmon) has been migrated from the jellyfish implementation to a custom implementation. This results in a small performance increase on our testing systems under linux, and a moderate performance increase under OSX.

  • Salmon is now compiled in C++14 mode (i.e. --std=c++14) by default rather than C++11 mode. This is the last salmon release that will support C++11 (by compiling with -DCONDA_BUILD=TRUE). Moving forward, C++14 compliance will be considered the minimum requirement to compile salmon from source and C++14 features will be used in new code.

version 0.10.1 fixes

This version addresses issues #228 and #229. The first issue could result in a segfault under OSX when running salmon with the new --validateMappings flag, and was the result of an errant <= in place of a < in a sorting comparator. Issue #229 likely predates v0.10.0 considerably, and could occur in VBOpt mode when the normalizer of a rich equivalence class had too small a numeric value. To address this, the numeric cutoffs have been adjusted so that both the normalizer and its inverse can be properly represented.

salmon v0.10.0

@rob-p rob-p released this May 29, 2018 · 87 commits to master since this release

Salmon 0.10.0 is a major feature release. It includes a family of algorithms to perform single cell analysis, but also a number of new feature and performance enhancements. We highly-recommend that all users upgrade when they have the chance.

Note : Due to the inclusion of the SHA512 hash in the salmon index (see in other changes below), existing salmon indices should be rebuilt.

alevin

Welcome alevin to the salmon family !

Working under the salmon engine, alevin brings new algorithms and infrastructure to perform single-cell quantification and analysis based on 3' tagged-end sequencing. The alevin mode is activated by using the alevin command, and currently supports quantification of Drop-seq (--dropseq) and 10x v1/2 (--chromium) single-cell protocols (v1 chemistry requires use of a special wrapper). Alevin works on raw-FASTA/Q files and performs the following tasks:

  • Intial Whitelisting: If not given --whitelist (an already known set of whitelisted barcodes e.g. as produced by Cell Ranger), alevin finds a rough estimate for the set of the whitelisted CB (Cellular Barcodes) based on their frequency.

  • Barcode Correction: In the first pass over the CB file, alevin constructs a dictionary for the correction of CB (if not on the whitelist) by correcting CB within 1-edit distance of the whitelisted CB. In case of multiple whitelist candidates, preference is given to SNP over indels. Optionally, a probabilistic model can be used to soft-assign barcodes, although that behavior is disabled by default. (--noSoftMap is true ).

  • UMI Correction & Deduplication: alevin introduces a novel method for deduplicating the UMIs (Unique Molecule identifiers) present in a sample. Alevin's algorithm uses equivalence-class-level information to infer when the same UMI must arise from different isoforms of a gene (to avoid over-collapsing UMI counts), but also accounts for the fact that collisions between UMIs within a gene are expected to be very rare (i.e. if UMIs arise within different equivalence classes of a gene, they are most likely to derive from different positions in the same underlying molecule). To use a baseline (i.e. simple gene-level) UMI deduplication algorthm, alevin can be used with --naive to disable its collision correction.

  • CB classification: Alevin uses various features in a machine-learning-based framework to classify the set of observed CBs that are likely to derive from valid captured cells (i.e. final whitelisting). This approach to CB classification is like that performed by the method of Petukhov et al.. Alevin uses features like the abundance of mitochndrial genes (--mrna), ribosomal geness (--rrna) and others, to for classification.

  • Cell-Gene count Matrix: By default, alevin outputs a cell-by-gene matrix out in a compressed binary format. However, --dumpCsvCounts can be used to dump a human-readable count matrix.

  • other features: --dumpfq does fast concatentation of corrected CB to the read names of the sequence containing fastq file; --dumpFeatures dumps the features and counts used by alevin to perform the ML-based CB classification; --dumpBfh dumps the full CB-Eqclass-UMI-Count data-structure used internally by alevin.

Note : We are actively developing and improving alevin, and are happy and excited to get feedback from the community. If you encounter an issue when using alevin, please be sure to tag your GitHub issue with the alevin tag when reporting the issue via GitHub.

mapping validation

Mapping validation is a new feature that allows salmon to validate its mappings via a traditional (affine-gap penalty) alignment procedure; it is enabled by passing the flag --validateMappings. This validation is made efficient (and fast) through a combination of :

  • using the very-efficient and highly-vectorized alignment implementation of @lh3's ksw2 library.

  • devising a novel caching heuristic that avoids re-aligning reads when sub-problems are redundant (this turns out to be a major computational bottleneck when aligning against the transcriptome).

Using the --validateMappings flag has two main potential benefits. First, this will help prevent salmon from considering potentially spurious mappings (i.e., mappings supported by only a few MMPs but which nonetheless would not support a high-quality read alignment). Second, this will help assign more appropriate mapping scores to reads that map to similar (but not identical) reference sequences --- essentially helping to appropriately down-weight sub-optimal mappings. Along with this flag, salmon introduces flags to set the match score (--ma), mismatch penalty (--mp), and gap open (--go) and extension (--ge) scores used when computing the alignment. It also allows the user to specify the minimum relative alignment score that will be considered as a valid mapping (--minScoreFraction). While these can all be customized, the defaults should be reasonable for typical use cases.

other changes

  • Salmon now enables the alignment error model by default in alignment-based mode. This means that the --useErrorModel flag is no longer valid, since its behavior is now the default. This flag has been removed, and a new flag added in its place. Passing alignment-based salmon the --noErrorModel flag will turn off the alignment error model in alignment-based mode.

  • Related to the above; the alignment error model works best in conjunction with range factorization. Thus, the default behavior is now to turn on range-based factorization in alignment mode (in conjunction with the error model).

  • New default VB prior : The default per-nucleotide VB prior has been changed to 1e-5. While this is still an ongoing area of research, a considerable amount of testing is suggesting that variational Bayesian optimization with a sparsity inducing prior regularly leads to more accurate abundance estimates than the default EM algorithm. While we are leaving the EM algorithm as the default for the offline-phase in the current release, this may change in future versions. We encourage users who may not already be doing so to explore the variational Bayesian-based offline optimization feature of salmon (enabled with --useVBOpt).

  • The library type compatibility is now enforced strictly. Previously mapping that disagreed with the inferred or provided library type simply had their probability decreased. Now, the default behavior is to discard such mappings. The new behavior is equivalent to running with the option --incompatPrior 0. The older behavior can be obtained by setting --incompatPrior to a small non-zero value.

  • The library format count statistics are now computed in a different (and hopefully less confusing) manner. Specifically, rather than being computed over the number of mappings of each type, the statistics are computed over the number of fragments that have at least one mapping of that type. This means that, e.g., if a fragment maps to 2 places in the forward orientation and 1 place in the reverse-complement orientation, this will now contribute only 1 count to the forward and reverse-complement compatibilites each. This should help reduce any reference bias when computing these summary statistics.

  • The default value of --gcSpeedSamp has been set to 5.

  • Inclusion of SHA512 hashes for the salmon index : When indexing salmon now computes both SHA256 and SHA512 indices for the reference. This is done to allow future-compatibility with GA4GH hashes (which will use a truncated variant of SHA512).

  • The default k-mer class (used for certain operations within salmon) has been migrated from the jellyfish implementation to a custom implementation. This results in a small performance increase on our testing systems under linux, and a moderate performance increase under OSX.

  • Salmon is now compiled in C++14 mode (i.e. --std=c++14) by default rather than C++11 mode. This is the last salmon release that will support C++11 (by compiling with -DCONDA_BUILD=TRUE). Moving forward, C++14 compliance will be considered the minimum requirement to compile salmon from source and C++14 features will be used in new code.

Salmon v0.9.1

@rob-p rob-p released this Nov 29, 2017 · 331 commits to master since this release

Salmon 0.9.1 Release Notes

Note: Version 0.9.1 fixes a warning with the indexer that was introduced by an API change that occurred due to an updated Fasta/q parser. The warning does not affect the indexing process, but nonetheless, the proper API should be obeyed. Also, v0.9.1 fixes a very small but long-standing indexing bug that would cause a single k-mer (the lexicographically largest) to not be indexed properly. The Salmon v0.9.0 release notes are recapitulated below for the convenience of those upgrading directly from v0.8.2.

As always, the newest release is easily installable via bioconda and Docker.

New features

  • During indexing, Salmon will now discard duplicate transcripts (i.e., transcripts with exactly the same sequence) by default. The information about the duplicate transcripts is written to a file in the index directory called duplicate_clusters.tsv. This is a two-column TSV file where the first column lists the name of a retained transcript and the second column lists the name of a discarded duplicate transcript (i.e., a transcript with identical sequence to the retained transcript, but which was discarded). Note: If you wish to retain multiple identical transcripts in the input (the prior behavior), this can be achieved by passing the Salmon indexing command the --keepDuplicates flag.

  • This is not a new feature, per se, but brings further parity between the alignment and mapping-based modes. It is now possible to dump the equivalence class files --dumpEq when using Salmon in alignment-based mode.

  • The range-factorization has been merged into the master branch. This allows using the data-driven likelihood factorization, which can improve quantification accuracy on certain classes of "difficult" transcripts. Currently, this feature interacts best (i.e., yields the most considerable improvements) when using alignment-based mode and when enabling error modeling --useErrorModel, though it can yield improvements in the mapping-based mode as well. This feature will also interact constructively with selective-alignment, which should land in the next (non-bug fix) release.

  • Added the quantmerge command. This allows producing a multi-sample TSV file with aggregated abundance metrics over samples from many different quantification runs. This can be useful to ease e.g. uploading of quantified data to certain online analysis tools like Degust.

Other improvements, features and changes

  • The multi-threaded read parser used by Salmon has been updated to considerably improve CPU utilization. Specifically, the previous queue management strategy (busy waiting) has been replaced by an intelligent, bounded, exponential-backoff strategy. Many improvements (and much of the code) comes from this series of blog posts by David Geier. Basically, what this means is that the performance will be the same as the prior implementation if your disks can feed reads to Salmon quickly enough, but if they can't, considerably less CPU time will be wasted waiting on input (i.e. processing speed will be better matched to I/O throughput).

  • In addition to the improved parser behavior, some of the noisy logger messages in the parser have been eliminated. In "pathological" situations with very fast disks and slow CPUs (or vice-versa), the previous parser may have generated an inordinate amount of output, creating large log files and otherwise slowing down processing. This should no longer happen.

  • Salmon will now terminate early (with a non-zero exit code) and report a meaningful error message if a corrupt input file is detected. Previously, corrupted compressed input files could have caused the parser to hang indefinitely. This behavior was fixed upstream in kseq, and the current parser wraps this detection with a descriptive exception message.

  • Renamed the --allowOrphans flag to --allowOrphansFMD, and added a --discardOrphansQuasi flag. This is a bit messy currently (the default in FMD mapping is to discard orphans and in quasi-mapping is to keep them). These flags to the obvious things and are docuemented more in the command line help. We are considering how best to clean-up simplify these flags in future releases.

  • Many other small improvements and bug fixes.

Salmon v0.9.0

@rob-p rob-p released this Nov 26, 2017 · 337 commits to master since this release

Salmon 0.9.0 Release Notes

As always, the newest release is easily installable via bioconda and Docker.

New features

  • During indexing, Salmon will now discard duplicate transcripts (i.e., transcripts with exactly the same sequence) by default. The information about the duplicate transcripts is written to a file in the index directory called duplicate_clusters.tsv. This is a two-column TSV file where the first column lists the name of a retained transcript and the second column lists the name of a discarded duplicate transcript (i.e., a transcript with identical sequence to the retained transcript, but which was discarded). Note: If you wish to retain multiple identical transcripts in the input (the prior behavior), this can be achieved by passing the Salmon indexing command the --keepDuplicates flag.

  • This is not a new feature, per se, but brings further parity between the alignment and mapping-based modes. It is now possible to dump the equivalence class files --dumpEq when using Salmon in alignment-based mode.

  • The range-factorization has been merged into the master branch. This allows using the data-driven likelihood factorization, which can improve quantification accuracy on certain classes of "difficult" transcripts. Currently, this feature interacts best (i.e., yields the most considerable improvements) when using alignment-based mode and when enabling error modeling --useErrorModel, though it can yield improvements in the mapping-based mode as well. This feature will also interact constructively with selective-alignment, which should land in the next (non-bug fix) release.

  • Added the quantmerge command. This allows producing a multi-sample TSV file with aggregated abundance metrics over samples from many different quantification runs. This can be useful to ease e.g. uploading of quantified data to certain online analysis tools like Degust.

Other improvements, features and changes

  • The multi-threaded read parser used by Salmon has been updated to considerably improve CPU utilization. Specifically, the previous queue management strategy (busy waiting) has been replaced by an intelligent, bounded, exponential-backoff strategy. Many improvements (and much of the code) comes from this series of blog posts by David Geier. Basically, what this means is that the performance will be the same as the prior implementation if your disks can feed reads to Salmon quickly enough, but if they can't, considerably less CPU time will be wasted waiting on input (i.e. processing speed will be better matched to I/O throughput).

  • In addition to the improved parser behavior, some of the noisy logger messages in the parser have been eliminated. In "pathological" situations with very fast disks and slow CPUs (or vice-versa), the previous parser may have generated an inordinate amount of output, creating large log files and otherwise slowing down processing. This should no longer happen.

  • Salmon will now terminate early (with a non-zero exit code) and report a meaningful error message if a corrupt input file is detected. Previously, corrupted compressed input files could have caused the parser to hang indefinitely. This behavior was fixed upstream in kseq, and the current parser wraps this detection with a descriptive exception message.

  • Renamed the --allowOrphans flag to --allowOrphansFMD, and added a --discardOrphansQuasi flag. This is a bit messy currently (the default in FMD mapping is to discard orphans and in quasi-mapping is to keep them). These flags to the obvious things and are docuemented more in the command line help. We are considering how best to clean-up simplify these flags in future releases.

  • Many other small improvements and bug fixes.

Salmon v0.8.2

@rob-p rob-p released this Mar 17, 2017 · 397 commits to master since this release

The main purpose of this release is to fix a bug (introduced in v0.8.0) that would prevent Salmon from being able to properly load 64-bit indexes (i.e. when the size of the transcriptome is > int32_t). If you are affected by this bug, it has been fixed in this release. Further, the bug existed only in the loading code. Hence, 64-bit indices made with Salmon v0.8.1 can now be properly loaded in Salmon v0.8.2.

Bug Fixes

  • This release fixes a bug that would prevent Salmon from being able to properly load 64-bit indices.

Minor enhancements and improvements

  • Changed the default size of the asynchronous queues used for logging. Previously, the queues were made un-necessarily large. This inflated the memory usage un-necessarily and the large allocation slowed down startup time a bit (only really noticeable with small indices). Salmon should now use a bit less memory and start up a bit faster.

  • Removed unused extraneous posWeight in equivalence classes, leading to a small bump in speed and a small decrease in memory usage.

  • The build system has been enhanced for this release to allow SHA256 verification of all packages downloaded during build. This will allow the new version to be made available via Homebrew (which has not accepted PRs on Salmon since v0.7.2 since it did not validate the non-vendored, downloaded packages).

Salmon v0.8.1

@rob-p rob-p released this Mar 6, 2017 · 421 commits to master since this release

Even though the changes and fixes are minor, it is recommended you update to Salmon 0.8.1 if you are using a previous version.

Bug Fixes

  • This release includes a fix for a bug that would cause Salmon (in alignment-based mode only) to write the inferential sampling (i.e., Gibbs sampling or bootstrap) files to the current working directory, rather than the quantification directory. This release resolves that bug, and now the inferential samples are properly written to the target quantification directory in both modes.

Changes & improvements

  • The indexer now computes a signature (SHA256 sum) of both the sequence and headers in the fasta file used to build the index. These signatures are propagated into the quantification estimates in the meta_info.json file. This will allow one to verify the exact index used for quantification, and will be utilized in software we are working on for metadata propagation across RNA-seq pipelines. However, this means that indices built for previous version of Salmon will need to be re-built for v0.8.1.

  • JEMalloc is now built with the --disable-debug. This avoids spurious zone allocator warnings on MacOS Sierra.

  • Bumped version of JEMalloc to 4.5.0.

  • Bumped to the latest version of spdlog.

  • Bumped included version of libcuckoo.

  • Bumped included version of sparsepp (via RapMap).

  • Bumped included version of RapMap.

  • Internal refactoring and improvement of option parsing and argument handling code. This is the first phase of a larger-scale unification of the quasi-mapping-based and alignment-based modes that will be made over the next few releases.

Salmon v0.8.0

@rob-p rob-p released this Jan 23, 2017 · 449 commits to master since this release

Bug Fixes

  • Fixed a bug in .gtf-based gene aggregation output that could cause a transcript to be attributed to the wrong gene if the transcript was not present in the gtf file.
  • Fixed bug that required a qualified path be provided when writing the quasi-mapping file (i.e., .sam).
  • Fixed a bug that could cause the SAM header to fail to be written when writing quasi-mappings to stdout.
  • Fixed behavior of --numPreAuxModelSamples so that it is consistent between quasi-mapping and alignment-based mode (and has an effect in both).
  • Fixed a "short style" option collision.
  • Fixed a bug that would cause bias correction not to be run if only the --posBias flag was passed.

Minor changes & improvements

  • Bumped to the latest version of spdlog.
  • Bumped included version of libcuckoo.
  • Bumped included version of sparsepp (via RapMap).
  • Bumped included version of RapMap.
  • meta_info.json now contains more information about the length classes used for positional bias correction when enabled (these length classes are now data driven.)
  • meta_info.json now records if equivalence classes were dumped, and if so, what properties were dumped as well (e.g. rich weights).
  • meta_info.json now includes the end as well as beginning time of each run.
  • Improvements to fragment-GC bias modeling for fragments that fall very close to the beginning or end of transcripts.
  • Added .gff and .gff3 (and capitalized variants of all) as recognized file formats for gene aggregation mode.
  • Changed the default prior mean and standard deviation of the fragment length distribution to better match more recent protocols and libraries.
  • Made slight improvements to the computation of the conditional fragment probabilities (i.e., P(f | t) in the model). Now the probability of a fragment length is conditioned on the transcript length, and the probability of a start position takes that length into account.

New features

  • Some important new indexing improvements due to improvements in RapMap; read more below.

  • Substantial overhaul and improvements to the posterior Gibbs sampler. The methodology now generally follows that of mmseq1. Specifically, the new (uncollapsed) sampler improves estimates of sampling variance (and uses the same methodology as before to account for inferential uncertainty).

  • Added --thinningFactor flag that lets the user specify how many Gibbs samples should be skipped between saved samples. Increasing this causes the Gibbs chain to run longer to generate a given number of target samples (but potentially reduces the autocorrelation between samples). The default is 16.

  • Added --meta flag, that automatically selects internal options optimized for metagenomic & microbiomic quantification.

  • Added --dumpEqWeights option that includes the rich equivalence class weights in the output file when equivalence classes are written to file.

  • Added experimental --noLengthCorrection option. This is intended to be used when quantifying based on protocols (e.g., Lexogen Quantseq) where the number of sequenced fragments / tags deriving from a target are assumed to be independent of that target's length. (This feature is still experimental, and requires more testing, so please provide feedback if you use it).

  • Added new --quasiCoverage option. This is analogous to the --coverage option, but the latter applies only to mapping under the FMD-based index (which is no longer recommended). This option enforces that a certain fraction of the read is covered by exact matches (specifically, maximum mappable prefixes) in order to consider a mapping as valid. The value is expressed as a number between 0 and 1; a larger value is more stringent, and less likely to allow spurious mappings, but can reduce sensitivity.

    New features due to changes and improvements in RapMap

    • New hash map for default index - The default quasiindex command now uses the sparsepp sparse hash map. While providing very similar lookup performance to the prior hash map implementation, sparsepp provides a number of benefits. Specifically, it uses substantially less memory (typically ~50% less) and, crucially, the memory usage grows gradually with the number of keys. A big problem with the previous implementation being used (Google's dense hash map) is that, on resize, the map would double and memory usage would jump by a factor of 3 (a new map of twice the size as the old, plus the original map from which to copy the keys). This means that even if you had enough memory to hold the final map, you might not be able to build it. Sparsepp, on the other hand exhibits memory usage that scales almost linearly with the number of items in the map. For more details on the performance characteristics of the new default hash used in the index, please see the sparsepp benchmarks here.
    • New frugal perfect hash index - The vastly improved memory usage of the new default quasi index essentially obviates the previous perfect-hash-based index. Specifically, since that perfect hash also stored the keys (to validate queries from outside the universe on which the hash was built), the size of the resulting index was similar, it simply required less memory to build. However, sparsepp achieves very similar memory usage to the previous perfect-hash-based index. Instead of removing the perfect-hash-based index entirely, the -p/--perfectHash flag now tells the quasiindex command to build a frugal perfect-hash-based index. This index uses a number of aggressive space-saving techniques which results in a much smaller memory footprint (but it is also slower to construct and has slower lookups than the default index). For large references, the new frugal perfect-hash-based index exhibits a memory reduction (over the new, reduced-memory, default index) of 40-50% (hence, it shows close to this savings over the old perfect-hash-based index as well). Also, for large references, the size of the index on disk is ~40% smaller. The cost of this substantial size reduction is that the frugal perfect-hash-based index takes 2-2.5 times longer to build, and lookups are slower. This slower lookup speed can, conceivably, reduce quasi-mapping speed a bit, but the speed hit (if there is one) is dataset dependent. This new indexing scheme should allow the construction of quasi indices on substantially larger references for a fixed RAM budget, and also reduces the memory required to retain the index in memory during mapping as well. Note: This type of index is specifically recommended if you need to build an index on a large set of targets (e.g., for metagenomic or microbiomic use).

References

[1] Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads. Turro E, Su S-Y, Goncalves A, Coin L, Richardson S and Lewin A. Genome Biology, 2011 Feb; 12:R13. doi: 10.1186/gb-2011-12-2-r13.

Salmon 0.7.2

@rob-p rob-p released this Aug 31, 2016 · 590 commits to master since this release

Bug Fixes

  • Removed attempt to copy un-necessary file (related to EMPH, the old perfect hash function) in fetchRapMap.sh
  • Fixed default alignment-mode options to be consistent with mapping-mode
    • Default maximum fragment length increased from 800 to 1000
    • Default alignment-mode auxiliary directory is now named aux_info as in mapping-mode
  • Fixed description of library types in lib_format_counts.json
    • Fixed string representation of single-end stranded reads (SF and SR rather than F & R as before)
    • Fixed duplicate entires in lib_format_counts.json
  • Fixed various other typos in the --help menus, including the removal of a duplicate option in alignment-based mode

Minor changes & improvements

  • Bumped to the latest version of spdlog.
  • Bumped to the latest version of Staden IO
  • The automatically detected library type is now applied slightly earlier, so that fewer fragments that map inconsistently with which is eventually considered to be the library format will be considered.
  • meta_info.json now contains information about the number of posterior samples, regardless of whether these were obtained with bootstrapping or posterior Gibbs sampling.

New features

  • Added the ability to write out mapping information to SAM format β€” When Salmon is run in mapping mode (with the quasi index), you can now have it write out information about the quasi-mappings it uses for quantification. This behavior is enabled with the option --writeMappings. If this option is provided with no arguments, it will, by default, write the mapping information to stdout (this can then be piped to e.g. samtools and converted to BAM format). Optionally, you may also provide this argument with a filename, in which case the mapping information will be written to that file in SAM format.
    • Note: In the 0.7.2 release, the file provided to --writeMappings must use a qualified path (e.g. --writeMappings=./out.sam rather than --writeMappings=out.sam), this constraint is already addressed on develop and will be fixed in the next release. Further, note that, because --writeMappings has an implicit argument (stdout), any explicit argument must be directly adjacent to the option; i.e. --writeMappings=./out.sam is OK, but --writeMappings ./out.sam is not β€” this is a fundamental limitation of how boost program_options handles implicit values.
    • Note: The mapping information is computed and written before library type compatibility checks take place, thus the mapping file will contain information about all mappings of the reads considered by Salmon, even those that may later be filtered out due to incompatibility with the library type.
  • Added the ability to perform automatic library type detection in alignment-based mode.
    • Note: The implementation of this feature involves opening the BAM file, peaking at the first record, and then closing it to determine if the library should be treated as single-end or paired-end. Thus, in alignment-based mode automatic library type detection will not work with an input stream. If your input is a regular file, everything should work as expected; otherwise, you should provide the library type explicitly in alignment-based mode.
    • Note: The automatic library type detection is performed on the basis of the alignments in the file. Thus, for example, if the upstream aligner has been told to perform strand-aware mapping (i.e. to ignore potential alignments that don't map in the expected manner), but the actual library is unstranded, automatic library type detection cannot detect this. It will attempt to detect the library type that is most consistent with the alignment that are provided.

Thanks

This release contains fixes to bugs reported by (or features suggested by) the following people: