Skip to content

@rob-p rob-p released this Apr 22, 2020

This is a minor release, but it nonetheless adds a few important features and fixes an outstanding bug.

This release incorporates all of the improvements and additions of 1.2.0, which are significant and which are covered in detail here.

New features:

  • salmon learned a new command line option --mismatchSeedSkip. This option can be used to tune seeding sensitivity for selective-alignment . The default value is 5, and should work well in most cases, but this can be tuned if the user wants. After a k-mer hit is extended to a uni-MEM, the uni-MEM extension can terminate for one of 3 reasons; the end of the read, the end of the unitig, or a mismatch. If the extension ends because of a mismatch, this is likely the result of a sequencing error. To avoid looking up many k-mers that will likely fail to be located in the index, the search procedure skips by a factor of mismatchSeedSkip until it either (1) finds another match or (2) is k-bases past the mismatch position. This value controls that skip length. A smaller value can increase sensitivity, while a larger value can speed up seeding.

  • salmon learned about the environment variable SALMON_NO_VERSION_CHECK. If this environment variable is set (to either 1 or TRUE) then salmon will skip checking for an updated version, regardless of whether or not it is passed the --no-version-check flag on the command line. This makes it easy to e.g. set the environment variable to control this behavior for instances running on a cluster. This addresses issue 486, and we thank @cihanerkut for the suggestion.

Improvements:

  • This is a change in default behavior: As raised in issue 505, salmon would not index sequence with duplicate decoy entries, unless the --keepDuplicates flag was passed. Instead, salmon would refuse to index these sequences until the duplicate decoys were removed. Since indexing duplicate sequences does not make any sense, we have decided that duplicate decoy sequences will always be discarded (regardless of the status of the --keepDuplicates flag). This lifts the burden on the user of having to ensure that the decoy sequences are free of duplicates. The behavior can now be described as: "If a decoy sequence is a duplicate of any previously-observed sequence, it is discarded, regardless of the status of the --keepDuplicates flag." This applies equally well if the decoy is a duplicate of a previously-observed decoy or if it is a duplicate of a non-decoy target sequence. Essentially, any decoy sequence that is a duplicate of a previously-observed sequence (decoy or not) will be discarded. The number of observed duplicate decoys (if > 0) will be reported to the log. Thanks to @tamuanand for raising the issue that led to this improvement.

  • During the build process, salmon (and pufferfish) now check directly if std::numeric_limits<_int128> is defined or not, and set the pre-processor flags accordingly. This should address an issue that was reported building under clang on OSX 10.15 (seemingly, earlier versions of the compiler turned on vendor-specific extensions under the -std=c++14 flag, while the newer version does not).

Bug fixes:

  • Addressed / fixed a possibly un-initialized variable (sopt.noSA) in argument parsing.
Assets 3

@rob-p rob-p released this Apr 10, 2020 · 9 commits to master since this release

Improvements and changes

Improvements


  • Extreme reduction in the required intermediate disk space used when building the salmon index. This improvement is due to the changes implemented by @iminkin in TwoPaCo (which pufferfish, and hence salmon, uses for constructing the colored, compacted dBG) addressing the issue here. This means that for larger references or references with many "contigs" (transcripts), the intermediate disk space requrements are reduced by up to 2 orders of magnitude!

  • Reduction in the memory required for indexing, especially when indexing with a small value of k. This improvement comes from (1) fixing a bug that was resulting in an unnecessarily-large allocation when "pufferizing" the output of TwoPaCo and (2) improving the storage of some intermediate data structures used during index construction. These improvements should help reduce the burden of constructing a decoy-aware index with small values of k. The issue of reducing the number of intermediate files created (which can hurt performance on NFS-mounted drives) is being worked on upstream, but is not yet resolved.

alevin

  • This release introduces support for the quantification of CITE-seq / feature barcoding based single-cell protocols! A full, end-to-end tutorial is soon-to-follow on the alevin-tutorial website.

New flags and options:


  • Salmon learned a new option (currently Beta) --softclip : This flag allows soft-clipping at the beginning and end of reads when they are scored with selective-alignment. If used in conjunction with the --writeMappings flag, then the CIGAR strings in the resulting SAM output will designate any soft-clipping that occurs at the beginning or end of the read. Note: To pass the selective-alignment filter, the read must still obtain a score of at least maximum achievable score * minScoreFraction, but softclipping allows omitting a poor quality sub-alignment at the beginning or end of the read with no change to the resulting score for the rest of the alignment (rather than forcing a negative score for these sub-alignments).

  • Salmon learned a new option --decoyThreshold <thresh>: For an alignemnt to an annotated transcript to be considered invalid, it must have an alignment score s such that s < (decoyThreshold * bestDecoyScore). A value of 1.0 means that any alignment strictly worse than the best decoy alignment will be discarded. A smaller value will allow reads to be allocated to transcripts even if they strictly align better to the decoy sequence. The previous behavior of salmon was to discard any mappings to annotated transcripts that were strictly worse than the best decoy alignment. This is equivalent to setting --decoyThreshold 1.0, which is the default behavior.

  • Salmon learned a new option --minAlnProb <prob> (default ): When selective alignment is carried out on a read, each alignment A is assigned a probability given by $e^{-(scoreExp * (bestScore - score(A)))}$, where the default scoreExp is just 1.0. Depending on how much worse a given alignment is compared to the best alignment for a read, this can result in an exceedingly small alignment probability. The --minAlnProb option lets one set the alignment probability below which an alignment's probability will be truncated to 0. This allows skipping the alignments for a fragment that are unlikely to be true (and which could increase the difficulty of inference in some cases). The default value is 1e-5.

  • Salmon learned a new flag --disableChainingHeuristic: Passing this flag will turn off the heuristic of Li 2018 that is used to speed up the MEM chaining step, where the inner loop of the chaining algorithm is terminated after a small number of previous pointers for a given MEM have been found. Passing this flag can improve the sensitivity of alignment to sequences that are highly repetitive (especially those with overlapping repetition), but it can make the chaining step somewhat slower.

  • Salmon learned a new flag --auxTargetFile <file>. The file passed to this option should be a list of targets (i.e. sequences indexed during indexing, or aligned against in the provided BAM file) for which auxliliary models (sequence-specific, fragment-GC, and position-specific bias correction) should not be applied. The format of this file is to provide one target name per-line, in a newline separated file. Unlike decoy sequences, this list of sequences is provided to the quant command, and can be different between different runs if so-desired. Also, unlike decoy sequences, the auxiliary targets will be quantified (e.g. they will have entries in quant.sf and can have reads assigned to them). To aid in metadata tracking of targets marked as auxiliary, the aux_info directory contains a new file aux_target_ids.json, which contains a json file listing the indices of targets that were treated as "auxiliary" targets in the current run.

  • The equivalence class output is now gzipped when written (and written to aux_info/eq_classes.txt.gz rather than aux_info/eq_classes.txt). To detect this behavior, an extra property gzipped is written to the eq_class_properties entry of aux_info/meta_info.json. Apart from being gzipped to save space, however, the format is unchanged. So, you can simply read the file using a gzip stream, or, alternatively, simply unzip the file before reading it.

  • Added special handling for reading SAM files that were, themselves, produced by salmon. Specifically, when reading SAM files produced by salmon, the AS tag will be used to assign appropriate conditional probabilities to different mappings for a fragment (rather than looking for a CIGAR string, which is not computed).

  • The versionInfo.json file generated during indexing now remember the specific version of salmon that was used to build the index. The indexVersion field is already a version identifier that is incremented when the index changes in a binary-incompatible way. However, the new field will allow one to know the exact salmon version that was used to build the index.

alevin

  • A couple of new flags has been added to support the feature barcoding based quantification in the alevin framework.

    • index command
      • --features: Performs indexing on a tsv file instead of a regular FASTA reference file. The tsv file should contain the name of the features, tab separated by the nucleotide sequence identifiers.
    • alevin command
      • --featureStart: The start index (0 based) of the feature barcode in the R2 read file. (Typically 0 for CITE-seq and 10 for 10x feature barcoding).
      • --featureLength: The length of the feature barcode in the R2 read file. (Typically 15 for both CITE-seq and 10x feature barcoding).
      • --citeseq: The command is used for quantifying the feature barcoded data where the alignment of the barcodes (with 1-edit distance) is done instead of the mRNA reads.
      • No --tgMap is needed when using --citeseq single-cell protocol.
  • --end 3 has been enabled in this release. It is useful for protocols where UMI comes before the Cellular Barcode (CB). Note: --end 3 does not start subsequencing from the 3' end of the R1 file. Instead, alevin still starts counting the subsequence from the 5' end. However, we first sample the UMI instead of the CB. The idea here is, --end 5 represents CB+UMI while --end 3 represents UMI+CB and all the sequences beyond the |CB| + |UMI| length are ignored, no matter what value is set for the flag --end.

Bug fixes

  • Fixed an issue (upstream in pufferfish), that is actually arising from bbhash. Specifically, the issue was unexpected behavior of bbhash during minimum perfect hash construction. It may create temporary files during MPHF construction, and it was using the current working directory to do this, with no option to override this behavior. We have fixed this in our copy of the bbhash code, and the salmon index command will now use the provided output directory as temporary working space for bbhash. This issue has been reported upstream in bbhash as issue 19.

  • Fixed an issue with long target names (raised in issue 451) not being allowed in the index. Previously, in the pufferfish-based index, target names of length > 255 were clipped to 255 characters. While this is not normally a problem, pipelines that attempt to encode significant metadata in the target name may be affected by this limit. With this release, target names of up to 65,536 characters are supported. Thanks to @chilampoon for raising this issue.

  • Fixed an issue where the computed alignment score could be wrong (too high) when there were MEMs in the highest-scoring chain that overlapped in the query and the reference by different amounts. This was relatively infrequent, but has now been fixed. Thanks to @cdarby for reporting the issue and providing a test case to fix it!

  • Fixed an issue where, in rare situations, usage of the alignment cache could cause non-determinism the the score for certain alignments, which could result in small fluctuations in the number of assigned fragments. The fix involves both addressing a bug in ksw2 where an incorrect alignment score for global alignment could be returned in certain rare situations depending on how the bandwidth parameter is set, and also by being more stringent in what alignments are inserted into the alignment cache and which mappings are searched for in the alignment cache. Many thanks to @csoneson for raising this issue and finding a dataset containing enough of the corner cases to track down and fix the issue. Thanks to @mohsenzakeri for isolating the underlying cases and figuring out how to fix them.

alevin

  • The big feature hash generated when --dumpBfh is set, creates a reverse UMI sequences than those present originally. This was a legacy bug, introduced when shifting from jellyfish based 2-bit encoding to the AlevinKmer class based 2-bit encoding. This has been fixed in the this release.

  • Fixed an issue where the --writeUnmappedNames did not work properly with alevin. This addresses issue 501.

Other notes

  • As raised in issue 500, the salmon executable, since v1.0.0, assumes the SSE4 instruction set. While this feature has been standard on processors for a long time, some older hardware may not have this feature set. This compile flag was removed from the pufferfish build in this release, as we noticed no speed regressions in its absence. However, please let us know if you notice any non-trivial speed regressions (which we do not expect).
Assets 3

@rob-p rob-p released this Dec 19, 2019 · 96 commits to master since this release

salmon 1.1.0 release notes

Note : This version contains some important fixes, please see below for detailed information.

Note : On our testing machines, this version of salmon was index-compatible with version 1.0.0. That is, it is likely that you need not re-build your index from what you built with 1.0.0. However, it is not clear that this compatibility is guaranteed by the cereal library. If you encounter difficulty loading a previously-built index, please consider re-building with the latest version before filing a bug report.

Note : If you want to build from source and use a version of the (header-only) cereal library already installed on your system, please make sure it is cereal v1.3.0. The current findCereal.cmake file does not support version restrictions, and we are working to improve this for proper automatic detection and enforcement of this constraint in future releases.

As always, a pre-compiled linux executable is included below and the latest release is available via Bioconda.

Improvements

  • SHA512 sums are now properly propagated forward to meta_info.json.

  • Bumped the included version of the cereal serialization library. The components used by salmon should be backward compatible in terms of reading output from the previous version (i.e. should not require index re-building).

  • The flag --keepFixedFasta was added to the index command. If this flag is passed, then a "fixed" version of the fasta file will be retained in the index directory. This file is created during indexing, but is normally deleted when indexing is complete. It contains the input fasta without duplicate sequences (unless --keepDuplicates was used), with the headers as understood by salmon, with N nucleotides replaced, etc.

  • Introduced a few small optimizations upstream (in pufferfish) to speed up selective-alignment; more are on the way (thanks to @mohsenzakeri).

Bug fixes

  • The bug described directly below led to the discovery of a different but related bug that could cause the extracted sequence used for bias correction to be incorrect. The code was assuming zero-initalization of memory which was not necessarily occuring. Note: This bug affects runs performed under mapping-based mode (i.e. when the input was not coming from a BAM file) and when --seqBias or --gcBias flags (or both) were used. Depending upon the initialization of the underlying memory, the bug may lead to unexpected results and diminished accuracy. The bug was present in versions 0.99.0 beta 1 through 1.0.0 (inclusive), and if you processed data using these versions in mapping-based mode with the flags mentioned above, we encourage you to reprocess this data with the newest version, just in case. We apologize for any inconvenience.

  • Fixed a bug that would occur when the input fasta file contained short sequences (<= length k) near the end of the file and bias correction (sequence-specific or fragment-GC) was enabled. This is particularly acute when the short sequence was immediately preceded by a very long target and would cause inordinate warning message printing to the log (hugely slowing down index loading). This printing to the log could slow the index loading considerably. Furthermore, this would provide sequence copies to short transcripts and decoy sequences even though they are not needed, which would result in unnecessary memory waste. The bug was due to a missing parenthesization to enforce the desired operator precedence. This fix should speed up index loading and reduce memory usage when using the --seqBias or --gcBias flags. Huge thanks to @mdshw5 for finding an input that would trigger this behavior (which didn't show up in testing), and for helping to track down the cause.

  • Fixed a bug that could occur in computing the Beta function component of the chaining score with very long queries. This should not have shown up at all with Illumina-length reads, but nonetheless the adjustment conceptually corrects the scoring for all cases. Thanks @mohsenzakeri.

Assets 3
Nov 6, 2019
updating version for hca

@k3yavi k3yavi released this Oct 31, 2019 · 136 commits to master since this release

This is a major stable release of salmon and brings a lot of exciting new features with extensive benchmarking in the latest preprint.

This new version of salmon is based on a fundamentally different indexing data structure (pufferfish) than the previous version. It also adopts a different mapping algorithm; a variant of selective-alignment. The new indexing data structure makes it possible to index the transcriptome as well as a large amount of "decoy" sequence in small memory. It also makes it possible to index the entire transcriptome and the entire genome in "reasonable" memory (currently ~18G in dense mode and ~14G in sparse mode, though these sizes may improve in the future), which provides a much more comprehensive set of potential decoy sequences. In the new index, the transcriptome and genome are on "equal footing", which helps to avoid bias toward either the transcriptome or the genome during mapping.

Note : To construct the ccDBG from the reference sequence, which is subsequently indexed with pufferfish, salmon makes use of (a very slightly modified version of) the TwoPaCo software. TwoPaCo implements a very efficient algorithm for building a ccDBG from a collection of reference sequences. One of the key parameters of TwoPaCo is the size of the Bloom filter used to record and filter possible junction k-mers. To ease the indexing procedure, salmon will attempt to automatically set a reasonable estimate for the Bloom filter size, based on an estimate of the number of distinct k-mers in the reference and using a default FPR of 0.1% over TwoPaCo's default 5 filters. To quickly obtain an estimate of the number of distinct k-mers, salmon makes use of (a very slightly modified version of) the ntCard software; specifically the nthll implementation.

Changes since v0.99.0 beta2

A bug related to alevin index parsing is fixed. Specifically, if the length of any one decoy target is less than the kmer length then alevin was dumping gene counts for decoy targets. Thanks @csoneson for reporting this and it has been fixed in the latest stable release.

Changes since v0.99.0 beta1

Allow passing of explicit filter size to the indexing command via the -f parameter (default is to estimate required filter size using nthll).

Fix bug that prevented dumping SAM output, if requested, in alevin mode.

Correctly enabled strictFilter mode in alevin, improving single-cell mapping quality.

Changes since v0.14.1

The indexing methodology of salmon is now based on pufferfish. Thus, any previous indices need to be re-built. However, the new indexing methodology is considerably faster and more parallelizable than the previous approach, so providing multiple threads to the index command shoule make relatively short work of this task.

The new version of salmon adopts a new and modified selective-alignment algorithm that is, nonetheless, very similar to the selective-alignment algorithm described in Alignment and mapping methodology influence transcript abundance estimation. In this release of salmon, selective-alignment is enabled by default (and, in fact, mapping without selective-alignemnt is disabled). We may explore, in the future, ways to allow disabling selecive-alignment under the new mapping approach, but at this point, it is always enabled.

As a consequence of the above, range factorization is enabled by default.

There is a new command-line flag --softclipOverhangs which allows reads that overhang the end of transcripts to be softclipped. The softclipped region will neither add to nor detract from the match score. This is more permissive than the default strategy which would require the overhaning bases of the read to be scored as a deletion under the alignment.

There is a new command-line flag --hitFilterPolicy which determines the policy by which hits or chains of hits are filtered in selective alignment, prior to alignment scoring. Filtering hits after chaining (the default) is more sensitive, but more computationally intensive, because it performs the chaining dynamic program for all hits. Filtering before chaining is faster, but some true hits may be missed. The NONE option is not recommended, but is the most sensitive. It does not filter any chains based on score, though all methods only retain the highest-scoring chains per transcript for subsequent alignment score. The options are BEFORE, AFTER, BOTH and NONE.

There is a new command-line flag --fullLengthAlignment, which performs selective-alignment over the full length of the read, beginning from the (approximate) initial mapping location and using extension alignment. This is in contrast with the default behavior which is to only perform alignment between the MEMs in the optimal chain (and before the first and after the last MEM if applicable). The default strategy forces the MEMs to belong to the alignment, but has the benefit that it can discover indels prior to the first hit shared between the read and reference.

The -d/--dumpEqWeights flag now dumps the information associated with whichever type of factorization is being used for quantification (the default now is range-factorized equivalence classes). The --dumpEq flag now always dumps simple equivalence classes. This means that no associated conditional probabilities are written to the file, and if range-factorization is being used, then all of the range-factorized equivalence classes that correspond to the same transcript set are collapsed into a simple equivalence class label and the corresponding counts are summed.

There has been a change to the default behavior of the VB prior. The default VB prior is now evaluated on a per-transcript rather than per-nucleotide basis. The previous behavior is enabled enabled by passing --perNucleotidePrior option to the quant command.

Considerable improvments have been made to fragment length modeling in the case of single-end samples.

Alevin now contains a flag --quartzseq2 to support the Quartz-Seq2 protocol (thanks @dritoshi).

bug fix: Alevin when provided with --dumpFeatures flag dumps featureDump.txt. The column header of the file was inconsistent with the values and has been fixed i.e. ArborescenceCount field should occur as the last column now.

bug fix: The mtx format overflows the total number of genes boundary when the total number of genes are exactly a multiple of 8. It has been fixed to be consistent in the latest release

The following command-line flags have been removed (since, given the new index, they no longer serve a useful function): --allowOrphansFMD, --consistentHits, --quasiCoverage.

Assets 3
Oct 29, 2019
missed sver

@k3yavi k3yavi released this Oct 4, 2019 · 274 commits to master since this release

This release is just the replica of v0.14.1, plus a hot-fix for the bug in alevin's output mtx format. Thanks @pinin4fjords for reporting this and it should fix #431 issue.
NOTE This release doesn't support the features in the beta release of 0.99. If you are interested in the new indexing and mapping scheme described in the v0.99 release notes, please wait for v1.0.0 or try the v0.99 beta2.

Assets 3
Pre-release

@rob-p rob-p released this Sep 9, 2019 · 181 commits to develop since this release

This is the second beta version of the next major release of salmon.

This new version of salmon is based on a fundamentally different indexing data structure (pufferfish) than the previous version. It also adopts a different mapping algorithm; a variant of selective-alignment. The new indexing data structure makes it possible to index the transcriptome as well as a large amount of "decoy" sequence in small memory. It also makes it possible to index the entire transcriptome and the entire genome in "reasonable" memory (currently ~18G in dense mode and ~14G in sparse mode, though these sizes may improve in the future), which provides a much more comprehensive set of potential decoy sequences. In the new index, the transcriptome and genome are on "equal footing", which helps to avoid bias toward either the transcriptome or the genome during mapping.

Since it constitutes such a major change (and advancement) in the indexing and alignment methodology, we are releasing beta versions of this new realease of salmon to give users the ability to try it out and to provide feedback before it becomes the "default" version you get via e.g. Bioconda. Since it is not currently possible to have both releases and "betas" in Bioconda, you can get the pre-compiled executables below, or build this version directly from the develop branch of the salmon repository.

Note : To construct the ccDBG from the reference sequence, which is subsequently indexed with pufferfish, salmon makes use of (a very slightly modified version of) the TwoPaCo software. TwoPaCo implements a very efficient algorithm for building a ccDBG from a collection of reference sequences. One of the key parameters of TwoPaCo is the size of the Bloom filter used to record and filter possible junction k-mers. To ease the indexing procedure, salmon will attempt to automatically set a reasonable estimate for the Bloom filter size, based on an estimate of the number of distinct k-mers in the reference and using a default FPR of 0.1% over TwoPaCo's default 5 filters. To quickly obtain an estimate of the number of distinct k-mers, salmon makes use of (a very slightly modified version of) the ntCard software; specifically the nthill implementation.

Changes since v0.99.0 beta1

  • Allow passing of explicit filter size to the indexing command via the -f parameter (default is to estimate required filter size using nthll).

  • Fix bug that prevented dumping SAM output, if requested, in alevin mode.

  • Correctly enabled strictFilter mode in alevin, improving single-cell mapping quality.

Changes since v0.14.1

  • The indexing methodology of salmon is now based on pufferfish. Thus, any previous indices need to be re-built. However, the new indexing methodology is considerably faster and more parallelizable than the previous approach, so providing multiple threads to the index command shoule make relatively short work of this task.

  • The new version of salmon adopts a new and modified selective-alignment algorithm that is, nonetheless, very similar to the selective-alignment algorithm described in Alignment and mapping methodology influence transcript abundance estimation. In this release of salmon, selective-alignment is enabled by default (and, in fact, mapping without selective-alignemnt is disabled). We may explore, in the future, ways to allow disabling selecive-alignment under the new mapping approach, but at this point, it is always enabled.

  • As a consequence of the above, range factorization is enabled by default.

  • There is a new command-line flag --softclipOverhangs which allows reads that overhang the end of transcripts to be softclipped. The softclipped region will neither add to nor detract from the match score. This is more permissive than the default strategy which would require the overhaning bases of the read to be scored as a deletion under the alignment.

  • There is a new command-line flag --hitFilterPolicy which determines the policy by which hits or chains of hits are filtered in selective alignment, prior to alignment scoring. Filtering hits after chaining (the default) is more sensitive, but more computationally intensive, because it performs the chaining dynamic program for all hits. Filtering before chaining is faster, but some true hits may be missed. The NONE option is not recommended, but is the most sensitive. It does not filter any chains based on score, though all methods only retain the highest-scoring chains per transcript for subsequent alignment score. The options are BEFORE, AFTER, BOTH and NONE.

  • There is a new command-line flag --fullLengthAlignment, which performs selective-alignment over the full length of the read, beginning from the (approximate) initial mapping location and using extension alignment. This is in contrast with the default behavior which is to only perform alignment between the MEMs in the optimal chain (and before the first and after the last MEM if applicable). The default strategy forces the MEMs to belong to the alignment, but has the benefit that it can discover indels prior to the first hit shared between the read and reference.

  • The -d/--dumpEqWeights flag now dumps the information associated with whichever type of factorization is being used for quantification (the default now is range-factorized equivalence classes). The --dumpEq flag now always dumps simple equivalence classes. This means that no associated conditional probabilities are written to the file, and if range-factorization is being used, then all of the range-factorized equivalence classes that correspond to the same transcript set are collapsed into a simple equivalence class label and the corresponding counts are summed.

  • There has been a change to the default behavior of the VB prior. The default VB prior is now evaluated on a per-transcript rather than per-nucleotide basis. The previous behavior is enabled enabled by passing --perNucleotidePrior option to the quant command.

  • Considerable improvments have been made to fragment length modeling in the case of single-end samples.

  • Alevin now contains a flag --quartzseq2 to support the Quartz-Seq2 protocol (thanks @dritoshi).

  • bug fix: Alevin when provided with --dumpFeatures flag dumps featureDump.txt. The column header of the file was inconsistent with the values and has been fixed i.e. ArborescenceCount field should occur as the last column now.

  • bug fix: The mtx format overflows the total number of genes boundary when the total number of genes are exactly a multiple of 8. It has been fixed to be consistent in the latest release

  • The following command-line flags have been removed (since, given the new index, they no longer serve a useful function): --allowOrphansFMD, --consistentHits, --quasiCoverage.

Assets 4
Pre-release

@rob-p rob-p released this Sep 6, 2019 · 186 commits to develop since this release

This is the first beta version of the next major release of salmon.

This new version of salmon is based on a fundamentally different indexing data structure (pufferfish) than the previous version. It also adopts a different mapping algorithm; a variant of selective-alignment. The new indexing data structure makes it possible to index the transcriptome as well as a large amount of "decoy" sequence in small memory. It also makes it possible to index the entire transcriptome and the entire genome in "reasonable" memory (currently ~18G in dense mode and ~14G in sparse mode, though these sizes may improve in the future), which provides a much more comprehensive set of potential decoy sequences. In the new index, the transcriptome and genome are on "equal footing", which helps to avoid bias toward either the transcriptome or the genome during mapping.

Since it constitutes such a major change (and advancement) in the indexing and alignment methodology, we are releasing beta versions of this new realease of salmon to give users the ability to try it out and to provide feedback before it becomes the "default" version you get via e.g. Bioconda. Since it is not currently possible to have both releases and "betas" in Bioconda, you can get the pre-compiled executables below, or build this version directly from the develop branch of the salmon repository.

Note : To construct the ccDBG from the reference sequence, which is subsequently indexed with pufferfish, salmon makes use of (a very slightly modified version of) the TwoPaCo software. TwoPaCo implements a very efficient algorithm for building a ccDBG from a collection of reference sequences. One of the key parameters of TwoPaCo is the size of the Bloom filter used to record and filter possible junction k-mers. To ease the indexing procedure, salmon will attempt to automatically set a reasonable estimate for the Bloom filter size, based on an estimate of the number of distinct k-mers in the reference and using a default FPR of 0.1% over TwoPaCo's default 5 filters. To quickly obtain an estimate of the number of distinct k-mers, salmon makes use of (a very slightly modified version of) the ntCard software; specifically the nthill implementation.

Changes since v0.14.1

  • The indexing methodology of salmon is now based on pufferfish. Thus, any previous indices need to be re-built. However, the new indexing methodology is considerably faster and more parallelizable than the previous approach, so providing multiple threads to the index command shoule make relatively short work of this task.

  • The new version of salmon adopts a new and modified selective-alignment algorithm that is, nonetheless, very similar to the selective-alignment algorithm described in Alignment and mapping methodology influence transcript abundance estimation. In this release of salmon, selective-alignment is enabled by default (and, in fact, mapping without selective-alignemnt is disabled). We may explore, in the future, ways to allow disabling selecive-alignment under the new mapping approach, but at this point, it is always enabled.

  • As a consequence of the above, range factorization is enabled by default.

  • There is a new command-line flag --softclipOverhangs which allows reads that overhang the end of transcripts to be softclipped. The softclipped region will neither add to nor detract from the match score. This is more permissive than the default strategy which would require the overhaning bases of the read to be scored as a deletion under the alignment.

  • There is a new command-line flag --hitFilterPolicy which determines the policy by which hits or chains of hits are filtered in selective alignment, prior to alignment scoring. Filtering hits after chaining (the default) is more sensitive, but more computationally intensive, because it performs the chaining dynamic program for all hits. Filtering before chaining is faster, but some true hits may be missed. The NONE option is not recommended, but is the most sensitive. It does not filter any chains based on score, though all methods only retain the highest-scoring chains per transcript for subsequent alignment score. The options are BEFORE, AFTER, BOTH and NONE.

  • There is a new command-line flag --fullLengthAlignment, which performs selective-alignment over the full length of the read, beginning from the (approximate) initial mapping location and using extension alignment. This is in contrast with the default behavior which is to only perform alignment between the MEMs in the optimal chain (and before the first and after the last MEM if applicable). The default strategy forces the MEMs to belong to the alignment, but has the benefit that it can discover indels prior to the first hit shared between the read and reference.

  • The -d/--dumpEqWeights flag now dumps the information associated with whichever type of factorization is being used for quantification (the default now is range-factorized equivalence classes). The --dumpEq flag now always dumps simple equivalence classes. This means that no associated conditional probabilities are written to the file, and if range-factorization is being used, then all of the range-factorized equivalence classes that correspond to the same transcript set are collapsed into a simple equivalence class label and the corresponding counts are summed.

  • There has been a change to the default behavior of the VB prior. The default VB prior is now evaluated on a per-transcript rather than per-nucleotide basis. The previous behavior is enabled enabled by passing --perNucleotidePrior option to the quant command.

  • Considerable improvments have been made to fragment length modeling in the case of single-end samples.

  • Alevin now contains a flag --quartzseq2 to support the Quartz-Seq2 protocol (thanks @dritoshi).

  • bug fix: Alevin when provided with --dumpFeatures flag dumps featureDump.txt. The column header of the file was inconsistent with the values and has been fixed i.e. ArborescenceCount field should occur as the last column now.

  • The following command-line flags have been removed (since, given the new index, they no longer serve a useful function): --allowOrphansFMD, --consistentHits, --quasiCoverage.

Assets 4

@k3yavi k3yavi released this Jun 26, 2019 · 280 commits to master since this release

This is primarily a bugfix release. For the recently-added features and capabilities, please refer to the 0.14.0 release notes.

The following bugs have been fixed in v0.14.1 :

  • If the number of skipped CBs are too high, then the reported whitelist can sometimes be from the skipped CB id. Thanks to @Ryan-Zhu for bringing this up and it has been fixed.
  • Multiple bugs in --dumpMtx format, thanks to @Ryan-Zhu and @alexvpickering for pointing these out.
    • Expression values can sometimes be reported in scientific notation, which can break certain downstream parsers. This has been modified to fixed-precision decimal (C++ default of a precision of 6).
    • The column ids were 0-indexed while mtx assumes 1-indexing. This has been fixed to report the indices starting from 1.
    • The reported expressions in the mtx file were in column-major format which doesn't align with the binary format counts and the quant_mat_cols.txt. It has been fixed to report row-major order, and now aligns with the order of the genes as reported in quants_mat_cols.txt
  • Alevin fails without reporting error when the number of low confidence cellular barcodes are < 1. We have fixed alevin to not perform intelligent whitleisting if the number of low confidence cellular barcodes are <200.
Assets 3
You can’t perform that action at this time.