Skip to content

RapMap v0.4.0

Compare
Choose a tag to compare
@rob-p rob-p released this 24 Sep 21:07
· 166 commits to develop-salmon since this release

This new release of RapMap is accompanied by a substantial cleanup of the underlying codebase (including a few bug fixes). The new version should better handle cases where there are almost equally-good mappings on the forward and reverse-complement strands (but where one mapping is slightly better than the other). It also introduces a few more user-facing features (see New Features below).

Important note:

The quasi-indices from previous versions of RapMap are not binary compatible with the new version (see below). Please re-build your indices before using RapMap v0.4.0.

New Features

  • New hash map for default index - The default quasiindex command now uses the sparsepp sparse hash map. While providing very similar lookup performance to the prior hash map implementation, sparsepp provides a number of benefits. Specifically, it uses substantially less memory (typically ~50% less) and, crucially, the memory usage grows gradually with the number of keys. A big problem with the previous implementation being used (Google's dense hash map) is that, on resize, the map would double and memory usage would jump by a factor of 3 (a new map of twice the size as the old, plus the original map from which to copy the keys). This means that even if you had enough memory to hold the final map, you might not be able to build it. Sparsepp, on the other hand exhibits memory usage that scales almost linearly with the number of items in the map. For more details on the performance characteristics of the new default hash used in the index, please see the sparsepp benchmarks here.
  • New frugal perfect hash index - The vastly improved memory usage of the new default quasi index essentially obviates the previous perfect-hash-based index. Specifically, since that perfect hash also stored the keys (to validate queries from outside the universe on which the hash was built), the size of the resulting index was similar, it simply required less memory to build. However, sparsepp achieves very similar memory usage to the previous perfect-hash-based index. Instead of removing the perfect-hash-based index entirely, the -p/--perfectHash flag now tells the quasiindex command to build a frugal perfect-hash-based index. This index uses a number of aggressive space-saving techniques which results in a much smaller memory footprint (but it is also slower to construct and has slower lookups than the default index). For large references, the new frugal perfect-hash-based index exhibits a memory reduction (over the new, reduced-memory, default index) of 40-50% (hence, it shows close to this savings over the old perfect-hash-based index as well). Also, for large references, the size of the index on disk is ~40% smaller. The cost of this substantial size reduction is that the frugal perfect-hash-based index takes 2-2.5 times longer to build, and lookups are slower. This slower lookup speed can, conceivably, reduce quasi-mapping speed a bit, but the speed hit (if there is one) is dataset dependent. This new indexing scheme should allow the construction of quasi indices on substantially larger references for a fixed RAM budget, and also reduces the memory required to retain the index in memory during mapping as well.
  • New options to the quasimap command - The following options have been introduced to the quasimap command:
    • sensitive mode - the -e / --sensitive flag will turn off some NIP-based jumping in the algorithm and will allow reads to compete for mapping using MMP-based coverage profiles. This can increase the sensitivity and specificity of difficult-to-map reads.
    • quasi coverage - the -z/--quasiCoverage option takes a number c <= 0 <= 1, that allows the user to specify that a read will only be considered as "mappable" if at least a fraction c of the read is covered by maximum-mappable-prefixes. Note that the condition that the coverage must be in terms of MMPs is rather stringent, and so this parameter is not to be interpreted as the fraction of nucleotides that would be covered under an optimal alignment. Nonetheless, it allows enforcing the requirement that a single k-length hit should not be sufficient evidence of mapping, and can reduce false-positive mappings when similar but distinct sequences are present in the sample but not the reference (the quasiCov option implies sensitive mode, but not vice-versa).
    • quiet flag - the q / --quiet flag will disable all non-warning/non-error output of the quasimapping command to the console.

Other changes

  • Removal of quality strings from the SAM output - RapMap now output * in place of the quality string of a read in the output SAM file. This is consistent with the SAM standard, and produces output that is considerably less verbose (faster to write and takes up less space), and which also compresses to BAM much better. If quality strings for particular reads are desired, they can always be retrieved from the corresponding read IDs and the original file (we may provide a tool for this in the future).
  • CIGAR string of unmapped reads - The CIGAR string of unmapped reads is now reported * rather than NS (i.e. softclipping of length N, where N is the read length).