Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for 'meta' alignment operations (OP_PAD and OP_HARD_CLIP) #64

Merged
merged 26 commits into from
Oct 1, 2022

Conversation

MillironX
Copy link
Member

@MillironX MillironX commented Jan 27, 2022

Types of changes

This PR implements the following changes:

  • ✨ New feature (A non-breaking change which adds functionality).
  • 🐛 Bug fix (A non-breaking change, which fixes an issue).
  • 💥 Breaking change (fix or feature that would cause existing functionality to change).

📋 Additional detail

Adds support for the OP_PAD operation, and fixes the implementation of the OP_HARD_CLIP operation. I have dubbed these two operations 'meta' operations, as they contain information about the origin and context of the alignment, but have no bearing on the content (i.e. reference or query) of the alignment.

New features

  1. Added function ismetaop, which returns true if passed OP_PAD or OP_HARD_CLIP
  2. Creating an alignment with a OP_PAD operation no longer throws an error
  3. Added validity check for OP_HARD_CLIP or OP_SOFT_CLIP in the wrong place
    According to the SAM specification, these have to be present at the end of an alignment, with soft clips optionally buffering hard clips from the rest of the alignment

Changed functionality

  1. cigar(::Alignment) now uses the AlignmentAnchor.alnpos to print operation length
  2. OP_HARD_CLIP is no longer considered an insert operation, and isinsertop(OP_HARD_CLIP) now returns false
  3. The help text for OP_PAD indicates it's supported now
  4. Trying to create an alignment with a hard or soft clip in an improper place will throw an error
    (I know I said this above, but it's probably the most likely addition to break something, even though it does conform to spec)

Justification

BioAlignments should be able to handle all of the operations defined by the SAM specification. As it currently is, BioAlignments is unable even to parse the alignments in Section 1.1 "An example" of the specification. While the work on clips may seem out-of-scope for a change adding an operation support, the way clips work and pads work are very similar and needed to be considered together.

Example

Using query sequence "r002" from Section 1.1 of the SAM specification:

BioAlignments 2.0.0

julia> using BioAlignments

julia> Alignment("3S6M1P1I4M", 1, 9)
ERROR: The P CIGAR operation is not yet supported.
Stacktrace:
 [1] Alignment(cigar::String, seqpos::Int64, refpos::Int64)
...

This PR

julia> using BioAlignments

julia> Alignment("3S6M1P1I4M", 1, 9)
Alignment:
  aligned range:
    seq: 0-14
    ref: 8-18
  CIGAR string: 3S6M1P1I4M

My testing also indicates that this fixes #56,

BioAlignments 2.0.0

julia> using BioAlignments, BioSequences

julia> aligned_sequence = AlignedSequence(dna"AGAAGTTTATCTGTGTGAACTTCTTGGCTTAGTATCGTTGAGAAGAATCGAGAGATTAGTGCAGTTTAAA",
       Alignment("75H25M75H", 1, 1))
---------------------------------------------------------------------------·························---------------------------------------------------------------------------
AGAAGTTTATCTGTGTGAACTTCTTGGCTTAGTATCGTTGAGAAGAATCGAGAGATTAGTGCAGTTTAAAError showing value of type AlignedSequence{LongDNASeq}:
ERROR: BoundsError: attempt to access LongDNASeq at index [71]
Stacktrace:
...

This PR

julia> using BioAlignments, BioSequences

julia> aligned_sequence = AlignedSequence(dna"AGAAGTTTATCTGTGTGAACTTCTTGGCTTAGTATCGTTGAGAAGAATCGAGAGATTAGTGCAGTTTAAA",
              Alignment("75H25M75H", 1, 1))
·························
AGAAGTTTATCTGTGTGAACTTCTT

☑️ Checklist

  • 🎨 The changes implemented is consistent with the julia style guide.
  • 📘 I have updated and added relevant docstrings, in a manner consistent with the documentation styleguide.
  • 📘 I have added or updated relevant user and developer manuals/documentation in docs/src/.
  • 🆗 There are unit tests that cover the code changes I have made.
  • 🆗 The unit tests cover my code changes AND they pass.
  • 📝 I have added an entry to the [UNRELEASED] section of the manually curated CHANGELOG.md file for this repository.
  • 🆗 All changes should be compatible with the latest stable version of Julia.
  • 💭 I have commented liberally for any complex pieces of internal code.

@MillironX MillironX marked this pull request as draft May 4, 2022 20:43
@MillironX MillironX marked this pull request as ready for review May 4, 2022 21:50
Copy link
Member

@kescobo kescobo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for putting in the effort to add this - huge win!

I don't feel confident in the thoroughness of my review, but I don't see anything that looks obviously bad, and the tests still pass so I'm good with this. I would love for someone else that uses this functionality more regularly to take a look as well, though.

src/alignment.jl Show resolved Hide resolved
@MillironX
Copy link
Member Author

@jakobnissen, @CiaranOMara, could I get a review from one of you?

@codecov
Copy link

codecov bot commented May 5, 2022

Codecov Report

Base: 87.60% // Head: 88.20% // Increases project coverage by +0.59% 🎉

Coverage data is based on head (96f8ee7) compared to base (b92ccf3).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #64      +/-   ##
==========================================
+ Coverage   87.60%   88.20%   +0.59%     
==========================================
  Files          16       16              
  Lines        1138     1178      +40     
==========================================
+ Hits          997     1039      +42     
+ Misses        141      139       -2     
Impacted Files Coverage Δ
src/alignment.jl 84.66% <100.00%> (+4.19%) ⬆️
src/operations.jl 88.23% <100.00%> (+0.73%) ⬆️
src/pairwise/alignment.jl 96.99% <100.00%> (+0.41%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@CiaranOMara CiaranOMara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

It might be worth separating 349e204 into a new PR while a decision is made about how contributions are merged around #44.

BTW, nicely scoped commits with clear messages!

src/alignment.jl Outdated Show resolved Hide resolved
src/alignment.jl Outdated Show resolved Hide resolved
@jakobnissen
Copy link
Member

LGTM. Thanks for putting in the work.
W.r.t #44 - we should roll back the change, then merge #72, then we can apply this.
This debacle all suggests to me that moving forward, we need to be more careful about what is API and what is internals, but let's have that change later.

@MillironX
Copy link
Member Author

I agree with @jakobnissen, we should postpone merging this until #44 and #72 are both in master, rather than reverting 349e204.

@kescobo kescobo modified the milestone: Version 3.0 Jun 16, 2022
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
According to the SAM spec, "H can only be present as the first and/or last
operation." In addition, it's not even classified as an insert operation
anymore.

Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
…tion

Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
The test suite included logic for treating ends of alignment differently,
but due to the loop condition, this logic was always false. Remove that
logic entirely.

Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
MillironX and others added 9 commits October 1, 2022 13:57
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
The `count_aligned` function was updated by @alyst to use the new `alnpos`
field in cc8237c ("count_aligned(): optimize with alnpos"). That
optimization assumes that all anchor types consume either sequence or
reference positions as they consume alignment positions, which isn't true
for meta-operations. Switch back to the old way, which does not make any
assumptions about how many bases each operation consumes.

Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
Co-authored-by: Kevin Bonham <kevbonham@gmail.com>
Signed-off-by: Thomas A. Christensen II <25492070+MillironX@users.noreply.github.com>
@MillironX
Copy link
Member Author

Failed downstream is expected: this is a breaking change due to the inclusion of #44.

@MillironX MillironX merged commit c7bb5a8 into BioJulia:master Oct 1, 2022
@MillironX MillironX deleted the operations-fix branch October 1, 2022 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ref2seq returns a position outside of the sequence length
4 participants