Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alignment string position support #44

Merged
merged 8 commits into from
Mar 4, 2021
Merged

Conversation

alyst
Copy link
Contributor

@alyst alyst commented Sep 22, 2020

Alignment string position support

Types of changes

This PR implements the following changes:

  • ✨ New feature (A non-breaking change which adds functionality).
  • πŸ› Bug fix (A non-breaking change, which fixes an issue).
  • πŸ’₯ Breaking change (fix or feature that would cause existing functionality to change).

πŸ“‹ Additional detail

  • Mapping positions from/to the input or reference sequence onto the common alignment string is quite useful (e.g. when comparing annotations attached to specific positions of the aligned strings). In addition to seq2ref() and ref2seq(), this PR adds 4 new functions to the public API: seq2agn, ref2agn, agn2seq, agn2ref, which allow to convert between input/reference/alignment positions in any direction.

  • Technically, keeping track of alignment positions is implemented as an AlignmentAnchor.agnpos field. Updating the pairwise alignment algorithms to support alignment positions was easy, only the common traceback utility macros had to be changed. It also should not affect the performance or memory consumption.

  • I also did a few minor cleanups (@inbounds were beneficial, refactored position conversion to be more generic and using Julia 1.x approach (avoid generated functions, rely on constant propagation, argument splatting)).

  • Provide a runnable example of use of your addition. This lets reviewers
    and others try out the feature before it is merged or makes it's way to release.

β˜‘οΈ Checklist

  • 🎨 The changes implemented is consistent with the julia style guide.
  • πŸ“˜ I have updated and added relevant docstrings, in a manner consistent with the documentation styleguide.
  • πŸ“˜ I have added or updated relevant user and developer manuals/documentation in docs/src/.
  • πŸ†— There are unit tests that cover the code changes I have made.
  • πŸ†— The unit tests cover my code changes AND they pass.
  • πŸ“ I have added an entry to the [UNRELEASED] section of the manually curated CHANGELOG.md file for this repository.
  • πŸ†— All changes should be compatible with the latest stable version of Julia.
  • πŸ’­ I have commented liberally for any complex pieces of internal code.

@alyst
Copy link
Contributor Author

alyst commented Sep 22, 2020

The failures due to Registry.toml problems, the testing even didn't start.
Also: is it possible to update CI to test on latest stable Julia and also on nightly (with allow_failures). I guess only LTS (is it 1.0 or 1.1?) Julia should be kept from the previous versions.

@alyst
Copy link
Contributor Author

alyst commented Sep 23, 2020

Some other notes on the PR:

  • In this PR I haven't yet updated the manual -- but if the maintainers think it's reasonable to include the alignment position functionality, I'll update it. The terminology (e.g. alignment position) also has to be clarified before committing to update the docs.
  • The PairwiseAlignment iterator interface could be updated to return the alignment position. But that would break the API, so I don't know what to do about that.
  • In addition to the alignment position, it would be nice to have API to query the input/reference sequence ranges aligned. Partly it exists as Alignment.firstref/lastref, but so far it's not public API and it doesn't cover input sequence positions.
  • Not a part of this PR, but something that I noticed: there Alignment, AlignedSequence and PairwiseAlignment types. Judging from the names, PariwiseAlignment and Alignment are expected to be from the same type hierarchy, but in fact these are very different types. Perhaps, it's possible to rename Alignment into something like AlignmentDefinition or AlignmentTransform (and fix the .jl filename and field names of that type accordingly (e.g. into transform -- right now in some parts of the code one may read aln.a.aln.anchors)) to avoid the confusion.

@alyst
Copy link
Contributor Author

alyst commented Oct 3, 2020

I don't know who would be the best to review, but since you were committing recently, maybe you can take a look, @benjward ? Or maybe you can suggest a more suitable reviewer? Thanks in advance!

@alyst
Copy link
Contributor Author

alyst commented Mar 3, 2021

@benjward @jakobnissen If you would have time, could you please review this PR?

@alyst alyst closed this Mar 3, 2021
@alyst alyst reopened this Mar 3, 2021
Copy link
Member

@jakobnissen jakobnissen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for just getting around to it now. I can't wrap my head around the logic, but despite a few minor issues, it looks good.

I'm not 100% sure what this functionality is useful for, though. Does the concept of an "alignment position" make sense?

Comment on lines 27 to 30
seq2agn,
ref2agn,
agn2seq,
agn2ref,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the rest of the codebase, "alignment" is abbreviated aln. Perhaps a search-replace for "agn"->"aln" would be best.

Map a position `i` from sequence to reference.
"""
function seq2ref(aln::Alignment, i::Integer)::Tuple{Int,Operation}
idx = findanchor(aln, i, Val{true})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idiomatically, you usually use Val by doing findanchor(aln, i, Val(true)), and then in the type signature that uses Val, you say

findanchor(aln::Alignment, i::Integer, ::Val{my_bool}) where my_bool

, i.e. using ::Val{Foo} instead of ::Type{Val{Foo}}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was never sure whether Val(...) generates a code to allocate that object or not, so to avoid allocation I was using type approach, but I'll change it to Val().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no allocation - the two things are completely equivalent. It's just the norm to use instances instead of types whenever you have a zero-field struct. (e.g. Julia's Base and the rest of BioJulia does it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, that was not my code. %) In the refactored functions Val is replaced by passing function references.

src/alignment.jl Show resolved Hide resolved

Map a position `i` from sequence to reference.
"""
seq2ref(aln::Alignment, i::Integer) = pos2pos(aln, i, seqpos, refpos)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - no variable named seqpos / refpos

src/alignment.jl Outdated
return j
end
# invariant (activate this for debugging)
#@assert anchors[lo].$pos < i ≀ anchors[hi].$pos
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this assert be re-enabled? Otherwise it should be deleted, I think.

Copy link
Contributor Author

@alyst alyst Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be still relevant, because the function logic was not touched, I've just added support for aligned position + made it a normal function, not @generated.
I'd rather keep it here as a comment: it nicely documents the constraint of the algorithm, but the context might be performance-critical to enable it.
I'll update it to use the current way of getting the position (pos(anchors...))) and check internally (without pushing here) whether it doesn't throw in tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I checked and tests do not throw if the line is enabled.

@@ -113,7 +113,7 @@ function Base.convert(::Type{Char}, op::Operation)
end

function Base.print(io::IO, op::Operation)
write(io, convert(Char, op))
write(io, isvalid(op) ? convert(Char, op) : '?')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we support invalid operations? Perhaps it's a better design to not allow invalid operations to exist. I'm not sure though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time has passed, I don't know why I've added it. Maybe when I'll be revisiting the PR, I'll see why.
Otherwise I agree, it could be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to keep this change:

  • to be in sync with show() (see below) that doesn't throw
  • in general, if you cannot print incorrect object, it's hard to figure out what is the problem. If we just throw an exception the user would not see the context (sequence position).

src/alignment.jl Outdated Show resolved Hide resolved
@alyst
Copy link
Contributor Author

alyst commented Mar 3, 2021

I'm not 100% sure what this functionality is useful for, though. Does the concept of an "alignment position" make sense?

For multiple sequence alignments it definitely does, because it's actually the column of MSA.

But it also does make sense for pairwise one. If you want to display the pairwise alignment, you may need to know what is the position of a specific letter from either sequence in the aligned sequence.
My use case was to correctly display the post-translational modifications (phosphorylation, ubiquitination etc) of inidividual amino acids on the pairwise-aligned homologous protein sequences.
There could be regions that are present only on the first sequence or regions that are present only on the second one. So you definitely need an aligned position, which preserves everything.

@alyst
Copy link
Contributor Author

alyst commented Mar 3, 2021

@jakobnissen It looks like at the moment CI seems broken. Should this package be updated to use GitHab actions for CI?
I can try submitting the PR for that, but maybe it would be better if someone who has better knowledge of BioJulia ecosystem is doing that.

@jakobnissen
Copy link
Member

Yeah... BioJulia is strapped for dev manpower at the moment, so there is a lot of cleanup and housekeeping that needs to be done in all sorts of packages, and just haven't been done. If you make a PR, I'll merge it! You can look at e.g. BioSequences for how to do this.

@alyst alyst mentioned this pull request Mar 3, 2021
otherwise Anchors with invalid operation (and these anchors are
created temporarily during traceback) would throw upon printing
@codecov
Copy link

codecov bot commented Mar 3, 2021

Codecov Report

Merging #44 (c8e434c) into master (3e3910c) will decrease coverage by 0.41%.
The diff coverage is 87.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #44      +/-   ##
==========================================
- Coverage   88.82%   88.41%   -0.42%     
==========================================
  Files          17       17              
  Lines        1056     1079      +23     
==========================================
+ Hits          938      954      +16     
- Misses        118      125       +7     
Impacted Files Coverage Ξ”
src/BioAlignments.jl 100.00% <ΓΈ> (ΓΈ)
src/alignment.jl 79.16% <76.92%> (-4.62%) ⬇️
src/alignedseq.jl 96.92% <100.00%> (+0.31%) ⬆️
src/anchors.jl 83.33% <100.00%> (+8.33%) ⬆️
src/operations.jl 90.32% <100.00%> (ΓΈ)
src/pairwise/algorithms/common.jl 91.17% <100.00%> (+4.21%) ⬆️
src/pairwise/algorithms/hamming_distance.jl 100.00% <100.00%> (ΓΈ)
src/pairwise/alignment.jl 99.06% <100.00%> (-0.05%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Ξ” = absolute <relative> (impact), ΓΈ = not affected, ? = missing data
Powered by Codecov. Last update 3e3910c...cc8237c. Read the comment docs.

- add position getters for AlignmentAnchor
- convert findanchor() from generated function to normal one with
  position getter as a param (in Julia constant propagation we trust)
- generic pos2pos() function that gets called by seq2ref/ref2seq()
@jakobnissen
Copy link
Member

@alyst Ping me when it's ready to be merged :)

- add alnpos to each anchor
- add mapping to/from alnpos
- add alnpos support to traceback
@alyst
Copy link
Contributor Author

alyst commented Mar 4, 2021

@jakobnissen Thank you! I think the PR is good to go.
I have addressed your suggestions, also I have added tests to the new aln2xxx/xxx2aln() methods that were not covered.
There are still new lines in the PR that are not covered, but they were derived from the existing exceptions that got more specific after alnpos was added.
They were not covered by the tests before. So I think it's fine to merge this PR, and address the coverage later by introducing more test cases.

@jakobnissen jakobnissen merged commit bc9ad52 into BioJulia:master Mar 4, 2021
@jakobnissen
Copy link
Member

Thanks for putting in the work!

@alyst alyst deleted the agn_pos branch March 4, 2021 11:07
@timholy
Copy link
Contributor

timholy commented Nov 24, 2021

alyst added a commit to alyst/BioAlignments.jl that referenced this pull request Nov 25, 2021
@alyst alyst mentioned this pull request Nov 25, 2021
11 tasks
@MillironX MillironX mentioned this pull request Jun 8, 2022
MillironX added a commit that referenced this pull request Oct 1, 2022
Downstream tests are failing because the breaking change from
#44 was merged here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants