Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sequence view #120

Merged
merged 12 commits into from
Feb 25, 2021
Merged

Add sequence view #120

merged 12 commits into from
Feb 25, 2021

Conversation

jakobnissen
Copy link
Member

@jakobnissen jakobnissen commented Jul 26, 2020

This adds a sequence view SeqView as discussed in #102 .

Because a LongSequence and SeqView share the same underlying encoding, many of the old LongSequence methods has been changed to now take a SeqOrView (which is just Union{LongSequence, SeqView})

You can construct a view with

  • SeqView(s::LongSequence, ::UnitRange)
  • @view my_seq[1:5]
  • SeqView(::Vector{UInt64}, ::UnitRange) (this one doesn't boundscheck)

Mutating methods that change the length of the view are intentionally not implemented, such as

  • filter!
  • map!
  • resize!
  • ungap!

@codecov
Copy link

codecov bot commented Jul 26, 2020

Codecov Report

Merging #120 (ea3201f) into master (1cfc0aa) will decrease coverage by 0.16%.
The diff coverage is 85.36%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #120      +/-   ##
==========================================
- Coverage   83.21%   83.04%   -0.17%     
==========================================
  Files          44       45       +1     
  Lines        2949     3061     +112     
==========================================
+ Hits         2454     2542      +88     
- Misses        495      519      +24     
Flag Coverage Δ
unittests 83.04% <85.36%> (-0.17%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/BioSequences.jl 50.00% <ø> (ø)
src/biosequence/biosequence.jl 54.54% <ø> (ø)
src/longsequences/constructors.jl 85.93% <0.00%> (+3.67%) ⬆️
src/minhash.jl 68.62% <33.33%> (-2.21%) ⬇️
src/biosequence/transformations.jl 79.16% <54.54%> (-4.44%) ⬇️
src/longsequences/transformations.jl 93.81% <72.72%> (-6.19%) ⬇️
src/longsequences/hash.jl 88.65% <75.00%> (-11.35%) ⬇️
src/longsequences/counting.jl 91.30% <88.23%> (ø)
src/longsequences/seqview.jl 93.54% <93.54%> (ø)
src/geneticcode.jl 73.00% <100.00%> (-0.41%) ⬇️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1cfc0aa...09d7e92. Read the comment docs.

@TransGirlCodes
Copy link
Member

Is editing a LongSequence, making it smaller say as a result of ungap! or resize! going to present a problem for a previously created view of say @view seq[90:100] if the sequence is resized to only 90bp?

@jakobnissen
Copy link
Member Author

jakobnissen commented Aug 2, 2020

Is editing a LongSequence, making it smaller say as a result of ungap! or resize! going to present a problem for a previously created view of say @view seq[90:100] if the sequence is resized to only 90bp?

Yes, the view is completely unsafe. In theory it could even segfault if the underlying array is truncated. Perhaps I could add another boundscheck when accessing a view that the underlying buffer must be long enough? That would slow down boundschecked access just a little.
Edit: On second thought, that will help with the segfaults, but still not guard against out-of-bounds indexing, since an e.g. 10-nt long nucleotide seq will still have 64 full bits in the buffer despite only using 20/40 bits. Perhaps better to just state that views are unsafe.

An alternative is for the view to contain not a reference to the vector, but a reference to the longsequence itself.

Edit. On third thought, the views of Base do not bounds check this way, so let's just keep the unsafe behaviour.

@TransGirlCodes
Copy link
Member

Ok, I'm happy with whatever behaviour closest resembles what julians already expect from a view, since that will be most consistent. We'll just make sure to document any pitfalls properly.

@jakobnissen
Copy link
Member Author

@CiaranOMara @benjward As far as I can see, this PR is ready to go. I'll leave this PR up for a few months if you want to give it a test run. Maybe there are some convenient constructors I've forgotten to implement, or a well-hidden bug or whatever.

@CiaranOMara
Copy link
Member

I think it’s fair to say that each BioSequence would have their view counterpart. To setup a consistent naming pattern for other views and to be more specific about what view SeqView provides, what about renaming SeqView as SubLongSequence? This name would also be more in line with Julia’s naming of SubArray.

@jakobnissen
Copy link
Member Author

That's a good point. Perhaps my name is too generic. LongSubSequence is probably better (though technically, it doesn't need to be long, but the same could be said of LongSequence).

We probably will never need views for e.g. kmers, but surely there are other sequence types than LongSequence. So yeah, we should rename it

@CiaranOMara
Copy link
Member

I was proposing "Sub" as a prefix convention.

@jakobnissen jakobnissen merged commit 2527208 into BioJulia:master Feb 25, 2021
@jakobnissen
Copy link
Member Author

I think this has been hanging around long enough - I'm merging this now, and will begin using master branch as my daily driver. Hopefully I can hammer out any remaining bugs that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants