Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement broadcasting for biosequences #135

Open
jakobnissen opened this issue Feb 28, 2021 · 7 comments · May be fixed by #171
Open

Implement broadcasting for biosequences #135

jakobnissen opened this issue Feb 28, 2021 · 7 comments · May be fixed by #171

Comments

@jakobnissen
Copy link
Member

TL;DR

For v3, we should add broadcasting to BioSequence, and make my_seq[1:4] = DNA_A error.

Rationale

I was working on #133 , but can't get it to work properly. When calling getindex and setindex! with BioSequences, there always seems to be an edge case where dispatch goes the wrong way, or something is not implemented, or you get a method ambiguity error. I remember thinking the same when doing #120 .

I hate method ambiguity errors. They're a critical error in the sense that they completely break your workflow. They can't be found statically, and they're extremely hard to test for. The LongSequence test suite is compresensive, but I can guarantee you, I can find method ambiguity in something as simple as setindex! there.

Why is this so hard? Why is it so hard for BioSequences in particular?

I've come to think one of the reasons is that my_seq[inds] = x has two distinct semtantic meaning in BioSequences right now, namely these two:

julia> my_seq = dna"GGCTA";

julia> my_seq[2:4] = dna"AAC"; my_seq
5nt DNA Sequence:
GAACA

julia> my_seq[2:4] = DNA_M; my_seq
5nt DNA Sequence:
GMMMA

Having two distinct semantic meanings covered by the same function is a bad idea. Here's what happens if you do it.

  1. Implement both methods. But oh, we need to have them be different signatures to dispatch to the right one. Let's implement setindex!(::BioSequence, ::Any, ::Vector{Bool}) and setindex!(::BioSequence, ::BioSequence, ::Vector{Bool}) separately. This works ONLY because one signature is more specific than the other, NOT because the signatures are incompatible. Note that we already loses some flexibility, because we now can't do my_seq[1:2] = [DNA_A, DNA_C], since that would call the wrong method.
  2. Make new biosequence e.g. LongSequence or LongSubSeq. We need to specialize one of the setindex! methods, so we add setindex!(::LongSequence, ::Any, ::Vector{Bool}).
  3. That new method is now more specific than the fallback (of course), but that now messes up the original dispatch, that was built on one method being more specific than the other. Method ambiguity, so we add setindex!(::LongSequence, ::LongSequence, ::Vector{Bool}). We lose even more flexibility, and the original method error may still be exposed if you try to do my_seq[1:2] = mer"TA". Also, we have to duplicate efforts by having this new method, whose content is identical to the original one.
  4. Repeat with a new biological sequence. Same as above, but much worse, since any combination of types can now error

We've gone down a wrong path. I can see two solutions:

  1. We never implement two semantically different methods with a type intersection between their arguments. That means we discard setindex!(::BioSequence, ::Any, ::Vector{Bool}), and instead have a method where we restrict the middle argument to be e.g. Union{BioSymbol, Char} or something, and the other method to have Union{BioSequence, AbstractVector}. The disadvantage here is that we lose some ducktyping.
  2. We never implement two semantically different method within the same function, full stop. That means we get rid of setindex!(::BioSequence, ::Any, ::Vector{Bool}) (and a few other functions like it), and replace them with new functions. I think we can do this by overloading broadcasting. The disadvantage here is that I don't know how complex implementing broadcasting is.
@jakobnissen jakobnissen added this to the v3.0.0 milestone Feb 28, 2021
@jakobnissen jakobnissen mentioned this issue Mar 15, 2021
5 tasks
@TransGirlCodes
Copy link
Member

TransGirlCodes commented Jul 17, 2021

My instinct with this, is to simply look at what can be done, indexing with Vectors, and basically have getindex and setindex for BioSequences let you do the same/have the same semantics. Then any functionality that throws out, we think about a new function or something.

@TransGirlCodes
Copy link
Member

julia> a = [1,2,3,4,5]
5-element Array{Int64,1}:
 1
 2
 3
 4
 5

julia> a[4:5] = [-1, -2]
2-element Array{Int64,1}:
 -1
 -2

julia> a
5-element Array{Int64,1}:
  1
  2
  3
 -1
 -2

julia> a[4:5] = -6
ERROR: ArgumentError: indexed assignment with a single value to many locations is not supported; perhaps use broadcasting `.=` instead?
Stacktrace:
 [1] setindex_shape_check(::Int64, ::Int64) at ./indices.jl:258
 [2] macro expansion at ./multidimensional.jl:795 [inlined]
 [3] _unsafe_setindex!(::IndexLinear, ::Array{Int64,1}, ::Int64, ::UnitRange{Int64}) at ./multidimensional.jl:789
 [4] _setindex! at ./multidimensional.jl:785 [inlined]
 [5] setindex!(::Array{Int64,1}, ::Int64, ::UnitRange{Int64}) at ./abstractarray.jl:1153
 [6] top-level scope at REPL[9]:1

Ah well, there we go! I would say let's go that route!

@TransGirlCodes
Copy link
Member

julia> my_seq[2:4] = DNA_M
ERROR: MethodError: no method matching length(::DNA)
Closest candidates are:
  length(::CompositeException) at task.jl:41
  length(::Base.MethodList) at reflection.jl:869
  length(::ExponentialBackOff) at error.jl:259
  ...
Stacktrace:
 [1] setindex!(::LongSequence{DNAAlphabet{4}}, ::DNA, ::UnitRange{Int64}) at /Users/bward/repos/github/BioJulia/BioSequences/src/biosequence/indexing.jl:78
 [2] top-level scope at REPL[32]:1

@jakobnissen Ok, so it looks like this is already not possible anymore, probably due to the interface rework. It likely got removed then.
So it looks like it's just to implement broadcasting now? Which I'm happy to have a crack at.

@jakobnissen
Copy link
Member Author

Exactly, I removed it for v3. I'd be very happy if you could give a go at implementing broadcasting, I got pretty much nowhere when I tried! (the Julia docs are not exactly easy to understand on that point, see JuliaLang/julia#32066).

I'd want to be a little careful to make sure that biosequences have a broadcast style guarantees that the result is a biosequence instead of a vector - but maybe we just need to implement something and see how it feels.

@TransGirlCodes
Copy link
Member

So as far as I can tell, we need a style, and then we need rules, that dictate when the output should be a sequence or not.

So dna"AAAA" .= DNA_G should clearly return a sequence, but many vanilla broadcastings e.g. dna"AAAA" .== DNA_G, I would say are fine to just return vectors as per usual. What broadcasting does currently work with biosequences currently does it by coercing the sequence to a DNA[] array first, which obviously we don't want, because waste.

@TransGirlCodes
Copy link
Member

So, since broadcasting is a bit... complex, I'm going to be documenting my progress here so I can ask questions and the wider julia community can poke holes in what I'm doing: https://discourse.julialang.org/t/attempting-to-customise-broadcasting-for-biosequences/64892

@TransGirlCodes TransGirlCodes linked a pull request Jul 20, 2021 that will close this issue
11 tasks
@jakobnissen jakobnissen removed this from the v3.0.0 milestone Oct 31, 2021
@jakobnissen
Copy link
Member Author

Before this issue is resolved, we should remember to fix the method ambiguity that currently exists between these two methods:

Base.Broadcast.BroadcastStyle(s1::BioSequences.PFMBroadcastStyle, s2::Base.Broadcast.BroadcastStyle) in B
ioSequences at /Users/jakobnissen/code/BioSequences.jl/src/search/pwm.jl:74
Base.Broadcast.BroadcastStyle(::S, ::Base.Broadcast.Unknown) where S<:Base.Broadcast.BroadcastStyle in Ba
se.Broadcast at broadcast.jl:133

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants