-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Erroneous reverse_complement behaviour? #110
Comments
@mmattocks I'm seeing the same thing BioJulia/FASTX.jl#23 I wonder if the same fix for |
@cjprybol That's mega ungood, this is a pretty bad bug if you've had one of these functions in a pipeline. I tried your suggestion by inserting orphan!(seq) after line 169 in transformations.jl, but the bug persists. I'll look at this more soon hopefully |
That's a bummer that there isn't an obvious and easy temporary fix. I'd been hunting this bug for the past couple of days after a coworker found errors in my reports. When taking a slice of the sequence, the new subsequence still references the data of the original sequence. As long as there isn't any offset the function behaves fine, but the function needs to be adjusted to be aware of the slices (or
|
Just did some similar tests. Interestingly, it seems I was unaffected by this bug when I was using my genome sampling functions that called reverse_complement() with slices, so the serialised dataset I have been working with is ok. This suggests that BioSequences 1.0 didn't have this bug, but I'm not sure. |
That's correct, based on timing it looks like these changes only came in time for v2.0.0 |
Ok from your PR I can see why this occurs. However I see two solutions more optimal than the hotfix:
Personally I prefer solution 2. @jakobnissen what do you think? |
That makes a lot of sense, thanks @benjward! I pushed an update to the PR to implement proposal 1. I gave implementing proposal 2 a shot but I got lost in how to handle the Vector{UInt64} data |
Thanks, if you're happy to use that fix whilst I'll take a look this evening and see if I can come up with something to fix |
Sorry for just seeing this now. Yes, the problem is indeed what you said. It's a little worrying to me that I missed writing tests for such a simple thing - it might be worth for me to go back and check whether subsequences are handled correctly for other transformations. I think proposal 1 (i.e. |
@jakobnissen I'm trying to recall the conversations we had about the performance of this thing. The original implementation of reverse: function Base.reverse(seq::LongSequence{A}) where {A<:NucleicAcidAlphabet}
data = Vector{UInt64}(undef, seq_data_len(A, length(seq)))
i = 1
next = lastbitindex(seq)
stop = bitindex(seq, 0)
r = rem(offset(next) + bits_per_symbol(seq), 64)
if r == 0
@inbounds while next - stop > 0
x = seq.data[index(next)]
data[i] = reversebits(x, BitsPerSymbol(seq))
i += 1
next -= 64
end
else
@inbounds while next - stop > 64
j = index(next)
x = (seq.data[j] << (64 - r)) | (seq.data[j - 1] >> r)
data[i] = reversebits(x, BitsPerSymbol(seq))
i += 1
next -= 64
end
if next - stop > 0
j = index(next)
x = seq.data[j] << (64 - r)
if r < next - stop
x |= seq.data[j - 1] >> r
end
data[i] = reversebits(x, BitsPerSymbol(seq))
end
end
return LongSequence{A}(data, 1:length(seq), false)
end Worked for slices that share data, but I also recall that some of those I did some experimenting and thought something like this: @inline function reverse_data_copy2!(pred, dst::Vector{UInt64}, src::Vector{UInt64}, range, B::BT) where {
BT <: Union{BitsPerSymbol{2}, BitsPerSymbol{4}, BitsPerSymbol{8}}}
j = 1
@inbounds @simd for i in range
dst[j] = pred(reversebits(src[i], B))
j = j + 1
end
end
testseq = LongSequence{DNAAlphabet{2}}("ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGATATATATATATATAT")
tsslice = testseq[490:610]
tscp = typeof(tsslice)(unsigned(length(tsslice)))
reverse_data_copy2!(pred, tscp.data, tsslice.data, index(lastbitindex(tsslice)):-1:index(firstbitindex(tsslice)), BioSequences.BitsPerSymbol(tsslice))
BioSequences.zero_offset!(tscp) But it doesn't actually work. |
@jakobnissen The only other place I can find
but I think patching this will fix anything that would be affected by this particular bug. I'll let you and @benjward work out the best path forward in regards to optimization, as you both understand this code-base much better than I. And thanks to you both for developing this package! |
This behaviour has been fixed on master. A bugfix release will follow. Thanks for pointing this out. I think the The in-place reverse_complement! is so much faster than it was in late v1 early v2, and is intensely SIMD vectorized in a way the old style reversing code was not, as the style of the loop it requires are harder for SIMD generation. A lot of my projects use the convention |
Thank you all for your attention to this bug and work on the package! |
Noticed something strange in some of my tests as I prepare packages for submission to BioJulia. For one test, I was using reverse_complement on a particular sequence, and the output of the function is definitely wrong. This happens on my local machines but not on the Travis testing images, which is how I noticed this, so I'm not sure that this is replicable at all- just want to make sure it's not a general phenomenon.
Expected Behavior
Ordinary reverse complement produced: from
TATATATATATCGCGCGCGCG
expect
CGCGCGCGCGATATATATATA
Current Behavior
Output of reverse_complement is
TATATATATATATATATATAT
whereas output of reverse_complement! is correct
Possible Solution / Implementation
No idea!
Steps to Reproduce (for bugs)
Your Environment
Status
~/.julia/environments/v1.4/Project.toml
[6e4b80f9] BenchmarkTools v0.5.0
[71d2fffc] BioBackgroundModels v0.1.0 [
../../../../../srv/git/BioBackgroundModels
][7e6ae17a] BioSequences v2.0.4
[a93c6f00] DataFrames v0.21.2
[31c24e10] Distributions v0.23.4
[c2308a5c] FASTX v1.1.2
[5fb14364] OhMyREPL v0.5.5
[92933f4c] ProgressMeter v1.3.1
[295af30f] Revise v2.7.2
[4c63d2b9] StatsFuns v0.9.5
The text was updated successfully, but these errors were encountered: