Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erroneous reverse_complement behaviour? #110

Closed
mmattocks opened this issue Jun 19, 2020 · 13 comments
Closed

Erroneous reverse_complement behaviour? #110

mmattocks opened this issue Jun 19, 2020 · 13 comments

Comments

@mmattocks
Copy link

mmattocks commented Jun 19, 2020

Noticed something strange in some of my tests as I prepare packages for submission to BioJulia. For one test, I was using reverse_complement on a particular sequence, and the output of the function is definitely wrong. This happens on my local machines but not on the Travis testing images, which is how I noticed this, so I'm not sure that this is replicable at all- just want to make sure it's not a general phenomenon.

Expected Behavior

Ordinary reverse complement produced: from
TATATATATATCGCGCGCGCG
expect
CGCGCGCGCGATATATATATA

Current Behavior

Output of reverse_complement is
TATATATATATATATATATAT
whereas output of reverse_complement! is correct

Possible Solution / Implementation

No idea!

Steps to Reproduce (for bugs)

julia> testseq=LongSequence{DNAAlphabet{2}}("ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGATATATATATATATAT")
1000nt DNA Sequence:
ATATATATATATATATATATATATATATATATATATATA…GCGCGCGCGCGCGCGCGCGCGCGATATATATATATATAT

julia> testseq[490:510]
21nt DNA Sequence:
TATATATATATCGCGCGCGCG

julia> reverse_complement(testseq[490:510])
21nt DNA Sequence:
TATATATATATATATATATAT

julia> reverse_complement!(testseq[490:510])
21nt DNA Sequence:
CGCGCGCGCGATATATATATA

Your Environment

Status ~/.julia/environments/v1.4/Project.toml
[6e4b80f9] BenchmarkTools v0.5.0
[71d2fffc] BioBackgroundModels v0.1.0 [../../../../../srv/git/BioBackgroundModels]
[7e6ae17a] BioSequences v2.0.4
[a93c6f00] DataFrames v0.21.2
[31c24e10] Distributions v0.23.4
[c2308a5c] FASTX v1.1.2
[5fb14364] OhMyREPL v0.5.5
[92933f4c] ProgressMeter v1.3.1
[295af30f] Revise v2.7.2
[4c63d2b9] StatsFuns v0.9.5

@cjprybol
Copy link
Contributor

@mmattocks I'm seeing the same thing BioJulia/FASTX.jl#23

I wonder if the same fix for reverse_complement! is also needed for reverse_complement 22d54a9

@mmattocks
Copy link
Author

@cjprybol That's mega ungood, this is a pretty bad bug if you've had one of these functions in a pipeline. I tried your suggestion by inserting orphan!(seq) after line 169 in transformations.jl, but the bug persists. I'll look at this more soon hopefully

@cjprybol
Copy link
Contributor

That's a bummer that there isn't an obvious and easy temporary fix. I'd been hunting this bug for the past couple of days after a coworker found errors in my reports.

When taking a slice of the sequence, the new subsequence still references the data of the original sequence. As long as there isn't any offset the function behaves fine, but the function needs to be adjusted to be aware of the slices (or parts as they are defined in the LongSequence type

julia> x = BioSequences.randdnaseq(7)
7nt DNA Sequence:
TTAGATT

julia> y = x[1:3]
3nt DNA Sequence:
TTA

julia> BioSequences.reverse_complement(BioSequences.reverse_complement(y)) == y
true

julia> y2 = x[2:4]
3nt DNA Sequence:
TAG

julia> BioSequences.reverse_complement(BioSequences.reverse_complement(y2)) == y2
false

julia> y2.data === y.data === x.data
true

julia> y.part
1:3

julia> y2.part
2:4

julia> x.part
1:7

@mmattocks
Copy link
Author

Just did some similar tests. Interestingly, it seems I was unaffected by this bug when I was using my genome sampling functions that called reverse_complement() with slices, so the serialised dataset I have been working with is ok. This suggests that BioSequences 1.0 didn't have this bug, but I'm not sure.

@cjprybol
Copy link
Contributor

That's correct, based on timing it looks like these changes only came in time for v2.0.0
ee06cfc

@TransGirlCodes
Copy link
Member

TransGirlCodes commented Jun 24, 2020

Ok from your PR I can see why this occurs. reverse_complent! does the right thing and calls orphan! before reverse_data! - any sequence that has orphan! called on it "owns" its data from then on, and the "offset" of the first element in a UInt64 register is always 0, so reverse_data! never has to worry about "slice sequences" that start partway through a vector.

However reverse_complement does not call orphan! rather it constructs a blank sequence that will contain the result, and then calls reverse_data_copy!, but reverse_data_copy! also does not consider sub-sequences, it reverses and copies an entire vector, which is not what should happen for sequences that are slices. So your hotfix works but you're going to be copying sequence data like mad due to your slices.

I see two solutions more optimal than the hotfix:

  1. Re-defining reverse_complement such that it simply makes a shallow copy of the input sequence, before doing reverse_complement! on the copy. This is simple and quick, but potentially has the performance drawback of copying the sequence shallow, before doing orphan to copy data, and to then traverse said data and reverse it in place.

  2. Work on reverse_data_copy! to let it work on sub-ranges of an input vector. This has the advantage of doing copying+reverse+pred(in this case complement) of the input data into a new sequence's vector in a single pass.

Personally I prefer solution 2. @jakobnissen what do you think?

@cjprybol
Copy link
Contributor

That makes a lot of sense, thanks @benjward! I pushed an update to the PR to implement proposal 1. I gave implementing proposal 2 a shot but I got lost in how to handle the Vector{UInt64} data

@TransGirlCodes
Copy link
Member

TransGirlCodes commented Jun 24, 2020

Thanks, if you're happy to use that fix whilst I'll take a look this evening and see if I can come up with something to fix reverse_data_copy!. The bug has v. serious consequences that might go uncaught by undisciplined users not accustomed to sanity checking their pipelines. So I'd rather fix the root cause than apply a bandaid. But I'll make sure to benchmark the PR and a reverse_data_copy! fix.

@jakobnissen
Copy link
Member

Sorry for just seeing this now. Yes, the problem is indeed what you said. It's a little worrying to me that I missed writing tests for such a simple thing - it might be worth for me to go back and check whether subsequences are handled correctly for other transformations.

I think proposal 1 (i.e. reverse_complement(x) = reverse_complement!(copy(x)) could be merged right now. It's not terribly slow, according to my quick benchmarks. Could be better, but that can wait a bit.

@TransGirlCodes
Copy link
Member

TransGirlCodes commented Jun 24, 2020

@jakobnissen I'm trying to recall the conversations we had about the performance of this thing.

The original implementation of reverse:

function Base.reverse(seq::LongSequence{A}) where {A<:NucleicAcidAlphabet}
    data = Vector{UInt64}(undef, seq_data_len(A, length(seq)))
    i = 1
    next = lastbitindex(seq)
    stop = bitindex(seq, 0)
    r = rem(offset(next) + bits_per_symbol(seq), 64)
    if r == 0
        @inbounds while next - stop > 0
            x = seq.data[index(next)]
            data[i] = reversebits(x, BitsPerSymbol(seq))
            i += 1
            next -= 64
        end
    else
        @inbounds while next - stop > 64
            j = index(next)
            x = (seq.data[j] << (64 - r)) | (seq.data[j - 1] >> r)
            data[i] = reversebits(x, BitsPerSymbol(seq))
            i += 1
            next -= 64
        end
        if next - stop > 0
            j = index(next)
            x = seq.data[j] << (64 - r)
            if r < next - stop
                x |= seq.data[j - 1] >> r
            end
            data[i] = reversebits(x, BitsPerSymbol(seq))
        end
    end
    return LongSequence{A}(data, 1:length(seq), false)
end

Worked for slices that share data, but I also recall that some of those next - stop > blah blah loops don't generate nice SIMD code, which your code does now - even the in-place version!

I did some experimenting and thought something like this:

@inline function reverse_data_copy2!(pred, dst::Vector{UInt64}, src::Vector{UInt64}, range, B::BT) where {
    BT <: Union{BitsPerSymbol{2}, BitsPerSymbol{4}, BitsPerSymbol{8}}}
    j = 1
    @inbounds @simd for i in range
        dst[j] = pred(reversebits(src[i], B))
        j = j + 1
    end
end

testseq = LongSequence{DNAAlphabet{2}}("ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCGCGCGCGCGCGCGCGCGCGCGCGATATATATATATATAT")

tsslice = testseq[490:610]

tscp = typeof(tsslice)(unsigned(length(tsslice)))
reverse_data_copy2!(pred, tscp.data, tsslice.data, index(lastbitindex(tsslice)):-1:index(firstbitindex(tsslice)), BioSequences.BitsPerSymbol(tsslice))
BioSequences.zero_offset!(tscp)

But it doesn't actually work.

@cjprybol
Copy link
Contributor

@jakobnissen The only other place I can find reverse_data_copy! being used is in the internal function _reverse. If I'm following the code calls correctly, it doesn't look like reverse_data_copy! can be called without either a call to orphan! or a shallow copy first. We could add a loop to the transformation tests to check for correctness with a full sequence and again with a slice to address your concern of:

it might be worth for me to go back and check whether subsequences are handled correctly for other transformations

but I think patching this will fix anything that would be affected by this particular bug. I'll let you and @benjward work out the best path forward in regards to optimization, as you both understand this code-base much better than I. And thanks to you both for developing this package!

@TransGirlCodes
Copy link
Member

This behaviour has been fixed on master. A bugfix release will follow. Thanks for pointing this out. I think the reverse_complement(x) = reverse_complement!(copy(x)) is probably the easiest route after trying some experiments today.

The in-place reverse_complement! is so much faster than it was in late v1 early v2, and is intensely SIMD vectorized in a way the old style reversing code was not, as the style of the loop it requires are harder for SIMD generation. A lot of my projects use the convention work(stuff) = work!(copy(stuff)) anyway, unless there's a more obvious and optimal way of doing it.

@mmattocks
Copy link
Author

Thank you all for your attention to this bug and work on the package!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants