Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove integer conversion and arithmetic #38

Merged
merged 3 commits into from
Mar 21, 2021

Conversation

jakobnissen
Copy link
Member

@jakobnissen jakobnissen commented Jun 4, 2020

As discussed on Slack

Currently, BioSymbols can be converted to integers:

julia> convert(Integer, DNA_M)
0x03

The problem with this is that convert is called automatically by Julia in many circumstances, which can lead to unexpected behaviour:

julia> a = dna"AAA"; a[1] = 3; a
3nt DNA Sequence:
MAA

The fact that DNA_M is internally stored as 0x03 is somewhat arbitrary implementation detail and should not be a part of the API for BioSymbols or BioSequences. Similarly, it's not entirely sensical why

julia> DNA_M + 1
DNA_G

In this PR, integer/biosymbol conversion and arithmetic is removed. This includes addition, subtraction, binary operations (|, ~ and &), and conversion.

  • To get an integer from a biosymbol x, use encoded_data(x).
  • To get a biosymbol from an integer, use encode(T, x).
  • Any future BioSymbols that might not be implemented as primitive types should implement:
    *encoded_data_eltype(::Type{T}),
    *encoded_data(x::T)
    *encode(::Type{T}, x)
    The two latter methods fall back to using reinterpret.

Edit: This is very breaking and will with almost 100% certainty create problems for other BioJulia packages. I can fix those, if we agree this should be merged

@codecov
Copy link

codecov bot commented Jun 4, 2020

Codecov Report

Merging #38 into master will increase coverage by 1.58%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master       #38      +/-   ##
===========================================
+ Coverage   98.41%   100.00%   +1.58%     
===========================================
  Files           3         3              
  Lines         126       112      -14     
===========================================
- Hits          124       112      -12     
+ Misses          2         0       -2     
Flag Coverage Δ
#unittests 100.00% <100.00%> (+1.58%) ⬆️
Impacted Files Coverage Δ
src/BioSymbols.jl 100.00% <100.00%> (+6.89%) ⬆️
src/aminoacid.jl 100.00% <100.00%> (ø)
src/nucleicacid.jl 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04ee588...ae96f92. Read the comment docs.

@jakobnissen
Copy link
Member Author

Re-added logical operations for nucleotides. I guess it actually is useful and meaningful to have DNA_A | DNA_G == DNA_R.

@CiaranOMara
Copy link
Member

Concerning BioJulia/BioSequences.jl#102, I think "encode" means the compression of the 8-bit primitive.

What about using a method structure like this instead, which neatly skips around the need to choose a name?

function (s::Type{BioSymbol})(bits::UInt8)
    return reinterpret(s, bits)
end
function (s::Type{BioSymbol})(char::Char)
    return ...
end

@TransGirlCodes
Copy link
Member

Re-added logical operations for nucleotides. I guess it actually is useful and meaningful to have DNA_A | DNA_G == DNA_R.

Yeah, I think keeping the logical and bitwise ops is a good idea. The docs describe how the symbols are encoded - DNA being "one-hot", and the logical ops then make implementing a lot of things like complements and set memberships and so on a breeze.

@CiaranOMara
Copy link
Member

@jakobnissen, apologies, today I realise it is the 8-bit symbol encoding that is compressed.

@jakobnissen
Copy link
Member Author

Can we merge this, @benjward ?

@TransGirlCodes
Copy link
Member

TransGirlCodes commented Aug 18, 2020

@jakobnissen I think this is fine, will be a major release though. I'm happy with use of encode and encoded_data here but would be as happy just with constructor methods. encode makes sense to us from BioSequences, but explicit constructor methods might make more sense to the non-specialist julia user. We can also of course have both and one just calls the other.

@jakobnissen jakobnissen merged commit 4d11c1e into BioJulia:master Mar 21, 2021
@jakobnissen jakobnissen mentioned this pull request May 9, 2021
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants