Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify BGZF text mode encoding limitations #2517

Merged
merged 5 commits into from
Jan 10, 2020

Conversation

peterjc
Copy link
Member

@peterjc peterjc commented Jan 9, 2020

This pull request addresses issue #2512, and documents a limitation (very reasonably under Python 2, but likely a surprise under Python 3) that BGZF in text mode does not do universal new lines mode.

  • I hereby agree to dual licence this and any previous contributions under both
    the Biopython License Agreement AND the BSD 3-Clause License.

  • I have read the CONTRIBUTING.rst file, have run flake8 locally, and
    understand that AppVeyor and TravisCI will be used to confirm the Biopython unit
    tests and style checks pass with these changes.

  • I have added my name to the alphabetical contributors listings in the files
    NEWS.rst and CONTRIB.rst as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

@peterjc
Copy link
Member Author

peterjc commented Jan 9, 2020

This also partly addresses #2490 by fixing that for BGZF.

[Update: Cherry-picked that tested commit from this branch direct to the master]

@codecov
Copy link

codecov bot commented Jan 9, 2020

Codecov Report

Merging #2517 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2517      +/-   ##
==========================================
+ Coverage   84.78%   84.78%   +<.01%     
==========================================
  Files         320      320              
  Lines       52344    52344              
==========================================
+ Hits        44379    44380       +1     
+ Misses       7965     7964       -1
Impacted Files Coverage Δ
Bio/bgzf.py 91.57% <ø> (ø) ⬆️
Bio/motifs/matrix.py 82.24% <0%> (+0.27%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7cd81c0...a944300. Read the comment docs.

@peterjc
Copy link
Member Author

peterjc commented Jan 9, 2020 via email

@peterjc
Copy link
Member Author

peterjc commented Jan 9, 2020

Thinking more, universal new line mode has a corner case with CR at the end of one compression block, and LF at the start of the next.

Moreover, that same problem likely applies to most unicode encodings - especially variable length encodings like UTF8.

The BGZF code might only be reliable on single-byte encodings like latin1 (which is what it currently is hard coded to use). https://en.wikipedia.org/wiki/Single-byte_encoding

Perhaps I should deprecate the text mode support...

@peterjc
Copy link
Member Author

peterjc commented Jan 10, 2020

Reading https://docs.python.org/3.6/library/codecs.html I still think only single-byte encodings will work, i.e. latin1 aka iso-8859-1 or perhaps other charmaps like the Windows centric cp1252 which again only defines 256 characters.

@peterjc peterjc changed the title Support user-specific encoding in BGZF text mode Clarify BGZF text mode encoding limitations Jan 10, 2020
@peterjc
Copy link
Member Author

peterjc commented Jan 10, 2020

@mdehoon could you look at this for comment?

Who else has a good grasp of encodings?

@peterjc
Copy link
Member Author

peterjc commented Jan 10, 2020

Thanks!

@peterjc peterjc merged commit d3029cd into biopython:master Jan 10, 2020
@peterjc peterjc deleted the bgzf_encoding branch January 10, 2020 16:40
@peterjc
Copy link
Member Author

peterjc commented Jan 10, 2020

Hah - skimmed it again when closing #2512, and I realise I left the latin1/utf8 mismatch in place. The tests are not broad enough... probably an omission on the writing side.

@peterjc
Copy link
Member Author

peterjc commented Jan 10, 2020

So right now it insists on latin1 on reading (because we can't deal with multi-byte characters split between blocks), but the code still takes the default encoding on output.

We could probably support any codec on output... but for symmetry putting it back to the previous release's behaviour and using latin1 again is safer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants