-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify BGZF text mode encoding limitations #2517
Conversation
This also partly addresses #2490 by fixing that for BGZF. [Update: Cherry-picked that tested commit from this branch direct to the master] |
Codecov Report
@@ Coverage Diff @@
## master #2517 +/- ##
==========================================
+ Coverage 84.78% 84.78% +<.01%
==========================================
Files 320 320
Lines 52344 52344
==========================================
+ Hits 44379 44380 +1
+ Misses 7965 7964 -1
Continue to review full report at Codecov.
|
Thinking about this, universal new line mode should be OK, it only alters the within block part of the virtual offset.
|
Thinking more, universal new line mode has a corner case with CR at the end of one compression block, and LF at the start of the next. Moreover, that same problem likely applies to most unicode encodings - especially variable length encodings like UTF8. The BGZF code might only be reliable on single-byte encodings like latin1 (which is what it currently is hard coded to use). https://en.wikipedia.org/wiki/Single-byte_encoding Perhaps I should deprecate the text mode support... |
Reading https://docs.python.org/3.6/library/codecs.html I still think only single-byte encodings will work, i.e. |
b07676d
to
919222a
Compare
@mdehoon could you look at this for comment? Who else has a good grasp of encodings? |
Thanks! |
Hah - skimmed it again when closing #2512, and I realise I left the latin1/utf8 mismatch in place. The tests are not broad enough... probably an omission on the writing side. |
So right now it insists on latin1 on reading (because we can't deal with multi-byte characters split between blocks), but the code still takes the default encoding on output. We could probably support any codec on output... but for symmetry putting it back to the previous release's behaviour and using latin1 again is safer. |
This pull request addresses issue #2512, and documents a limitation (very reasonably under Python 2, but likely a surprise under Python 3) that BGZF in text mode does not do universal new lines mode.
I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the
CONTRIBUTING.rst
file, have runflake8
locally, andunderstand that AppVeyor and TravisCI will be used to confirm the Biopython unit
tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst
andCONTRIB.rst
as part of this pull request, am listedalready, or do not wish to be listed. (This acknowledgement is optional.)