Clarify BGZF text mode encoding limitations #2517

peterjc · 2020-01-09T11:40:53Z

This pull request addresses issue #2512, and documents a limitation (very reasonably under Python 2, but likely a surprise under Python 3) that BGZF in text mode does not do universal new lines mode.

I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the CONTRIBUTING.rst file, have run flake8 locally, and
understand that AppVeyor and TravisCI will be used to confirm the Biopython unit
tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst and CONTRIB.rst as part of this pull request, am listed
already, or do not wish to be listed. (This acknowledgement is optional.)

peterjc · 2020-01-09T11:44:39Z

This also partly addresses #2490 by fixing that for BGZF.

[Update: Cherry-picked that tested commit from this branch direct to the master]

codecov · 2020-01-09T11:56:24Z

Codecov Report

Merging #2517 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2517      +/-   ##
==========================================
+ Coverage   84.78%   84.78%   +<.01%     
==========================================
  Files         320      320              
  Lines       52344    52344              
==========================================
+ Hits        44379    44380       +1     
+ Misses       7965     7964       -1

Impacted Files	Coverage Δ
Bio/bgzf.py	`91.57% <ø> (ø)`	⬆️
Bio/motifs/matrix.py	`82.24% <0%> (+0.27%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7cd81c0...a944300. Read the comment docs.

peterjc · 2020-01-09T17:57:24Z

Thinking about this, universal new line mode should be OK, it only alters the within block part of the virtual offset.

peterjc · 2020-01-09T21:16:40Z

Thinking more, universal new line mode has a corner case with CR at the end of one compression block, and LF at the start of the next.

Moreover, that same problem likely applies to most unicode encodings - especially variable length encodings like UTF8.

The BGZF code might only be reliable on single-byte encodings like latin1 (which is what it currently is hard coded to use). https://en.wikipedia.org/wiki/Single-byte_encoding

Perhaps I should deprecate the text mode support...

peterjc · 2020-01-10T10:42:53Z

Reading https://docs.python.org/3.6/library/codecs.html I still think only single-byte encodings will work, i.e. latin1 aka iso-8859-1 or perhaps other charmaps like the Windows centric cp1252 which again only defines 256 characters.

peterjc · 2020-01-10T15:27:09Z

@mdehoon could you look at this for comment?

Who else has a good grasp of encodings?

peterjc · 2020-01-10T16:40:18Z

Thanks!

peterjc · 2020-01-10T16:44:27Z

Hah - skimmed it again when closing #2512, and I realise I left the latin1/utf8 mismatch in place. The tests are not broad enough... probably an omission on the writing side.

peterjc · 2020-01-10T16:52:25Z

So right now it insists on latin1 on reading (because we can't deal with multi-byte characters split between blocks), but the code still takes the default encoding on output.

We could probably support any codec on output... but for symmetry putting it back to the previous release's behaviour and using latin1 again is safer.

peterjc mentioned this pull request Jan 9, 2020

Remove/refactor redundant sys.version_info checks #2490

Closed

peterjc added 4 commits January 10, 2020 11:40

Document latin1 and new-line limits to BGZF text mode

61f7213

Clarify use of latin1 in BGZF test

5f5ba65

Remove check for gzip bug in old versions of Python

9615d73

More use of context manager with gzip

919222a

peterjc force-pushed the bgzf_encoding branch from b07676d to 919222a Compare January 10, 2020 11:40

peterjc changed the title ~~Support user-specific encoding in BGZF text mode~~ Clarify BGZF text mode encoding limitations Jan 10, 2020

Must update test_SeqIO_index.py too

a944300

mdehoon approved these changes Jan 10, 2020

View reviewed changes

peterjc merged commit d3029cd into biopython:master Jan 10, 2020

peterjc deleted the bgzf_encoding branch January 10, 2020 16:40

peterjc mentioned this pull request Jan 10, 2020

Unicode encoding in BGZF #2512

Closed

peterjc mentioned this pull request Jan 10, 2020

Actually restrict to latin1 for BGZF output #2532

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify BGZF text mode encoding limitations #2517

Clarify BGZF text mode encoding limitations #2517

peterjc commented Jan 9, 2020

peterjc commented Jan 9, 2020 •

edited

codecov bot commented Jan 9, 2020 •

edited

peterjc commented Jan 9, 2020 via email •

edited

peterjc commented Jan 9, 2020

peterjc commented Jan 10, 2020

peterjc commented Jan 10, 2020

peterjc commented Jan 10, 2020

peterjc commented Jan 10, 2020

peterjc commented Jan 10, 2020

Clarify BGZF text mode encoding limitations #2517

Clarify BGZF text mode encoding limitations #2517

Conversation

peterjc commented Jan 9, 2020

peterjc commented Jan 9, 2020 • edited

codecov bot commented Jan 9, 2020 • edited

Codecov Report

peterjc commented Jan 9, 2020 via email • edited

peterjc commented Jan 9, 2020

peterjc commented Jan 10, 2020

peterjc commented Jan 10, 2020

peterjc commented Jan 10, 2020

peterjc commented Jan 10, 2020

peterjc commented Jan 10, 2020

peterjc commented Jan 9, 2020 •

edited

codecov bot commented Jan 9, 2020 •

edited

peterjc commented Jan 9, 2020 via email •

edited