Unicode encoding in BGZF #2512

peterjc · 2020-01-08T11:49:44Z

The work and discussion in #2468 leaves the BGZF code in a state where it uses a mix of latin1 (used in the Bio._py3k conversion functions) and the default encoding.

Properly when used in text mode, the BGZF code should follow the gzip library and take a user specified encoding.

Self-assigning issue.

The text was updated successfully, but these errors were encountered:

peterjc · 2020-01-09T11:16:13Z

This turns out to be complicated. Python 2 text mode didn't do universal new lines mode by default, but Python 3 does. This means on examples like Tests/SamBam/ex1.bam which has Windows line lines we run into complications in the test as the BGZF code does not do universal new lines mode, but gzip.open does in text mode.

$ python
Python 3.7.4 (default, Aug 13 2019, 15:17:50) 
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import gzip
>>> with gzip.open("ex1.bam", "rb") as handle: raw_bytes = handle.read()
... 
>>> with gzip.open("ex1.bam", "rt") as handle: open_as_text = handle.read()
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xxx/miniconda3/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 100: invalid start byte
>>> with gzip.open("ex1.bam", "rt", encoding="latin1") as handle: open_as_latin1 = handle.read()
... 
>>> len(raw_bytes)
456614
>>> len(raw_bytes.decode("latin1"))
456614
>>> len(open_as_latin1)
456610
>>> raw_bytes.decode("latin1").replace("\r\n", "\n").replace("\r", "\n") == open_as_latin1
True
>>>

Sadly attempting to implement universal new lines mode in BGZF text mode would be a major piece of work due to the special file offsets (the entire point of BGZF). However, we do need to document this limitation.

peterjc · 2020-01-10T16:41:25Z

Closed by #2517 - sticking with hard coded latin1 due to the nature of the blocked compression.

peterjc · 2020-01-10T16:56:14Z

Almost closed by #2517, I spotted an oversight - fixed in #2532

peterjc · 2021-09-23T14:45:38Z

I should have closed this back in Jan 2020.

It may not be impossible, but it would certainly be non-trivial to implement.

peterjc self-assigned this Jan 8, 2020

peterjc added the Enhancement label Jan 8, 2020

peterjc mentioned this issue Jan 8, 2020

Remove use of Bio._py3k (Python 2 / 3 compatibility) #2420

Closed

peterjc mentioned this issue Jan 9, 2020

Clarify BGZF text mode encoding limitations #2517

Merged

3 tasks

peterjc mentioned this issue Jan 10, 2020

Actually restrict to latin1 for BGZF output #2532

Merged

3 tasks

peterjc mentioned this issue Sep 23, 2021

BgzfReader argument mode not tested if a fileobj is specified #3748

Closed

peterjc closed this as completed Sep 23, 2021

peterjc mentioned this issue Sep 28, 2021

Added encoding to bgzf writer #1489

Closed

peterjc mentioned this issue Jan 3, 2022

Can bgzf._as_bytes calls be effectively replaced with bytes() in Python3? #3832

Closed

ikravets mentioned this issue Sep 4, 2023

bgzf: add utf-8 support #4436

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode encoding in BGZF #2512

Unicode encoding in BGZF #2512

peterjc commented Jan 8, 2020

peterjc commented Jan 9, 2020

peterjc commented Jan 10, 2020

peterjc commented Jan 10, 2020

peterjc commented Sep 23, 2021

Unicode encoding in BGZF #2512

Unicode encoding in BGZF #2512

Comments

peterjc commented Jan 8, 2020

peterjc commented Jan 9, 2020

peterjc commented Jan 10, 2020

peterjc commented Jan 10, 2020

peterjc commented Sep 23, 2021