Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode encoding in BGZF #2512

Closed
peterjc opened this issue Jan 8, 2020 · 4 comments · May be fixed by #4436
Closed

Unicode encoding in BGZF #2512

peterjc opened this issue Jan 8, 2020 · 4 comments · May be fixed by #4436
Assignees

Comments

@peterjc
Copy link
Member

peterjc commented Jan 8, 2020

The work and discussion in #2468 leaves the BGZF code in a state where it uses a mix of latin1 (used in the Bio._py3k conversion functions) and the default encoding.

Properly when used in text mode, the BGZF code should follow the gzip library and take a user specified encoding.

Self-assigning issue.

@peterjc
Copy link
Member Author

peterjc commented Jan 9, 2020

This turns out to be complicated. Python 2 text mode didn't do universal new lines mode by default, but Python 3 does. This means on examples like Tests/SamBam/ex1.bam which has Windows line lines we run into complications in the test as the BGZF code does not do universal new lines mode, but gzip.open does in text mode.

$ python
Python 3.7.4 (default, Aug 13 2019, 15:17:50) 
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import gzip
>>> with gzip.open("ex1.bam", "rb") as handle: raw_bytes = handle.read()
... 
>>> with gzip.open("ex1.bam", "rt") as handle: open_as_text = handle.read()
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xxx/miniconda3/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 100: invalid start byte
>>> with gzip.open("ex1.bam", "rt", encoding="latin1") as handle: open_as_latin1 = handle.read()
... 
>>> len(raw_bytes)
456614
>>> len(raw_bytes.decode("latin1"))
456614
>>> len(open_as_latin1)
456610
>>> raw_bytes.decode("latin1").replace("\r\n", "\n").replace("\r", "\n") == open_as_latin1
True
>>> 

Sadly attempting to implement universal new lines mode in BGZF text mode would be a major piece of work due to the special file offsets (the entire point of BGZF). However, we do need to document this limitation.

@peterjc
Copy link
Member Author

peterjc commented Jan 10, 2020

Closed by #2517 - sticking with hard coded latin1 due to the nature of the blocked compression.

@peterjc
Copy link
Member Author

peterjc commented Jan 10, 2020

Almost closed by #2517, I spotted an oversight - fixed in #2532

@peterjc
Copy link
Member Author

peterjc commented Sep 23, 2021

I should have closed this back in Jan 2020.

It may not be impossible, but it would certainly be non-trivial to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant