Can bgzf._as_bytes calls be effectively replaced with bytes() in Python3? #3832

MatthewRalston · 2022-01-02T05:42:05Z

Setup

I am reporting a problem with Biopython version, Python version, and operating
system as follows:

>import sys; print(sys.version)
3.10.1 ... [GCC 11.1.0]
>import platform; print(platform.python_implementation()); print(platform.platform())
CPython
Linux-5.15.12-arch1-1-x86_64-with-glibc2.33
>import Bio; print(Bio.__version__)
1.79

Was using Bio.bgzf in my pet project (MatthewRalston/kmerdb) and suddenly my locked version of Biopython 1.79 was producing errors about the function being no longer available.

Expected behaviour

I expected _as_bytes to be an available helper to cast data correctly for formatting a metadata block. I wouldn't even go so far as to say I should have been using it, but the function was used throughout the codebase and figured it would be safe to use.

For reference I was subclassing/inheriting methods from bgzf.BgzfWriter into my wrapper class for a custom bgzf format. I wouldnt dare post the attrocious code I'm using to do this here.

Actual behavior

bgzf._as_bytes() has been removed.

Question

Are we able to just use bytes() in python3 now to cast data to be written with _write_block? Sadly, I wasn't able to get desired behavior from write() when I last checked, so I had been using _write_block to write a YAML metadata block(s) to the top of my custom format.

The text was updated successfully, but these errors were encountered:

peterjc · 2022-01-03T18:02:06Z

Functions with a single leading underscore are considered private by convention in Python, and thus subject to change without warning - they are not considered part of the stable public API.

The _as_bytes function was last used in Bio/bgzf.py in Biopython 1.76, defined for Python 3 at https://github.com/biopython/biopython/blob/biopython-176/Bio/_py3k/__init__.py#L78 as follows:

    def _as_bytes(s):
        """Turn byte string or unicode string into a bytes string (PRIVATE).
        The Python 2 version returns a (byte) string.
        """
        if isinstance(s, bytes):
            return s
        # Assume it is a unicode string
        # Note ISO-8859-1 aka Latin-1 preserves first 256 chars
        return codecs.latin_1_encode(s)[0]

Basically if you have a (unicode) string, encode it appropriately as a bytes string. You can do that with an encode method (as here), or calling the built-in function bytes(...) but should think carefully about the encoding.

I suspect whatever issue you were having with the public API may be to do with properly encoding unicode data into bytes. If you can pin this down to a short self contained example, please do open another issue about that.

(If you want to know more about why the BGZF module does not support [missing caveat added: arbitrary encodings for] unicode, see #2512)

MatthewRalston · 2022-01-03T18:20:28Z

Thanks Peter! I value your response, it does make sense why the _as_bytes function would change so rapidly and I was aware of that going in. I'm happy you closed the issue too, it was kind of a non-issue issue more about usage than reporting a "bug".

I'm still unable to use the "write" function, even after using a flush. Here is a short example from kmerdb. This is how I am currently writing my metadata "block" to the bgzf file before dumping the rest of the contents to the file with write . I'd love to use write() as commented below but my usage still seems to be wrong!

        # I've needed to implement a basic block_writer, maintaining compatibility with the Biopython bgzf submodule.
        #self.write(bytes(yaml.dump(metadata, sort_keys=False), 'utf-8'))
        
        for i in range(self.metadata["metadata_blocks"]):
            metadata_slice = metadata_bytes[:65536]
            metadata_bytes = metadata_bytes[65536:]
            self._write_block(metadata_slice)

peterjc · 2022-01-03T23:19:11Z

This is a guess, but using UTF-8 as an encoding here worries me (see discussion linked to from #2512), the problem being multi-byte unicode letters could be split between BGZF blocks. If so, then any failures could be data dependent.

If you could narrow this down to a self contained example, that'd be something I could look into. Thanks.

MatthewRalston · 2022-01-04T15:05:42Z

Thanks Peter, you were correct. I was setting mode="wb" for the BgzfWriter and needing to encode all my text as UTF-8 for my metadata header. I am now using mode="w" and using the write method directly as intended.

The write method was working perfectly when I was writing single str lines to the file, but wasn't working when I was dumping the metadata at the top of the file (with UTF-8 encoding as above), even after flushing. Thanks for the responses, totally RTFM and user error.

peterjc closed this as completed Jan 3, 2022

peterjc mentioned this issue Jan 4, 2022

kmerdb uses mode="wb" with BgzfWriter subclass MatthewRalston/kmerdb#50

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can bgzf._as_bytes calls be effectively replaced with bytes() in Python3? #3832

Can bgzf._as_bytes calls be effectively replaced with bytes() in Python3? #3832

MatthewRalston commented Jan 2, 2022

peterjc commented Jan 3, 2022 •

edited

MatthewRalston commented Jan 3, 2022

peterjc commented Jan 3, 2022

MatthewRalston commented Jan 4, 2022 •

edited

Can bgzf._as_bytes calls be effectively replaced with bytes() in Python3? #3832

Can bgzf._as_bytes calls be effectively replaced with bytes() in Python3? #3832

Comments

MatthewRalston commented Jan 2, 2022

Setup

Expected behaviour

Actual behavior

Question

peterjc commented Jan 3, 2022 • edited

MatthewRalston commented Jan 3, 2022

peterjc commented Jan 3, 2022

MatthewRalston commented Jan 4, 2022 • edited

peterjc commented Jan 3, 2022 •

edited

MatthewRalston commented Jan 4, 2022 •

edited