Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can bgzf._as_bytes calls be effectively replaced with bytes() in Python3? #3832

Closed
MatthewRalston opened this issue Jan 2, 2022 · 4 comments

Comments

@MatthewRalston
Copy link

Setup

I am reporting a problem with Biopython version, Python version, and operating
system as follows:

>import sys; print(sys.version)
3.10.1 ... [GCC 11.1.0]
>import platform; print(platform.python_implementation()); print(platform.platform())
CPython
Linux-5.15.12-arch1-1-x86_64-with-glibc2.33
>import Bio; print(Bio.__version__)
1.79

Was using Bio.bgzf in my pet project (MatthewRalston/kmerdb) and suddenly my locked version of Biopython 1.79 was producing errors about the function being no longer available.

Expected behaviour

I expected _as_bytes to be an available helper to cast data correctly for formatting a metadata block. I wouldn't even go so far as to say I should have been using it, but the function was used throughout the codebase and figured it would be safe to use.

For reference I was subclassing/inheriting methods from bgzf.BgzfWriter into my wrapper class for a custom bgzf format. I wouldnt dare post the attrocious code I'm using to do this here.

Actual behavior

bgzf._as_bytes() has been removed.

Question

Are we able to just use bytes() in python3 now to cast data to be written with _write_block? Sadly, I wasn't able to get desired behavior from write() when I last checked, so I had been using _write_block to write a YAML metadata block(s) to the top of my custom format.

@peterjc
Copy link
Member

peterjc commented Jan 3, 2022

Functions with a single leading underscore are considered private by convention in Python, and thus subject to change without warning - they are not considered part of the stable public API.

The _as_bytes function was last used in Bio/bgzf.py in Biopython 1.76, defined for Python 3 at https://github.com/biopython/biopython/blob/biopython-176/Bio/_py3k/__init__.py#L78 as follows:

    def _as_bytes(s):
        """Turn byte string or unicode string into a bytes string (PRIVATE).
        The Python 2 version returns a (byte) string.
        """
        if isinstance(s, bytes):
            return s
        # Assume it is a unicode string
        # Note ISO-8859-1 aka Latin-1 preserves first 256 chars
        return codecs.latin_1_encode(s)[0]

Basically if you have a (unicode) string, encode it appropriately as a bytes string. You can do that with an encode method (as here), or calling the built-in function bytes(...) but should think carefully about the encoding.

I suspect whatever issue you were having with the public API may be to do with properly encoding unicode data into bytes. If you can pin this down to a short self contained example, please do open another issue about that.

(If you want to know more about why the BGZF module does not support [missing caveat added: arbitrary encodings for] unicode, see #2512)

@peterjc peterjc closed this as completed Jan 3, 2022
@MatthewRalston
Copy link
Author

Thanks Peter! I value your response, it does make sense why the _as_bytes function would change so rapidly and I was aware of that going in. I'm happy you closed the issue too, it was kind of a non-issue issue more about usage than reporting a "bug".

I'm still unable to use the "write" function, even after using a flush. Here is a short example from kmerdb. This is how I am currently writing my metadata "block" to the bgzf file before dumping the rest of the contents to the file with write . I'd love to use write() as commented below but my usage still seems to be wrong!

        # I've needed to implement a basic block_writer, maintaining compatibility with the Biopython bgzf submodule.
        #self.write(bytes(yaml.dump(metadata, sort_keys=False), 'utf-8'))
        
        for i in range(self.metadata["metadata_blocks"]):
            metadata_slice = metadata_bytes[:65536]
            metadata_bytes = metadata_bytes[65536:]
            self._write_block(metadata_slice)

@peterjc
Copy link
Member

peterjc commented Jan 3, 2022

This is a guess, but using UTF-8 as an encoding here worries me (see discussion linked to from #2512), the problem being multi-byte unicode letters could be split between BGZF blocks. If so, then any failures could be data dependent.

If you could narrow this down to a self contained example, that'd be something I could look into. Thanks.

@MatthewRalston
Copy link
Author

MatthewRalston commented Jan 4, 2022

Thanks Peter, you were correct. I was setting mode="wb" for the BgzfWriter and needing to encode all my text as UTF-8 for my metadata header. I am now using mode="w" and using the write method directly as intended.

The write method was working perfectly when I was writing single str lines to the file, but wasn't working when I was dumping the metadata at the top of the file (with UTF-8 encoding as above), even after flushing. Thanks for the responses, totally RTFM and user error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants