Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_file_from_memory() can't add binary files #68

Closed
anordal opened this issue Apr 26, 2018 · 3 comments · Fixed by #85
Closed

add_file_from_memory() can't add binary files #68

anordal opened this issue Apr 26, 2018 · 3 comments · Fixed by #85

Comments

@anordal
Copy link

anordal commented Apr 26, 2018

add_file_from_memory() fails if given a bytestring of nonzero length (regardless of content):

import libarchive

with libarchive.file_writer('bytes.tar', 'pax') as ar:
	content = b'bytes'
	ar.add_file_from_memory('bytes.bin', len(content), content)
Traceback (most recent call last):
  File "bytes.py", line 5, in <module>
    ar.add_file_from_memory('bytes.bin', len(content), content)
  File "/usr/lib/python3.6/site-packages/libarchive/write.py", line 105, in add_file_from_memory
    write_data(archive_pointer, chunk, len(chunk))
TypeError: object of type 'int' has no len()

Of course, converting the bytestring to string as a workaround is not always an option.

A better testcase would include some actual invalid UTF-8 (like b'\x80'), to assert binary cleanliness, but as demonstrated, that was not the problem here.

Python 3.6.4
python3-libarchive-c 2.7

@anordal
Copy link
Author

anordal commented Apr 27, 2018

Found this workaround:

 
 with libarchive.file_writer('bytes.tar', 'pax') as ar:
 	content = b'bytes'
-	ar.add_file_from_memory('bytes.bin', len(content), content)
+	ar.add_file_from_memory('bytes.bin', len(content), [content])

Or, why not do exactly the same inside the function to fix the problem (untested):

--- a/libarchive/write.py
+++ b/libarchive/write.py
@@ -99,6 +99,10 @@ class ArchiveWrite(object):
             entry_set_perm(archive_entry_pointer, permission)
             write_header(archive_pointer, archive_entry_pointer)
 
+            # Make bytestrings work #68
+            if isinstance(entry_data, bytes):
+                entry_data = [entry_data]
+
             for chunk in entry_data:
                 if not chunk:
                     break

@srandall52
Copy link
Contributor

AFAICT your "workaround" is in fact the one and only correct way to use this method.
The documentation is weak, and probably could read, "entry_data: binary content of entry as an iterable yielding bytes or bytearray objects."
As for the test case, it is badly broken and only "works" by accident. And clearly sets a bad example.

@MartinFalatic
Copy link
Contributor

I found out the hard way that if you feed a unicode string to entry_data (e.g., your_data being passed as [your_data]) you will get VERY strange output - specifically, it'll look like UTF-16 doubly-encoded ("A" (0x41 in ascii) is 0x0041 in Unicode, and then it appears to get re-encoded as 0x00000041).

So, if your_data is unicode, .encode() it first. In Python 3 you can just check if it's an instance of str.

This also pops up if you are using the unicode_literals import on Python 2 and strings are involved.

But what caused me the most trouble is that, regardless of the import above... json.dumps() in Python 2.7 can return either a non-unicode string or a unicode one, depending on the options. In Python 3, json.dumps() returns str... and you'll have the same problem if you don't encode() it to bytes.

The more I think about it, the more I wonder if this is simply a bug. For Python3 at least, entry_data should ONLY be a list of byte objects. I'm trying to think of why you'd want to let the library try to encode non-byte data given that it will fail badly in the effort and then blithely pass that broken data to the system libarchive.

MartinFalatic pushed a commit to MartinFalatic/python-libarchive-c that referenced this issue Apr 14, 2019
MartinFalatic pushed a commit to MartinFalatic/python-libarchive-c that referenced this issue Apr 14, 2019
Changaco pushed a commit to MartinFalatic/python-libarchive-c that referenced this issue Oct 20, 2019
Changaco added a commit to MartinFalatic/python-libarchive-c that referenced this issue Oct 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants