Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xsd:base64Binary can't be converted to a string #646

Closed
hsolbrig opened this issue Aug 19, 2016 · 3 comments
Closed

xsd:base64Binary can't be converted to a string #646

hsolbrig opened this issue Aug 19, 2016 · 3 comments
Labels
bug Something isn't working help wanted Extra attention is needed parsing Related to a parsing.
Milestone

Comments

@hsolbrig
Copy link
Contributor

hsolbrig commented Aug 19, 2016

The XSDToPython conversion function in term.py tries to decode base64 literals and treat them as python strings. This doesn't work when the encoded data is (or happens to resemble) a different character encoding.

Example:

>>> ttl1 = """
@prefix fhir: <http://hl7.org/fhir/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://hl7.org/fhir/Binary/example> fhir:value "JVBERi0xLjUNJeLjz9MNCjEwIDAgb2JqDTw8L0xpbmVhcml6ZWQgMS9MIDEzMDA2OC9PIDEyL0Ug MTI1NzM1L04gMS9UIDEyOTc2NC9IIFsgNTQ2IDIwNF0+Pg1lbmRvYmoNICAgICAgICAgICAg"^^xsd:base64Binary . """
>>> rdflib.Graph().parse(data=ttl1, format="turtle")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
      ...
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/rdflib/term.py", line 589, in __new__
    lexical_or_value = lexical_or_value.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
>>>

What is happening is that the XSDToPython table in term.py (~1482) converts base64Binary to bytes:

>>> x = base64.b64decode("JVBERi0xLjUNJeLjz9MNCjEwIDAgb2JqDTw8L0xpbmVhcml6ZWQgMS9MIDEzMDA2OC9PIDEyL0Ug MTI1NzM1L04gMS9UIDEyOTc2NC9IIFsgNTQ2IDIwNF0+Pg1lbmRvYmoNICAgICAgICAgICAg") 
>>> x
b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n10 0 obj\r<</Linearized 1/L 130068/O 12/E 125735/N 1/T 129764/H [ 546 204]>>\rendobj\r            '

term.py then attempts to convert a bytes value into a string assuming that the original data was some sort of utf-8 string to begin with(!!!)

    if py3compat.PY3 and isinstance(lexical_or_value, bytes):
        lexical_or_value = lexical_or_value.decode('utf-8')

This doesn't work. In the above example, x is latin1 encoded.

>>> x.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

If you were able to determine that it is, in fact, an encoded string, you could solve this issue with:

>>> x.decode('latin1')
'%PDF-1.5\r%âãÏÓ\r\n10 0 obj\r<</Linearized 1/L 130068/O 12/E 125735/N 1/T 129764/H [ 546 204]>>\rendobj\r            '

But you can't. Take another example - I want to encode some arbitrary binary data:

>>> binary_data = bytes([0, 17, 243, 117, 2])
>>> binary_data
b'\x00\x11\xf3u\x02'

Encode it as a base64 literal, which I can place in an arbitrary block of RDF

>>> base64.b64encode(binary_data).decode('utf-8')
'ABHzdQI=''
>>> ttl2 = """@prefix ex: <http://example.org/> .
 @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

 ex:foo ex:v "ABHzdQI="^^xsd:base64Binary ."""
>>> rdflib.Graph().parse(data=ttl2, format="turtle")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
       ...
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/rdflib/term.py", line 589, in __new__
    lexical_or_value = lexical_or_value.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 2: invalid continuation byte

Recommendation: rdflib Graph parsers should leave base64binary literals alone. The user needs to understand the context in order to convert them into anything else (that is why they are binary to begin with...)

Replace line 1465 and 1466:

     URIRef(
        _XSD_PFX + 'base64Binary'): lambda s: base64.b64decode(py3compat.b(s)),

with

     URIRef(_XSD_PFX + 'base64Binary') : None,
@joernhees joernhees added bug Something isn't working parsing Related to a parsing. help wanted Extra attention is needed labels Sep 16, 2016
@joernhees joernhees added this to the rdflib 4.2.2 milestone Sep 16, 2016
@nateprewitt
Copy link
Contributor

I ran into this today working with a triplestore containing images encoded as base64. I'm in agreement with @hsolbrig that attempting to do anything automatically is going cause unexpected, and likely unwanted, behaviour. It's probably best to leave these as they are and allow user to choose next steps.

If that sounds acceptable, I'm happy to implement the above suggested fix and some tests in a PR tonight.

@gromgull
Copy link
Member

This would break compatibility for people who rely on the automatic encoding happening, moving to 5.0

@gromgull gromgull modified the milestones: rdflib 5.0.0, rdflib 4.2.2 Jan 19, 2017
@gromgull
Copy link
Member

fixed in 695a670

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed parsing Related to a parsing.
Projects
None yet
Development

No branches or pull requests

4 participants