You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The XSDToPython conversion function in term.py tries to decode base64 literals and treat them as python strings. This doesn't work when the encoded data is (or happens to resemble) a different character encoding.
Example:
>>> ttl1 = """
@prefix fhir: <http://hl7.org/fhir/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://hl7.org/fhir/Binary/example> fhir:value "JVBERi0xLjUNJeLjz9MNCjEwIDAgb2JqDTw8L0xpbmVhcml6ZWQgMS9MIDEzMDA2OC9PIDEyL0Ug MTI1NzM1L04gMS9UIDEyOTc2NC9IIFsgNTQ2IDIwNF0+Pg1lbmRvYmoNICAgICAgICAgICAg"^^xsd:base64Binary . """
>>> rdflib.Graph().parse(data=ttl1, format="turtle")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/rdflib/term.py", line 589, in __new__
lexical_or_value = lexical_or_value.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
>>>
What is happening is that the XSDToPython table in term.py (~1482) converts base64Binary to bytes:
>>> x = base64.b64decode("JVBERi0xLjUNJeLjz9MNCjEwIDAgb2JqDTw8L0xpbmVhcml6ZWQgMS9MIDEzMDA2OC9PIDEyL0Ug MTI1NzM1L04gMS9UIDEyOTc2NC9IIFsgNTQ2IDIwNF0+Pg1lbmRvYmoNICAgICAgICAgICAg")
>>> x
b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n10 0 obj\r<</Linearized 1/L 130068/O 12/E 125735/N 1/T 129764/H [ 546 204]>>\rendobj\r '
term.py then attempts to convert a bytes value into a string assuming that the original data was some sort of utf-8 string to begin with(!!!)
if py3compat.PY3 and isinstance(lexical_or_value, bytes):
lexical_or_value = lexical_or_value.decode('utf-8')
This doesn't work. In the above example, x is latin1 encoded.
>>> x.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
If you were able to determine that it is, in fact, an encoded string, you could solve this issue with:
Encode it as a base64 literal, which I can place in an arbitrary block of RDF
>>> base64.b64encode(binary_data).decode('utf-8')
'ABHzdQI=''
>>> ttl2 = """@prefix ex: <http://example.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
ex:foo ex:v "ABHzdQI="^^xsd:base64Binary ."""
>>> rdflib.Graph().parse(data=ttl2, format="turtle")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/rdflib/term.py", line 589, in __new__
lexical_or_value = lexical_or_value.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 2: invalid continuation byte
Recommendation: rdflib Graph parsers should leave base64binary literals alone. The user needs to understand the context in order to convert them into anything else (that is why they are binary to begin with...)
I ran into this today working with a triplestore containing images encoded as base64. I'm in agreement with @hsolbrig that attempting to do anything automatically is going cause unexpected, and likely unwanted, behaviour. It's probably best to leave these as they are and allow user to choose next steps.
If that sounds acceptable, I'm happy to implement the above suggested fix and some tests in a PR tonight.
The XSDToPython conversion function in term.py tries to decode base64 literals and treat them as python strings. This doesn't work when the encoded data is (or happens to resemble) a different character encoding.
Example:
What is happening is that the XSDToPython table in term.py (~1482) converts base64Binary to bytes:
term.py then attempts to convert a bytes value into a string assuming that the original data was some sort of utf-8 string to begin with(!!!)
This doesn't work. In the above example, x is latin1 encoded.
If you were able to determine that it is, in fact, an encoded string, you could solve this issue with:
But you can't. Take another example - I want to encode some arbitrary binary data:
Encode it as a base64 literal, which I can place in an arbitrary block of RDF
Recommendation: rdflib Graph parsers should leave base64binary literals alone. The user needs to understand the context in order to convert them into anything else (that is why they are binary to begin with...)
Replace line 1465 and 1466:
with
The text was updated successfully, but these errors were encountered: