-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only use unicode-compatible values for lexical space #674
Conversation
Hey @joernhees, wanted to check in and see if you had any thoughts on this. Thanks! |
I think this should defn. go against 5.0. I will try to spend some time and get a 4.2.2 bug-fix release out SOON. Then we can make 5.0 master. |
Thanks for the update @gromgull! I'll keep an eye on things and I'll rebase this onto master when the branch is ready for 5.0.0 commits. |
hmm, keeping base64 strings as they are and requiring the user to base64 decode them manually (as this would enforce) is one way to solve this, but i think we can do better... The part of #646 that probably is wrong is treating the value as a unicode string later on. It should be treated as a byte string after decoding. In other words, if you have a picture / pdf inside your RDF serialized as a base64 encoded literal, i'd see it as RDFLib's responsibility to give you picture / pdf data in form of a byte string, no? Probably a bit difficult with Literals being unicode subclasses... |
@joernhees I'm not tied to the idea of requiring them to be manually decoded, but I do think the approach provides the least amount of discomfort for the most users. The three primary issues I see with automatically decoding are the subtle modification of the Literal value, the issues with With automatically decoding, I can run a SPARQL query and get the Literal back as its Base64-encoded value. When I try to compare that with the value of my Literal node in Python, it will differ. It will likely be hard to determine what's happening here without digging into this issue tracker. We also have the problem of the value content possibly not being string compatible, which isn't true for the rest of RDFLib's literals. If I create an loop treating my literals as strings, RDFLib may crash here which can be painful especially with large/changing datasets. Regarding returning bytes, I'm generally in favor of bytes over encoded-strings, however this should preferably be done uniformly. Returning bytes here introduces the issue of the literal having a type of As for Literals subclassing All that said, I can look into amending the patch for a more agreeable solution if this one is a non-starter. |
Literals subclass unicode because they all have some string representation (lexical space). That does not stop us from having any python object as value (value space). For instance, xsd:integers have the string representation as a "2", but a as value a python int with the value 2. The same way, Literals can have a lexical encoding which is the base64 encoded byte-stream, and a value-object that is the decoded bytes. |
@gromgull, apologies, the last portion of my findings may have been hastily misattributed. I hadn't dug into the code base much before this evening. For strings, it seems RDFLib doesn't maintain an initial distinction between the lexical space and value space, but rather tries to convert the value space back into unicode. The lexical value will always be overwritten by the response from The only reason your given example of I'm not immediately sure how to address this, but I can try to work out a more robust solution. It may be a few days until I have time to sit down. |
5ff7517
to
58a7da3
Compare
Alright, this is probably a more acceptable solution. The patch now addresses a more core problem of the assumption This doesn't address the possibility of an initial |
Hmm, the Python 2 build is failing when trying to install an external dependency. I don't see anything in the diff that immediately points to a problem. |
Yeah, reverting the commit to 14243b6 still produces the same issue. It looks like something has changed upstream. A brief look is pointing to a possible conflict between the requirement of pyparsing <= 1.5.7 and the current version of setuptools. It now requires pyparse >= 2.1.0. I ran the test sweet in Python 2 with pyparsing at 2.1.0 without issues. I'll throw up another PR removing the hard versioning in a moment. |
58a7da3
to
e4f3f20
Compare
Funny, I had builds that worked 18 hours ago. Looking at the log, they got setuptool 12.0.5, in this PR build it's 34.something. |
e4f3f20
to
0525357
Compare
@gromgull, they pushed out 34.0.0 yesterday (~20 hours ago) which removes all of setuptools vendored dependencies. They're now using requirements from pypa, via a library called packaging, which is what was causing the conflict. The first command of the .travis.yml upgrades pip and setuptools to the latest versions. Anyway, I just restarted the build, so this should be ready for a glance when you have a moment :) Also I don't believe this would be considered breaking anymore, so it likely could be a candidate for another 4.x release. |
0525357
to
34d112e
Compare
Looks good!
is maybe more readable if you actually just call it but it's not super important! |
Hmm, I'm a bit torn here. My knee jerk reaction is bigger blocks in the try introduces a higher likelihood of masking exceptions. Particularly, it'll be easy to overlook with future additions/changes to those branch statements. That said, it seems pretty unlikely this function will change much. I updated the branches and I think it makes the try:
if isinstance(value, bytes):
value.decode('utf-8')
elif py3compat.PY3:
str(value)
else:
unicode(value)
except UnicodeError:
return False
return True |
This allows us to avoid accidentally overriding our intial lexical value with one that is not unicode compatible after normalization. This is specifically relevant for arbitrary binary data with bytes outside of the defined unicode range.
I agree, it's not that much cleaner. Leave it as it is! |
34d112e
to
e798da6
Compare
Great, I fixed the merge conflict so this should be ready to go whenever you all are set :) |
Thanks a lot! |
No problem, glad to help 🎉 |
awesome, thanks |
* master: (44 commits) quote cleanup OCD serializer/parser alias for 'ntriples' serializer/parser alias for 'ttl' cleanup remove outdated always skipped test a bit of changelog add a NTSerializer sub-class for nt11 (#700) Restrict normalization to unicode-compatible values (#674) fixes for turtle/trig namespace handling skip serialising empty default graph skip round-trip test, unfixable until 5.0 prefix test for #428 Added additional trig unit tests to highlight some currently occurring issues. remove ancient and broken 2.3 support code. (#681) updating deprecated testing syntax (#697) docs: clarify the use of an identifier when persisting a triplestore (#654) removing pyparsing version requirement (#696) made min/max aggregate functions support all literals (#694) actually fix projection from sub-queries added dawg tests for #607 ...
This is an attempt to address #646. Right now, rdflib assumes that base64 encoded strings contain unicode content upon decode. This is often not the case, specifically for things like images and PDFs.
This will allow the user to still retrieve the base64 string as it's stored in the triplestore and choose how to handle the data from there.
I'm currently filing this against master, but it does seem like a somewhat breaking change, so I can refile against the 5.0.0 branch if needed.