Only use unicode-compatible values for lexical space #674

nateprewitt · 2016-12-03T03:43:35Z

This is an attempt to address #646. Right now, rdflib assumes that base64 encoded strings contain unicode content upon decode. This is often not the case, specifically for things like images and PDFs.

This will allow the user to still retrieve the base64 string as it's stored in the triplestore and choose how to handle the data from there.

I'm currently filing this against master, but it does seem like a somewhat breaking change, so I can refile against the 5.0.0 branch if needed.

nateprewitt · 2016-12-19T04:59:34Z

Hey @joernhees, wanted to check in and see if you had any thoughts on this. Thanks!

gromgull · 2017-01-12T13:59:16Z

I think this should defn. go against 5.0. I will try to spend some time and get a 4.2.2 bug-fix release out SOON. Then we can make 5.0 master.

nateprewitt · 2017-01-12T16:56:12Z

Thanks for the update @gromgull! I'll keep an eye on things and I'll rebase this onto master when the branch is ready for 5.0.0 commits.

joernhees · 2017-01-19T14:47:10Z

hmm, keeping base64 strings as they are and requiring the user to base64 decode them manually (as this would enforce) is one way to solve this, but i think we can do better...

The part of #646 that probably is wrong is treating the value as a unicode string later on. It should be treated as a byte string after decoding. In other words, if you have a picture / pdf inside your RDF serialized as a base64 encoded literal, i'd see it as RDFLib's responsibility to give you picture / pdf data in form of a byte string, no? Probably a bit difficult with Literals being unicode subclasses...

nateprewitt · 2017-01-19T18:21:21Z

@joernhees I'm not tied to the idea of requiring them to be manually decoded, but I do think the approach provides the least amount of discomfort for the most users.

The three primary issues I see with automatically decoding are the subtle modification of the Literal value, the issues with bytes objects between Python 2/3, and as you noted, the fact that Literal is defined to be unicode in the class hierarchy.

With automatically decoding, I can run a SPARQL query and get the Literal back as its Base64-encoded value. When I try to compare that with the value of my Literal node in Python, it will differ. It will likely be hard to determine what's happening here without digging into this issue tracker. We also have the problem of the value content possibly not being string compatible, which isn't true for the rest of RDFLib's literals. If I create an loop treating my literals as strings, RDFLib may crash here which can be painful especially with large/changing datasets.

Regarding returning bytes, I'm generally in favor of bytes over encoded-strings, however this should preferably be done uniformly. Returning bytes here introduces the issue of the literal having a type of str in Python 2 and bytes (and not str/unicode) in Python 3. That means code has the possibility of diverting unexpectedly when running a program with different versions. This is something to be expected when switching versions but I've found things like isinstance(value, str) or str(value) are common stumbling blocks, even for those experienced with Python. It's easy for unchanged code that was working to suddenly break without a immediate explanation to why.

As for Literals subclassing unicode, returning bytes fundamentally undermines some core premises in the library. While we may be able to remove that single cast to unicode from XSDToPython, most of the Literal class' functionality relies on it being castable with unicode. We lose the ability to use all equivalences (__eq__,__lt__, etc), and other basic functionality.

All that said, I can look into amending the patch for a more agreeable solution if this one is a non-starter.

gromgull · 2017-01-19T18:48:43Z

Literals subclass unicode because they all have some string representation (lexical space). That does not stop us from having any python object as value (value space). For instance, xsd:integers have the string representation as a "2", but a as value a python int with the value 2.

The same way, Literals can have a lexical encoding which is the base64 encoded byte-stream, and a value-object that is the decoded bytes.

nateprewitt · 2017-01-20T00:51:48Z

@gromgull, apologies, the last portion of my findings may have been hastily misattributed. I hadn't dug into the code base much before this evening.

For strings, it seems RDFLib doesn't maintain an initial distinction between the lexical space and value space, but rather tries to convert the value space back into unicode. The lexical value will always be overwritten by the response from XSDToPython here, when the return value is not None. That means we now have both the lexical space and value space set as binary data. When the Literal constructor calls unicode.__new__(cls, lexical_or_value), lexical_or_value is set to a value of bytes and thus crashes.

The only reason your given example of "2" works is because Python can convert an integer 2 into a unicode string. The current __new__ method is handing an int (not a unicode or str object) to unicode.__new__ which is why you're getting the expected lexical format. This can be verified pretty quickly by stepping through Literal's __new__ with pdb and checking the type of lexical_or_value immediately before the unicode constructor is called.

I'm not immediately sure how to address this, but I can try to work out a more robust solution. It may be a few days until I have time to sit down.

nateprewitt · 2017-01-23T18:08:02Z

Alright, this is probably a more acceptable solution. The patch now addresses a more core problem of the assumption _value is unicode compatible. This will now tests if the string can be converted to unicode before electing it as a normalized successor to the initial lexical_or_value value.

This doesn't address the possibility of an initial lexical_or_value value that isn't unicode compatible (non-printable), but I'm not sure if that can be handled in a sensical way.

nateprewitt · 2017-01-23T18:09:19Z

Hmm, the Python 2 build is failing when trying to install an external dependency. I don't see anything in the diff that immediately points to a problem.

nateprewitt · 2017-01-23T18:32:59Z

Yeah, reverting the commit to 14243b6 still produces the same issue. It looks like something has changed upstream.

A brief look is pointing to a possible conflict between the requirement of pyparsing <= 1.5.7 and the current version of setuptools. It now requires pyparse >= 2.1.0. I ran the test sweet in Python 2 with pyparsing at 2.1.0 without issues. I'll throw up another PR removing the hard versioning in a moment.

gromgull · 2017-01-24T08:44:50Z

Funny, I had builds that worked 18 hours ago. Looking at the log, they got setuptool 12.0.5, in this PR build it's 34.something.

nateprewitt · 2017-01-24T11:39:18Z

@gromgull, they pushed out 34.0.0 yesterday (~20 hours ago) which removes all of setuptools vendored dependencies. They're now using requirements from pypa, via a library called packaging, which is what was causing the conflict. The first command of the .travis.yml upgrades pip and setuptools to the latest versions.

Anyway, I just restarted the build, so this should be ready for a glance when you have a moment :) Also I don't believe this would be considered breaking anymore, so it likely could be a candidate for another 4.x release.

gromgull · 2017-01-24T13:29:42Z

Looks good!

coding_func, param = getattr(value, 'decode'), 'utf-8'

is maybe more readable if you actually just call it value.decode - move the try/except outside the 3-branch if statement and just return from inside?

but it's not super important!

nateprewitt · 2017-01-24T14:05:03Z

Hmm, I'm a bit torn here. My knee jerk reaction is bigger blocks in the try introduces a higher likelihood of masking exceptions. Particularly, it'll be easy to overlook with future additions/changes to those branch statements.

That said, it seems pretty unlikely this function will change much. I updated the branches and I think it makes the decode line easier to grock, but the rest of the code (for me at least) becomes a bit fuzzier. I think I'm -.5 on this but can make the change if you feel strongly about it.

    try:
        if isinstance(value, bytes):
            value.decode('utf-8')
        elif py3compat.PY3:
            str(value)
        else:
            unicode(value)
    except UnicodeError:
        return False
    return True

This allows us to avoid accidentally overriding our intial lexical value with one that is not unicode compatible after normalization. This is specifically relevant for arbitrary binary data with bytes outside of the defined unicode range.

gromgull · 2017-01-24T14:06:49Z

I agree, it's not that much cleaner. Leave it as it is!

nateprewitt · 2017-01-24T14:08:38Z

Great, I fixed the merge conflict so this should be ready to go whenever you all are set :)

gromgull · 2017-01-24T14:12:50Z

Thanks a lot!

nateprewitt · 2017-01-24T14:13:25Z

No problem, glad to help 🎉

joernhees · 2017-01-25T22:51:27Z

awesome, thanks

* master: (44 commits) quote cleanup OCD serializer/parser alias for 'ntriples' serializer/parser alias for 'ttl' cleanup remove outdated always skipped test a bit of changelog add a NTSerializer sub-class for nt11 (#700) Restrict normalization to unicode-compatible values (#674) fixes for turtle/trig namespace handling skip serialising empty default graph skip round-trip test, unfixable until 5.0 prefix test for #428 Added additional trig unit tests to highlight some currently occurring issues. remove ancient and broken 2.3 support code. (#681) updating deprecated testing syntax (#697) docs: clarify the use of an identifier when persisting a triplestore (#654) removing pyparsing version requirement (#696) made min/max aggregate functions support all literals (#694) actually fix projection from sub-queries added dawg tests for #607 ...

gromgull added this to the rdflib 5.0.0 milestone Jan 12, 2017

joernhees added bug Something isn't working enhancement New feature or request in-resolution parsing Related to a parsing. labels Jan 19, 2017

joernhees added discussion and removed in-resolution labels Jan 19, 2017

nateprewitt force-pushed the 646_dont_decode_base64 branch from 5ff7517 to 58a7da3 Compare January 23, 2017 18:00

nateprewitt changed the title ~~Remove automatic decoding for base64 values~~ Only use unicode-compatible values for lexical space Jan 23, 2017

nateprewitt force-pushed the 646_dont_decode_base64 branch from 58a7da3 to e4f3f20 Compare January 23, 2017 21:29

nateprewitt force-pushed the 646_dont_decode_base64 branch from e4f3f20 to 0525357 Compare January 24, 2017 11:32

nateprewitt force-pushed the 646_dont_decode_base64 branch from 0525357 to 34d112e Compare January 24, 2017 11:42

nateprewitt force-pushed the 646_dont_decode_base64 branch from 34d112e to e798da6 Compare January 24, 2017 14:07

gromgull merged commit 695a670 into RDFLib:master Jan 24, 2017

nateprewitt deleted the 646_dont_decode_base64 branch January 24, 2017 14:12

joernhees modified the milestones: rdflib 4.2.2, rdflib 5.0.0 Jan 25, 2017

pyup-bot mentioned this pull request Jan 29, 2017

Update rdflib to 4.2.2 mytardis/mytardis#815

Merged

This was referenced Mar 16, 2017

Initial Update mozilla/amo-validator#510

Closed

Update rdflib to 4.2.2 mozilla/amo-validator#515

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only use unicode-compatible values for lexical space #674

Only use unicode-compatible values for lexical space #674

nateprewitt commented Dec 3, 2016

nateprewitt commented Dec 19, 2016

gromgull commented Jan 12, 2017

nateprewitt commented Jan 12, 2017

joernhees commented Jan 19, 2017

nateprewitt commented Jan 19, 2017

gromgull commented Jan 19, 2017

nateprewitt commented Jan 20, 2017

nateprewitt commented Jan 23, 2017

nateprewitt commented Jan 23, 2017

nateprewitt commented Jan 23, 2017

gromgull commented Jan 24, 2017

nateprewitt commented Jan 24, 2017

gromgull commented Jan 24, 2017

nateprewitt commented Jan 24, 2017

gromgull commented Jan 24, 2017

nateprewitt commented Jan 24, 2017

gromgull commented Jan 24, 2017

nateprewitt commented Jan 24, 2017

joernhees commented Jan 25, 2017

Only use unicode-compatible values for lexical space #674

Only use unicode-compatible values for lexical space #674

Conversation

nateprewitt commented Dec 3, 2016

nateprewitt commented Dec 19, 2016

gromgull commented Jan 12, 2017

nateprewitt commented Jan 12, 2017

joernhees commented Jan 19, 2017

nateprewitt commented Jan 19, 2017

gromgull commented Jan 19, 2017

nateprewitt commented Jan 20, 2017

nateprewitt commented Jan 23, 2017

nateprewitt commented Jan 23, 2017

nateprewitt commented Jan 23, 2017

gromgull commented Jan 24, 2017

nateprewitt commented Jan 24, 2017

gromgull commented Jan 24, 2017

nateprewitt commented Jan 24, 2017

gromgull commented Jan 24, 2017

nateprewitt commented Jan 24, 2017

gromgull commented Jan 24, 2017

nateprewitt commented Jan 24, 2017

joernhees commented Jan 25, 2017