Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only use unicode-compatible values for lexical space #674

Merged
merged 1 commit into from
Jan 24, 2017

Conversation

nateprewitt
Copy link
Contributor

This is an attempt to address #646. Right now, rdflib assumes that base64 encoded strings contain unicode content upon decode. This is often not the case, specifically for things like images and PDFs.

This will allow the user to still retrieve the base64 string as it's stored in the triplestore and choose how to handle the data from there.

I'm currently filing this against master, but it does seem like a somewhat breaking change, so I can refile against the 5.0.0 branch if needed.

@nateprewitt
Copy link
Contributor Author

Hey @joernhees, wanted to check in and see if you had any thoughts on this. Thanks!

@gromgull gromgull added this to the rdflib 5.0.0 milestone Jan 12, 2017
@gromgull
Copy link
Member

I think this should defn. go against 5.0. I will try to spend some time and get a 4.2.2 bug-fix release out SOON. Then we can make 5.0 master.

@nateprewitt
Copy link
Contributor Author

Thanks for the update @gromgull! I'll keep an eye on things and I'll rebase this onto master when the branch is ready for 5.0.0 commits.

@joernhees joernhees added bug Something isn't working enhancement New feature or request in-resolution parsing Related to a parsing. labels Jan 19, 2017
@joernhees
Copy link
Member

hmm, keeping base64 strings as they are and requiring the user to base64 decode them manually (as this would enforce) is one way to solve this, but i think we can do better...

The part of #646 that probably is wrong is treating the value as a unicode string later on. It should be treated as a byte string after decoding. In other words, if you have a picture / pdf inside your RDF serialized as a base64 encoded literal, i'd see it as RDFLib's responsibility to give you picture / pdf data in form of a byte string, no? Probably a bit difficult with Literals being unicode subclasses...

@nateprewitt
Copy link
Contributor Author

@joernhees I'm not tied to the idea of requiring them to be manually decoded, but I do think the approach provides the least amount of discomfort for the most users.

The three primary issues I see with automatically decoding are the subtle modification of the Literal value, the issues with bytes objects between Python 2/3, and as you noted, the fact that Literal is defined to be unicode in the class hierarchy.

With automatically decoding, I can run a SPARQL query and get the Literal back as its Base64-encoded value. When I try to compare that with the value of my Literal node in Python, it will differ. It will likely be hard to determine what's happening here without digging into this issue tracker. We also have the problem of the value content possibly not being string compatible, which isn't true for the rest of RDFLib's literals. If I create an loop treating my literals as strings, RDFLib may crash here which can be painful especially with large/changing datasets.

Regarding returning bytes, I'm generally in favor of bytes over encoded-strings, however this should preferably be done uniformly. Returning bytes here introduces the issue of the literal having a type of str in Python 2 and bytes (and not str/unicode) in Python 3. That means code has the possibility of diverting unexpectedly when running a program with different versions. This is something to be expected when switching versions but I've found things like isinstance(value, str) or str(value) are common stumbling blocks, even for those experienced with Python. It's easy for unchanged code that was working to suddenly break without a immediate explanation to why.

As for Literals subclassing unicode, returning bytes fundamentally undermines some core premises in the library. While we may be able to remove that single cast to unicode from XSDToPython, most of the Literal class' functionality relies on it being castable with unicode. We lose the ability to use all equivalences (__eq__,__lt__, etc), and other basic functionality.

All that said, I can look into amending the patch for a more agreeable solution if this one is a non-starter.

@gromgull
Copy link
Member

Literals subclass unicode because they all have some string representation (lexical space). That does not stop us from having any python object as value (value space). For instance, xsd:integers have the string representation as a "2", but a as value a python int with the value 2.

The same way, Literals can have a lexical encoding which is the base64 encoded byte-stream, and a value-object that is the decoded bytes.

@nateprewitt
Copy link
Contributor Author

@gromgull, apologies, the last portion of my findings may have been hastily misattributed. I hadn't dug into the code base much before this evening.

For strings, it seems RDFLib doesn't maintain an initial distinction between the lexical space and value space, but rather tries to convert the value space back into unicode. The lexical value will always be overwritten by the response from XSDToPython here, when the return value is not None. That means we now have both the lexical space and value space set as binary data. When the Literal constructor calls unicode.__new__(cls, lexical_or_value), lexical_or_value is set to a value of bytes and thus crashes.

The only reason your given example of "2" works is because Python can convert an integer 2 into a unicode string. The current __new__ method is handing an int (not a unicode or str object) to unicode.__new__ which is why you're getting the expected lexical format. This can be verified pretty quickly by stepping through Literal's __new__ with pdb and checking the type of lexical_or_value immediately before the unicode constructor is called.

I'm not immediately sure how to address this, but I can try to work out a more robust solution. It may be a few days until I have time to sit down.

@nateprewitt
Copy link
Contributor Author

Alright, this is probably a more acceptable solution. The patch now addresses a more core problem of the assumption _value is unicode compatible. This will now tests if the string can be converted to unicode before electing it as a normalized successor to the initial lexical_or_value value.

This doesn't address the possibility of an initial lexical_or_value value that isn't unicode compatible (non-printable), but I'm not sure if that can be handled in a sensical way.

@nateprewitt
Copy link
Contributor Author

Hmm, the Python 2 build is failing when trying to install an external dependency. I don't see anything in the diff that immediately points to a problem.

@nateprewitt
Copy link
Contributor Author

Yeah, reverting the commit to 14243b6 still produces the same issue. It looks like something has changed upstream.

A brief look is pointing to a possible conflict between the requirement of pyparsing <= 1.5.7 and the current version of setuptools. It now requires pyparse >= 2.1.0. I ran the test sweet in Python 2 with pyparsing at 2.1.0 without issues. I'll throw up another PR removing the hard versioning in a moment.

@nateprewitt nateprewitt changed the title Remove automatic decoding for base64 values Only use unicode-compatible values for lexical space Jan 23, 2017
@gromgull
Copy link
Member

Funny, I had builds that worked 18 hours ago. Looking at the log, they got setuptool 12.0.5, in this PR build it's 34.something.

@nateprewitt
Copy link
Contributor Author

@gromgull, they pushed out 34.0.0 yesterday (~20 hours ago) which removes all of setuptools vendored dependencies. They're now using requirements from pypa, via a library called packaging, which is what was causing the conflict. The first command of the .travis.yml upgrades pip and setuptools to the latest versions.

Anyway, I just restarted the build, so this should be ready for a glance when you have a moment :) Also I don't believe this would be considered breaking anymore, so it likely could be a candidate for another 4.x release.

@gromgull
Copy link
Member

Looks good!

coding_func, param = getattr(value, 'decode'), 'utf-8'

is maybe more readable if you actually just call it value.decode - move the try/except outside the 3-branch if statement and just return from inside?

but it's not super important!

@nateprewitt
Copy link
Contributor Author

Hmm, I'm a bit torn here. My knee jerk reaction is bigger blocks in the try introduces a higher likelihood of masking exceptions. Particularly, it'll be easy to overlook with future additions/changes to those branch statements.

That said, it seems pretty unlikely this function will change much. I updated the branches and I think it makes the decode line easier to grock, but the rest of the code (for me at least) becomes a bit fuzzier. I think I'm -.5 on this but can make the change if you feel strongly about it.

    try:
        if isinstance(value, bytes):
            value.decode('utf-8')
        elif py3compat.PY3:
            str(value)
        else:
            unicode(value)
    except UnicodeError:
        return False
    return True

This allows us to avoid accidentally overriding our intial lexical
value with one that is not unicode compatible after normalization.
This is specifically relevant for arbitrary binary data with bytes
outside of the defined unicode range.
@gromgull
Copy link
Member

I agree, it's not that much cleaner. Leave it as it is!

@nateprewitt
Copy link
Contributor Author

Great, I fixed the merge conflict so this should be ready to go whenever you all are set :)

@gromgull gromgull merged commit 695a670 into RDFLib:master Jan 24, 2017
@gromgull
Copy link
Member

Thanks a lot!

@nateprewitt nateprewitt deleted the 646_dont_decode_base64 branch January 24, 2017 14:12
@nateprewitt
Copy link
Contributor Author

No problem, glad to help 🎉

@joernhees joernhees modified the milestones: rdflib 4.2.2, rdflib 5.0.0 Jan 25, 2017
@joernhees
Copy link
Member

awesome, thanks

joernhees added a commit that referenced this pull request Jan 25, 2017
* master: (44 commits)
  quote cleanup OCD
  serializer/parser alias for 'ntriples'
  serializer/parser alias for 'ttl'
  cleanup
  remove outdated always skipped test
  a bit of changelog
  add a NTSerializer sub-class for nt11 (#700)
  Restrict normalization to unicode-compatible values (#674)
  fixes for turtle/trig namespace handling
  skip serialising empty default graph
  skip round-trip test, unfixable until 5.0
  prefix test for #428
  Added additional trig unit tests to highlight some currently occurring issues.
  remove ancient and broken 2.3 support code. (#681)
  updating deprecated testing syntax (#697)
  docs: clarify the use of an identifier when persisting a triplestore (#654)
  removing pyparsing version requirement (#696)
  made min/max aggregate functions support all literals (#694)
  actually fix projection from sub-queries
  added dawg tests for #607
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working discussion enhancement New feature or request parsing Related to a parsing.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants