Skip to content

`Graph.serialize` escapes ampersands (&) in URIs, but doesn't decode them on parse #222

Closed
jsalonen opened this Issue Jul 4, 2012 · 4 comments

2 participants

@jsalonen
jsalonen commented Jul 4, 2012

When serializing rdflib Graphs with serialize, ampersand characters are (incorrectly) escaped as &. The following short script demonstrates the problem:

from rdflib import Graph, Namespace, URIRef

g = Graph()
DC = Namespace("http://purl.org/dc/elements/1.1/")
g.add( (URIRef("http://www.example.com/param1=val1&param2=val2"), DC['title'], "Test") )
data =  g.serialize()

g.serialize(file('ampersand.rdf', 'wt'))

g2 = Graph()
g2.parse(file('ampersand.rdf', 'rt'))
print g2.serialize()

As a result I get:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
   xmlns:ns1="http://purl.org/dc/elements/1.1/"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>
  <rdf:Description rdf:about="http://www.example.com/param1=val1&amp;amp;param2=val2">
    <ns1:title rdf:resource="Test"/>
  </rdf:Description>
</rdf:RDF>

So as a result, the ampersand gets encoded twice, which isn't what I expect.

I'm using RDFLib 3.2.1 and Python 2.7.2

@gjhiggins
RDFLib member

RDFLib is processing your url correctly.

The entity expansion that has been applied to your url "http://www.example.com/param1=val1&amp;param2=val2" is only pertinent in the context of a serialized graph or when the url is written in an HTML document.

Note that replacing & with &amp; is only done when writing the URL in HTML, where "&" is a
special character (along with "<" and ">"). When writing the same URL in a plain text email
message or in the location bar of your browser, you would use "&" and not "&amp;". With
HTML, the browser translates "&amp;" to "&" so the Web server would only see "&" and not
"&amp;" in the query string of the request.

(from http://htmlhelp.com/tools/validator/problems.html)

If you are comfortable realigning your expectation (as opposed to arguing that RDFLib should handle this special case of malformation), please close this ticket.

@jsalonen
jsalonen commented Jul 6, 2012

Thanks for your reply.

Seems like I was somewhat mistaken. &amp; indeed is a valid XML entity (http://www.w3.org/TR/REC-xml/) and thus, it is totally valid to escape & as &amp; in RDF/XML.

What still puzzles me is the behaviour of RDFLib when deserializing: & remains escaped in Graph when an RDF/XML is parsed. This means that the second time I serialize the graph, &amp; will become &amp;amp;

If this is the expected behaviour, then what is the preferred way to deal with this in RDFLib? For instance, is there a way to specify for parse that I want it to decode XML entities or do I just need to do this manually after every call to parse? Documentation for parse didn't give any further insight.

@gjhiggins
RDFLib member

hmmm, perhaps I should have written "only pertinent in the context of a graph serialized in RDF/XML format". And the escaping isn't just "totally valid" but mandatory: "MUST NOT appear in their literal form"

The entities are escaped when the graph is serialized to XML and unescaped when read back in, so if what you are feeding a graph is a (valid) XML serialization, then the entities are /required/ to be escaped. But in the example you give, you're not feeding it XML, you're feeding it a string --- in which the ampersand /should not/ be escaped because escaping is "only pertinent in the context of a graph serialized in RDF/XML format". There, fixed it for me :-)

As the example below demonstrates, when a graph is fed with properly serialized (entity-escaped) XML then round-tripping works just as one would expect.

Looking at how ampersand is handled by the (not-XML) serialization formats of ntriples and notatation3 might shed a bit more light.

python

import rdflib
# Pretty much your example, except I wrapped the string in a Literal
g = rdflib.Graph()
DC = rdflib.Namespace("http://purl.org/dc/elements/1.1/")
g.add((rdflib.URIRef("http://www.example.com/param1=val1&amp;param2=val2"),
    DC['title'], rdflib.Literal("Test")))
data = g.serialize()

with open('ampersand.rdf', 'wb') as fp:
    fp.write(data)

g2 = rdflib.Graph()
g2.parse('ampersand.rdf')

print(g2.serialize(format="pretty-xml").decode('utf8'))
# results in double-escaping
"""\
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:ns1="http://purl.org/dc/elements/1.1/"
>
  <rdf:Description
        rdf:about="http://www.example.com/param1=val1&amp;amp;param2=val2">
    <ns1:title>Test</ns1:title>
  </rdf:Description>
</rdf:RDF>
"""

# Not XML, so no escaping
print(g2.serialize(format="nt").decode('utf8'))
"""\
<http://www.example.com/param1=val1&amp;param2=val2>
    <http://purl.org/dc/elements/1.1/title> "Test" .
"""

# Not XML, so no escaping
print(g2.serialize(format="n3").decode('utf8'))
"""\
@prefix ns1: <http://purl.org/dc/elements/1.1/> .

<http://www.example.com/param1=val1&amp;param2=val2> ns1:title "Test" .
"""

# Try with correctly-serialized, singly-escaped XML, same URL as in the original example
# but this time, in the context of an XML serialization ...

rdfxml = """\
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:ns1="http://purl.org/dc/elements/1.1/"
>
  <rdf:Description
    rdf:about="http://www.example.com/param1=val1&amp;param2=val2">
    <ns1:title rdf:resource="file:///tmp/Test"/>
  </rdf:Description>
</rdf:RDF>"""

g3 = rdflib.Graph()
g3.parse(data=rdfxml)
g3.add((rdflib.URIRef("http://www.example.com/param1=valI&param2=valII"),
        DC['title'], rdflib.Literal("Test")))

print(g3.serialize().decode('utf8'))

# Correctamundo
"""\
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
   xmlns:ns1="http://purl.org/dc/elements/1.1/"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>
  <rdf:Description
        rdf:about="http://www.example.com/param1=valI&amp;param2=valII">
    <ns1:title>Test</ns1:title>
  </rdf:Description>
  <rdf:Description
        rdf:about="http://www.example.com/param1=val1&amp;param2=val2">
    <ns1:title rdf:resource="file:///tmp/Test"/>
  </rdf:Description>
</rdf:RDF>
"""

# ditto
print(g3.serialize(format="nt").decode('utf8'))
"""\
<http://www.example.com/param1=valI&param2=valII>
    <http://purl.org/dc/elements/1.1/title> "Test" .
<http://www.example.com/param1=val1&param2=val2>
    <http://purl.org/dc/elements/1.1/title> <file:///tmp/Test> .
"""

# ditto
print(g3.serialize(format="n3").decode('utf8'))
"""\
@prefix ns1: <http://purl.org/dc/elements/1.1/> .

<http://www.example.com/param1=val1&param2=val2>
    ns1:title <file:///tmp/Test> .

<http://www.example.com/param1=valI&param2=valII>
    ns1:title "Test" .
"""

The above code is Python2.7/3 idiomatic.

@jsalonen jsalonen closed this Jul 6, 2012
@jsalonen jsalonen reopened this Jul 6, 2012
@jsalonen
jsalonen commented Jul 6, 2012

WOW! Thank you for a very comprehensive and helpful answer!

And I agree with you: RDFLib is working correctly and as expected, so there is no bug or whatsoever. I just need to change the way I encode ampersands in my application.

Closing the issue.

@jsalonen jsalonen closed this Jul 6, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.