-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graph parse module does not handle URI / IRI with non-ascii characters #1429
Comments
Is this on Windows only and therefore related to Issue #1430? |
Sorry I left out my environment details. I am on Ubuntu 20.04, and my rdflib version is |
OK, thanks, so a separate issue from #1430. Yes it would be great to see a PR here! If you can add one with a test using your example IRI and perhaps any other IRI with non-ASCII characters in the non last part of the IRI, that would be great. You might also put a note in the update to link back to that SO article so people will know where too look for a discussion about more comprehensive solutions. |
I looked at the SO discussion and distilled it out to this, works as expected in the "restore-pyrdfa" branch so parking it here for maybe later: def test_issue1429_uri_utf8():
"""
https://stackoverflow.com/a/40654295
netloc should be encoded using IDNA;
non-ascii URL path should be encoded to UTF-8 and then percent-escaped;
non-ascii query parameters should be encoded to the encoding of a page
URL was extracted from (or to the encoding server uses), then
percent-escaped.
"""
def iri2uri(iri):
from urllib.parse import urlsplit, urlunsplit, quote
"""
Convert an IRI to a URI (Python 3).
https://stackoverflow.com/a/42309027
"""
uri = ""
if isinstance(iri, str):
(scheme, netloc, path, query, fragment) = urlsplit(iri)
scheme = quote(scheme)
netloc = netloc.encode("idna").decode("utf-8")
path = quote(path)
query = quote(query)
fragment = quote(fragment)
uri = urlunsplit((scheme, netloc, path, query, fragment))
return uri
from rdflib import Graph
g = Graph()
# Original example, (https://dbpedia.org/page) retrieve HTML
# ^^^^
try:
g.parse(iri2uri("https://dbpedia.org/page/Almería"))
except Exception as e:
assert (
repr(e)
== """PluginException("No plugin registered for (text/html, <class 'rdflib.parser.Parser'>)")"""
)
# Retrieve RDF (https://dbpedia.org/resource)
# ^^^^^^^^
g.parse(iri2uri("https://dbpedia.org/resource/Almería"))
# Old example from SO, no longer resolves but is correctly converted
# g.parse(iri2uri("http://国立極地研究所.jp/english/")) |
@gjhiggins why park this in the "restore-pyrdfa" branch? This fix will work for non-HTML sources with non-ASCII chars in the IRI, perhaps something like |
Add an iri-to-uri conversion utility to encode IRIs to URIs for `Graph.parse()` sources. Added a couple of tests because feeding it with a suite of IRIs to check seems overkill (not that I could find one). Fixes #1429 Co-authored-by: Iwan Aucamp <aucampia@gmail.com>
To reproduce behavior:
I was able to bypass it locally by editing the function in
parser.py
according to the help here
Is there a better way to deal with this scenario or should this be a PR?
The text was updated successfully, but these errors were encountered: