Graph parse module does not handle URI / IRI with non-ascii characters #1429

PootieT · 2021-10-08T14:48:48Z

To reproduce behavior:

from rdflib import Graph

g = Graph()
g.parse("https://dbpedia.org/page/Almería")

I was able to bypass it locally by editing the function in parser.py

def _create_input_source_from_location(file, format, input_source, location):
    # Fix for Windows problem https://github.com/RDFLib/rdflib/issues/145
    path = pathlib.Path(location)
    if path.exists():
        location = path.absolute().as_uri()

    base = pathlib.Path.cwd().as_uri()

    concept = location.split('/')[-1]                # I added this line and the line below
    location = location.replace(concept, quote(concept))
    absolute_location = URIRef(location, base=base)

    if absolute_location.startswith("file:///"):
        filename = url2pathname(absolute_location.replace("file:///", "/"))
        file = open(filename, "rb")
    else:
        input_source = URLInputSource(absolute_location, format)

    auto_close = True
    # publicID = publicID or absolute_location  # Further to fix
    # for issue 130

    return absolute_location, auto_close, file, input_source

according to the help here

Is there a better way to deal with this scenario or should this be a PR?

The text was updated successfully, but these errors were encountered:

nicholascar · 2021-10-12T12:42:19Z

Is this on Windows only and therefore related to Issue #1430?

PootieT · 2021-10-12T13:50:20Z

Sorry I left out my environment details. I am on Ubuntu 20.04, and my rdflib version is 6.0.1. python version is 3.8.5. So not exactly the same as the windows issue

nicholascar · 2021-10-12T14:09:58Z

OK, thanks, so a separate issue from #1430.

Yes it would be great to see a PR here! If you can add one with a test using your example IRI and perhaps any other IRI with non-ASCII characters in the non last part of the IRI, that would be great.

You might also put a note in the update to link back to that SO article so people will know where too look for a discussion about more comprehensive solutions.

ghost · 2021-12-14T12:39:19Z

I looked at the SO discussion and distilled it out to this, works as expected in the "restore-pyrdfa" branch so parking it here for maybe later:

def test_issue1429_uri_utf8():
    """
    https://stackoverflow.com/a/40654295

    netloc should be encoded using IDNA;
    non-ascii URL path should be encoded to UTF-8 and then percent-escaped;
    non-ascii query parameters should be encoded to the encoding of a page
    URL was extracted from (or to the encoding server uses), then
    percent-escaped.

    """
    def iri2uri(iri):
        from urllib.parse import urlsplit, urlunsplit, quote

        """
        Convert an IRI to a URI (Python 3).
        https://stackoverflow.com/a/42309027
        """
        uri = ""
        if isinstance(iri, str):
            (scheme, netloc, path, query, fragment) = urlsplit(iri)
            scheme = quote(scheme)
            netloc = netloc.encode("idna").decode("utf-8")
            path = quote(path)
            query = quote(query)
            fragment = quote(fragment)
            uri = urlunsplit((scheme, netloc, path, query, fragment))

        return uri

    from rdflib import Graph

    g = Graph()

    # Original example, (https://dbpedia.org/page) retrieve HTML
    #                                        ^^^^
    try:
        g.parse(iri2uri("https://dbpedia.org/page/Almería"))
    except Exception as e:
        assert (
            repr(e)
            == """PluginException("No plugin registered for (text/html, <class 'rdflib.parser.Parser'>)")"""
        )

    # Retrieve RDF (https://dbpedia.org/resource)
    #                                   ^^^^^^^^
    g.parse(iri2uri("https://dbpedia.org/resource/Almería"))

    # Old example from SO, no longer resolves but is correctly converted
    # g.parse(iri2uri("http://国立極地研究所.jp/english/"))

nicholascar · 2021-12-15T01:21:22Z

@gjhiggins why park this in the "restore-pyrdfa" branch? This fix will work for non-HTML sources with non-ASCII chars in the IRI, perhaps something like http://example.com/thing/Almería.ttl for instance. So this is a generic fix for parse()

Add an iri-to-uri conversion utility to encode IRIs to URIs for `Graph.parse()` sources. Added a couple of tests because feeding it with a suite of IRIs to check seems overkill (not that I could find one). Fixes #1429 Co-authored-by: Iwan Aucamp <aucampia@gmail.com>

This was referenced Dec 15, 2021

fix for issue1429 #1507

Closed

Issue 1484 fix #1508

Closed

ghost mentioned this issue May 9, 2022

Fixes #1429, add iri2uri #1902

Merged

8 tasks

aucampia closed this as completed in #1902 May 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph parse module does not handle URI / IRI with non-ascii characters #1429

Graph parse module does not handle URI / IRI with non-ascii characters #1429

PootieT commented Oct 8, 2021

nicholascar commented Oct 12, 2021

PootieT commented Oct 12, 2021 •

edited

Loading

nicholascar commented Oct 12, 2021

ghost commented Dec 14, 2021

nicholascar commented Dec 15, 2021

Graph parse module does not handle URI / IRI with non-ascii characters #1429

Graph parse module does not handle URI / IRI with non-ascii characters #1429

Comments

PootieT commented Oct 8, 2021

nicholascar commented Oct 12, 2021

PootieT commented Oct 12, 2021 • edited Loading

nicholascar commented Oct 12, 2021

ghost commented Dec 14, 2021

nicholascar commented Dec 15, 2021

PootieT commented Oct 12, 2021 •

edited

Loading