Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph parse module does not handle URI / IRI with non-ascii characters #1429

Closed
PootieT opened this issue Oct 8, 2021 · 5 comments · Fixed by #1902
Closed

Graph parse module does not handle URI / IRI with non-ascii characters #1429

PootieT opened this issue Oct 8, 2021 · 5 comments · Fixed by #1902

Comments

@PootieT
Copy link

PootieT commented Oct 8, 2021

To reproduce behavior:

from rdflib import Graph

g = Graph()
g.parse("https://dbpedia.org/page/Almería")

I was able to bypass it locally by editing the function in parser.py

def _create_input_source_from_location(file, format, input_source, location):
    # Fix for Windows problem https://github.com/RDFLib/rdflib/issues/145
    path = pathlib.Path(location)
    if path.exists():
        location = path.absolute().as_uri()

    base = pathlib.Path.cwd().as_uri()

    concept = location.split('/')[-1]                # I added this line and the line below
    location = location.replace(concept, quote(concept))
    absolute_location = URIRef(location, base=base)

    if absolute_location.startswith("file:///"):
        filename = url2pathname(absolute_location.replace("file:///", "/"))
        file = open(filename, "rb")
    else:
        input_source = URLInputSource(absolute_location, format)

    auto_close = True
    # publicID = publicID or absolute_location  # Further to fix
    # for issue 130

    return absolute_location, auto_close, file, input_source

according to the help here

Is there a better way to deal with this scenario or should this be a PR?

@nicholascar
Copy link
Member

Is this on Windows only and therefore related to Issue #1430?

@PootieT
Copy link
Author

PootieT commented Oct 12, 2021

Sorry I left out my environment details. I am on Ubuntu 20.04, and my rdflib version is 6.0.1. python version is 3.8.5. So not exactly the same as the windows issue

@nicholascar
Copy link
Member

OK, thanks, so a separate issue from #1430.

Yes it would be great to see a PR here! If you can add one with a test using your example IRI and perhaps any other IRI with non-ASCII characters in the non last part of the IRI, that would be great.

You might also put a note in the update to link back to that SO article so people will know where too look for a discussion about more comprehensive solutions.

@ghost
Copy link

ghost commented Dec 14, 2021

I looked at the SO discussion and distilled it out to this, works as expected in the "restore-pyrdfa" branch so parking it here for maybe later:

def test_issue1429_uri_utf8():
    """
    https://stackoverflow.com/a/40654295

    netloc should be encoded using IDNA;
    non-ascii URL path should be encoded to UTF-8 and then percent-escaped;
    non-ascii query parameters should be encoded to the encoding of a page
    URL was extracted from (or to the encoding server uses), then
    percent-escaped.

    """
    def iri2uri(iri):
        from urllib.parse import urlsplit, urlunsplit, quote

        """
        Convert an IRI to a URI (Python 3).
        https://stackoverflow.com/a/42309027
        """
        uri = ""
        if isinstance(iri, str):
            (scheme, netloc, path, query, fragment) = urlsplit(iri)
            scheme = quote(scheme)
            netloc = netloc.encode("idna").decode("utf-8")
            path = quote(path)
            query = quote(query)
            fragment = quote(fragment)
            uri = urlunsplit((scheme, netloc, path, query, fragment))

        return uri

    from rdflib import Graph

    g = Graph()

    # Original example, (https://dbpedia.org/page) retrieve HTML
    #                                        ^^^^
    try:
        g.parse(iri2uri("https://dbpedia.org/page/Almería"))
    except Exception as e:
        assert (
            repr(e)
            == """PluginException("No plugin registered for (text/html, <class 'rdflib.parser.Parser'>)")"""
        )

    # Retrieve RDF (https://dbpedia.org/resource)
    #                                   ^^^^^^^^
    g.parse(iri2uri("https://dbpedia.org/resource/Almería"))

    # Old example from SO, no longer resolves but is correctly converted
    # g.parse(iri2uri("http://国立極地研究所.jp/english/"))

@nicholascar
Copy link
Member

@gjhiggins why park this in the "restore-pyrdfa" branch? This fix will work for non-HTML sources with non-ASCII chars in the IRI, perhaps something like http://example.com/thing/Almería.ttl for instance. So this is a generic fix for parse()

This was referenced Dec 15, 2021
@ghost ghost mentioned this issue May 9, 2022
8 tasks
aucampia added a commit that referenced this issue May 19, 2022
Add an iri-to-uri conversion utility to encode IRIs to URIs for `Graph.parse()` sources. Added a couple of tests because feeding it with a suite of IRIs to check seems overkill (not that I could find one).

Fixes #1429

Co-authored-by: Iwan Aucamp <aucampia@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants